Handling Imbalanced Data
Abstract:
In real-world machine learning tasks, class imbalance is a common challenge—where one class or data category (often the class of interest) is significantly rarer than others. For example, in medical diagnosis tasks such as cancer detection, positive cases may represent a small fraction of the dataset. Models trained on such imbalanced data tend to be biased toward the majority class, resulting in poor performance on the minority class, which may be critical in many applications.
Addressing this imbalance is essential, especially when correct classification of the minority class carries high importance. The choice of performance metrics should reflect this priority—metrics like precision, recall, F1-score, or area under the precision-recall curve are more informative than accuracy in these settings.
One of the most straightforward approaches to handling class imbalance is
resampling
:
-
Undersampling
reduces the size of the majority class
-
Oversampling
increases the representation of the minority class, often by duplicating or synthetically generating new examples
These techniques aim to balance the class distribution and help the model learn from both classes more effectively.
Techniques for Over- and Under-Sampling
Oversampling Techniques
1. Random Oversampling
Duplicates minority class samples randomly until class balance is achieved.
Pros
: Simple, easy to implement
Cons
: Risk of overfitting due to repeated samples
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
from sklearn.datasets import make_classification
X, y = make_classification(n_classes=2, weights=[0.9, 0.1], n_samples=1000, random_state=42)
print("Before:", Counter(y))
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)
print("After:", Counter(y_res))
2. SMOTE (Synthetic Minority Over-sampling Technique)
Generates synthetic samples by interpolating between existing minority class instances.
Pros
: Avoids exact duplicates, reduces overfitting
Cons
: May introduce noise if the minority class has a lot of variation
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)
print("After SMOTE:", Counter(y_smote))
SMOTE finds k nearest neighbors of a sample and generates new samples along the line connecting them.
Undersampling Techniques
1. Random Undersampling
Randomly removes instances from the majority class.
Pros
: Simple and fast
Cons
: Can lose important data, leading to underfitting
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X, y)
print("After Random Undersampling:", Counter(y_rus))
2. Tomek Links
Removes majority class examples that are very close to the minority class, cleaning ambiguous decision boundaries.
Pros
: Helps improve class separation
Cons
: Less aggressive than other undersampling methods
from imblearn.under_sampling import TomekLinks
tl = TomekLinks()
X_tl, y_tl = tl.fit_resample(X, y)
print("After Tomek Links:", Counter(y_tl))
Tomek Links : A pair of samples (from different classes) is a Tomek Link if they are each other’s nearest neighbors. The majority class sample is removed.
Combined Sampling
You can combine over- and undersampling to leverage the strengths of both.
from imblearn.combine import SMOTETomek
smote_tomek = SMOTETomek(random_state=42)
X_combined, y_combined = smote_tomek.fit_resample(X, y)
print("After SMOTE + Tomek:", Counter(y_combined))
Leave a Comment