Handling Imbalanced Data

Abstract:

In real-world machine learning tasks, class imbalance is a common challenge—where one class or data category (often the class of interest) is significantly rarer than others. For example, in medical diagnosis tasks such as cancer detection, positive cases may represent a small fraction of the dataset. Models trained on such imbalanced data tend to be biased toward the majority class, resulting in poor performance on the minority class, which may be critical in many applications.

Addressing this imbalance is essential, especially when correct classification of the minority class carries high importance. The choice of performance metrics should reflect this priority—metrics like precision, recall, F1-score, or area under the precision-recall curve are more informative than accuracy in these settings.


One of the most straightforward approaches to handling class imbalance is resampling :
- Undersampling reduces the size of the majority class
- Oversampling increases the representation of the minority class, often by duplicating or synthetically generating new examples

These techniques aim to balance the class distribution and help the model learn from both classes more effectively.

Techniques for Over- and Under-Sampling

Oversampling Techniques

1. Random Oversampling

Duplicates minority class samples randomly until class balance is achieved.

Pros : Simple, easy to implement
Cons : Risk of overfitting due to repeated samples

from imblearn.over_sampling import RandomOverSampler
from collections import Counter
from sklearn.datasets import make_classification

X, y = make_classification(n_classes=2, weights=[0.9, 0.1], n_samples=1000, random_state=42)
print("Before:", Counter(y))

ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)
print("After:", Counter(y_res))

2. SMOTE (Synthetic Minority Over-sampling Technique)

Generates synthetic samples by interpolating between existing minority class instances.

Pros : Avoids exact duplicates, reduces overfitting
Cons : May introduce noise if the minority class has a lot of variation

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)
print("After SMOTE:", Counter(y_smote))

SMOTE finds k nearest neighbors of a sample and generates new samples along the line connecting them.

Undersampling Techniques

1. Random Undersampling

Randomly removes instances from the majority class.

Pros : Simple and fast
Cons : Can lose important data, leading to underfitting

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X, y)
print("After Random Undersampling:", Counter(y_rus))

2. Tomek Links

Removes majority class examples that are very close to the minority class, cleaning ambiguous decision boundaries.

Pros : Helps improve class separation
Cons : Less aggressive than other undersampling methods

from imblearn.under_sampling import TomekLinks

tl = TomekLinks()
X_tl, y_tl = tl.fit_resample(X, y)
print("After Tomek Links:", Counter(y_tl))

Tomek Links : A pair of samples (from different classes) is a Tomek Link if they are each other’s nearest neighbors. The majority class sample is removed.

Combined Sampling

You can combine over- and undersampling to leverage the strengths of both.

from imblearn.combine import SMOTETomek

smote_tomek = SMOTETomek(random_state=42)
X_combined, y_combined = smote_tomek.fit_resample(X, y)
print("After SMOTE + Tomek:", Counter(y_combined))

Learn More:


Leave a Comment

Comments

Are You a Physicist?


Join Our
FREE-or-Land-Job Data Science BootCamp