Handling a nasty imbalanced dataset

While working with Machine Learning or Deep Learning algorithms, there is a possibility that your model can be biased because the dataset was unchecked for its balanced ratio between the 2 classes.

Let me give you a quick example.

Here’s a sample code through which I have generated an imbalanced dataset (because everyone uses the “famous” Credit-Card dataset).

This code generates a dataframe with shape = (1000,6) and the value count classes of y are as follows -

The model will be very biased to this data because we will be training the model on the data points which, 900 times out of 1000, have their output as 0.

The classification metrics for a simple Logistic Regression are as follows -

This looks just too good! But there’s definitely an issue here because we just can’t accept the fact something is 100% correct, especially in Machine Learning models.

The problem here is that the answer, we are observing in the classification report, is only for class 0.

Let us take a sample data point to understand this better.

According to this, the data point belongs to class 1. Let us test that out in the Logistic Regression model we have trained.

Oops! As expected the model has classified the data into class 0. This is the exact reason we should not have an imbalanced dataset and even if we have one, we should handle it before moving further to any step.

To handle an imbalanced dataset, we can employ the following methods -

Random UnderSampling

Random Undersampling is a technique used to balance an imbalanced dataset by randomly removing samples from the majority class. The idea behind this technique is to reduce the number of samples in the majority class to match the number of samples in the minority class, thus achieving a balanced dataset.

Advantages

Simple
An easy-to-implement technique that does not require any additional data.

Disadvantages -

This method can result in a loss of information, as it discards samples from the majority class that could be useful for training the model.

Random OverSampling

Random Oversampling is a technique used to balance an imbalanced dataset by randomly duplicating samples from the minority class. The idea is to increase the number of samples in the minority class to match the number of samples in the majority class, creating a balanced dataset.

Advantages:

Simple and easy to implement. It does not require additional data.
Helps to improve the performance of the model on the minority class by providing more samples for training.

Disadvantages:

This method can lead to overfitting, as it introduces duplicate samples, which may cause the model to memorize the training data rather than generalize to new data.
Randomly duplicating samples might not capture the true underlying distribution of the minority class, potentially leading to biased results.
The increased number of samples in the minority class can make the training process computationally expensive, especially for large datasets.

SMOTETomek

SMOTETomek is a hybrid approach for handling imbalanced datasets. It combines two popular techniques: SMOTE and Tomek links.

SMOTE - Synthetic Minority Over-sampling Technique

Reference of the image - Link

It is an oversampling technique that generates synthetic samples for (between) the minority class by creating new instances that are a combination of existing minority class samples.
It does this by selecting a random minority class sample and finding its k-nearest neighbours (yellow circles in the figure) and then creating new samples along the lines connecting these neighbours.
Once synthetic data are generated along the lines (red dots on the lines), the next point is chosen and the process is repeated until the desired number of synthetic instances is generated.
Mathematically, if x_i represents an original minority class sample, x_j represents one of its k nearest neighbours, and alpha represents the sampling ratio, the process of generating a synthetic sample (x_n or x_new) can be represented as:

$$x_n = x_i + alpha * (x_j - x_i)$$

where:
- x_n: The synthetic sample created.
- x_i: The original minority class sample.
- x_j: One of the k nearest neighbours of x_i.
- alpha: A random number between 0 and 1, determining the position of the new sample along the line segment between x_i and x_j.

Tomek Links

It is an undersampling technique that identifies and removes overlapping samples from both the majority and minority classes.
In simple words, Tomek Links are the links or pairs of samples in the data from different classes that are very close to each other.
These pairs create noisy regions in the dataset and can lead to misclassification.
The black region in the scatter plot, shown above, are the tomek links. The data points from class 0 and class 1 are very close to each other.
Now, tomek links are identified and the majority class samples (blue circles) from each link are removed to reduce noise.
By removing the majority class points involved in Tomek links, the classifier's focus shifts away from the boundary regions, potentially leading to a more robust and accurate decision boundary and also we smartly undersample the majority class in a very effective way.

It's worth noting that SMOTETomek is not a one-size-fits-all solution, and its effectiveness depends on the specific characteristics of the dataset. It should be used in conjunction with other techniques like feature selection and parameter tuning.

However, it is a powerful tool for addressing imbalanced datasets and can significantly improve the performance of machine learning models.

Moreover, SMOTETomek is a relatively easy-to-use approach that can be applied to different types of machine learning models. It has been shown to perform well in various applications such as fraud detection, credit scoring, and medical diagnosis.

One very important point to note is that if your dataset size is very big, it can take ages to balance using SMOTETomek even with a RTX GPU. If this is the case, I would suggest you pass the dataset in batches to reduce the time and computational cost.

SUMMARY-

SMOTETomek = SMOTE (over-sampling) + Tomek Links (under-sampling)
SMOTE - take a sample from minority class, finding nearest neighbours (K-NN) and synthetically produce data points on the line joining the point and the neighbour.
Tomek Link - points from class A and class B are very close to each other. Identify them and remove the point from majority class.
not a one-size-fits-all solution but effective with preprocessing, feature engineering, hyperparameter tuning, metric selection (accuracy might be highly misleading ; can choose precision, recall, F-beta score)
easy to use and very effective
If data size very big, send in batches of data.

I hope you find this article an impeccably researched and profoundly insightful exposition, crafted by an austere and no-nonsense authority in the subject matter, who sternly refrains from any levity or jest, determined to provide only the most solemn and weighty knowledge to its discerning readers.

I hope you find this article fun-to-learn and not complex. Happy Learning :)

For any suggestions or just a chat-

Raj Kulkarni - LinkedIn