What is class imbalance?
A dataset is said to have class imbalance problem if there is a huge difference between the number of observations of each class. For example a classification problem of finding fraudulent transactions will have very few fraudulent transactions compared non fraudulent transactions
Why is it a problem?
Because machine learning algorithms try to model in a way which gives most accuracy. So, if our fraudulent transactions dataset has 1% of fraudulent transactions, the model can just predict everything as non-fraudulent transaction resulting in 99% percent without identifying any fraudulent transaction, which is obviously not what we are looking for
Ways to deal with class imbalance?
- Data Level approach
- Algorithmic approach
We will be using the Jigsaw Toxic Comment Classification dataset to illustrate the effect of various techniques. Sample of five observations are shown below:
The table has 8 columns id, comment_text and 6 types of toxic types. For simplicity we will be just predicting if a comment is bad or not. For that let’s create a new column called “bad_commet” with values 1 for toxic and 0 for non-toxic comment
The approach we are going to follow is first do basic cleaning of the comments and apply TF-IDF which will be an input to the classification models.
Splitting the data into train and test and transforming them with TF-IDF Vectorizer.
Now let’s fit a simple logistic regression model and see how it performs on our train and test data with the help of classification report
At first glance it might seem like we got pretty good results with 95% accuracy but if you inspect all the metrics the recall for toxic class (class 1) is just 0.55. In other words the model has failed to identify almost half of the toxic comments which is not what we want. This is the exact problem with imbalanced data.
Now we will look at various methods to mitigate the issue of not identifying the toxic comments
Data Level approaches
- Up Sampling
In up sampling we pick random samples from the minority class (with replacement) and append to the original dataset to have higher minority representation. Below is the implementation of the same.
Appending the comments with their respective tags in order random sample minority instances
There are 100341 non-toxic and 11358 toxic comments. Let’s random sample from minority class and append it to the minority data frame
Now our new dataset has an equal number of minority and majority instances. Next step is to apply TF-IDF Vectorizer and fit a model
There are two things to be noticed here:
- In the test report we can see that the recall has improved significantly from 0.55 to 0.81
- Precision has dropped to 0.60, in other words 40% of comments which we predicted as toxic where not actually toxic. This can be a problem or not based what the business needs
cons:
- Due to repeated instances of minority class observations, this method could lead to overfitting
- Down Sampling
Down sampling is complete opposite of up sampling. Here we reduce the number of majority observations by random sampling
cons:
- The amount of observations left after the downsampling could be very low to train a model
- Could also lead to loss of important information from the dataset
- Synthetic Minority Over Sampling Technique(SMOTE)
SMOTE is used to deal with overfitting in case of over sampling. In this a sample of minority class observations are chosen and similar synthetic instances are created and are added to the original dataset.
Pros:
- Need not worry about the overfitting issue due to oversampling as new instances are created rather than duplicating
- No loss of information like in downsampling
Cons:
- Creating new instances could lead to noise
- Modified Synthetic Minority Over Sampling Technique (MSMOTE)
MSMOTE is similar to SMOTE, but it also considers the distribution of the minority class and noise in the dataset
(more explanation needed here on how it works)
Algorithmic approaches
- Applying class weights
Many classification models provide the option to provide weights to each class of observations. For example if our data has 99%:1% imbalance, we can make sure that the model considers it is 99 times more important to classify a minority instance than a majority instance
- Bagging based techniques
(need to understand how this solves imbalance issue)
- Boosting based techniques
- Ada Boost technique
- Gradient Tree Boosting technique
- XG Boost technique