Handling Imbalanced Classes in classification
Imbalanced distribution of classes in a dataset causes bias in machine learning algorithms towards the majority class, which is not what we want. This bias lies in the cost function, which typically assigns the same loss value for each misclassified instance, regardless of their frequency.
When the dataset is naturally highly imbalanced, usually the goal of the classification is, to detect the minority class rather than the majority class. For example, suppose we have diagnostic information of n individuals in a medical dataset, and want to predict whether the individual has cancer. In that case, we focus on predicting all cancer instances correctly, which is the minority class in this example.
Recall and Precision metrics to measure performance
- Recall: When the goal is to minimize false negative predictions
- Example: We prefer to predict non-cancer instances as cancer as opposed to missing real cancer cases.
- Metric:
Recall = TP / TP + FN
- Precision: When the goal is to minimize false positive predictions
- Example: Spam detector predicts a valid email as spam, and you never get to see it. We prefer to receive spam emails as opposed to never receiving non-spam emails. Here, spam is the positive class, and it is what we care about.
- Metric:
Precision: TP / TP + FP
Because we cannot have both high recall and high precision, we need to make a tradeoff. This is called the precision-recall tradeoff. This tradeoff can be visualized with a recall-precision curve.
To evaluate performance then, we pick a metric and target value for this metric. Say, we want recall to be 90%. Then all the models we build should have at least 90% recall, and among them, the one with the highest precision wins.
Some of the techniques used to handle imbalanced datasets
- Resampling the dataset
- Under-sampling the majority class: We lose information by throwing away data
- Over-sampling the minority class: We can oversample the minority class by making multiple copies of it. Some algorithms create synthetic examples of the minority class, such as SMOTE. This artificial data might introduce a new error in the model; then, it will be hard to know where this error is coming from: the real data or artificial data?
- Increasing the penalty associated with the minority class in prediction algorithm
Most classification algorithms support this:
- scikit-learn:class_weight
arg
- XGBoost:scale_pos_weight
arg - Varying the decision threshold
Many classifiers in scikit-learn package also yield probabilities. To increase recall/precision, we can determine a prediction will be positive only when its probability > threshold. Varying the decision threshold will change the default behavior, where the default threshold is 0.5.