• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • Crypto Currency
  • Technology
  • Contact
NEO Share

NEO Share

Sharing The Latest Tech News

  • Home
  • Artificial Intelligence
  • Machine Learning
  • Computers
  • Mobile
  • Crypto Currency

Machine Learning: How to Handle Class Imbalance

February 13, 2021 by systems

Ken Hoffman

Introduction:

When building a classification model, you may get great results (high accuracy score) only to realize that your model is only predicting every observation to one class. This is caused by class imbalance. Class imbalance is a problem in machine learning where the total number of one class of data significantly outnumbers the total number of another class of data. To illustrate what class imbalance looks like and how it works, let’s say that you have a two-class dataset that includes 50 diabetes patients and 5000 non-diabetes patients. In this example, the classification model will tend to classify patients as non-diabetes patients because it can’t pick up on the data and trends that would lead to a patient having diabetes. Even though the model would end up with an accuracy score of 99% by just classifying every patient as a non-diabetes patient, the model is not functioning effectively to classify the patients appropriately.

Many datasets will have an uneven number of instances in each class, but a small difference is usually acceptable. As a rule of thumb, if a two-class dataset has a difference of greater than 65% to 35%, than it should be looked at as a dataset with class imbalance. If you are using a dataset with more than two classes and are unsure of whether it has class imbalance, you can always try running your models without making any adjustments and determine whether the model is functioning appropriately or whether it is only predicting to certain classes.

Ways to Handle Class Imbalance:

I) Use a Different Performance Metric

As discussed earlier, Accuracy Score is not a good metric to use when there is class imbalance in your data. Some metrics that can be more helpful when handling class imbalance are:

  • Precision — the fraction of relevant instances amongst the retrieved instances
  • Recall — the fraction of relevant instances that were retrieved
  • F1 Score — weighted average of Precision and Recall
  • Confusion Matrix — a table that illustrates the correct predictions and incorrect predictions made
Precision and Recall
Confusion Matrix

II) Collect More Data

This might be self explanatory, but, by collecting more data, you may be able to create a more balanced dataset. If more data is available that can help balance the classes in your dataset, this can be an easy and simple way to combat class imbalance.

III) Resample Your Dataset

You can resample your dataset to create a more balanced dataset by either adding copies of instances from the minority class (over-sampling) or deleting instances from the majority class (under-sampling). These two approaches are very simple and easy to implement. As a general rule-of-thumb, you should generally use under-sampling when you have a very large dataset. In the same sense, you should generally use over-sampling when your dataset is small. Regardless, it doesn’t hurt to try both methods and see how your results are affected.

IV) Generate Synthetic Samples

There are algorithms that you can use to generate synthetic samples such as SMOTE (Synthetic Minority Over-sampling Technique) and Tomek links.

SMOTE is an oversampling method that creates synthetic samples from the minority class. It works by selecting minority observations that are similar to each other and drawing a line between the examples in order to create new synthetic samples.

Tomek links work by detecting observations of opposite classes that are nearest neighbors. It removes the majority instance of these pairs. The goal of Tomek links is to clarify the border between minority and majority classes to make the minority region more distinct in the model.

References:

  • Team, Towards AI. “Dealing with Class Imbalance — Dummy Classifiers.” Towards AI — The Best of Tech, Science, and Engineering, 2 Aug. 2020,
  • Brownlee, Jason. “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset.” Machine Learning Mastery, 15 Aug. 2020, machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/.
  • Brownlee, Jason. “A Gentle Introduction to Imbalanced Classification.” Machine Learning Mastery, 14 Jan. 2020, machinelearningmastery.com/what-is-imbalanced-classification/.
  • Brownlee, Jason. “How to Combine Oversampling and Undersampling for Imbalanced Classification.” Machine Learning Mastery, 4 Jan. 2021, machinelearningmastery.com/combine-oversampling-and-undersampling-for-imbalanced-classification/.
  • “Learning from Imbalanced Classes.” KDnuggets, www.kdnuggets.com/2016/08/learning-from-imbalanced-classes.html/2.

Filed Under: Machine Learning

Primary Sidebar

Carmel WordPress Help

Carmel WordPress Help: Expert Support to Keep Your Website Running Smoothly

Stay Ahead: The Latest Tech News and Innovations

Cryptocurrency Market Updates: What’s Happening Now

Emerging Trends in Artificial Intelligence: What to Watch For

Top Cloud Computing Services to Secure Your Data

Footer

  • Privacy Policy
  • Terms and Conditions

Copyright © 2025 NEO Share

Terms and Conditions - Privacy Policy