Imbalanced Data Classification
Carmen Terei
We speak of imbalanced data when there is a significant imbalance in the amount of data in the different classes of data from a set of data. For example, a set of data with two classes of data where the amount of data in one of the classes is 100, 1000, 10000 times or even more than in the other class. Also, often the classes which are under-represented in the data set are the ones of interest.
Classification of imbalanced data is a part of data mining. The classification problem for imbalanced data is interesting and challenging to researchers because most standard data mining methods claim their assumption for balanced data but are not properly applicable for the imbalanced one. When these standard methods are applied to imbalanced data sets, they are overwhelmed by the instances in majority class,ignoring the instances in minority class, thus this phenomenon results in high accuracy for the majority class but poor accuracy for the minority one.
The approaches that have been proposed to deal with the class imbalance problem can be categorized into two groups: the internal approaches that create new algorithms or modify existing ones to take the class-imbalance problem into consideration and external approaches that preprocess the data in order to diminish the effect of their class imbalance. Furthermore, cost-sensitive learning solutions, incorporating both the data (external) and algorithmic level (internal) approaches assume higher misclassification costs for samples in the minority class and seek to minimize the high cost errors. Ensemble methods are also frequently adapted to imbalanced domains, either by modifying the ensemble learning algorithm at the data-level approach to preprocess the data before the learning stage of each classifier or by embedding a cost-sensitive framework in the ensemble learning process.
The evaluation criteria is a key factor in assessing the classification performance and guiding the classifier modeling. In imbalanced domains, the evaluation of the classifiers’ performance must be carried out using specific metrics in order to take into account the class distribution. For example, derived from the accuracy metric, we can use the true positive, true negative, false positive and false negative metrics. Derived from these four metrics there are also: the geometric mean of true rates, the F-measure and others. There are also the graphical approaches such as the Receiver Operating Characteristic (ROC) graphic or the Area Under the ROC Curve (AUC).