Random forest for unbalanced multiple-class classification / Anna Sofia Kircher
VerfasserKircher, Anna Sofia
Begutachter / BegutachterinFilzmoser, Peter
ErschienenWien, 2017
Umfangiii, 127 Seiten : Illustrationen, Diagramme
HochschulschriftTechnische Universität Wien, Diplomarbeit, 2017
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers
Schlagwörter (EN)Random Forest / Multiple-class classification / Unbalanced data / Error rates
URNurn:nbn:at:at-ubtuw:1-100518 Persistent Identifier (URN)
 Das Werk ist frei verfügbar
Random forest for unbalanced multiple-class classification [2.88 mb]
Zusammenfassung (Englisch)

Random Forest is a cutting-edge method for unbalanced multiple-class classification. The main problem with unbalanced data is that the classifier tends to focus more on the bigger classes than on the smaller classes. To overcome this skewness, three sampling methods, namely oversampling, undersampling and a combination of both are introduced and compared based on the performance of the forest on a highly unbalanced data set with eleven classes. It seems that oversampling improves the performance of the forest dramatically, while undersampling often worsens it compared to the unbalanced classification. A combination of both seems, however, more adequate for this specific analysed data set since the effect of oversampling on the accuracy is much lower regarding the test data set than the dramatic improvements for the training data set. The danger of overfitting is lower if the data set is not only oversampled but retains its original total size while the observations are oversampled or undersampled to the same amount of observations. Analysing the data has shown that there are many noisy variables which legitimated raising the value of available variables (mtry) from the default to the median value between the default for classification mtry =sqrt(p) and the default value for regression mtry = 2p/3 .