Cost sensitive screening methods for binary classification

Schroeder, Fabian

doi:10.34726/hss.2018.48802

Record link:

https://doi.org/10.34726/hss.2018.48802
http://hdl.handle.net/20.500.12708/7661

Title:

Cost sensitive screening methods for binary classification

Citation:

Schroeder, F. (2018). Cost sensitive screening methods for binary classification [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2018.48802

reposiTUm DOI:

10.34726/hss.2018.48802

CatalogPlus:

AC15102749

Publication Type:

Thesis - Dissertation

Language:

English

Authors:

Schroeder, Fabian

Advisor:

Filzmoser, Peter

Organisational Unit:

E105 - Institut für Stochastik und Wirtschaftsmathematik

Date (published):

2018

Number of Pages:

145

Keywords:

linear discriminant analysis; cost sensitive

Abstract:

In hochdimensionalen Klassifikationsaufgaben, z.B. in Fall-Kontroll-Studien für die Detektion von Biomarkern, sind Variablenfilter eine einfache und skalierbare Methode uninformative Variablen zu verwerfen. Die Nichtberücksichtigung der Operation Conditions bei der Suche nach interessanten Variablen kann jedoch zu falschen Schlüssen führen. Daher untersuche ich in dieser Arbeit die Verwendung von Variablenfiltern, die auf dem erwarteten Prognosefehler von eindimensionalen Klassifikatoren basieren. Grundsätzlich dürfte die Wahl des Klassifikators die Eigenschaften des daraus resultierenden Filters bestimmen. Ein naheliegender parametrischer Ansatz wäre der Bayes Klassifikator unter der Annahme von normalverteilten Klassen. Dieser Filter ist jedoch stark abhängig von der Verteilungsannahme und schneidet schlecht ab wenn die Daten von der Annahme abweichen. Ein nichtparametrischer Ansatz wäre daher für die Vorauswahl von hunderttausenden Variablen verschiedenem Ursprungs zu bevorzugen. Statt einer Annahme über die Verteilung der Klassen, könnte man einfach annehmen, dass der optimale Klassifikator von einer bestimmten Gestalt ist, z.B. dass der Annahmebereich für eine Klasse ein (möglicherweise unbeschränktes) Intervall ist. Diese Annahme dürfte für eine große Anzahl von Verteilungsfamilien erfüllt sein. Ausserdem haben die daraus resultierenden Methoden eine weitere interessante Eigenschaft: es ist möglich die Verteilung der Teststatistik unter der Nullhypothese für endliche Stichproben exakt zu bestimmen. Die Berechnung basiert auf einem schnellen rekursiven Algorithmus. In dieser Arbeit untersuche ich die vorgeschlagenen Filtermethoden analytisch sowie mit Hilfe von simulierten und echten Daten.

In high-dimensional classification tasks, e.g., case-control experiments for biomarker detection, variable filters constitute a simple and scalable method to discard uninformative variables. When screening for interesting variables the neglect of the operating conditions of the classification task can lead to false conclusions. This thesis, thus, proposes filtering statistics based on the expected prediction error of a univariate classifier. The choice of classifier will generally determine the characteristics of the filtering method. An obvious parametric approach was the Bayesian classifier for a mixture of Gaussian class conditionals. This approach, however, relies heavily on the parametric assumptions and any deviation from these will reduce the performance of the filter dramatically. Opting for a non-parametric classifier seems reasonable in the context of screening thousands of different variables. Thus, instead of assuming a parametric family for the class conditionals we will assume that the optimal classifier is a member of the family of threshold or interval classifiers. This assumption should hold true for a great number of different distributions. Furthermore, these methods exhibit another interesting property. It is possible to obtain the exact finite sample distribution of the test statistic under the null hypothesis of equal class conditional distributions by means of a fast recursive algorithm. In this thesis, I have studied the proposed screening methods analytically and evaluated their screening characteristics by means of simulated as well as real data.

License:

In Copyright

Appears in Collections:

Thesis