<div class="csl-bib-body">
<div class="csl-entry">Baston, R. (2015). <i>Analysis of hubness and application of reduction methods on high dimensional datasets</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2015.30460</div>
</div>
-
dc.identifier.uri
https://doi.org/10.34726/hss.2015.30460
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/2853
-
dc.description
Zusammenfassung in deutscher Sprache
-
dc.description
Abweichender Titel laut Übersetzung der Verfasserin/des Verfassers
-
dc.description.abstract
Hubness is a fairly new issue emerging out of the domains business intelligence and machine learning, that originates from an asymmetric distance relationship between two points inside a dataset. In a high dimensional dataset with a high number of points, this will result is a small amount of points with a lot of neighbors with fairly short distance, the so called 'hub points', and many points with long distance to all neighbors, the 'anti hubs'. That's the reason why hub points occur frequently as neighbors to other points. On the other hand, most other points rarely occur as neighbors and therefore have little influence on subsequent classification. There is a Matlab hubness toolbox which is able to calculate the hubness of a dataset, however, it does not contain functions to deal with large datasets. Also, neither a method to calculate the distance matrix using different metrics and nor a method to compare those hubness values are implemented in the toolbox. Those functions were added by myself in the course of the programming work for this thesis. This thesis aims to answer research questions regarding the contained hubness of datasets and the possibility to reduce said hubness using new reduction methods while verifying that the quality of the data is not reduced. Two of the reduction methods proposed in this thesis are the exclusion of the biggest hubpoints and the projection on a hypersphere both proposed by Abdel Aziz Taha. The third reduction method is the execution of a principal component analysis to find the most important dimensions and visualize the hub points and the data in form of a plot. After the implementation of these new reduction methods, and the application on artificial and non-artificial data, it became clear that the reduction methods operate as expected, reducing the hubness. The pca reduction method turned out to be a non-viable approach though, since the quality of the data and the retrieval system was reduced in the process. The other two reduce hubness and do not damage the quality of the data and the retrieval system. Another general observation was the fact that artificial datasets tended to contain more hubness than non-artificial datasets.
en
dc.language
English
-
dc.language.iso
en
-
dc.rights.uri
http://rightsstatements.org/vocab/InC/1.0/
-
dc.subject
Hubness
de
dc.subject
Business Intelligence
de
dc.subject
Classification
de
dc.subject
Hubness
en
dc.subject
Business Intelligence
en
dc.subject
Classification
en
dc.title
Analysis of hubness and application of reduction methods on high dimensional datasets
en
dc.title.alternative
Hubness Analyse in der Datenanalyse
de
dc.type
Thesis
en
dc.type
Hochschulschrift
de
dc.rights.license
In Copyright
en
dc.rights.license
Urheberrechtsschutz
de
dc.identifier.doi
10.34726/hss.2015.30460
-
dc.contributor.affiliation
TU Wien, Österreich
-
dc.rights.holder
Roscoe Baston
-
dc.publisher.place
Wien
-
tuw.version
vor
-
tuw.thesisinformation
Technische Universität Wien
-
dc.contributor.assistant
Lidy, Thomas
-
tuw.publication.orgunit
E188 - Institut für Softwaretechnik und Interaktive Systeme