Analysis of hubness and application of reduction methods on high dimensional datasets

Baston, Roscoe

doi:10.34726/hss.2015.30460

Record link:

https://doi.org/10.34726/hss.2015.30460
http://hdl.handle.net/20.500.12708/2853

Title:

Analysis of hubness and application of reduction methods on high dimensional datasets

Citation:

Baston, R. (2015). Analysis of hubness and application of reduction methods on high dimensional datasets [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2015.30460

reposiTUm DOI:

10.34726/hss.2015.30460

CatalogPlus:

AC12724489

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Baston, Roscoe

Advisor:

Rauber, Andreas

Co-advisor:

Lidy, Thomas

Organisational Unit:

E188 - Institut für Softwaretechnik und Interaktive Systeme

Date (published):

2015

Number of Pages:

Keywords:

Hubness; Business Intelligence; Classification

Abstract:

Hubness is a fairly new issue emerging out of the domains business intelligence and machine learning, that originates from an asymmetric distance relationship between two points inside a dataset. In a high dimensional dataset with a high number of points, this will result is a small amount of points with a lot of neighbors with fairly short distance, the so called 'hub points', and many points with long distance to all neighbors, the 'anti hubs'. That's the reason why hub points occur frequently as neighbors to other points. On the other hand, most other points rarely occur as neighbors and therefore have little influence on subsequent classification. There is a Matlab hubness toolbox which is able to calculate the hubness of a dataset, however, it does not contain functions to deal with large datasets. Also, neither a method to calculate the distance matrix using different metrics and nor a method to compare those hubness values are implemented in the toolbox. Those functions were added by myself in the course of the programming work for this thesis. This thesis aims to answer research questions regarding the contained hubness of datasets and the possibility to reduce said hubness using new reduction methods while verifying that the quality of the data is not reduced. Two of the reduction methods proposed in this thesis are the exclusion of the biggest hubpoints and the projection on a hypersphere both proposed by Abdel Aziz Taha. The third reduction method is the execution of a principal component analysis to find the most important dimensions and visualize the hub points and the data in form of a plot. After the implementation of these new reduction methods, and the application on artificial and non-artificial data, it became clear that the reduction methods operate as expected, reducing the hubness. The pca reduction method turned out to be a non-viable approach though, since the quality of the data and the retrieval system was reduced in the process. The other two reduce hubness and do not damage the quality of the data and the retrieval system. Another general observation was the fact that artificial datasets tended to contain more hubness than non-artificial datasets.

Additional information:

Zusammenfassung in deutscher Sprache
Abweichender Titel laut Übersetzung der Verfasserin/des Verfassers

License:

In Copyright

Appears in Collections:

Thesis