Analysis of hubness and application of reduction methods on high dimensional datasets

Baston, Roscoe

doi:10.34726/hss.2015.30460

DC Field

Value

Language

dc.contributor.advisor

Rauber, Andreas

dc.contributor.author

Baston, Roscoe

dc.date.accessioned

2020-06-28T04:13:05Z

dc.date.issued

2015

dc.date.submitted

2016-01

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Baston, R. (2015). <i>Analysis of hubness and application of reduction methods on high dimensional datasets</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2015.30460</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2015.30460

dc.identifier.uri

http://hdl.handle.net/20.500.12708/2853

dc.description

Zusammenfassung in deutscher Sprache

dc.description

Abweichender Titel laut Übersetzung der Verfasserin/des Verfassers

dc.description.abstract

Hubness is a fairly new issue emerging out of the domains business intelligence and machine learning, that originates from an asymmetric distance relationship between two points inside a dataset. In a high dimensional dataset with a high number of points, this will result is a small amount of points with a lot of neighbors with fairly short distance, the so called 'hub points', and many points with long distance to all neighbors, the 'anti hubs'. That's the reason why hub points occur frequently as neighbors to other points. On the other hand, most other points rarely occur as neighbors and therefore have little influence on subsequent classification. There is a Matlab hubness toolbox which is able to calculate the hubness of a dataset, however, it does not contain functions to deal with large datasets. Also, neither a method to calculate the distance matrix using different metrics and nor a method to compare those hubness values are implemented in the toolbox. Those functions were added by myself in the course of the programming work for this thesis. This thesis aims to answer research questions regarding the contained hubness of datasets and the possibility to reduce said hubness using new reduction methods while verifying that the quality of the data is not reduced. Two of the reduction methods proposed in this thesis are the exclusion of the biggest hubpoints and the projection on a hypersphere both proposed by Abdel Aziz Taha. The third reduction method is the execution of a principal component analysis to find the most important dimensions and visualize the hub points and the data in form of a plot. After the implementation of these new reduction methods, and the application on artificial and non-artificial data, it became clear that the reduction methods operate as expected, reducing the hubness. The pca reduction method turned out to be a non-viable approach though, since the quality of the data and the retrieval system was reduced in the process. The other two reduce hubness and do not damage the quality of the data and the retrieval system. Another general observation was the fact that artificial datasets tended to contain more hubness than non-artificial datasets.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Hubness

dc.subject

Business Intelligence

dc.subject

Classification

dc.subject

Hubness

dc.subject

Business Intelligence

dc.subject

Classification

dc.title

Analysis of hubness and application of reduction methods on high dimensional datasets

dc.title.alternative

Hubness Analyse in der Datenanalyse

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2015.30460

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Roscoe Baston

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Lidy, Thomas

tuw.publication.orgunit

E188 - Institut für Softwaretechnik und Interaktive Systeme

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC12724489

dc.description.numberOfPages

dc.identifier.urn

urn:nbn:at:at-ubtuw:1-83799

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.assistant.staffStatus

exstaff

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.cerifentitytype

Publications

item.fulltext

with Fulltext

item.languageiso639-1

item.openairetype

master thesis

item.openaccessfulltext

Open Access

item.mimetype

application/pdf

item.grantfulltext

open

crisitem.author.dept

TU Wien

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(606.87 kB)

In Copyright

Show simple item record

Google Scholar^TM

Check

Google ScholarTM

Google Scholar^TM