Reproduzierbarkeit von Rankings in einem dynamischen Dokumentkorpus / von Hannes Bösch
Weitere Titel
Reproducible Ranking of Retrieval Results in a Dynamically Changing Document Corpus
VerfasserBösch, Hannes
Begutachter / BegutachterinRauber, Andreas
ErschienenWien, 2017
Umfang154 Seiten
HochschulschriftTechnische Universität Wien, Diplomarbeit, 2017
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprueft
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers
Schlagwörter (DE)Informationssuche / Textsuche / Reproduzierbare Ergebnisliste von Dokumenten / Spaltenbasierte Datenbanken / Inverted Index / Retrieval Index Performance / MonetDB / Apache Lucene / Zitieren von Daten / BM25 / Wikipedia / Dokument-Zerteilung / IR Prototyping
Schlagwörter (EN)Information Retrieval / Text Retrieval / Reproducible Retrieval Ranked Lists / Column-Store Database / Inverted Index / Retrieval Index Performance / MonetDB / Apache Lucene / Data Citation / BM25 / Wikipedia / Document Parsing / IR Prototyping
URNurn:nbn:at:at-ubtuw:1-98748 Persistent Identifier (URN)
 Das Werk ist frei verfügbar
Reproduzierbarkeit von Rankings in einem dynamischen Dokumentkorpus [2.96 mb]
Zusammenfassung (Englisch)

The core structure of (probabilistic) information retrieval systems lacks the ability to make retrieval result rankings reproducible. When the underlying data changes, IR indices change over time and especially the history of tf-idf values is hard to preserve. Thus, the same query might produce different results when the collection has been updated in the meantime. Only little research is directed to reproducibility in IR, though it would be desirable in fields of research or patent applications. The first step into this direction is to have subsets of documents in a dynamically evolving data environment unambiguously identifiable. This can be achieved with structured data and a data schema suitable for scalable data citation (cf. section 3). It suggests maintaining a history of evolving data by tagging data records with timestamps and keeping a version history for each update on the collection. Conventional row-stores cannot deal with this large volume data and statistics aggregations, as it would be required for IR applications. Yet, the column-store architecture is designed for analytical workloads and has already been proposed for IR-prototyping (cf. section 4.2), an approach for building retrieval indices on top of RDBMS. This thesis combines the concepts of IR-prototyping with data citation in order to enhance retrieval indices to achieve reproducibility. It addresses questions on how database schemes have to be shaped and if these models are efficient to deal with today¿s requirements on retrieval systems. The results hold promises for the future.