Bibliographic Metadata

Adaptierung von Word Embeddings für domänenspezifisches Information Retrieval / von Sebastian Hofstätter
Additional Titles
Adapting Word embeddings for domain-specic information retrieval
AuthorHofstätter, Sebastian
CensorRekabsaz, Navid ; Hanbury, Allan
PublishedWien, 2018
Descriptionxiii, 68 Seiten : Diagramme
Institutional NoteTechnische Universität Wien, Diplomarbeit, 2018
Zusammenfassung in deutscher Sprache
Document typeThesis (Diplom)
Keywords (DE)Informationsrückgewinnung / Word Embeddings / Word2Vec / Globaler Kontext / Verwandte Wörter
Keywords (EN)Information Retrieval / Word Embeddings / Word2Vec / Global context / Related terms
URNurn:nbn:at:at-ubtuw:1-108442 Persistent Identifier (URN)
 The work is publicly available
Adaptierung von Word Embeddings für domänenspezifisches Information Retrieval [1.89 mb]
Abstract (English)

Search engines rank documents based on their relevance to a given query using only exact word matches might miss results. Expanding a document retrieval query with similar words gained from a word embedding offers great potential for better query results. The expansion of the search space allows to retrieve relevant documents, even if they do not contain the actual query. An additional word improves the query results only if it is relevant to the topic of the search. As observed by previous studies, an essential problem in using an out-of-box word embedding for document retrieval is that some of the added similar words have a negative impact on the retrieval performance. We create word embedding based similarity models, which are used to expand query words in domain-specific Information Retrieval. For this we adapt an existing word embedding with additional information gained from different contexts -- we incorporate them into a Skip-gram word embedding with Retrofitting. We experiment with different external resources: Latent Semantic Indexing, semantic lexicons. We also study various techniques to combine two different external resources. We first analyze changes in the local neighborhoods of query terms and global differences between the original and retrofitted vector spaces. We then evaluate the effect of the changed word embeddings on domain-specific retrieval test collections. We report improved results on some test collections. In conclusion, we show that in two out of three test collections, incorporating external resources significantly improves the results over using an out-of-the-box word embedding.

The PDF-Document has been downloaded 18 times.