Using health statistics to improve medical and health search

Sierek, Tawan

doi:10.34726/hss.2015.14861

Record link:

https://doi.org/10.34726/hss.2015.14861
http://hdl.handle.net/20.500.12708/2093

Title:

Using health statistics to improve medical and health search

Citation:

Sierek, T. (2015). Using health statistics to improve medical and health search [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2015.14861

reposiTUm DOI:

10.34726/hss.2015.14861

CatalogPlus:

AC12296844

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Sierek, Tawan

Advisor:

Hanbury, Allan

Organisational Unit:

E188 - Institut für Softwaretechnik und Interaktive Systeme

Date (published):

2015

Number of Pages:

Keywords:

information retrieval; medical search; ranking

Abstract:

Behandelnde Ärzte und Ärztinnen beziehen oft zusätzliche Informationen aus Information Retrieval (IR) Systemen. Aber die immer größer werdende Anzahl von wissenschaftlichen Texten in der Medizin machen das Auffinden von relevanter Information immer schwerer. Auch immer mehr Laien informieren sich selbständig im Internet über medizinische Themen und benutzen oft eine Suchmaschine als Ausgangspunkt. Beide Benutzertypen profitieren von der Effizienz von IR Techniken, welche das Kernstück von Suchmaschinen bilden. Ein wichtiger Schritt ist die Reihung von Suchergebnissen. Unser Ziel ist es, die Reihung bei medizinischen Suchanfragen zu verbessern, in dem Statistiken von diversen Krankheiten berücksichtigt werden. Wir nehmen an, dass ein medizinischer Artikel, welcher eine bestimmte Krankheit zum Thema hat, relevanter ist wenn die Krankheit häufiger vorkommt. Aufbauend auf dieser Annahme, glauben wir auch, dass die Reihung von Suchergebnissen an ein Patientenprofil angepasst werden kann, in dem das Alter und Geschlecht berücksichtigt werden. Es ist allgemein bekannt, dass einige Krankheiten unterschiedlich oft bei Männern und Frauen, beziehungsweise jungen und älteren Personen, vorkommen. Nach unserem besten Wissen existieren keine wissenschaftlichen Arbeiten, die IR Techniken, basierend of Gesundheitsstatistiken, thematisieren. Wir entwickeln ein stochastisches Modell, welches ein epidemiologisches Maß und ein Patientenprofil einbeziehen. Aufbauend auf dem formalen Modell implementieren wir einen Prototyp. Der Prototyp ordnet das Ergebnis einer state-of-the-art Suchmaschine neu, in dem er die Suchergebnisse zu ICD-9-CM Codes zuordnet. Anhand der Wahrscheinlichkeit einer Diagnose mit dem selbem Code, im Bezug zu einem Patientenprofil, wird das Suchergebnis höher oder niedriger gereiht. Der Prototyp wird mit zwei Testkollektionen, von zwei kürzlich organisierten Evaluierungskampagnen, evaluiert und getestet. Wir etablieren Baselines mit den effizientesten IR Methoden, die in einer weit verbreitete Open Source Suchmaschine implementiert sind. Unsere Experimente zeigen innerhalb einer Testkollektion eine minimale Verbesserung. Diese ist aber nicht statistisch signifikant. Der Prototyp ordnet Suchergebnisse zu ICD-9-CM Codes basierend auf Wikipedia-Artikel, welche als Ground Truth dienen. Durch das spärliche Vorhandensein von Trainingsdaten, können wir diesen kritischen Schritt nicht evaluieren und deshalb sind unsere Ergebnisse beeinflusst. Wir schlagen vor, weiterhin Forschungen auf Basis des formalen Modells durchzuführen, aber mit Testkollektionen mit manuell annotierten Dokumenten.

Healthcare professionals often find additional information by consulting information retrieval systems (IR) when treating a patient. But they face an ever growing amount of scientific literature, which makes it harder to find the relevant citations or articles for a given clinical case. Non-professionals now commonly seek information about health on their own, often starting at a web search engine. Both types of users benefit from the effectiveness of IR techniques, which are essential for web search engines or retrieval systems accessing bibliographic databases. A critical part is the ranking process, as this determines which article or web-page is more relevant than others and should, therefore, be ranked higher. Our goal is to improve the ranking process within health searches by taking available health statistics into account. We assume that it is beneficial for the user if text documents that cover more frequent diseases are ranked higher than others. Based on this assumption, we also believe that health search can be contextualized, by adapting the ranking to a patient profile that contains age and sex data. It is common knowledge that a number of diseases are unequally distributed among men and women, as well as among young and old people. To the best of our knowledge, IR approaches based on health statistics are not covered by scientific literature. We develop a probabilistic model that incorporates an epidemiological measure and a patient profile. We implement a prototype based on the formal model. The prototype re-ranks the top 150 results of a state-of-the-art system. It maps the documents to ICD-9-CM codes and, depending on the probability of a diagnosis with the same code for a patient with a given profile, the document is ranked higher or lower. The prototype is evaluated using the test collections of two recent evaluation campaigns in the health domain. We establish baselines with the best-performing IR methods, of a widely used open source search engine. At these times, our experiments show only a minor improvement over the baseline, which we can not report as statistically significant. Our prototype maps documents to ICD-9-CM codes automatically, but relies only on Wikipedia articles serving as the ground truth. Due to this sparseness of training data, we can not evaluate this crucial step and, therefore, our results are biased. We suggest conducting further research based on our formal model, but with test collections of manually annotated documents.

Additional information:

Zsfassung in dt. Sprache

License:

In Copyright

Appears in Collections:

Thesis