Document image analysis preprocessing of low-quality and sparsely inscribed documents

Kleber, Florian

doi:10.34726/hss.2014.23118

DC Field

Value

Language

dc.contributor.advisor

Sablatnig, Robert

dc.contributor.author

Kleber, Florian

dc.date.accessioned

2020-06-29T16:42:26Z

dc.date.issued

2014

dc.date.submitted

2014-05

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Kleber, F. (2014). <i>Document image analysis preprocessing of low-quality and sparsely inscribed documents</i> [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2014.23118</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2014.23118

dc.identifier.uri

http://hdl.handle.net/20.500.12708/7466

dc.description

Zsfassung in dt. Sprache

dc.description.abstract

Aufgrund einer steigenden Digitalisierung von den Beständen von Bibliotheken, Handschriftenabteilungen (altertümliche Manuskripten), oder per Hand ausgefüllte Formulare gibt es die Notwendigkeit der automatischen Verarbeitung von digitalen Bildern von Dokumenten. Projekte wie Google Books of Google Inc. oder IMPACT (Improving Access to Text) benötigen automatisierte Systeme der Dokumentenanalyse. Zu den Vorverarbeitungsschritten in der Dokumentenanalyse von Bildern gehören die Binarisierung (Einteilung in Vordergrund und Hintergrund) und die Detektion der Dokumentausrichtung. Eine Formularklassifikation erlaubt die Extraktion von Formularfeldern aufgrund der MetaInformation (Position der Formularfelder) von bekannten Formulartypen. Binarisierung als auch die Korrektur der globalen Ausrichtung sind wesentliche Vorverarbeitungsschritte für die Layoutanalyse als auch der Zeichenerkennung (OCR). Eine Formularklassifikation erlaubt einerseits das Sortieren von Dokumenten und ist ebenfalls ein Vorverarbeitungsschritt für die Layoutanalyse (z.B. Form Dropout). Diese Dissertation beschäftigt sich mit den drei genannten Dokument-Vorverarbeitungsschritten, wobei vor allem schlecht erhaltene (historische, altertümliche) Dokumente als auch Dokumente mit geringem Inhalt (wenige Worte) betrachtet werden. Die entwickelte Methodik kann dabei zum Beispiel auf Dokumentfragmente angewendet werden, wodurch eine Rekonstruktion von "zerrissenen" Dokumenten ermöglicht wird. Die erforschten Methodiken werden mit State of the Art Metriken evaluiert und mit Methoden die im Rahmen von Contests präsentiert wurden verglichen.

dc.description.abstract

The mass digitalization of libraries, national archives or museums needs an automated processing of the acquired image data for a further preparation (indexing, word spotting) and improving the access to the content, thus a document analysis. Projects and institutions that are dealing with the digitalization of documents are amongst others the manuscript research center of Graz University (Vestigia), Improving Access to Text (IMPACT), or projects like Google Books of Google Inc. Document preprocessing is one of the most important steps of document image analysis and is defined as noise removal and binarization, thus foreground/background separation. An additional preprocessing step is the skew estimation of documents which can be based on binarized images or on original grayvalue image. Uncorrected documents can affect the performance of Optical Character Recognition (OCR) and segmentation (layout analysis) methods. Document classification can be used for automated indexing in digital libraries by classifying all e.g. "Table of Contents" pages or allows a document retrieval on large document image databases. By classifying document types, a-priori knowledge (position of text boxes) can be incorporated into the document image analysis system, thus facilitating higher-level document analysis. While binarization and skew estimation are defined as classical preprocessing steps, form classification is added as a preprocessing step within this thesis. The research within this thesis deals with this three preprocessing steps for ancient and historical documents with sparsely inscribed Information (printed or written text). Historical documents can be degraded (e.g. faded out ink or noise like background stains) or fragmented due to their storage conditions. The methods are evaluated using state of the art metrics and are compared to methods of current document Image analysis contests regarding binarization and skew estimation.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Document Analysis

dc.subject

Binarization

dc.subject

Skew Estimation

dc.subject

Form Classification

dc.title

Document image analysis preprocessing of low-quality and sparsely inscribed documents

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2014.23118

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Florian Kleber

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

tuw.publication.orgunit

E183 - Institut für Rechnergestützte Automation

dc.type.qualificationlevel

Doctoral

dc.identifier.libraryid

AC11682958

dc.description.numberOfPages

132

dc.identifier.urn

urn:nbn:at:at-ubtuw:1-62369

dc.thesistype

Dissertation

dc.thesistype

Dissertation

tuw.author.orcid

0000-0001-8351-5066

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.advisor.orcid

0000-0003-4195-1593

item.openaccessfulltext

Open Access

item.grantfulltext

open

item.cerifentitytype

Publications

item.mimetype

application/pdf

item.openairecristype

http://purl.org/coar/resource_type/c_db06

item.languageiso639-1

item.openairetype

doctoral thesis

item.fulltext

with Fulltext

crisitem.author.dept

E193-01 - Forschungsbereich Computer Vision

crisitem.author.orcid

0000-0001-8351-5066

crisitem.author.parentorg

E193 - Institut für Visual Computing and Human-Centered Technology

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(14.06 MB)

In Copyright

Show simple item record

Page view(s)

284

checked on Nov 19, 2023

Download(s)

145

checked on Nov 19, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM