Text detection and recognition in natural scene images

Opitz, Michael

doi:10.34726/hss.2013.22576

Record link:

https://doi.org/10.34726/hss.2013.22576
http://hdl.handle.net/20.500.12708/4215

Title:

Text detection and recognition in natural scene images

Citation:

Opitz, M. (2013). Text detection and recognition in natural scene images [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2013.22576

reposiTUm DOI:

10.34726/hss.2013.22576

CatalogPlus:

AC11180712

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Opitz, Michael

Advisor:

Sablatnig, Robert

Organisational Unit:

E183 - Institut für Rechnergestützte Automation

Date (published):

2013

Number of Pages:

Abstract:

Text detection and recognition in natural scene images has applications in computer vision systems such as license plate detection, automatic street sign translation, image retrieval and help for visually impaired people. Scene text, however, may have complex background, image blur, partially occluded text, variations in font-styles, image noise and varying illumination. Hence scene text recognition is a challenging computer vision problem. This work addresses the problem of dictionary driven end-to-end scene text recognition, which is divided into a text detection problem and a text recognition problem. For text detection an AdaBoost sliding window classifier is used to detect text in multiple scales. The effectiveness of several feature-sets for this classifier are compared and evaluated. A modified Local Ternary Pattern (LTP) feature-set is found as most effective for text detection. In a post-processing stage Maximally Stable Extremal Regions (MSER) are detected and labeled as text or non-text. Text regions are grouped to textlines. Textlines are split into words by a word-splitting method build upon k-means and linear Support Vector Machines (SVM). For text recognition a deep Convolutional Neural Network (CNN) trained with backpropagation is used as one-dimensional sliding window classifier. To avoid overfitting the network is regularized by Dropout. Recognition-responses are used in a Viterbi-style algorithm to find the most plausible word in a dictionary. The influence of training set size and size of convolutional layers is evaluated. The system presented outperforms state of the art methods on the ICDAR 2003 and 2011 dataset in the text-detection (F-score: 74.2% / 76.7%), dictionary-driven cropped-word recognition (F-score: 87.1% / 87.1%) and dictionary-driven end-to-end recognition (F-score: 72.6% / 72.3%) tasks.

Additional information:

Abweichender Titel laut Übersetzung der Verfasserin/des Verfassers
Zsfassung in dt. Sprache

License:

In Copyright

Appears in Collections:

Thesis