SPA-Sentences

Training and evaluation of Spanish handwriting recognition systems

Abstract

SPA-Sentences: Training and evaluation of handwriting recognition systems in Spanish. The global trend to process automation is greater, if possible, when it comes to migrating paper documents to digital medida. In addition to the digitalization itself, the effective processing and transcription of documents, in many cases handwritten documents and/or with low resolution, allow information to be exploited that would not otherwise be viable. All tools for handwritten text recognition are based on machine learning techniques that ultimately require the use of corpus for training. Although there are corpus of modern handwritten text for different languages such as English or French, the same is not true for Spanish that differs enough to require a specific corpus. SPA-Sentences is a corpus of handwritten phrases in Spanish for training and evaluation of handwriting recognition systems in Spanish. The corpus consists of handwritten phrases extracted of 1,617 forms produced by the same number of writers. There are a total of 13,691 sentences containing around 100,000 instances of words, with a vocabulary of 3,288 words. These data allow an effective training of the systems of recognition. The corpus includes scanned images of the forms as well as segmentation information into lines and its manually supervised transcription. A set of programs in Python to extract the information from the image files and their corresponding xml files is also provided. Thanks to SPA-Sentences, it is possible to train and fine-tune handwriting recognition systems for Latin alphabets and, in particular, for Spanish, as well as the evaluation of the recognition system under standard conditions for the scientific community.

Technical specifications

Type of technology

SOFTWARE

Inventors

Person in charge

Castro Bleda, María José