Optical Character Recognition (OCR)

Top Previous Next

An Optical Character Recognition (OCR) system is an application that enables the conversion of scanned paper documents into editable and searchable text. The engine analyses the structure of the document image and divides the page into elements such as blocks of text, tables, and images. These blocks are used to identify character image patterns, which help generate hypotheses about the possible characters. These hypotheses are then used to create variations at the character, word, and line levels, along with their associated probabilities. The set of probability hypotheses is then searched to find the most likely combination of characters, words, and lines, ultimately producing a textual representation of the image.

Evaluation of OCR:

(As reported in Hocking, J. and Puttkammer, M.J., 2016, November. Optical character recognition for South African languages. In Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), 2016 (pp. 1-5). IEEE.)

Language	Character Error Rate (CER) %
Afrikaans	0.52
isiNdebele	1.00
isiXhosa	1.92
isiZulu	0.91
Sesotho sa Leboa	0.23
Sesotho	1.01
Setswana	0.24
Siswati	0.50
Tshivenḓa	0.34
Xitsonga	0.66

** English uses the default Tesseract engine, which can be found at https://github.com/tesseract-ocr/tesseract