Optical Character Recognition undergoes three main stages, each comprising a couple of steps. The combination of NLP and OCR in particular is very powerful as it improves the accuracy of the output. We will go through the different steps and explore how NLP can significantly enhance the OCR performance.
1. Pre-Processing
Scanned images are not necessarily in pristine condition. Images can be low quality, blurred, distorted, and so on. Enhancing the image and suppressing the undesirable elements can help reduce the risk of error and are integral to the whole process. This is especially significant when it comes to documents containing sensitive information that needs to be converted with absolutely no error. Given its importance, many techniques are applied to make this process as accurate as possible. Below are some of the most common
– Binarization
Colors in images can affect the recognition of characters. One common technique is to convert the image into black and white only, having 0 and 255 for their respective pixel values. Since text generally contrasts with its background, the values of text pixels and those of the background pixels would be significantly different. Binarization in OCR is a great way to distinguish text elements from the background element.
– De-skew (Skew Correction)
When scanned, some documents are not correctly aligned. Detecting the skew and rotating the image accordingly is another essential pre-processing technique. Horizontally aligned text is easier to recognize.
– Noise Suppression
Eliminating noise and distortion is a common technique to enhance the image quality and make text more easily recognizable. For example, some printed letters can be more intense than others. Images can also contain noise elements that interfere with text recognition. Sharpening the edges of letters and smoothening the images helps text stand out from the other image elements.
– Layout Analysis
Text can sometimes be found inside tables and divided into paragraphs and columns. The ability to identify the different blocks of text is called layout analysis, or “zoning”.
As the name implies, segmentation breaks the pre-processed image into subparts. It takes place on three main levels: lines, words, and characters. To achieve that, the image has to be both horizontally projected (lines) and vertically projected (words and characters).
– Line level
The point of line segmentation is to divide the words into lines. The trick here is to correctly identify the spacing between lines; those will be the dividers. For that, we have to take into account the text pixels and the background pixels we mentioned earlier. We know that lines that contain words obviously have a larger concentration of text pixels as opposed to the spacing lines. That is precisely the idea behind line segmentation.
– Word level
The next level is dividing segmented lines into words. The process is quite similar, only with vertical projection this time. The idea is to identify the spaces between words based on the concentration of background pixels.
– Character level
After segmenting text into lines and words, it is possible to further segment words into characters, though it is not often necessary. Whenever characters are already spaced, such as in most typed documents, character segmentation is not needed. It becomes essential in the case of handwriting, for instance.
– Feature Extraction
Arguably the most crucial stage of any OCR pipeline, feature extraction allows the retrieval of relevant information from the pre-processed documents. The scanned images are assumed to be enhanced and effectively divided into manageable segments.
This is the aspect of OCR that can benefit greatly from NLP and machine learning techniques. Extracting relevant information requires “making sense” of the text data, that is, understanding the context of the various text elements present in the document. This can be approached in a couple of ways.
– Named Entity Recognition (NER)
This NLP technique identifies key elements in a text, called entities, and place them in predefined categories. These entities can be words or groups of words. Entity detection can be used to identify names, organizations, geographic locations, monetary values, etc. Additionally, they can not only denote semantic entities such as animals, colors, foods, and diseases but also syntactic elements such as noun phrases, verbs, adjectives, and other parts of speech.
– Document Classification
It usually goes hand in hand with NER. After extracting the relevant sections from the document and categorizing them, NLP makes use of pre-trained models and pre-defined categories to classify the entire document, based on the detected entities. Documents can be anything: emails, social media posts, invoices, contracts, articles, and so on.
3. Post-Processing
Ideally, the recognition process would be flawless and post-processing would not be needed, but that is rarely the case. After the scanned image had been cleaned, its key information retrieved and classified, post-processing takes place to further improve the accuracy of OCR. For instance, errors in spelling, entity identification, and grammatical mistakes may occur and need to be addressed to enhance the OCR capabilities. There are various NLP and machine learning methods that help achieved that, namely:
– Dictionary-based: The OCR output is compared to a set of words pertaining to a predefined lexicon. This especially effective when the document is already categorized. For instance, spelling mistakes and slightly mismatched words can be corrected thanks to dictionary-based post-processing.