Insights Into Natural Language Processing and Optical Character Recognition​

Jun 7, 2022

As technology advances, the gap between various disciplines narrows. Many of the technological breakthroughs were made possible thanks to multidisciplinary approaches. Combining multiple fields is often a very effective way to solve complex problems that would not be solved otherwise. One such discipline is the intriguing field of Natural Language Processing (NLP) which combines Computer Science and Linguistics, among other things. To put it simply, NLP interprets human language and renders it understandable by the machine. That very idea, however, is far from simple. 

In this article, we will provide a comprehensive examination of Natural Language Processing, focusing on how it makes use of the remarkable technology of OCR. 

Roots of NLP



To understand the origins of one of the most defining technological fields, we will have to rewind a little, particularly to the conception of language itself. Natural Language Processing stems from the very similar process that exists in the human mind. Humankind’s ability to perceive, understand, and produce language is indeed fascinating and has been the subject of interest of many thinkers and researchers throughout history. The topic was studied from plenty of perspectives: psychological, behaviorist, anthropological and sociological to count a few. This attempt to understand how human language works makes up the vast field of Linguistics, or the science of language. The first real research in NLP was, reasonably enough, inspired by Linguistics.


Then, technology advanced and computers emerged. Just as humans communicated with one another, we attempted to emulate and reproduce this ability with machines. At first, it was a very one-sided communication, one where we tell the machine exactly what to do and it responds accordingly, based on a set of rules we’d already defined. Next, we began imagining what it would be like if the machines ‘talked back’, that is, what if they actually understood human language the way we do and were able to communicate with it. The vision that all barriers of communication between Man and machine should be broken is behind the fascinating field of Natural Language Processing.


NLP, What Is It All About?


Natural Language Processing is one of the fundamental elements of the broader universe of Artificial Intelligence, much like humankind’s language ability is part of the larger whole of human intelligence. NLP is the aspect of AI responsible for the understanding and interpretation of human language, both written and verbal.  In order to make human-generated speech or text understandable by the computer, NLP takes advantage of other advanced technology such as machine learning and deep learning. This combination aims to help the machine understand not only the structure of what is said or written, but also the full meaning and context behind it.


NLP, What Are the Challenges?


The main challenge NLP faces is that language is often ambiguous and open to multiple interpretations. It is one of the intrinsic characteristics of human language. For instance, a word can have multiple meanings, and sentences can be interpreted differently depending on the context. This gives computers a little trouble processing language. Solving these ambiguities, referred to as disambiguation, and revealing the intent behind words and sentences is one of the main challenging tasks of Natural Language Processing.


The sheer diversity of human languages, even within one language, creates some big difficulties. Languages differ substantially in their lexicons and structures. Thorough linguistic research has covered hundreds of languages over time, some languages have had more resources than others. Naturally, NLP models have trained mostly for the most common, widely spoken of them. With many low resource languages in the world, it is very challenging to build universal NLP applications incorporating all human languages. Multilingualism is at the heart of NLP aims, and tremendous research is going into machine translation and multilingual embedding to overcome this challenge.


From NLP to OCR


Data that can be interpreted by NLP is not necessarily typed into the computer, readily encoded for direct processing. Text can be handwritten or printed in paper and can be found in physical or digital images. We know that traditionally, text is encoded into the computer with each character representing a specific numeric value. However, that naturally does not apply to text found outside the confines of 0s and 1s inside the computer, that is, physical documents and images.

The precise need to convert these heaps of textual information gave rise to the technology of Optical Character Recognition (OCR), a technology that often goes hand in hand with NLP.


Optical Character Recognition (OCR)

The technology that allows detecting text contained in scanned images and converting it into machine-readable text is known as Optical Character Recognition. Scanned images can be documents, printed or handwritten, and they can be photos containing text. The process is quite simple to understand, though many techniques come into its implementation. It largely involves singling out letters and numbers and arranging them into a logical order. Enhanced by NLP, machine learning and deep learning algorithms, text recognition and conversion can be automated without the need to manually enter any sort of data. We will explain the process in more depth in the following section.


Stages of OCR


Optical Character Recognition undergoes three main stages, each comprising a couple of steps. The combination of NLP and OCR in particular is very powerful as it improves the accuracy of the output. We will go through the different steps and explore how NLP can significantly enhance the OCR performance.


1. Pre-Processing

Scanned images are not necessarily in pristine condition. Images can be low quality, blurred, distorted, and so on. Enhancing the image and suppressing the undesirable elements can help reduce the risk of error and are integral to the whole process. This is especially significant when it comes to documents containing sensitive information that needs to be converted with absolutely no error. Given its importance, many techniques are applied to make this process as accurate as possible. Below are some of the most common


– Binarization

Colors in images can affect the recognition of characters. One common technique is to convert the image into black and white only, having 0 and 255 for their respective pixel values. Since text generally contrasts with its background, the values of text pixels and those of the background pixels would be significantly different. Binarization in OCR is a great way to distinguish text elements from the background element.


– De-skew (Skew Correction)

When scanned, some documents are not correctly aligned. Detecting the skew and rotating the image accordingly is another essential pre-processing technique. Horizontally aligned text is easier to recognize.


– Noise Suppression 

Eliminating noise and distortion is a common technique to enhance the image quality and make text more easily recognizable. For example, some printed letters can be more intense than others. Images can also contain noise elements that interfere with text recognition. Sharpening the edges of letters and smoothening the images helps text stand out from the other image elements.


– Layout Analysis

Text can sometimes be found inside tables and divided into paragraphs and columns. The ability to identify the different blocks of text is called layout analysis, or “zoning”.


2. Recognition (Segmentation & Feature Extraction) 

 – Segmentation

As the name implies, segmentation breaks the pre-processed image into subparts. It takes place on three main levels: lines, words, and characters. To achieve that, the image has to be both horizontally projected (lines) and vertically projected (words and characters).


– Line level 

The point of line segmentation is to divide the words into lines. The trick here is to correctly identify the spacing between lines; those will be the dividers. For that, we have to take into account the text pixels and the background pixels we mentioned earlier. We know that lines that contain words obviously have a larger concentration of text pixels as opposed to the spacing lines. That is precisely the idea behind line segmentation.


– Word level

The next level is dividing segmented lines into words. The process is quite similar, only with vertical projection this time. The idea is to identify the spaces between words based on the concentration of background pixels.


– Character level 

After segmenting text into lines and words, it is possible to further segment words into characters, though it is not often necessary. Whenever characters are already spaced, such as in most typed documents, character segmentation is not needed. It becomes essential in the case of handwriting, for instance.


– Feature Extraction

Arguably the most crucial stage of any OCR pipeline, feature extraction allows the retrieval of relevant information from the pre-processed documents. The scanned images are assumed to be enhanced and effectively divided into manageable segments.

This is the aspect of OCR that can benefit greatly from NLP and machine learning techniques. Extracting relevant information requires “making sense” of the text data, that is, understanding the context of the various text elements present in the document. This can be approached in a couple of ways.


– Named Entity Recognition (NER) 

This NLP technique identifies key elements in a text, called entities, and place them in predefined categories. These entities can be words or groups of words. Entity detection can be used to identify names, organizations, geographic locations, monetary values, etc. Additionally, they can not only denote semantic entities such as animals, colors, foods, and diseases but also syntactic elements such as noun phrases, verbs, adjectives, and other parts of speech.


– Document Classification

It usually goes hand in hand with NER. After extracting the relevant sections from the document and categorizing them, NLP makes use of pre-trained models and pre-defined categories to classify the entire document, based on the detected entities. Documents can be anything: emails, social media posts, invoices, contracts, articles, and so on.


3. Post-Processing

Ideally, the recognition process would be flawless and post-processing would not be needed, but that is rarely the case. After the scanned image had been cleaned, its key information retrieved and classified, post-processing takes place to further improve the accuracy of OCR. For instance, errors in spelling, entity identification, and grammatical mistakes may occur and need to be addressed to enhance the OCR capabilities. There are various NLP and machine learning methods that help achieved that, namely:


– Dictionary-based: The OCR output is compared to a set of words pertaining to a predefined lexicon. This especially effective when the document is already categorized. For instance, spelling mistakes and slightly mismatched words can be corrected thanks to dictionary-based post-processing. 


 – Rule-based: Just like a rule-based approach can be used to detect entities and classify documents, it can be used to correct mistakes, especially the recurrent ones. While it’s impossible to set up rules for all possible errors, rule-based post-processing can significantly reduce the need for human intervention.  
– Machine learning-based: Given an adequate amount of training data, post-processing can greatly improve the accuracy of OCR systems. Using the right tools and techniques, this allows the comparison of the OCR output with plenty of similar documents. These appropriately labeled documents constitute the training data and can effectively be introduced to the OCR systems to correct errors and eliminate the need for manual correction. 

Final Thoughts

Despite the challenges and limitations, OCR is a significant breakthrough that has helped many industries in their efforts for digitization. Large quantities of text can be processed quickly and accurately, using a variety of tools and techniques. This is a huge advantage compared to the traditional, manual data entry; It saves considerable time, cost, and effort.
As with pretty much all modern technologies, the success of OCR depends, to a great extent, on the advancement of other fields of research. Natural Language Processing and OCR are a particularly powerful combination. NLP enriches the OCR process, allowing relevant information to be extracted from documents, which is often the main purpose of using OCR in the first place.   Thanks to NLP and other AI-based technologies, OCR made a huge leap from the traditional systems managed by humans to the fully automated systems that can merge with the most cutting-edge technology to solve more complex problems and reach new heights.