Building An NLP Project From Zero To Hero (1): Project Overview

Dec 7, 2021

Whether it’s receipts, contracts, financial documents, or invoices etc., automating information retrieval will help you increase your business efficiency and productivity at a fraction of the cost. However, this amazing feat will not be possible without text annotation. While natural language processing (NLP) tasks such as NER or relation extraction have been widely used for information retrieval in unstructured text, analyzing structured documents such as invoices, receipts, and contracts is a more complicated endeavor.


First, there is not much semantic context around the entities we want to extract (i.e. price, seller, tax, etc.) that can be used to train an NLP model. Second, the document layout changes frequently from one invoice to another; this will cause traditional NLP task such as NER to perform poorly on structured documents. That being said, structured text — such as an invoice — contain rich spatial information about the entities. This spatial information can be used to create a 2-D position embedding that denotes the relative position of a token within a document. More recently, Microsoft released a new model LayoutLM  to jointly model interactions between text and layout information across scanned document images. They achieved new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42).

Building An NLP Project From Zero To Hero (Project Overview)

Whether it’s receipts, contracts, financial documents, or invoices etc., automating information retrieval will help you increase your business efficiency and productivity at a fraction of the cost. However, this amazing feat will not be possible without text annotation. While natural language processing (NLP) tasks such as NER or relation extraction have been widely used for information retrieval in unstructured text, analyzing structured documents such as invoices, receipts, and contracts is a more complicated endeavor.


First, there is not much semantic context around the entities we want to extract (i.e. price, seller, tax, etc.) that can be used to train an NLP model. Second, the document layout changes frequently from one invoice to another; this will cause traditional NLP task such as NER to perform poorly on structured documents. That being said, structured text — such as an invoice — contain rich spatial information about the entities. This spatial information can be used to create a 2-D position embedding that denotes the relative position of a token within a document. More recently, Microsoft released a new model LayoutLM  to jointly model interactions between text and layout information across scanned document images. They achieved new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42).

Building An NLP Project From Zero To Hero (Project Overview)

                                                                                                 An example of one hot encoding text

 

 

Next, we convert every document to a set of vectors that will be considered as an input for our model to predict a label y (let us say a sentiment or a topic classification task for the sake of simplicity), you can simply concatenate the one-hot vectors for each word contained within the document.

 

As you have noticed, this approach is not effective due to essentially the high size of memory needed to model all the vectors and the documents.

 

Imagine you have a vocabulary of 10 000 words. And then imagine you have some sentences or documents that stretch out to 100 words. So, this document vector will be of length 10 000 * 100 = 1 000 000 values. And most importantly 99.99% of these values are just 0 which does not bring anything useful to the model. Furthermore, this representation is oversimplifying the complexity of languages as they require more attention to the meanings and the context of words.

 

 

Of course, there exist much better text representation techniques like Term-frequency Inverse-document-frequency (TF-IDF) and Word Embeddings. We will not be delving into them for now as it will be only necessary for a future article when we will be designing and training our model.

 

All you should retain from this section is that in NLP, we need to encode our text into vectors, a mathematically structured data. I believe this is what lies at the core of NLP and the rest is either details or lies at the intersection of other fields like Machine Learning and Linguistics: To process your text, you will need to understand linguistic concepts like stopwords, part of speech tags, and tokenization. And to train your model, you will need statistical models like support vector machines (SVM) and neural networks (NN).

 

Why NLP is important?

 

“Well NLP is cool and stuff, but how can we leverage it to improve our businesses more efficiently? How it could differ from the more traditional techniques?”

 

As we have said before, NLP allows machines to effectively understand and manipulate human languages. With that, you will be able to automate a lot of tasks and improve their rapidity and scale, like data labeling, translation, customer feedback, and text analysis. Applying NLP to real world cases and not just for research purposes, will bring a significant competitive advantage to many business.

An interesting an article written by the HealthCatalyst. In 2005, Indiana University Health (IU Health) implemented a machine learning early-warning system to identify unusual trends in the emergency department (ED). At some point, it detected an abnormal number of patients having the same specific symptoms (include dizziness, confusion, nausea, etc). At first, the existing data did not show something unusual, unlike the early-warning system. Later, it was revealed that these individuals lived in the same apartment complex and that their heater was malfunctioning. That caused them to get sick from carbon monoxide.

 

This ability to analyze massive amounts of data, specifically unstructured data, is a game-changer. From our little story, we can see how the model was capable of leading its developers to the right way in their analysis of the problem at hand. It did not exactly provide the full answer, but it helped them pinpoint this ‘black swan’ hidden in plain sight as the existing data did include anything about this phenomenon.

 

Another fascinating story is that of Kasisto. Founded in 2015, the company created a chatbot called KAI that can help banking and financial organizations develop their own chatbots which would help their customers with receiving their services and managing their finances. These chatbots, of course, are made using NLP.

For example, a bank can feed KAI data containing transaction records and account details, in order to train a model for the customer’s support. By learning over a moderate amount of time and with enough data, the chatbot will be able to answer questions and fulfill services in the chat interface. You can ask it simple questions like what is my largest transaction so far, or you can ask for a recommendation for a certain need you have and it will share with you the links you need. It can also redirect customers to human service agents in case of need.

NLP entered also the legal domain as there are many companies like Ross Intelligence, which uses IBM Watson, developed natural language query interfaces so that you ask questions as if there is a lawyer that will answer all your questions.

 

Building An NLP Project From Zero To Hero (Project Overview)

Most popular uses of NLP

 

 

Now, these are a few stories of many. I hope that you can see the reasons why one should actually think about adopting NLP. So, now, let us take an overview of what we will be learning in this series! 

Project Overview

So you have a collection of documents, like pdf or XML or even txt, you want to analyze them thoroughly. For example, you want to detect all entities present within the entire corpus. You can decide to train a Named Entity Recognition model. You can annotate your text manually or go for text annotation tools. The annotated documents are then fed to the NER model so it will be finally able to perform the desired analysis.

For this series, we will be training a custom NER Model to use for stock news analysis. We will also give special care to the Data Labeling part. Data labeling or data annotation is so important in Machine Learning. Garbage in, garbage out.

Here is the outline of the series:

 

  1. Project Overview
  2. Data Collection
  3. Data Preprocessing
  4. Data Labeling
  5. Model Training
  6. Model Deployment
  7. Model Monitoring
  8. Text Mining


Each part of this series will have its own proper article. We will try to preserve the gentle tone and not complicate things more than they should.

Conclusion

 

This series is aimed mainly at those who know at least some bits of NLP but are struggling to go to the next level. We will also try to make the series friendly for the non-technical folk, especially those who want to leverage their businesses with its power. UBIAI will share some of their tips across the series. UBIAI is a company that specializes in data annotations and creating custom NLP models. Feel free to contact us at admin@100.21.53.251 or Twitter.

Stay tuned and see you in the next article!

 

UBIAI