ubiai deep learning

MEDICAL REPORT USING NER & OCR WITH EASYOCR

Aug 9, 2022

Abstract :

 

Healthcare organizations around the country are turning to optical character recognition software to Become Paperless and improve patient care. Claims Capture is an intelligent, accurate and highly scalable data capture and document processing solution that drastically reduces an organization’s commitment to paper-based processes and the errors associated with manual data entry

Medical records are important resources in which patients’ diagnosis and treatment activities in hospitals are documented. In recent years, many medical institutions have done significant work in archiving electronic medical records. Handwritten medical records are gradually being replaced by digital ones. Many researchers strive for extracting medical knowledge from digital data, using medical knowledge to help medical professionals understand potential causes of various symptoms, and building medical decision support systems.

Medical named entity recognition (NER) is an important technique that has recently received attention in medical communities in extracting named entities from medical texts, such as diseases, drugs, surgery reports, anatomical parts, and examination documents.

In this article we gonna describe the manner to extract text from images files related to covid-19 and recognize three entities (PATHOGEN,MEDICAL CONDITION ,MEDICINE ) from this unstructured text using fine-tuning with spacy transformers,to generate finally a summary including all this informations about this disease.

 

Named Entity Recognition :

 

Named Entity Recognition is a common problem in NLP dealing with identifying and classifying named entities.

A named entity is a real life object which has an identification and can be defined by a name. A place, person, countries or organizations can be a named entity. For example, Microsoft is an organization and Asia is a geographic entity.

A raw or instructed data is processed and by using the help of named entity recognition, one can label and classify the data as different entities. A NER system is developed with the help of linguistic approaches and statical methods.

A NER model begins with identifying an entity and categorizes into the most suitable class.

 

Named Entity Recognition with spaCy :

 

SpaCy is an open source Natural processing library with fast statistical entity recognition system. The methods that are available in SpaCy for NER assigns a label to the text data and classifies the same as defined above.

Spacy also provides us an option to add arbitrary classes to entity recognition systems and update the model to include new examples. We can train our own data for business-specific needs and prepare the model as necessary.

 

Spacy Transformers :

 

Transformers are a particular architecture for deep learning models that revolutionized natural language processing. The defining characteristic for a Transformer is the self-attention mechanism. Using it, each word learns how related it is to the other words in a sequence.

Transformers are a family of neural network architectures that compute dense, context-sensitive representations for the tokens in your documents. Downstream models in your pipeline can then use these representations as input features to improve their predictions. You can connect multiple components to a single transformer model, with any or all of those components giving feedback to the transformer to fine-tune it to your tasks.

spaCy’s transformer support interoperates with PyTorch and the HuggingFace transformers library, giving you access to thousands of pretrained models for your pipelines. There are many great guides to transformer models, but for practical purposes, you can simply think of them as drop-in replacements that let you achieve higher accuracy in exchange for higher training and runtime costs.

NER WITH SPACY TRANSFORMERS AND OCR WITH EASYOCR

 

Optical Character Recognition (OCR) :

Optical character recognition (OCR) is referred to as text recognition. An OCR program extracts and repurposes data from scanned documents, camera images and image-only pdfs. OCR software singles out letters on the image, puts them into words and then puts the words into sentences, thus enabling access to and editing of the original content. It also eliminates the need for manual data entry.

EasyOCR :

 
 

EasyOCR, is a Python package that allows computer vision developers to effortlessly perform Optical Character Recognition.

 
NER WITH SPACY TRANSFORMERS AND OCR WITH EASYOCR

 

MEDICAL REPORT SUMMARY :

in our case, we will extract text from our input images and predict 3 entities(PATHOGEN, MEDICALCONDITION, MEDECINE) based on our datasets using custom ner spacy models, finally, generate a summary contains these entities with their labels.

 

1- Ocr detection and text extraction :

 

let’s start by installing easyocr package and import it :

 

				
					!pip install easyocr
import easyocr
install open-cv library :

!pip install opencv-python
import cv2
import matplotlib :

import matplotlib.pyplot as plt
define ocr text extraction function :

###############OCR EXTRACTION OF TEXT AND RESULT FUNCTION##########
def ocr_extraction(IMAGE_PATH):
    reader = easyocr.Reader(['en'])
    result = reader.readtext(IMAGE_PATH,paragraph="False")
    text=''
    for res in result:
        text+=res[1]+'
'
    return text,result    #return all text and the result of each #text detected from image
define contours detection function :

#############OCR CONTOURS DETECTION  FUNCTION###########
def draw_contours(image_path):
    img = cv2.imread(image_path)
    for detection in result: 
        top_left = tuple(detection[0][0])
        bottom_right = tuple(detection[0][2])
        img = cv2.rectangle(img,top_left,bottom_right,(0,255,0),3)
    plt.figure(figsize=(10,10))
    plt.axis('off')
    plt.imshow(img)
    plt.show()text,result=ocr_extraction('img/rap4.jpg')
print('text --> 
',text)
draw_contours('img/rap4.jpg')
				
			
NER WITH SPACY TRANSFORMERS AND OCR WITH EASYOCR
NER WITH SPACY TRANSFORMERS AND OCR WITH EASYOCR

 

2- train custom ner spacy model :

 

To train the model, we will need relevant data with proper annotations. I have used the medical entities dataset here ‘medicine.json’ :

 

!pip install spacyimport json
import spacy
from spacy.tokens import DocBin
from spacy.util import filter_spans
from tqdm import tqdm

 

 

In the create_train_spacy function below ,We are extracting the text and corresponding annotations and creating a structed data below. For the data in text above, we have the labels with their corresponding span.

spaCy uses DocBin class for annotated data, so we’ll have to create the DocBin objects for our training examples. This DocBin class efficiently serializes the information from a collection of Doc objects. It is faster and produces smaller data sizes than pickle, and allows the user to deserialize without executing arbitrary Python code. The indices of some entities overlap. spaCy provides a utility method filter_spans to deal with this.

 

				
					#CREATE SPACY FILE TO TRAIN MODEL AND GET TRAIN DATA TEXT  FUNCTION#
def create_train_spacy(json_file_path):
    with open(json_file_path, 'r') as f:
        data = json.load(f)
    training_data = {'classes' : ['MEDICINE', "MEDICALCONDITION", "PATHOGEN"], 'annotations' : []}
    for example in data:
#             print(example['document'])
            temp_dict = {}
            temp_dict['text'] = example['document']
            temp_dict['entities'] = []
            for annotation in example['annotation']:
                start = annotation['start']
                end = annotation['end']
                label = annotation['label'].upper()
                temp_dict['entities'].append((start, end, label))
            training_data['annotations'].append(temp_dict)
    nlp = spacy.blank("en") # load a new spacy model
    doc_bin = DocBin()
    for training_example  in tqdm(training_data['annotations']): 
        text = training_example['text']
        labels = training_example['entities']
        doc = nlp.make_doc(text) 
        ents = []
        for start, end, label in labels:
            span = doc.char_span(start, end, label=label, alignment_mode="contract")
            if span is None:
                print("Skipping entity")
            else:
                ents.append(span)
        filtered_ents = filter_spans(ents)
        doc.ents = filtered_ents 
        doc_bin.add(doc)    doc_bin.to_disk("./training_data.spacy") # save the docbin object
    return training_data   #return training data after saving as spacy file for training {entities:(start_char,end_char,label),text:[txt]}
				
			

 

i use gpu (transformers) to setup my model, you need to install spacy transformers by consulting the link below :

https://spacy.io/usage/embeddings-transformers

The DocBin saves the Training_Data in Spacy format which we need to train a model. Then, We can manually create a config file as per the use case or quickly create a base config on spaCy’s training quickstart page here: https://spacy.io/usage/training

 

NER WITH SPACY TRANSFORMERS AND OCR WITH EASYOCR

 

you can choose the epochs to train your model as you like based on your computer performance, me, I choose 50 epochs and I adjust this in config.cfg file:

 

training_data=create_train_spacy('medicine.json')
!python -m spacy init fill-config base_config.cfg config.cfg
####I SET 50 EPOCHS IN config.cfg
 
NER WITH SPACY TRANSFORMERS AND OCR WITH EASYOCR

 

train the model :

!python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy

 

3-Ner for medical unstructured document :

we should create the display function like below :

 

				
					import random
from spacy import displacy
|######COLOR GENERATOR FUNCTION ######
def color_gen(): #this function generates and returns a random color.
    random_number = random.randint(0,16777215) #16777215 ~= 256x256x256(R,G,B)
    hex_number = format(random_number, 'x')
    hex_number = '#' + hex_number
    return hex_number #generate color randomly#####DISPLAY DOCUMENT FUNCTION ########def display_doc(doc): 
    colors = { ent.label_:color_gen() for ent in doc.ents
             }
    options = {"ents": [ ent.label_ for ent in doc.ents], 
                "colors": colors
              }
    displacy.render(doc, style='ent', options=options, jupyter=True)#display of entities recognition in text


Let’s load the best-performing model and test it on a piece of text. the output below was generated in Jupyter notebook :



nlp_ner = spacy.load("model-best")
doc=nlp_ner(text)
display_doc(doc)
				
			
NER WITH SPACY TRANSFORMERS AND OCR WITH EASYOCR

 

4- generate medical report summary pdf with entities and label refer to it :

install fpdf package :

 

				
					!pip install fpdf
from fpdf import FPDF
create details dictionary function to generate {label :[entities]} :

#DETAILS EXTRACTION FUNCTION OF DOCUMENT(LABEL->ENTITIES) #
def details_dict(doc):
    Details = {}
    for ent in doc.ents:
    #     print(ent.ents,ent.label_)
        if(ent.label_ not in Details):
            Details[ent.label_]=[str(ent.ents[0])]        else:
            if(str(ent.ents[0]).strip() not  in Details[ent.label_] ):
                Details[ent.label_].append(str(ent.ents[0])) 
    return Details #return detail label+all his entities
create function to save this details dictionary into details.txt file :

#TEXT FILE FUNCTION TO SAVE DETAILS #
def create_file_txt(dict_variable):
    text_file = open("details.txt", "w")
    Details=dict_variable

    for dic in Details:
        txt=dic.upper() +' : ' 
        for i in range(len(Details[dic])):
            if(i<len(Details[dic])-1):
                txt+=Details[dic][i]+' , '
            else:
                txt+=Details[dic][i]
        txt+='
'
        text_file.write(txt)#close file
    text_file.close()

create function to generate SUMMARY.pdf medical report summary based on the previous file details.txt :

#PDF SUMMARY OF THE MEDICAL REPORT FUNCTION #
def create_summary_pdf(file_txt_path):
    # save FPDF() class into a
    # variable pdf
    pdf = FPDF()
      # Add a page
    pdf.add_page()# set style and size of font
    # that you want in the pdf
    pdf.set_font("Arial", size = 15)# create a cell
    pdf.cell(200, 10, txt = "HEALTHCARE",
             ln = 1, align = 'C')# add another cell
    pdf.cell(200, 10, txt = "A CLINICAL REPORT SUMMARY


",ln=1 , align = 'C')
    f = open(file_txt_path, "r")
    pdf.set_font("Arial", size = 10)
    for x in f:        pdf.multi_cell(0, 5, txt = '
'+x)
    f.close()
    # save the pdf with name .pdf
    pdf.output("SUMMARY.pdf")
details=details_dict(doc)
create_file_txt(details)
create_summary_pdf("details.txt")
				
			
NER WITH SPACY TRANSFORMERS AND OCR WITH EASYOCR

 

6- generate medical report summary pdf based on ocr and ner for multiple files :

i have stored in ‘img/’ folder multiple images for medical report ,so i try to apply ner for all this images ,and generate a summary for all this document so let’s code :

 
				
					import os
create a function to extract all the files path content in the ‘img/’ folder :

#FILES PATH EXTRACTION #
def files_path(dir_path):
    result_files = []
    for root, dirs, files in os.walk(dir_path):
        if files:
            result_files.append(sorted(files))
    return result_files[0]
let’s display our entities and generate the summary in the same time :

nlp_ner = spacy.load("model-best")
d={}
resume=''
result_files=files_path('img')
for file in  result_files :   
    IMAGE_PATH = 'img/'+file
    text,result=ocr_extraction(IMAGE_PATH)
    resume+=text+'
'
    doc=nlp_ner(text)
    Details=details_dict(doc)
    for ent in Details:
        if(ent not in d):
            d[ent]=Details[ent]
        else:
            for itm in Details[ent]:
                if itm not in d[ent]:
                    d[ent].append(itm)
doc=nlp_ner(resume)
display_doc(doc)                    
create_file_txt(d)
create_summary_pdf("details.txt")
 
				
			
NER WITH SPACY TRANSFORMERS AND OCR WITH EASYOCR
NER WITH SPACY TRANSFORMERS AND OCR WITH EASYOCR

 

Conclusion :

OCR using scanned documents, camera images, Medical named entity recognition using un-labelled texts and medical records is a challenging task. This article creates a way of using medical dictionary .In this tutorial we use easyocr to extract unstructured text from scanned documents and trained a transformer model to predict our 3 entities, we create a pdf file to resume our document.