Building a Job Entity Recognizer Using Amazon Comprehend
Jul 24, 2020
With the advent of Natural Language Processing (NLP), traditional job searches based on static keywords are becoming less desirable because of their inaccuracy and will eventually become obsolete. While the traditional search engine performs simple keyword searches, the NLP based search engine extract named entities, key phrases, sentiment, etc. to enrich the documents with metadata and perform search query based on the extracted metadata. In this tutorial, we will build a machine learning model to extract entities, such as skills, diploma and diploma major, from job descriptions using Named Entity Recognition (NER).
In this tutorial we will use Amazon Comprehend custom entity recognizer to extract entities from job descriptions. There are two ways to train the model (see documentation):
- Entity List: Provide a list of words with their associated entity type
- Annotation: Provide the location of the word in the document and its entity type so Amazon Comprehend can train on both the entity and its context
Providing an entity list is usually the fastest way to train the model but this will result in lower accuracy. We decided to use the Annotation method to train the model to get the most accurate results. This step requires manual annotation of hundreds of documents which can be very time consuming. Choosing the right annotation tool is therefore of the utmost importance. In this tutorial, we used the UBIAI text annotation tool (available in the beta version for free) because it comes with extensive features such as:
- ML auto-annotation
- Dictionary and regex auto-annotation
- Team collaboration to share annotation tasks
- Direct export of annotation to Amazon Comprehend format
For more information about UBIAI annotation tool ,please visit the documentation page and my previous post “How to Automate Job Search Using Named Entity Recognition-Part 1”.
Using manual dictionary and ML auto annotation, I was able to perform 200 annotation per entity type, which is the minimum required by Amazon comprehend to train a model, in few hours. Once done with the annotation, I export it using the “Annotations” option to Amazon Comprehend format:
The downloaded annotation file will include each document in a separate txt file plus a csv file that contains 4 columns:
- File: The name of the file containing the document which is included in the downloaded file from the UBIAI website.
- Line: The line number containing the entity
- Begin Offset: The character offset in the input text (relative to the beginning of the line) that shows where the entity begins. The first character is at position 0. End Offset: The character offset in the input text that shows where the entity ends. Type: User defined entity type
Next step is to upload the documents and the annotation csv file to Amazon S3 database and configure Amazon Comprehend custom NER model.
Training NER Custom Model with Amazon Comprehend:
Training a custom NER in Amazon Comprehend is fairly easy and quick. First, I create a folder in Amazon S3 to transfer all the training data and testing documents (Note: Amazon Comprehend requires 1000 documents for training and testing), named “Document list”. In a different folder “Annotation entities list” I upload the csv file containing the annotation output from UBIAI tool.