How to Build a Knowledge Graph with Neo4J and Transformers

Nov 21, 2021

In my previous article “Building a Knowledge Graph for Job Search using BERT Transformer”, we explored how to create a knowledge graph from job descriptions using entities and relations extracted by a custom transformer model. While we were able to get great visuals of our nodes and relations using Python library networkX, the actual graph lived in Python memory and wasn’t stored in database. This can be problematic when trying to create a scalable applications where you have to store an ever growing knowledge graph. This is where Neo4j excels, it enables you to store the graph in a fully functional database that will allow you to manage large amount of data. In addition, Neo4j’s Cypher language is rich, easy to use and very intuitive.

In this article, I will show how to build a knowledge graph from job descriptions using fine-tuned transformer-based Named Entity Recognition (NER) and spacy’s relation extraction models. The method described here can be used in any different field such as biomedical, finance, healthcare, etc.

Below are the steps we are going to take:

Load our fine-tuned transformer NER and spacy relation extraction model in google colab
Create a Neo4j Sandbox and add our entities and relations
Query our graph to find the highest job match to a target resume, find the three most popular skills and highest skills co-occurrence

For more information on how to generate training data using UBIAI and fine-tuning the NER and relation extraction model, checkout the articles below:

The dataset of job descriptions is publicly available in Kaggle.

At the end of this tutorial, we will be able to create a knowledge graph as shown below.

And in graph visual:

Named Entity and Relation Extraction

First we load the dependencies for NER and relation models as well as the NER model itself that has been previously fine-tuned to extract skills, diploma, diploma major and years of experience:

				
					!pip install -U pip setuptools wheel
!python -m spacy project clone tutorials/rel_component
!pip install -U spacy-nightly --pre
!!pip install -U spacy transformers
import spacy
#restart the runtime after installation of deps
nlp = spacy.load("[PATH_TO_THE_MODEL]/model-best")
Type caption for embed (optional)
Load the jobs dataset from which we want to extract the entities and relations:
import pandas as pd
def get_all_documents():
df = pd.read_csv("/content/drive/MyDrive/job_DB1_1_29.csv",sep='"',header=None)
documents = []
for index,row in df.iterrows():
    documents.append(str(row[0]))
    return documents
documents = get_all_documents()
documents = documents[:]

Type caption for embed (optional)

Extract entities from the jobs dataset:

				
					import hashlib

def extract_ents(documents,nlp):
  docs = list()
  for doc in nlp.pipe(documents, disable=["tagger", "parser"]):
      dictionary=dict.fromkeys(["text", "annotations"])
      dictionary["text"]= str(doc)
      dictionary['text_sha256'] =  hashlib.sha256(dictionary["text"].encode('utf-8')).hexdigest()
      annotations=[]

      for e in doc.ents:
        ent_id = hashlib.sha256(str(e.text).encode('utf-8')).hexdigest()
        ent = {"start":e.start_char,"end":e.end_char, "label":e.label_,"label_upper":e.label_.upper(),"text":e.text,"id":ent_id}
        if e.label_ == "EXPERIENCE":
          ent["years"] = int(e.text[0])

        annotations.append(ent)

      dictionary["annotations"] = annotations
      docs.append(dictionary)
  #print(annotations)
  return docs

parsed_ents = extract_ents(documents,nlp)
Type caption for embed (optional)
We can take a look at some of the extracted entities before we feed them to our relation extraction model:

[('stock market analysis', 'SKILLS'),
 ('private investor', 'SKILLS'),
 ('C++', 'SKILLS'),
 ('Investment Software', 'SKILLS'),
 ('MS Windows', 'SKILLS'),
 ('web development', 'SKILLS'),
 ('Computer Science', 'DIPLOMA_MAJOR'),
 ('AI', 'SKILLS'),
 ('software development', 'SKILLS'),
 ('coding', 'SKILLS'),
 ('C', 'SKILLS'),
 ('C++', 'SKILLS'),
 ('Visual Studio', 'SKILLS'),
 ('2 years', 'EXPERIENCE'),
 ('C/C++ development', 'SKILLS'),
 ('data compression', 'SKILLS'),
 ('financial markets', 'SKILLS'),
 ('financial calculation', 'SKILLS'),
 ('GUI design', 'SKILLS'),
 ('Windows development', 'SKILLS'),
 ('MFC', 'SKILLS'),
 ('Win', 'SKILLS'),
 ('HTTP', 'SKILLS'),
 ('TCP/IP', 'SKILLS'),
 ('sockets', 'SKILLS'),
 ('network programming', 'SKILLS'),
 ('System administration', 'SKILLS')]
We are now ready to predict the relations; first load the relation extraction model, make sure to change directory to rel_component/scripts to access all the necessary scripts for the relation model.

cd rel_component/
import random
import typer
from pathlib import Path
import spacy
from spacy.tokens import DocBin, Doc
from spacy.training.example import Example

# make the factory work
from rel_pipe import make_relation_extractor, score_relations

# make the config work
from rel_model import create_relation_model, create_classification_layer, create_instances, create_tensors

#restart the runtime after installation of deps
nlp2 = spacy.load("/content/drive/MyDrive/training_rel_roberta/model-best")

def extract_relations(documents,nlp,nlp2):
  predicted_rels = list()
  for doc in nlp.pipe(documents, disable=["tagger", "parser"]):
    source_hash = hashlib.sha256(doc.text.encode('utf-8')).hexdigest()
    for name, proc in nlp2.pipeline:
          doc = proc(doc)

    for value, rel_dict in doc._.rel.items():
      for e in doc.ents:
        for b in doc.ents:
          if e.start == value[0] and b.start == value[1]:
            max_key = max(rel_dict, key=rel_dict. get)
            #print(max_key)
            e_id = hashlib.sha256(str(e).encode('utf-8')).hexdigest()
            b_id = hashlib.sha256(str(b).encode('utf-8')).hexdigest()
            if rel_dict[max_key] >=0.9 :
              #print(f" entities: {e.text, b.text} --> predicted relation: {rel_dict}")
              predicted_rels.append({'head': e_id, 'tail': b_id, 'type':max_key, 'source': source_hash})
  return predicted_rels

predicted_rels = extract_relations(documents,nlp,nlp2)
Type caption for embed (optional)
Predicted relations:  entities: ('5+ years', 'software engineering') --> predicted relation: {'DEGREE_IN': 9.5471655e-08, 'EXPERIENCE_IN': 0.9967771}  entities: ('5+ years', 'technical management') --> predicted relation: {'DEGREE_IN': 1.1285037e-07, 'EXPERIENCE_IN': 0.9961034}  entities: ('5+ years', 'designing') --> predicted relation: {'DEGREE_IN': 1.3603304e-08, 'EXPERIENCE_IN': 0.9989103}  entities: ('4+ years', 'performance management') --> predicted relation: {'DEGREE_IN': 6.748373e-08, 'EXPERIENCE_IN': 0.92884386}

Neo4J

We are now ready to load our jobs dataset and extracted data into the neo4j database.

First, start a neo4j blank sandbox and add you connection details as shown below:

				
					documents = get_all_documents()
documents = documents[:]
parsed_ents = extract_ents(documents,nlp)
predicted_rels = extract_relations(documents,nlp,nlp2)

#basic neo4j query function
from neo4j import GraphDatabase
import pandas as pd

host = 'bolt://[your_host_address]'
user = 'neo4j'
password = '[your_password]'
driver = GraphDatabase.driver(host,auth=(user, password))

def neo4j_query(query, params=None):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())
      
Type caption for embed (optional)
Next, we add the documents, entities and relations to the knowledge graph. Note that we need to extracted the integer number of years from the name of entity EXPERIENCE and stored it as a property.
#clean your current neo4j sandbox db (remove everything)
neo4j_query("""
MATCH (n) DETACH DELETE n;
""")

#Create a first main node
neo4j_query("""
MERGE (l:LaborMarket {name:"Labor Market"})
RETURN l
""")

#add entities to KG: skills, expereince, diploma, major-diploma
neo4j_query("""
MATCH (l:LaborMarket)
UNWIND $data as row
MERGE (o:Offer{id:row.text_sha256})
SET o.text = row.text
MERGE (l)-[:HAS_OFFER]->(o)
WITH o, row.annotations as entities
UNWIND entities as entity
MERGE (e:Entity {id:entity.id})
ON CREATE SET
              e.name = entity.text,
              e.label = entity.label_upper
MERGE (o)-[m:MENTIONS]->(e)
ON CREATE SET m.count = 1
ON MATCH SET m.count = m.count + 1
WITH e as e
CALL apoc.create.addLabels( id(e), [ e.label ] )
YIELD node
REMOVE node.label
RETURN node
""", {'data': parsed_ents})

#Add property 'name' to entity EXPERIENCE
res = neo4j_query("""
MATCH (e:EXPERIENCE)
RETURN e.id as id, e.name as name
""")
# Extract the number of years from the EXPERIENCE name and store in property years
import re
def get_years(name):
  return re.findall(r"d+",name)[0]
res["years"] = res.name.map(lambda name: get_years(name))
data = res.to_dict('records')
#Add property 'years' to entity EXPERIENCE
neo4j_query("""
UNWIND $data as row
MATCH (e:EXPERIENCE {id:row.id})
SET e.years = row.years
RETURN e.name as name, e.years as years
""",{"data":data})

#Add relations to KG
neo4j_query("""
UNWIND $data as row
MATCH (source:Entity {id: row.head})
MATCH (target:Entity {id: row.tail})
MATCH (offer:Offer {id: row.source})
MERGE (source)-[:REL]->(r:Relation {type: row.type})-[:REL]->(target)
MERGE (offer)-[:MENTIONS]->(r)
""", {'data': predicted_rels})
Now begins the fun part. We are ready to launch the knowledge graph and run queries. Let’s run a query to find the best job match to a target profile:

#Show the best match in a table
other_id = "8de6e42ddfbc2a8bd7008d93516c57e50fa815e64e387eb2fc7a27000ae904b6"

query = """
MATCH (o1:Offer {id:$id})-[m1:MENTIONS]->(s:Entity)<- [m2:MENTIONS]-(o2:Offer)
RETURN DISTINCT o1.id as Source,o2.id as Proposed_Offer, count(*) as freq, collect(s.name) as common_terms
ORDER BY freq
DESC LIMIT $limit
"""
res = neo4j_query(query,{"id":other_id,"limit":3})
res

#In neo4j browser,use this query to show graph of best matched job
"""MATCH (o1:Offer {id:"8de6e42ddfbc2a8bd7008d93516c57e50fa815e64e387eb2fc7a27000ae904b6"})-[m1:MENTIONS]->(s:Entity)<- [m2:MENTIONS]-(o2:Offer)
WITH o1,s,o2, count(*) as freq
MATCH (o1)--(s)
RETURN collect(o2)[0], o1,s, max(freq)"""
Type caption for embed (optional)

Type caption for embed (optional)

Results in tabular form showing the common entities:

And in graph visual:

While this dataset was composed of only 29 job descriptions, the method described here can be applied to large scale dataset with thousands of jobs. With only few lines of codes, we can extract the highest job match to a target profile instantaneously.

Let’s find out the most in demand skills:

query = """

MATCH (s:SKILLS)<-[:MENTIONS]-(o:Offer)

RETURN s.name as skill, count(o) as freq

ORDER BY freq DESC

LIMIT 10

"""

res = neo4j_query(query)

res

And skills that require that highest years of experience:

query = """

MATCH (s:SKILLS)--(r:Relation)--(e:EXPERIENCE) where r.type = "EXPERIENCE_IN"

return s.name as skill,e.years as years

ORDER BY years DESC

LIMIT 10

"""

res = neo4j_query(query)
res

re

Web development and support requires the highest years of experience followed by Security setup.

Finally, lets check with pair of skills co-occur the most:

neo4j_query("""

MATCH (s1:SKILLS)<-[:MENTIONS]-(:Offer)-[:MENTIONS]->(s2:SKILLS)

WHERE id(s1) < id(s2)

RETURN s1.name as skill1, s2.name as skill2, count(*) as cooccurrence

ORDER BY cooccurrence

DESC LIMIT 5

""")

Try UBIAI AI Annotation Tool now !

Annotate smartly and quickly any type of documents in the most record time
Fine-tune your DL models with our approved tool tested by +100 Experts now!
Get better and fantastic collaboration space with your team.

Conclusion:

In this post, we described how to leverage transformers based NER and spacy’s relation extraction models to create knowledge graph with Neo4j. In addition to information extraction, the graph topology can be used as an input to another machine learning model.

Combining NLP with Neo4j’s graph DB is a going to accelerate information discovery in many domains, with more notable applications in healthcare and biomedical.

If you have any questions or want to create custom models for your specific case, leave a note below or send us an email at admin@100.21.53.251.

What are you waiting for?

Automate your process!

The Services provided are really great, we received a genuine advice and at very reasonable cost. all the work went hassle-free and no complication.

How to Build a Knowledge Graph with Neo4J and Transformers

Nov 21, 2021

Named Entity and Relation Extraction

Neo4J

Conclusion:

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

How to Build a Knowledge Graph with Neo4J and Transformers

Nov 21, 2021

Named Entity and Relation Extraction

Neo4J

Conclusion:

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost

Fine Tuning LLMs on Your Own Dataset