Build a Knowledge Graph with Neo4J and Transformers

How to Build a Knowledge Graph with Neo4J and Transformers

Nov 21, 2021

In my previous article “Building a Knowledge Graph for Job Search using BERT Transformer”, we explored how to create a knowledge graph from job descriptions using entities and relations extracted by a custom transformer model. While we were able to get great visuals of our nodes and relations using Python library networkX, the actual graph lived in Python memory and wasn’t stored in database. This can be problematic when trying to create a scalable applications where you have to store an ever growing knowledge graph. This is where Neo4j excels, it enables you to store the graph in a fully functional database that will allow you to manage large amount of data. In addition, Neo4j’s Cypher language is rich, easy to use and very intuitive.

In this article, I will show how to build a knowledge graph from job descriptions using fine-tuned transformer-based Named Entity Recognition (NER) and spacy’s relation extraction models. The method described here can be used in any different field such as biomedical, finance, healthcare, etc.

Below are the steps we are going to take:

  • Load our fine-tuned transformer NER and spacy relation extraction model in google colab
  • Create a Neo4j Sandbox and add our entities and relations
  • Query our graph to find the highest job match to a target resume, find the three most popular skills and highest skills co-occurrence

For more information on how to generate training data using UBIAI and fine-tuning the NER and relation extraction model, checkout the articles below:

  1. – Introducing UBIAI: Easy-to-Use Text Annotation for NLP Applications
  2. – How to Train a Joint Entities and Relation Extraction Classifier using BERT Transformer with spaCy 3
  3. – How to Fine-Tune BERT Transformer with spaCy 3

The dataset of job descriptions is publicly available in Kaggle.

At the end of this tutorial, we will be able to create a knowledge graph as shown below.

Build a Knowledge Graph with Neo4J and Transformers
Build a Knowledge Graph with Neo4J and Transformers

And in graph visual:

Build a Knowledge Graph with Neo4J and Transformers

Named Entity and Relation Extraction

First we load the dependencies for NER and relation models as well as the NER model itself that has been previously fine-tuned to extract skills, diploma, diploma major and years of experience:

					!pip install -U pip setuptools wheel
!python -m spacy project clone tutorials/rel_component
!pip install -U spacy-nightly --pre
!!pip install -U spacy transformers
import spacy
#restart the runtime after installation of deps
nlp = spacy.load("[PATH_TO_THE_MODEL]/model-best")
Type caption for embed (optional)
Load the jobs dataset from which we want to extract the entities and relations:
import pandas as pd
def get_all_documents():
df = pd.read_csv("/content/drive/MyDrive/job_DB1_1_29.csv",sep='"',header=None)
documents = []
for index,row in df.iterrows():
    return documents
documents = get_all_documents()
documents = documents[:]
Type caption for embed (optional)
  • Extract entities from the jobs dataset:
					import hashlib

def extract_ents(documents,nlp):
  docs = list()
  for doc in nlp.pipe(documents, disable=["tagger", "parser"]):
      dictionary=dict.fromkeys(["text", "annotations"])
      dictionary["text"]= str(doc)
      dictionary['text_sha256'] =  hashlib.sha256(dictionary["text"].encode('utf-8')).hexdigest()

      for e in doc.ents:
        ent_id = hashlib.sha256(str(e.text).encode('utf-8')).hexdigest()
        ent = {"start":e.start_char,"end":e.end_char, "label":e.label_,"label_upper":e.label_.upper(),"text":e.text,"id":ent_id}
        if e.label_ == "EXPERIENCE":
          ent["years"] = int(e.text[0])


      dictionary["annotations"] = annotations
  return docs

parsed_ents = extract_ents(documents,nlp)
Type caption for embed (optional)
We can take a look at some of the extracted entities before we feed them to our relation extraction model:

[('stock market analysis', 'SKILLS'),
 ('private investor', 'SKILLS'),
 ('C++', 'SKILLS'),
 ('Investment Software', 'SKILLS'),
 ('MS Windows', 'SKILLS'),
 ('web development', 'SKILLS'),
 ('Computer Science', 'DIPLOMA_MAJOR'),
 ('AI', 'SKILLS'),
 ('software development', 'SKILLS'),
 ('coding', 'SKILLS'),
 ('C', 'SKILLS'),
 ('C++', 'SKILLS'),
 ('Visual Studio', 'SKILLS'),
 ('2 years', 'EXPERIENCE'),
 ('C/C++ development', 'SKILLS'),
 ('data compression', 'SKILLS'),
 ('financial markets', 'SKILLS'),
 ('financial calculation', 'SKILLS'),
 ('GUI design', 'SKILLS'),
 ('Windows development', 'SKILLS'),
 ('MFC', 'SKILLS'),
 ('Win', 'SKILLS'),
 ('HTTP', 'SKILLS'),
 ('TCP/IP', 'SKILLS'),
 ('sockets', 'SKILLS'),
 ('network programming', 'SKILLS'),
 ('System administration', 'SKILLS')]
We are now ready to predict the relations; first load the relation extraction model, make sure to change directory to rel_component/scripts to access all the necessary scripts for the relation model.

cd rel_component/
import random
import typer
from pathlib import Path
import spacy
from spacy.tokens import DocBin, Doc
from import Example

# make the factory work
from rel_pipe import make_relation_extractor, score_relations

# make the config work
from rel_model import create_relation_model, create_classification_layer, create_instances, create_tensors

#restart the runtime after installation of deps
nlp2 = spacy.load("/content/drive/MyDrive/training_rel_roberta/model-best")

def extract_relations(documents,nlp,nlp2):
  predicted_rels = list()
  for doc in nlp.pipe(documents, disable=["tagger", "parser"]):
    source_hash = hashlib.sha256(doc.text.encode('utf-8')).hexdigest()
    for name, proc in nlp2.pipeline:
          doc = proc(doc)

    for value, rel_dict in doc._.rel.items():
      for e in doc.ents:
        for b in doc.ents:
          if e.start == value[0] and b.start == value[1]:
            max_key = max(rel_dict, key=rel_dict. get)
            e_id = hashlib.sha256(str(e).encode('utf-8')).hexdigest()
            b_id = hashlib.sha256(str(b).encode('utf-8')).hexdigest()
            if rel_dict[max_key] >=0.9 :
              #print(f" entities: {e.text, b.text} --> predicted relation: {rel_dict}")
              predicted_rels.append({'head': e_id, 'tail': b_id, 'type':max_key, 'source': source_hash})
  return predicted_rels

predicted_rels = extract_relations(documents,nlp,nlp2)
Type caption for embed (optional)
Predicted relations:  entities: ('5+ years', 'software engineering') --> predicted relation: {'DEGREE_IN': 9.5471655e-08, 'EXPERIENCE_IN': 0.9967771}  entities: ('5+ years', 'technical management') --> predicted relation: {'DEGREE_IN': 1.1285037e-07, 'EXPERIENCE_IN': 0.9961034}  entities: ('5+ years', 'designing') --> predicted relation: {'DEGREE_IN': 1.3603304e-08, 'EXPERIENCE_IN': 0.9989103}  entities: ('4+ years', 'performance management') --> predicted relation: {'DEGREE_IN': 6.748373e-08, 'EXPERIENCE_IN': 0.92884386}



We are now ready to load our jobs dataset and extracted data into the neo4j database.

					documents = get_all_documents()
documents = documents[:]
parsed_ents = extract_ents(documents,nlp)
predicted_rels = extract_relations(documents,nlp,nlp2)

#basic neo4j query function
from neo4j import GraphDatabase
import pandas as pd

host = 'bolt://[your_host_address]'
user = 'neo4j'
password = '[your_password]'
driver = GraphDatabase.driver(host,auth=(user, password))

def neo4j_query(query, params=None):
    with driver.session() as session:
        result =, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())
Type caption for embed (optional)
Next, we add the documents, entities and relations to the knowledge graph. Note that we need to extracted the integer number of years from the name of entity EXPERIENCE and stored it as a property.
#clean your current neo4j sandbox db (remove everything)

#Create a first main node
MERGE (l:LaborMarket {name:"Labor Market"})

#add entities to KG: skills, expereince, diploma, major-diploma
MATCH (l:LaborMarket)
UNWIND $data as row
MERGE (o:Offer{id:row.text_sha256})
SET o.text = row.text
MERGE (l)-[:HAS_OFFER]->(o)
WITH o, row.annotations as entities
UNWIND entities as entity
MERGE (e:Entity {})
     = entity.text,
              e.label = entity.label_upper
MERGE (o)-[m:MENTIONS]->(e)
ON CREATE SET m.count = 1
ON MATCH SET m.count = m.count + 1
WITH e as e
CALL apoc.create.addLabels( id(e), [ e.label ] )
YIELD node
REMOVE node.label
""", {'data': parsed_ents})

#Add property 'name' to entity EXPERIENCE
res = neo4j_query("""
RETURN as id, as name
# Extract the number of years from the EXPERIENCE name and store in property years
import re
def get_years(name):
  return re.findall(r"d+",name)[0]
res["years"] = name: get_years(name))
data = res.to_dict('records')
#Add property 'years' to entity EXPERIENCE
UNWIND $data as row
SET e.years = row.years
RETURN as name, e.years as years

#Add relations to KG
UNWIND $data as row
MATCH (source:Entity {id: row.head})
MATCH (target:Entity {id: row.tail})
MATCH (offer:Offer {id: row.source})
MERGE (source)-[:REL]->(r:Relation {type: row.type})-[:REL]->(target)
MERGE (offer)-[:MENTIONS]->(r)
""", {'data': predicted_rels})
Now begins the fun part. We are ready to launch the knowledge graph and run queries. Let’s run a query to find the best job match to a target profile:

#Show the best match in a table
other_id = "8de6e42ddfbc2a8bd7008d93516c57e50fa815e64e387eb2fc7a27000ae904b6"

query = """
MATCH (o1:Offer {id:$id})-[m1:MENTIONS]->(s:Entity)<- [m2:MENTIONS]-(o2:Offer)
RETURN DISTINCT as Source, as Proposed_Offer, count(*) as freq, collect( as common_terms
res = neo4j_query(query,{"id":other_id,"limit":3})

#In neo4j browser,use this query to show graph of best matched job
"""MATCH (o1:Offer {id:"8de6e42ddfbc2a8bd7008d93516c57e50fa815e64e387eb2fc7a27000ae904b6"})-[m1:MENTIONS]->(s:Entity)<- [m2:MENTIONS]-(o2:Offer)
WITH o1,s,o2, count(*) as freq
MATCH (o1)--(s)
RETURN collect(o2)[0], o1,s, max(freq)"""
Type caption for embed (optional)
Type caption for embed (optional)

Results in tabular form showing the common entities:

How to Build a Knowledge Graph with Neo4J and Transformers

And in graph visual:




While this dataset was composed of only 29 job descriptions, the method described here can be applied to large scale dataset with thousands of jobs. With only few lines of codes, we can extract the highest job match to a target profile instantaneously.

Let’s find out the most in demand skills:



query = """
RETURN as skill, count(o) as freq
res = neo4j_query(query)
Build a Knowledge Graph with Neo4J and Transformers

And skills that require that highest years of experience:


query = """
MATCH (s:SKILLS)--(r:Relation)--(e:EXPERIENCE) where r.type = "EXPERIENCE_IN"
return as skill,e.years as years
res = neo4j_query(query)
Build a Knowledge Graph with Neo4J and Transformers

Web development and support requires the highest years of experience followed by Security setup.

Finally, lets check with pair of skills co-occur the most:


WHERE id(s1) < id(s2)
RETURN as skill1, as skill2, count(*) as cooccurrence
ORDER BY cooccurrence
Build a Knowledge Graph with Neo4J and Transformers


In this post, we described how to leverage transformers based NER and spacy’s relation extraction models to create knowledge graph with Neo4j. In addition to information extraction, the graph topology can be used as an input to another machine learning model.

Combining NLP with Neo4j’s graph DB is a going to accelerate information discovery in many domains, with more notable applications in healthcare and biomedical.

If you have any questions or want to create custom models for your specific case, leave a note below or send us an email at admin@

Follow us on Twitter @UBIAI5