Build a Knowledge Graph with Neo4J and Transformers

How to Build a Knowledge Graph with Neo4J and Transformers

Nov 21, 2021

In my previous article “Building a Knowledge Graph for Job Search using BERT Transformer”, we explored how to create a knowledge graph from job descriptions using entities and relations extracted by a custom transformer model. While we were able to get great visuals of our nodes and relations using Python library networkX, the actual graph lived in Python memory and wasn’t stored in database. This can be problematic when trying to create a scalable applications where you have to store an ever growing knowledge graph. This is where Neo4j excels, it enables you to store the graph in a fully functional database that will allow you to manage large amount of data. In addition, Neo4j’s Cypher language is rich, easy to use and very intuitive.


In this article, I will show how to build a knowledge graph from job descriptions using fine-tuned transformer-based Named Entity Recognition (NER) and spacy’s relation extraction models. The method described here can be used in any different field such as biomedical, finance, healthcare, etc.


Below are the steps we are going to take:

  • Load our fine-tuned transformer NER and spacy relation extraction model in google colab
  • Create a Neo4j Sandbox and add our entities and relations
  • Query our graph to find the highest job match to a target resume, find the three most popular skills and highest skills co-occurrence


For more information on how to generate training data using UBIAI and fine-tuning the NER and relation extraction model, checkout the articles below:


  1. – Introducing UBIAI: Easy-to-Use Text Annotation for NLP Applications
  2. – How to Train a Joint Entities and Relation Extraction Classifier using BERT Transformer with spaCy 3
  3. – How to Fine-Tune BERT Transformer with spaCy 3


The dataset of job descriptions is publicly available in Kaggle.

At the end of this tutorial, we will be able to create a knowledge graph as shown below.

Build a Knowledge Graph with Neo4J and Transformers
Build a Knowledge Graph with Neo4J and Transformers

And in graph visual:

Build a Knowledge Graph with Neo4J and Transformers

Named Entity and Relation Extraction

First we load the dependencies for NER and relation models as well as the NER model itself that has been previously fine-tuned to extract skills, diploma, diploma major and years of experience:

				
					!pip install -U pip setuptools wheel
!python -m spacy project clone tutorials/rel_component
!pip install -U spacy-nightly --pre
!!pip install -U spacy transformers
import spacy
#restart the runtime after installation of deps
nlp = spacy.load("[PATH_TO_THE_MODEL]/model-best")
Type caption for embed (optional)
Load the jobs dataset from which we want to extract the entities and relations:
import pandas as pd
def get_all_documents():
df = pd.read_csv("/content/drive/MyDrive/job_DB1_1_29.csv",sep='"',header=None)
documents = []
for index,row in df.iterrows():
    documents.append(str(row[0]))
    return documents
documents = get_all_documents()
documents = documents[:]
				
			
Type caption for embed (optional)
  • Extract entities from the jobs dataset:
				
					import hashlib

def extract_ents(documents,nlp):
  docs = list()
  for doc in nlp.pipe(documents, disable=["tagger", "parser"]):
      dictionary=dict.fromkeys(["text", "annotations"])
      dictionary["text"]= str(doc)
      dictionary['text_sha256'] =  hashlib.sha256(dictionary["text"].encode('utf-8')).hexdigest()
      annotations=[]

      for e in doc.ents:
        ent_id = hashlib.sha256(str(e.text).encode('utf-8')).hexdigest()
        ent = {"start":e.start_char,"end":e.end_char, "label":e.label_,"label_upper":e.label_.upper(),"text":e.text,"id":ent_id}
        if e.label_ == "EXPERIENCE":
          ent["years"] = int(e.text[0])

        annotations.append(ent)

      dictionary["annotations"] = annotations
      docs.append(dictionary)
  #print(annotations)
  return docs

parsed_ents = extract_ents(documents,nlp)
Type caption for embed (optional)
We can take a look at some of the extracted entities before we feed them to our relation extraction model:

[('stock market analysis', 'SKILLS'),
 ('private investor', 'SKILLS'),
 ('C++', 'SKILLS'),
 ('Investment Software', 'SKILLS'),
 ('MS Windows', 'SKILLS'),
 ('web development', 'SKILLS'),
 ('Computer Science', 'DIPLOMA_MAJOR'),
 ('AI', 'SKILLS'),
 ('software development', 'SKILLS'),
 ('coding', 'SKILLS'),
 ('C', 'SKILLS'),
 ('C++', 'SKILLS'),
 ('Visual Studio', 'SKILLS'),
 ('2 years', 'EXPERIENCE'),
 ('C/C++ development', 'SKILLS'),
 ('data compression', 'SKILLS'),
 ('financial markets', 'SKILLS'),
 ('financial calculation', 'SKILLS'),
 ('GUI design', 'SKILLS'),
 ('Windows development', 'SKILLS'),
 ('MFC', 'SKILLS'),
 ('Win', 'SKILLS'),
 ('HTTP', 'SKILLS'),
 ('TCP/IP', 'SKILLS'),
 ('sockets', 'SKILLS'),
 ('network programming', 'SKILLS'),
 ('System administration', 'SKILLS')]
We are now ready to predict the relations; first load the relation extraction model, make sure to change directory to rel_component/scripts to access all the necessary scripts for the relation model.

cd rel_component/
import random
import typer
from pathlib import Path
import spacy
from spacy.tokens import DocBin, Doc
from spacy.training.example import Example

# make the factory work
from rel_pipe import make_relation_extractor, score_relations

# make the config work
from rel_model import create_relation_model, create_classification_layer, create_instances, create_tensors

#restart the runtime after installation of deps
nlp2 = spacy.load("/content/drive/MyDrive/training_rel_roberta/model-best")

def extract_relations(documents,nlp,nlp2):
  predicted_rels = list()
  for doc in nlp.pipe(documents, disable=["tagger", "parser"]):
    source_hash = hashlib.sha256(doc.text.encode('utf-8')).hexdigest()
    for name, proc in nlp2.pipeline:
          doc = proc(doc)

    for value, rel_dict in doc._.rel.items():
      for e in doc.ents:
        for b in doc.ents:
          if e.start == value[0] and b.start == value[1]:
            max_key = max(rel_dict, key=rel_dict. get)
            #print(max_key)
            e_id = hashlib.sha256(str(e).encode('utf-8')).hexdigest()
            b_id = hashlib.sha256(str(b).encode('utf-8')).hexdigest()
            if rel_dict[max_key] >=0.9 :
              #print(f" entities: {e.text, b.text} --> predicted relation: {rel_dict}")
              predicted_rels.append({'head': e_id, 'tail': b_id, 'type':max_key, 'source': source_hash})
  return predicted_rels

predicted_rels = extract_relations(documents,nlp,nlp2)
Type caption for embed (optional)
Predicted relations:  entities: ('5+ years', 'software engineering') --> predicted relation: {'DEGREE_IN': 9.5471655e-08, 'EXPERIENCE_IN': 0.9967771}  entities: ('5+ years', 'technical management') --> predicted relation: {'DEGREE_IN': 1.1285037e-07, 'EXPERIENCE_IN': 0.9961034}  entities: ('5+ years', 'designing') --> predicted relation: {'DEGREE_IN': 1.3603304e-08, 'EXPERIENCE_IN': 0.9989103}  entities: ('4+ years', 'performance management') --> predicted relation: {'DEGREE_IN': 6.748373e-08, 'EXPERIENCE_IN': 0.92884386}

				
			

Neo4J

We are now ready to load our jobs dataset and extracted data into the neo4j database.

				
					documents = get_all_documents()
documents = documents[:]
parsed_ents = extract_ents(documents,nlp)
predicted_rels = extract_relations(documents,nlp,nlp2)

#basic neo4j query function
from neo4j import GraphDatabase
import pandas as pd

host = 'bolt://[your_host_address]'
user = 'neo4j'
password = '[your_password]'
driver = GraphDatabase.driver(host,auth=(user, password))

def neo4j_query(query, params=None):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())
      
Type caption for embed (optional)
Next, we add the documents, entities and relations to the knowledge graph. Note that we need to extracted the integer number of years from the name of entity EXPERIENCE and stored it as a property.
#clean your current neo4j sandbox db (remove everything)
neo4j_query("""
MATCH (n) DETACH DELETE n;
""")

#Create a first main node
neo4j_query("""
MERGE (l:LaborMarket {name:"Labor Market"})
RETURN l
""")

#add entities to KG: skills, expereince, diploma, major-diploma
neo4j_query("""
MATCH (l:LaborMarket)
UNWIND $data as row
MERGE (o:Offer{id:row.text_sha256})
SET o.text = row.text
MERGE (l)-[:HAS_OFFER]->(o)
WITH o, row.annotations as entities
UNWIND entities as entity
MERGE (e:Entity {id:entity.id})
ON CREATE SET
              e.name = entity.text,
              e.label = entity.label_upper
MERGE (o)-[m:MENTIONS]->(e)
ON CREATE SET m.count = 1
ON MATCH SET m.count = m.count + 1
WITH e as e
CALL apoc.create.addLabels( id(e), [ e.label ] )
YIELD node
REMOVE node.label
RETURN node
""", {'data': parsed_ents})

#Add property 'name' to entity EXPERIENCE
res = neo4j_query("""
MATCH (e:EXPERIENCE)
RETURN e.id as id, e.name as name
""")
# Extract the number of years from the EXPERIENCE name and store in property years
import re
def get_years(name):
  return re.findall(r"d+",name)[0]
res["years"] = res.name.map(lambda name: get_years(name))
data = res.to_dict('records')
#Add property 'years' to entity EXPERIENCE
neo4j_query("""
UNWIND $data as row
MATCH (e:EXPERIENCE {id:row.id})
SET e.years = row.years
RETURN e.name as name, e.years as years
""",{"data":data})

#Add relations to KG
neo4j_query("""
UNWIND $data as row
MATCH (source:Entity {id: row.head})
MATCH (target:Entity {id: row.tail})
MATCH (offer:Offer {id: row.source})
MERGE (source)-[:REL]->(r:Relation {type: row.type})-[:REL]->(target)
MERGE (offer)-[:MENTIONS]->(r)
""", {'data': predicted_rels})
Now begins the fun part. We are ready to launch the knowledge graph and run queries. Let’s run a query to find the best job match to a target profile:

#Show the best match in a table
other_id = "8de6e42ddfbc2a8bd7008d93516c57e50fa815e64e387eb2fc7a27000ae904b6"

query = """
MATCH (o1:Offer {id:$id})-[m1:MENTIONS]->(s:Entity)<- [m2:MENTIONS]-(o2:Offer)
RETURN DISTINCT o1.id as Source,o2.id as Proposed_Offer, count(*) as freq, collect(s.name) as common_terms
ORDER BY freq
DESC LIMIT $limit
"""
res = neo4j_query(query,{"id":other_id,"limit":3})
res

#In neo4j browser,use this query to show graph of best matched job
"""MATCH (o1:Offer {id:"8de6e42ddfbc2a8bd7008d93516c57e50fa815e64e387eb2fc7a27000ae904b6"})-[m1:MENTIONS]->(s:Entity)<- [m2:MENTIONS]-(o2:Offer)
WITH o1,s,o2, count(*) as freq
MATCH (o1)--(s)
RETURN collect(o2)[0], o1,s, max(freq)"""
Type caption for embed (optional)
				
			
Type caption for embed (optional)

Results in tabular form showing the common entities:

How to Build a Knowledge Graph with Neo4J and Transformers

And in graph visual:

 

 

 

While this dataset was composed of only 29 job descriptions, the method described here can be applied to large scale dataset with thousands of jobs. With only few lines of codes, we can extract the highest job match to a target profile instantaneously.

Let’s find out the most in demand skills:

 

 

query = """
MATCH (s:SKILLS)<-[:MENTIONS]-(o:Offer)
RETURN s.name as skill, count(o) as freq
ORDER BY freq DESC
LIMIT 10
"""
res = neo4j_query(query)
res
 
Build a Knowledge Graph with Neo4J and Transformers

And skills that require that highest years of experience:

 

query = """
MATCH (s:SKILLS)--(r:Relation)--(e:EXPERIENCE) where r.type = "EXPERIENCE_IN"
return s.name as skill,e.years as years
ORDER BY years DESC
LIMIT 10
"""
res = neo4j_query(query)
res
re
 
Build a Knowledge Graph with Neo4J and Transformers

Web development and support requires the highest years of experience followed by Security setup.

Finally, lets check with pair of skills co-occur the most:

 

neo4j_query("""
MATCH (s1:SKILLS)<-[:MENTIONS]-(:Offer)-[:MENTIONS]->(s2:SKILLS)
WHERE id(s1) < id(s2)
RETURN s1.name as skill1, s2.name as skill2, count(*) as cooccurrence
ORDER BY cooccurrence
DESC LIMIT 5
""")
Build a Knowledge Graph with Neo4J and Transformers

Conclusion:

In this post, we described how to leverage transformers based NER and spacy’s relation extraction models to create knowledge graph with Neo4j. In addition to information extraction, the graph topology can be used as an input to another machine learning model.

Combining NLP with Neo4j’s graph DB is a going to accelerate information discovery in many domains, with more notable applications in healthcare and biomedical.

If you have any questions or want to create custom models for your specific case, leave a note below or send us an email at admin@100.21.53.251.

Follow us on Twitter @UBIAI5

UBIAI