If you are considering natural language processing (NLP), spaCy is the go-to solution. This free and open-source library offers extensive built-in capabilities, making it progressively popular for the processing and analysis of data within the NLP domain.
Through this article you will learn:
Getting started
What is spaCy
Installing SpaCy
Statistical Models
Read a Text
Sentence boundary Detection using SpaCy
Tokenization using SpaCy
Stop words removal using SpaCy
Named Entity Recognition using SpaCy
Sentences similarity using SpaCy
LLMs integration using SpaCy
Integrating a Generative pre-trained transformer model from OpenAi
Named entity recognition using an Open source model from the Hugging Face
Entity linking using SpaCy
Relation extraction using SpaCy
10. Entity resolution using SpaCy
spaCy is an open-source natural language processing (NLP) library for Python. It is designed specifically for tasks related to processing and analyzing human language, such as part-of-speech tagging, named entity recognition, and syntactic parsing. spaCy is built with a focus on efficiency, speed, and ease of use.
You can simply pip install the library using your command line. Or on your notebook.
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en
spaCy provides statistical models tailored for various languages, available as separate Python modules for installation. These models serve as robust engines within spaCy, proficient in executing multiple NLP tasks like part-of-speech tagging, named entity recognition, and dependency parsing.
To obtain these models specifically designed for the English language, execute the following code:
!python3 -m spacy download en_core_web_sm
!python3 -m spacy download en_core_web_lg
To process a given input string using spaCy and access linguistic annotations, you start by loading the English language model ‘en_core_web_sm’ using spaCy. Then, you can employ the following code snippet to read the ‘ UBIAI is cool text ’ :
import spacy
load_model = spacy.load('en_core_web_sm')
text = ('UBIAI is cool')
nlp = load_model(text)
Sentence Boundary Detection (SBD) is a natural language processing (NLP) task that involves identifying the boundaries of sentences in a given text. The goal is to determine where each sentence begins and ends within the text. Accurate sentence boundary detection is crucial for various NLP applications, such as part-of-speech tagging, named entity recognition, and syntactic analysis, as these tasks often rely on sentences as basic processing units.
In this example, SpaCy processes the input text “ UBIAI is cool. This article is about Spacy.” , then the doc.sents property is used to iterate over the identified sentences, printing each sentence separately.
import spacy
# Load the English language model
load_model = spacy.load('en_core_web_sm')
# Create an nlp object with an input text
nlp = load_model("UBIAI is cool. This article is about Spacy.")
# Extract sentences using the sents property
sentences = list(nlp.sents)
# Print each sentence in the nlp with one sentence per line
for i,sentence in enumerate(sentences):
print(i,sentence)
As output we get :
Tokenization involves breaking down input text into individual units, referred to as tokens, which can include words, punctuation marks, and spaces.
spaCy provides various attributes for the Token class, allowing for more detailed information about each token. Some of these Token classes are :
import spacy
# Load the English language model
load_model = spacy.load("en_core_web_sm")
# Create an nlp object with an input text
nlp = load_model("UBIAI is cool.")
# Iterate over the tokens
for token in nlp:
# Print token attributes
print(token, token.is_alpha, token.is_punct, token.is_space)
As result we get :
Stop words removal is a text preprocessing technique that involves eliminating common words that are considered to be of little value in helping to understand the meaning of a text. These words, known as stop words, typically include common words such as “the,” “and,” “is,” and “in,” which appear frequently across various texts and do not contribute significant semantic meaning.
import spacy
load_model = spacy.load("en_core_web_sm")
# Create an nlp object
nlp = load_model("UBIAI is cool.")
for token in nlp:
if not token.is_stop:
print(token)
As result we get :
Named Entity Recognition (NER) is a natural language processing (NLP) task that involves identifying and classifying entities (such as names of people, organizations, locations, dates, and more) in a given text. The goal of NER is to extract structured information from unstructured text and identify specific entities within the text.
spaCy can identify named entities within a document by soliciting predictions from the model. However, it’s essential to note that the effectiveness of the models hinges on the examples they were trained on. Named Entity Recognition (NER) might not always yield perfect results, and adjustments to the tuning may be necessary based on your specific use case.
import spacy
load_model = spacy.load("en_core_web_sm")
doc = load_model("Larry Page founded Google")
# Text and label of named entity span
[(ent.text, ent.label_) for ent in doc.ents]
# [('Larry Page', 'PERSON'), ('Google', 'ORG')]
As output we get :
Similarity is established by comparing word vectors, also known as “word embeddings,” which are multi-dimensional representations of word meanings. spaCy seamlessly incorporates dense, real-valued vectors that capture distributional similarity information, enabling efficient analysis of semantic relationships between words.
import spacy
load_model = spacy.load("en_core_web_lg")
nlp = load_model("dog cat banana afskfsd")
for token in nlp:
# Print the token text, the boolean value of whether the token is part of the model’s vocabulary, dimensions, and the boolean value of whether the token is out-of-vocabulary
print(token.text, token.has_vector, token.vector_norm, token.is_oov)
Large Language Models (LLMs) showcase robust natural language understanding capabilities. With minimal examples, and sometimes even none, an LLM can be directed to execute tailored Natural Language Processing (NLP) tasks, encompassing text classification, named entity recognition, and beyond.
Combining extensive language models with spaCy is achievable through spacy-llm. This integration allows you to enjoy the advantages of both approaches. You can efficiently set up a pipeline with components driven by LLM prompts and seamlessly incorporate components using alternative methods. As your project evolves, you have the flexibility to substitute specific LLM-powered components or transition entirely according to your needs.
To do so start by installing the spacy-llm library :
python -m pip install spacy-llm
To leverage OpenAI’s model, the initial step involves creating an account and generating a new API key, granting you access to the model. Commence this process by following the instructions provided. For further details, consult the OpenAI website.
Once that is done . Create a config.cfg file that is going to serve as your configuration file.
[nlp]
# Specify the Language of the LLM => english
lang = "en"
# Specify the pipeline you want => a Large language model
pipeline = ["llm"]
[components]
[components.llm]
factory = "llm"
[components.llm.task]
@llm_tasks = "spacy.TextCat.v2"
labels = ["COMPLIMENT", "INSULT"]
[components.llm.model]
# Then specify the model => GPT3.5
@llm_models = "spacy.GPT-3-5.v1"
# Then specify the model’s configuration => GPT3.5
config = {"temperature": 0.0}
Then run your script :
from spacy_llm.util import assemble
# Assemble the config file
nlp = assemble("config.cfg")
# Prompt it
doc = nlp("You look gorgeous!")
print(doc.cats)
Other models are available for use :
Create a config.cfg file that is going to serve as your configuration file:
[nlp]
# The language of the LLM and the pipeline
lang = "en"
pipeline = ["llm"]
[components]
[components.llm]
factory = "llm"
[components.llm.task]
# The component you want to work with
@llm_tasks = "spacy.NER.v3"
# Specify your NER labels
labels = ["PERSON", "ORGANISATION", "LOCATION"]
Then run your script :
from spacy_llm.util import assemble
nlp = assemble("config.cfg")
doc = nlp("Jack and Jill rode up the hill in Les Deux Alpes")
print([(ent.text, ent.label_) for ent in doc.ents])
An entity linker is a component in Natural Language Processing (NLP) systems that aims to link or associate named entities mentioned in a text with unique identifiers in a knowledge base. The goal is to ground these entities in a specific context or real-world knowledge.
Entity linking in spaCy involves the process of connecting named entities identified in a text to unique identifiers or entries in a knowledge base. This connection helps ground the identified entities in a specific context, providing additional information and facilitating a deeper understanding of the text. SpaCy facilitates entity linking through its EntityLinker component, which can be added to the spaCy pipeline. Here’s a simplified outline of the entity linking process using spaCy:
Relation extraction with spaCy involves the identification and classification of relationships between entities mentioned in a text. SpaCy’s capabilities in relation extraction are often harnessed through custom rule-based approaches, machine learning models, or a combination of both. Users can employ spaCy’s Matcher or DependencyMatcher to create rules that capture specific syntactic or semantic patterns indicative of relationships between entities.
Alternatively, machine learning models like spaCy’s Named Entity Recognition (NER) can be fine-tuned or extended to predict relations between entities based on labeled training data. By leveraging the linguistic and contextual insights provided by spaCy, relation extraction becomes a crucial component in understanding the connections and associations between entities within a given text.
This capability finds applications in various domains, including information retrieval, knowledge graph construction, and improving the overall depth of natural language understanding in diverse NLP applications.
For the given string “UBIAI offers intelligent text Annotation services” . A potential knowledge graph would be the following.
Entity resolution in spaCy involves the process of identifying and consolidating references to the same real-world entities within a given text. This is particularly useful when dealing with variations, synonyms, or ambiguous mentions of entities. While spaCy does not have a dedicated entity resolution component, its capabilities, such as Named Entity Recognition (NER) and linguistic features, can be leveraged for entity resolution tasks.
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. is a tech giant. The company is headquartered in Cupertino."
# Apply NER to identify entities
doc = nlp(text)
# Custom rules or heuristics for entity resolution
entity_mapping = {
"tech giant": "ORG",
"company": "ORG",
"Cupertino": "GPE"
}
# Resolve entities based on custom rules
resolved_entities = [entity_mapping.get(ent.text, ent.text) for ent in doc.ents]
print(resolved_entities)
As an output you’ll get :
In conclusion, spaCy stands out as a versatile and widely adopted tool in the field of Natural Language Processing (NLP). Its comprehensive capabilities, ranging from tokenization and part-of-speech tagging to named entity recognition and dependency parsing, make it a go-to library for a diverse array of NLP tasks.