ubiai deep learning
social_default.96b04585 (1)

Getting Started with Basic Text Processing using SpaCy Models

Dec 14th, 2023

If you are considering natural language processing (NLP), spaCy is the go-to solution. This free and open-source library offers extensive built-in capabilities, making it progressively popular for the processing and analysis of data within the NLP domain.

Through this article you will learn: 

  1. Getting started

    • What is spaCy

    • Installing SpaCy

    • Statistical Models 

    • Read a Text

  2. Sentence boundary Detection using SpaCy

  3. Tokenization using SpaCy

  4. Stop words removal using SpaCy

  5. Named Entity Recognition using SpaCy

  6. Sentences similarity using SpaCy

  7. LLMs integration using SpaCy

    • Integrating a Generative pre-trained transformer model from OpenAi

    • Named entity recognition using an Open source model from the Hugging Face

  8. Entity linking using SpaCy

  9. Relation extraction using SpaCy

      10. Entity resolution using SpaCy

1- Getting started:

A-What is spaCy:

spaCy is an open-source natural language processing (NLP) library for Python. It is designed specifically for tasks related to processing and analyzing human language, such as part-of-speech tagging, named entity recognition, and syntactic parsing. spaCy is built with a focus on efficiency, speed, and ease of use.

B- Installing SpaCy:

You can simply pip install the library using your command line. Or on your notebook.

				
					import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en

				
			
image_2023-12-14_153817124

Are you looking for a quick data annotation tool ?

C- Statistical Models :

spaCy provides statistical models tailored for various languages, available as separate Python modules for installation. These models serve as robust engines within spaCy, proficient in executing multiple NLP tasks like part-of-speech tagging, named entity recognition, and dependency parsing.

 

To obtain these models specifically designed for the English language, execute the following code:

				
					!python3 -m spacy download en_core_web_sm
!python3 -m spacy download en_core_web_lg

				
			
image_2023-12-14_153930197

D- Read a Text :

To process a given input string using spaCy and access linguistic annotations, you start by loading the English language model ‘en_core_web_sm’ using spaCy. Then, you can employ the following code snippet to read the ‘ UBIAI is cool text ’ :

				
					import spacy
load_model = spacy.load('en_core_web_sm')
text = ('UBIAI is cool')
nlp = load_model(text)

				
			

2- Sentence Boundary Detection using SpaCy :

Sentence Boundary Detection (SBD) is a natural language processing (NLP) task that involves identifying the boundaries of sentences in a given text. The goal is to determine where each sentence begins and ends within the text. Accurate sentence boundary detection is crucial for various NLP applications, such as part-of-speech tagging, named entity recognition, and syntactic analysis, as these tasks often rely on sentences as basic processing units.

In this example, SpaCy processes the input text “ UBIAI is cool. This article is about Spacy.” , then the doc.sents property is used to iterate over the identified sentences, printing each sentence separately.



				
					import spacy
# Load the English language model
load_model = spacy.load('en_core_web_sm')
# Create an nlp object with an input text
nlp = load_model("UBIAI is cool. This article is about Spacy.")
# Extract sentences using the sents property
sentences = list(nlp.sents)
# Print each sentence in the nlp with one sentence per line
for i,sentence in enumerate(sentences):
    print(i,sentence)

				
			

As output we get :

image_2023-12-14_154156200

3- Tokenization using SpaCy :

Tokenization involves breaking down input text into individual units, referred to as tokens, which can include words, punctuation marks, and spaces.

spaCy provides various attributes for the Token class, allowing for more detailed information about each token. Some of these Token classes are : 

  • token.is_alpha: Detects if the token consists of alphabetic characters.
  • token.is_punct: Detects if the token is a punctuation symbol
  • token.is_space: Detects if the token is a space.



				
					import spacy
# Load the English language model
load_model = spacy.load("en_core_web_sm")
# Create an nlp object with an input text
nlp = load_model("UBIAI is cool.")
# Iterate over the tokens
for token in nlp:
    # Print token attributes
    print(token, token.is_alpha, token.is_punct, token.is_space)

				
			

As result we get : 

image_2023-12-14_154307417

4- Stop words removal using SpaCy :

Stop words removal is a text preprocessing technique that involves eliminating common words that are considered to be of little value in helping to understand the meaning of a text. These words, known as stop words, typically include common words such as “the,” “and,” “is,” and “in,” which appear frequently across various texts and do not contribute significant semantic meaning.

				
					import spacy
load_model = spacy.load("en_core_web_sm")


# Create an nlp object
nlp = load_model("UBIAI is cool.")


for token in nlp:
    if not token.is_stop:
        print(token)

				
			

As result we get : 

image_2023-12-14_154426693

5- Named Entity Recognition using SpaCy :

Named Entity Recognition (NER) is a natural language processing (NLP) task that involves identifying and classifying entities (such as names of people, organizations, locations, dates, and more) in a given text. The goal of NER is to extract structured information from unstructured text and identify specific entities within the text.

spaCy can identify named entities within a document by soliciting predictions from the model. However, it’s essential to note that the effectiveness of the models hinges on the examples they were trained on. Named Entity Recognition (NER) might not always yield perfect results, and adjustments to the tuning may be necessary based on your specific use case.

				
					import spacy


load_model = spacy.load("en_core_web_sm")
doc = load_model("Larry Page founded Google")
# Text and label of named entity span
[(ent.text, ent.label_) for ent in doc.ents]
# [('Larry Page', 'PERSON'), ('Google', 'ORG')]

				
			

As output we get : 

image_2023-12-14_154553072

6- Sentences similarity using SpaCy :

Similarity is established by comparing word vectors, also known as “word embeddings,” which are multi-dimensional representations of word meanings. spaCy seamlessly incorporates dense, real-valued vectors that capture distributional similarity information, enabling efficient analysis of semantic relationships between words.

				
					import spacy
load_model = spacy.load("en_core_web_lg")


nlp = load_model("dog cat banana afskfsd")


for token in nlp:
# Print the token text, the boolean value of whether the token is part of the model’s vocabulary, dimensions, and the boolean value of whether the token is out-of-vocabulary
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

				
			



  • token.text: Prints the text of the token.
  • token.has_vector: A boolean indicating whether the token has a vector representation in the model’s vocabulary. If True, the token has a vector; if False, it does not.
  • token.vector_norm: The Euclidean norm (magnitude) of the token’s vector representation. This can provide a sense of the length or “strength” of the vector.
  • token.is_oov: A boolean indicating whether the token is out-of-vocabulary (OOV), i.e., whether it is not recognized by the spaCy model. If True, the token is out-of-vocabulary; if False, it is part of the vocabulary.
image_2023-12-14_154722025

7- LLMs integration using SpaCy :

Large Language Models (LLMs) showcase robust natural language understanding capabilities. With minimal examples, and sometimes even none, an LLM can be directed to execute tailored Natural Language Processing (NLP) tasks, encompassing text classification, named entity recognition, and beyond.

 

Combining extensive language models with spaCy is achievable through spacy-llm. This integration allows you to enjoy the advantages of both approaches. You can efficiently set up a pipeline with components driven by LLM prompts and seamlessly incorporate components using alternative methods. As your project evolves, you have the flexibility to substitute specific LLM-powered components or transition entirely according to your needs.

 

To do so start by installing the spacy-llm library : 

				
					python -m pip install spacy-llm
				
			

A- Integrating a Generative pre-trained transformer model from OpenAi :

To leverage OpenAI’s model, the initial step involves creating an account and generating a new API key, granting you access to the model. Commence this process by following the instructions provided. For further details, consult the OpenAI website. 

Once that is done . Create a config.cfg file that is going to serve as your configuration file.

				
					[nlp]
# Specify the Language of the LLM => english
lang = "en"
# Specify the pipeline you want => a Large language model
pipeline = ["llm"]


[components]


[components.llm]
factory = "llm"


[components.llm.task]
@llm_tasks = "spacy.TextCat.v2"
labels = ["COMPLIMENT", "INSULT"]


[components.llm.model]
# Then specify the model => GPT3.5
@llm_models = "spacy.GPT-3-5.v1"
# Then specify the model’s configuration => GPT3.5
config = {"temperature": 0.0}

				
			

Then run your script :

				
					from spacy_llm.util import assemble
# Assemble the config file
nlp = assemble("config.cfg")
# Prompt it
doc = nlp("You look gorgeous!")
print(doc.cats)

				
			

 

Other models are available for use : 

 

  • spacy.GPT-4.v2: the Text GPT-4 model family
  • spacy.Text-Davinci.v2: the Text Davinci model family
  • spacy.Text-Curie.v2: the Text Curie model family
  • spacy.Azure.v1: Azure’s OpenAI models

B- : Named entity recognition using an Open source model from the Hugging Face

Create a config.cfg file that is going to serve as your configuration file: 

				
					[nlp]
# The language of the LLM and the pipeline
lang = "en"
pipeline = ["llm"]


[components]


[components.llm]
factory = "llm"


[components.llm.task]
# The component you want to work with
@llm_tasks = "spacy.NER.v3"
# Specify your NER labels
labels = ["PERSON", "ORGANISATION", "LOCATION"]

				
			

Then run your script : 

				
					from spacy_llm.util import assemble


nlp = assemble("config.cfg")
doc = nlp("Jack and Jill rode up the hill in Les Deux Alpes")
print([(ent.text, ent.label_) for ent in doc.ents])



				
			
  • spacy.NER.v3: Implements Chain-of-Thought reasoning for NER extraction – obtains higher accuracy than v1 or v2.
  • spacy.NER.v2: Builds on v1 and additionally supports defining the provided labels with explicit descriptions.
  • spacy.NER.v1: The original version of the built-in NER task supports both zero-shot and few-shot prompting.

8- Entity linking:

An entity linker is a component in Natural Language Processing (NLP) systems that aims to link or associate named entities mentioned in a text with unique identifiers in a knowledge base. The goal is to ground these entities in a specific context or real-world knowledge. 

Entity linking in spaCy involves the process of connecting named entities identified in a text to unique identifiers or entries in a knowledge base. This connection helps ground the identified entities in a specific context, providing additional information and facilitating a deeper understanding of the text. SpaCy facilitates entity linking through its EntityLinker component, which can be added to the spaCy pipeline. Here’s a simplified outline of the entity linking process using spaCy:

  • Named Entity Recognition (NER): Before entity linking, the text is processed through spaCy’s Named Entity Recognition to identify and classify entities like persons, organizations, and locations.
  • Knowledge Base (KB): A knowledge base containing entries for entities, their unique identifiers, and additional information is required. Examples of knowledge bases include Wikipedia or custom databases.
  • EntityLinker Configuration: Create an instance of the EntityLinker component and add it to the spaCy pipeline. Configure the EntityLinker with the path to the knowledge base.

 

9- Relation extraction :

Relation extraction with spaCy involves the identification and classification of relationships between entities mentioned in a text. SpaCy’s capabilities in relation extraction are often harnessed through custom rule-based approaches, machine learning models, or a combination of both. Users can employ spaCy’s Matcher or DependencyMatcher to create rules that capture specific syntactic or semantic patterns indicative of relationships between entities. 

Alternatively, machine learning models like spaCy’s Named Entity Recognition (NER) can be fine-tuned or extended to predict relations between entities based on labeled training data. By leveraging the linguistic and contextual insights provided by spaCy, relation extraction becomes a crucial component in understanding the connections and associations between entities within a given text. 

This capability finds applications in various domains, including information retrieval, knowledge graph construction, and improving the overall depth of natural language understanding in diverse NLP applications.

For the given string “UBIAI offers intelligent text Annotation services” . A potential knowledge graph would be the following.

image_2023-12-14_155354725

10- Entity Resolution using SpaCy :

Entity resolution in spaCy involves the process of identifying and consolidating references to the same real-world entities within a given text. This is particularly useful when dealing with variations, synonyms, or ambiguous mentions of entities. While spaCy does not have a dedicated entity resolution component, its capabilities, such as Named Entity Recognition (NER) and linguistic features, can be leveraged for entity resolution tasks.

				
					import spacy


nlp = spacy.load("en_core_web_sm")


text = "Apple Inc. is a tech giant. The company is headquartered in Cupertino."


# Apply NER to identify entities
doc = nlp(text)


# Custom rules or heuristics for entity resolution
entity_mapping = {
    "tech giant": "ORG",
    "company": "ORG",
    "Cupertino": "GPE"
}


# Resolve entities based on custom rules
resolved_entities = [entity_mapping.get(ent.text, ent.text) for ent in doc.ents]


print(resolved_entities)

				
			

As an output you’ll get : 

image_2023-12-14_155529630

Conclusion

In conclusion, spaCy stands out as a versatile and widely adopted tool in the field of Natural Language Processing (NLP). Its comprehensive capabilities, ranging from tokenization and part-of-speech tagging to named entity recognition and dependency parsing, make it a go-to library for a diverse array of NLP tasks.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !