ubiai deep learning
NLP Techniques

NLP Techniques: Tokenization, POS Tagging and NER

Nov 6th, 2023

In today’s world, where human-computer interaction is increasingly important, NLP plays a crucial role in bridging the gap between humans and machines. Imagine a computer not only understanding your words but also responding in a meaningful way – that’s the magic of NLP.


Have you ever wondered how your favorite virtual assistant understands your voice commands or how chatbots engage in seemingly natural conversations with you? It’s all thanks to NLP. 


In this article, we will delve into the fascinating world of NLP techniques.


Our journey into NLP will focus on three key techniques: Tokenization, Part-of-Speech Tagging, and Named Entity Recognition.


By the end of this article, you’ll have a comprehensive understanding of these techniques, and you’ll appreciate how they enable machines to process human language effectively.

NLP techniques:

Tokenization: as first of the NLP techniques

In the NLP field, tokenization plays a central role in the processing pipeline. It is the process of disassembling a given text into its smallest units, known as tokens, which can include punctuation marks, words, and even numbers.


But why do we need tokenization? 


Tokenization enables us to analyze the frequency of words within the entire text by segmenting it into these tokens. This, in turn, forms the basis for creating models based on these word frequencies. Additionally, tokenization allows us to label tokens according to their word type, a concept we will dig into further when discussing Part of Speech Tagging.


As we progress into the practical application of tokenization, we will focus exclusively on the NLTK library for all aspects of this article. 


We will explore the process in detail, providing step-by-step guidance using NLTK, a comprehensive library within the NLP domain. Before we embark on our hands-on journey, it is essential to ensure that we have NLTK properly installed and ready for use.

nlp techniques

We’ll begin by addressing the process of splitting text into sentences through sentence tokenization:

nlp techniques

Now, let’s delve deeper into the idea of dividing text into words:

nlp techniques


Stemming essentially chops off prefixes and suffixes from words to obtain their  base form. This can be particularly useful when you want to identify common elements in words. Let’s take a look at an example : 

nlp techniques

As a result, we obtain the root form of each word.


We can use stemming to prepare text data for various NLP tasks, such as text classification or sentiment analysis, by reducing the complexity of the words used. It’s a fundamental technique that plays a role in simplifying and standardizing text data.


Lemmatization, unlike stemming, considers the context and meaning of words, producing valid words from the language’s lexicon. Here’s an example : 

Screenshot 2023-11-01 at 11.15.47 AM

As you can see, lemmatization retains words in their base form, considering their grammatical context and meaning.

Important NLP techniques: POS Tagging

Part of Speech Tagging is a fundamental linguistic process where words in a given text are labeled according to their respective word types, such as nouns, adjectives, adverbs, verbs, and more.


This tagging assigns each word a specific grammatical category, which helps in understanding the structure and meaning of the text.


This process is a crucial step in natural language processing, as it enables computers to understand the syntactic and semantic structure of text, facilitating various language analysis tasks.

Let’s get the tagger model: 

Screenshot 2023-11-01 at 11.16.50 AM

The output is a list of tuples, where each tuple consists of a word and its corresponding part-of-speech tag.

nlp techniques

What are the Main Parts of Speech?


In the realm of grammar and linguistics, there are nine primary parts of speech that categorize words in a sentence according to their specific functions. These include:


  • Noun (NN): A word for a person, place, thing, or idea.
  • Verb (VB): A word for an action or something that happens.
  • Adjective (JJ): A word that describes a noun or pronoun.
  • Adverb (RB): A word that describes a verb, adjective, or another adverb.
  • Pronoun (PRP): A word that stands in for a noun.
  • Conjunction (CC): A word that links words, phrases, or clauses together.
  • Preposition (IN): A word showing how a noun or pronoun relates to other parts of a sentence.
  • Interjection (UH): A word or phrase used to express strong emotions.


Understanding these fundamental parts of speech is essential for analyzing and interpreting language, both for humans and in natural language processing applications.


Named Entity Recognition (NER) might sound like a complicated concept, but it’s actually quite straightforward and incredibly useful. Let’s break it down in simple terms.


What is NER?


NER stands for Named Entity Recognition. It’s a process that helps computers identify and classify specific things, or “entities,” in text. These entities can be names of people, places, organizations, dates, percentages, currency, and more. Essentially, NER helps computers understand what’s important in a piece of writing.


NER uses clever techniques to recognize these entities. It’s a bit like a detective looking for clues. It searches for patterns and context to figure out what’s what. 

For example, if it spots a capital letter at the start of a word and it’s surrounded by other capital letters, it might guess it’s a person’s name.


The output represents a series of words from a text that have been processed through Named Entity Recognition (NER).

Screenshot 2023-11-01 at 11.18.45 AM

Conclusion: NLP techniques

As we wrap up our exploration of NLP techniques, we’ve uncovered the magic behind Tokenization, POS Tagging, and Named Entity Recognition (NER).


These techniques aren’t just tools; they’re the keys to understanding the language-machine connection. In the ever-evolving world of NLP, mastering these techniques empowers us to unlock the full potential of language and technology, making our communication with machines smoother and more intuitive.


As we conclude our journey through the field of NLP techniques, we have looked into the enchanting realms of Tokenization, POS Tagging, and Named Entity Recognition (NER). Tokenization serves as the linguistic partition, POS Tagging assigns roles to words, and NER, akin to a detective, uncovers concealed treasures within text.


We extend our gratitude to you for joining us on this journey.