Build An NLP Project From Zero To Hero (3): Data Preprocessing
Dec 26, 2021
We continue the series and in this article, we are going to focus on Data Preprocessing. It is usually deemed even more important than training the model itself. Garbage in, garbage out.
Data Preprocessing is the process of transforming raw data into an understandable format that suits your task. Naturally, we cannot work with raw data, let alone demanding machines to understand it. We remind the reader that it is important that our data quality remains above certain threshold and will elaborate more on that later.
First, we will be talking about Data Processing in general and then we will explain the techniques we have used.
So let us dive in!
A Gentle Introduction to Data Preprocessing
Raw data seems chaotic, ambiguous, and unclear. It needs to be preprocessed in order for it to be useful to train the model.
One good example is when you want to train a sentiment analysis model, a model that can predict the sentiment of a text, positive, negative or neutral. You encode the text as vectors and train your model but it seems that your model is not improving.
You decide to peak at the most popular words in the model’s vocabulary, you find words like “I”, “they”, “is”, “and” and so on. They are stopwords that do not bring new information to your model. Therefore the model has not been provided a meaningful embedding vectors so it can recognize the sentiment behind each text.
This example is a little bit oversimplified but I think it is good to understand why preprocessing your data is so important especially in NLP.
The general Steps of Data Preprocessing in a Machine Learning Project:
– Exploratory Data Analysis: or EDA is the initial step of investigating the data. It allows you to discover clues and patterns that will help you make the right decisions for the next steps.
Data Cleaning: Data can be ‘dirty’. It can have missing values and unwanted errors or ‘noise’. You must remove or alter these anomalies.
– Data Integration: If you work with multiple datasets, you should aim to merge or unify all of them into one single source. This step requires knowing well the metadata of each dataset, detecting common entities and elements in these datasets, and merging conflicted data value concepts (like date and time).
– Data Reduction: Data can be voluminous in size and complexity. This step aims to reduce the size without sacrificing the valuable information contained within the data. Techniques like dimensionality reduction and data compression are very helpful in this case. A very simple example is to use only the most relevant columns in a dataset for your problem.
– Data Transformation: This step involves changing the data into a more accessible form or structure. There are many techniques like Data Smoothing, Aggregation of data as a summary, Discretization of continuous data as intervals, and Normalization of data to be contained in a predefined range of values.
– Data Labeling: It is the process of identifying raw data (images, text files, videos, etc.) and adding one or more meaningful and informative labels to provide context so that a machine learning model can learn from it.
You will not need to use every step and every technique. It all depends on the requirements of your project. For our project, our data is text, so we need to know the steps of Text Preprocessing:
Tokenization: It is about dividing a text string into smaller parts or “tokens”. Tokens can be either word, characters, or subwords. Hence, tokenization can be found in three forms — word, character, and subword (n-gram characters) tokenization. Tokens are the building blocks of Natural Language and the most common way of processing raw text as their ultimate goal is to create the vocabulary out of a given corpus.
Normalization: It aims to put all text in a predefined standard, such as converting all characters to lowercase and lemmatizing tokens.
Denoising: Noise removal removes any undesired ’noise’ in the data, for example, it will remove extra white space and special characters like punctuation.
You can see that there are similarities between the general steps of data preprocessing and Text preprocessing. Denoising is Data Cleaning, and Normalization is a type of Data Transformation
Sometimes these three steps can overlap with each other in no particular order. For example, you can consider Tokenization as a form of Normalization. Or you can do denoising before the tokenization.
The steps mentioned in this section serve as a guideline or a roadmap for your future projects in NLP, so you won’t get lost or forget again what should be done!
Now, it is time to apply this knowledge, and for every technique we use, we will certainly explain our reasons for choosing it.
Text Preprocessing for a NER model
Our objective is to train a Named Entity Recognition model, and for that we need data and that data must be annotated.
The data we are working with is the financial tweets dataset from Kaggle.
We need to preprocess this data to obtain a clean and rich corpus for our model.
Prepare your workspace
We have two datasets: stockerbot-export.csv, which contains the text of tweets, and stocks_cleaned.csv which contains the ticker symbol for every company included in the tweets. A ticker or a stock symbol is a unique series of letters assigned to a security for trading purposes. For example, the ticker for Apple is ‘AAPL’.
import pandas as pd #some rows are not parsed correctly, use error_bad_lines=False to ignore themtweets = pd.read_csv('/content/stockerbot-export.csv', error_bad_lines=False)tickers = pd.read_csv('/content/stocks_cleaned.csv')