Speedup Data Labeling using Clustering:enhanced Data Labeling
Oct 9, 2022
In a case where you are working with datasets that are largely unstructured and unclassified, annotation might look impossible at first glance, but fortunately, there are some tools and techniques that can help us achieve this goal.
- The importance of data labeling
- Classification vs Clustering
- Proposed solution
- An automated solution: Auto-Labeling with UBIAI
- Some relevant business cases
This article can be considered by both technical (data scientists, ML engineers…) and non-technical readers (product managers, project managers, business owners…) and will provide you with solutions to optimize your workflow.
What’s supervised learning?
Supervised learning is a task in the spectrum of pattern recognition and machine learning methods in which the goal is to train a model from labeled data points. The learned model is then applied to an unseen test set and the method is validated based on how successful it was in assigning test data to different classes.
What’s unsupervised learning?
Unsupervised learning is a task in the spectrum of pattern recognition and machine learning methods in which the initial dataset is a bulk of unlabeled data which can be analyzed and clustered into clusters that share similar properties.
What is data labeling?
Data labeling — also known as data annotation, tagging, or classification — is the process of preparing datasets for algorithms that learn to recognize repetitive patterns in labeled data. Once enough labeled data has been processed by the algorithm, it can begin to identify the same patterns in datasets that haven’t been labeled. As you rinse and repeat this process, the algorithms behind AI and machine learning solutions grow smarter and more efficient. In short, the main goal of labeling is tagging a group of samples with one or more labels that describes the data.
Labeling typically takes a set of unlabeled data and enhances each piece of it with additional information that is called tags.
The importance of data labeling
Given that unlabeled datasets are unstructured and contain less descriptive information, they are obviously easier to acquire or generate and require much less human effort to create in the opposite of labeled datasets which are more expensive but also more reliable.
However, since the data provided to the unsupervised techniques are unlabeled, there is no clear way to validate the quality of this approach which influences the reliability of the trained model.
On the other hand, Artificial intelligence (AI) is only as good as the data that it is trained with; the quality of the data that an AI algorithm is trained with correlates directly with its success.
Now given the specifications of each type of data to effectively deploy AI models in real-world applications, it is important that application stakeholders know how confident a model is in the predictions it is making. This can be traced all the way back to the data labeling stage, and it is therefore key to ensure that workers involved in the labeling process are being assessed for quality assurance purposes.
With a robust quality assurance process in place, an AI model has a much higher chance of learning and achieving what it is designed to do through a process known as ‘garbage in, garbage out’ — the concept that says the quality of the output is determined by the quality of the input.
Classification vs Clustering
Both clustering and classification are methods of pattern identification used in machine learning to categorize objects into different classes based on their features.
There are similarities between these two data science grouping techniques, but the main difference is that the classification method uses predefined classes to which objects are assigned, whereas the clustering method groups objects based on identifying similarities/dismilarities between them into non-predefined classes. Classification is used with labeled data and is geared towards supervised learning, while clustering is used with unlabeled data, and geared towards unsupervised learning.
Which one to choose?
For smaller datasets, manual annotation and organization are feasible and even optimal. However, as your data begins to scale, annotation, classification, and categorization become exponentially harder. Clustering — depending on the algorithm you’re using — can cut down your annotation and classification time because it’s less interested in specific outcomes and more concerned with the categorization itself. For instance, speech recognition algorithms produce millions of data points that would take hundreds of hours to fully annotate. Clustering algorithms can reduce the total work time and give you answers faster. So clustering is the optimal solution in some of these cases.
As a solution to speed up the annotation process and make it less expensive and time-consuming, let’s start with a large, unstructured data set and use clustering to help us in the labeling process.
Clustering large data sets are perhaps the most optimal application of this analysis tool thanks to the amount of work it takes off your hands. As with other unsupervised learning tools, clustering can take large datasets and, without instruction, quickly organize them into something more usable. The best part is that if you’re not looking to perform a massive analysis, clustering can give you fast answers about your data.
The proposed solution gives a unique way of leveraging word embeddings to perform text clustering. This technique tackles one of the biggest problems of text mining which is called the curse of dimensionality in its own way so as to give more efficient clustering of textual data, especially suitable in the case of big textual datasets.
Explaining the data flow: How can embedding and clustering speed up annotation?
So the main idea behind the solution can be visualized from this illustration:
A recently proposed model for generating contextual embeddings can be very helpful for us, known as Bidirectional Encoder Representations using Transformers (BERT). BERT is a complex neural network architecture that is trained on a large corpus of books and English Wikipedia.
Steps to follow
First, let’s install the necessary packages with pip: !pip install -U sentence-transformers !pip install -U umap Second, we should import the libraries that we need: import pandas as pd from sentence_transformers import SentenceTransformer import umap Now we should load the sentence encoder. SentenceTransformers is a Python framework for state-of-the-art sentence, text, and image embeddings. The initial work is described in their paper “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” model = SentenceTransformer('paraphrase-MiniLM-L6-v2') Now let’s load the initial dataset: dataset = pd.read_csv("dataset.csv")[['text']] sentences = dataset['text'] Now using the SentenceTransformer we can calculate the embeddings to create the N-dimensional representation: sentence_embeddings = model.encode(sentences) Using the UMAP we can reduce the previous result to be bi-dimensional: Now we apply the coordinates and save the result to a CSV file: dataset['x'] = var_tfm[:, 0] dataset['y'] = var_tfm[:, 1] dataset.to_csv("ready.csv")
- Finally, After the conversion of all the words (string format) into a numerical vector format makes it very easy and accurate to measure similarity (or dissimilarity) between words. This kind of semantic comparison between words was not very accurate before the introduction of models like BERT.
Now we can use clustering to group similar data which makes the process of labeling very easy and accurate.
An automated solution: Auto-Labeling with UBIAI
UBIAI has an auto-labeling feature that decreases human effort and saves time and money on data labeling. The tool offers an option for document auto-annotation by using ML models, dictionaries, and also a Rule-Based approach. It easily allows auto-annotation of entities such as time, location, date, product, person, etc., after uploading the text from native files.