Active Learning
Jan 4, 2023
Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs. It is sometimes referred to as optimal experimental design.
It is an important technique to create a decent machine learning model while keeping the amount of supervised/labeled datasets to a minimum by selecting the seemingly most important data points.
This technique is also considered in situations where labeling is difficult or time-consuming. Passive learning, or the conventional way through which a large quantity of labeled data is created by a human oracle, requires enormous efforts in terms of man hours.
In a successful active learning system, the algorithm is able to choose the most informative data points through some defined metric, subsequently passing them to a human labeler and progressively adding them to the training set.
A diagrammatic representation is shown below :

Why do we need active learning?
The idea of active learning is inspired by the known concept that not all data points are equally important for training a model. Just have a look at the data points shown below. It’s a cluster of two sets with a decision boundary in between.

Now assume a scenario with more than tens of thousands of data points without any labels to learn from. It would be cumbersome or even extremely expensive to label all those points manually. To mitigate this pain, if a random subset of data is selected among the lot and then labeled for model training, most likely, we would end up with a model with sub-par performance, as you can observe in the image below. The catch is that the decision boundary created by this random sampling can lead to lower accuracies and other diminished performance metrics.

But what if we somehow manage to select a bunch of data points near the decision boundary and help the model to learn selectively? This would be the most preferred scenario for selecting the samples from the given unlabelled dataset.This is how the concept of active learning originated and evolved.

Active learning using sampling techniques
Active learning using sampling can be boiled down to the following steps:
a. Labeling a subsample of data using Human Oracle.
b. Train a relatively light model on the labeled data.
c. The model is made to predict the class of every remaining unlabelled data
point.
d. A score is given to every unlabelled data point based on the model
outputs.
e. A subsample is chosen based on these generated scores and sent out for
labeling (the size of the subsamplecould depend on the availability
of labeling budget/resources and time).
f.The model is retrained based on the cumulative labeled datasets.
Repeat steps 3-6 until the model approaches desired levels of performance. At a later stage, you can increase the model complexity as well. and in our case we going to use this scenario who make the active learning model.
Pool-based sampling
In this case, the data samples are chosen from a pool of unlabelled data based on the informative value scores and sent for manual labeling. Unlike stream-based sampling, oftentimes, the entire unlabelled dataset is scrutinized for the selection of the best instances.
Least confidence
This strategy allows an active learner to select the unlabeled data samples for which the model is least confident in
prediction or class assignment. So if the model predicted 0.5 for a class with the highest probability, LC value becomes 0.5.
The relation can be deducted from the form given –
Active Learning and Data Annotation
As can be observed from the fundamentals of the Active Learning approach, this method reduces the total amount of data needed for a model to perform well. This means that the time and cost that the data labeling process incurs is highly reduced as only a fraction of the dataset is labeled.
However, the tasks of data annotation and model training are often handled separately, and by different organizations. Hence the interaction of both the processes is a challenge that often becomes hard to tackle, owing to the confidentiality and privacy of the data and processes.
Often, Active Learning is used in association with online or iterative learning during the process of data annotation, using Human in the Loop approaches. Active Learning then is responsible for fetching the most useful data and iterative learning, enhancing model performance as the process of annotation continues, and allowing a machine agent to assist humans.

Creating an Active Learning Model for NER Annotation using spaCy
Named entity recognition (NER) is an important task in natural language processing (NLP), as it allows us to identify and classify named entities such as people, organizations, locations, and more in text data. However, manually annotating a large corpus of text data for NER can be time-consuming and labor-intensive.
One way to make the process more efficient is to use active learning, a machine learning technique that involves training a model on a small amount of labeled data,and then selectively choosing the most informative examples to be labeled by a human annotator. By iteratively labeling and training the model, we can improve its performance over time, while requiring fewer human annotations.
In this article, we will learn how to create an active learning model for NER annotation using the popular spaCy library in Python.
import spacy
nlp = spacy.load('en_core_web_sm')
TRAIN_DATA =[
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
Next, we can train the model using spaCy’s nlp.update() method. This method takes the training data and a numberof iterations (the number of times the model should be trained on the data) as arguments:
nlp.update(TRAIN_DATA, sgd={"loss_scale": 0.001}, n_iter=20)
Once the model is trained, we can evaluate its performance on a test set of data.This will allow us to determine howwell the model is able to identify named entities in unseen text.To do this, we can use spaCy’s displacy.render() method, which provides a visual representation of the entities identified by the model:
from spacy import displacy
doc = nlp("Who is Shaka Khan?")
displacy.render(doc, style="ent")
Then, we can create an active learning model using spaCy’s create_loop() method.This method takes the trained NER model and the training data as arguments:
nlp = spacy.load("en_core_web_sm")
nlp.update(TRAIN_DATA, sgd={"loss_scale": 0.001}, n_iter=20)
active_model = create_loop(nlp, TRAIN_DATA)
Next, we can use the active_model to select the most informative examples from our dataset and add them to the annotation queue on UBIAI. We can do this using the active_model’s sample() and annotate() methods.The sample() method selects the most informative examples from the dataset, and the annotate() method adds them to the queue for human annotation on UBIAI.
The number of examples selected and added to the queue can be controlled using the sample() method’s n_samples argument. For example, to add 10 examples to the queue for human annotation on UBIAI, we can use the following code:
selected_examples = active_model.sample(n_samples=10)
active_model.annotate(selected_examples, session=session)
def calculate_doc_scores(annotated_examples):
doc_scores = {}
for example in annotated_examples:
doc_id = example['document_id']
if doc_id not in doc_scores:
doc_scores[doc_id] = []
doc_scores[doc_id].append(example['confidence'])
for doc_id, scores in doc_scores.items():
num_entities = len(scores)
total_score = sum(scores)
doc_scores[doc_id] = total_score / num_entities
return doc_scores
doc_scores = calculate_doc_scores(annotated_examples)
sorted_doc_scores = sorted(doc_scores.items(), key=lambda x: x[1])
top_10_percent_docs = sorted_doc_scores[:int(len(sorted_doc_scores) * 0.1)]