Annotate text from Google Sheet using Google Apps Script and Machine Learning APIs
SEP 19, 2022
After being provided with an extensive unlabeled data set, data scientists found that manually labeling them is tedious and time-consuming. This is where Active Learning emerged.
- What is active learning in ML?
- Active learning queries strategies, Pros & Cons.
- Active learning frameworks and tools.
1. What is active learning in ML?
Labeling data is by far the process that takes the most amount of time throughout a Machine Learning project, taking about 80% of the whole project. Active Learning turns the unfavorable number of unstructured data into an advantage: it’s a type of semi-supervised learning where a small set of labeled data and a large amount of unlabeled data are collectively used to train a supervised model.
The active learning technique starts with a small set of manually labeled data called “the seed”. The model is then trained upon the gathered labeled data, whose choice is selective because it will be providing the maximum learning opportunities to the model. After being trained, the learner will predict the rest of the unlabelled data and a score of priority will be attributed. The process will be sequentially repeated: a new set of labeled data based on the priority score is established. Once the new model is trained, the priority score is again updated to continue labeling.
2. Active learning queries strategies, Pros & Cons
As we mentioned earlier, Active learning is an Active process as it allows the model to pose queries during training. Queries are usually unlabeled data instances soon to be annotated by an oracle (human annotator/expert/scientist).
The three main Active learning strategies are:
Stream-based Selective Sampling :
This method is used when we have a large database. A small set of data from the database is dedicated for the model to be trained: it is the Training Dataset. The model chooses from this unlabeled pool whether or not to query the label of the instance from the user.
This approach is a low-cost method nonetheless presents a major disadvantage which is the limited performance due to separate decisions.
Pool-based sampling :
In pool-based sampling, a large set of unlabelled data is available from which the model selects a batch of X samples. The X samples are then ranked based on some “informativeness” measure. The best-ranked samples are set to be labeled by the oracle. The drawback of this method is the amount of memory it requires.
Membership query synthesis :
Generally, this approach is used when we have a minimal dataset. This algorithm starts by generating its own set of instances that it believes would be most beneficial for the model’s training. The unlabeled instances are then sent to the oracle to be labeled. The downside of this method is the high chances of data misidentifying whereas its advantage is compatibility with problems where it is easy to generate a data instance.
3. Active learning frameworks and tools
moDAL: it is an active learning framework for python 3. It is built by scikit-learn, making it flexible and easy to use. modAL precedence comes from the fact that it supports many active learning strategies such as Pool-based sampling, Selective sampling, and query synthesis.
- UBIAI: is a robust labeling tool in the field of Natural Language Processing (NLP) that is widely used due to its simple platform and fluidity as it doesn’t require coding knowledge, so it makes it easy to use.
- Libact: is Pool-based active learning in Python. Being a python package, it is destined to make active learning easier for general users. The package not only implements several popular active learning strategies but also features the active-learning-by-learning meta-algorithm that assists the users in automatically selecting the best strategy on the fly.