Can Weak Labeling Replace Human-Labeled Data? A step-by-step comparison between weak and full supervision

Jun 7, 2022

In recent years, there has been a significant advancement in Natural Language Processing (NLP) due to the advent of deep learning models. Real-world applications using NLP, ranging from intelligent chatbots to automated data extraction from unstructured documents, are becoming more prevalent and bringing real business values to many companies. However, these models still require hand-labeled training data to fine-tune them to the specific business use cases. It can take many months to gather this data and even longer to label it, especially if a domain expert is needed and there are multiple classes to be identified within the text. As you can imagine, this can become a real adoption barrier to many businesses as subject matter experts are hard to find and expensive.


To address this problem, researchers have adopted weak forms of supervision, such as using heuristically generated label functions and external knowledge bases to programmatically label the data. While this approach holds a lot of promise, its impact on the model performance in comparison with full supervision remains unclear.

In this tutorial, we will generate two training datasets from job descriptions: one generated with weak labeling and a second dataset generated by hand labeling using UBIAI. We will then compare the model performance on a NER task that aims at extracting skills, experience, diploma and diploma major from job descriptions. The data and the notebook are available in my github repo.


With weak supervision, the user defines a set of functions and rules that assign a noisy label, that is a label that may not be correct, to unlabeled data. The labeling functions may be in the form of patterns such as regular expressions, dictionaries, ontologies, pre-trained machine learning models, or crowd annotation.

Weak supervision pipelines have three components: (1) user-defined labeling functions and heuristic functions, (2) a statistical model which takes as input the labels from the functions, and outputs probabilistic labels, and (3) a machine learning model that is trained on the probabilistic training labels from the statistical model.

Is Weak Labeling Capable of Replacing Human-Labeled Data?

Image by Author

Labeling Functions

To perform the weak labeling, we will write a set of functions that encode dictionaries, patterns, knowledge bases and rules related to the corpus we would like to label. In this tutorial, we will add functions that will auto-label entities SKILLS, EXPERIENCE, DIPLOMA and DIPLOMA_MAJOR from job description. After applying those functions to the unlabeled data, the results will be aggregated into a single, probabilistic annotation layer using a statistical model provided by the skweak library.

First we will start by creating a dictionary of skills Skills_Data.json and add it to our function lf3 to annotate the SKILLS entity. The dictionary was obtained from a publicly available dataset.

import spacy, re
from skweak import heuristics, gazetteers, generative, utils
nlp=spacy.load('en_core_web_md' , disable=['ner'])
gazetteer = gazetteers.GazetteerAnnotator("SKILLS", tries)
lf3= gazetteers.GazetteerAnnotator("SKILLS", tries)
For the EXPERIENCE entity, we use regex pattern to capture the number of years of experience:

#Create_Function_Foe_Experience_Detection(Use Regex)
def experience_detector (doc):
    expression=r'[0-9][+] years'
    for match in re.finditer(expression, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
            yield span.start , span.end ,  "EXPERIENCE"
lf1 = heuristics.FunctionAnnotator("experience", experience_detector)
For the entities DIPLOMA and DPLOMA_MAJOR, we use a publicly available dataset from Kaggle and regex:

with open('Diploma_Dic.json' , 'r' , encoding='UTF-8') as f :

with open ('Diploma_Major_Dic.json' ,encoding='UTF-8') as f :
    DIPLOMA_MAJOR=json.load(f)#Create Diploma_Function
def Diploma_fun(doc):
    for key in DIPLOMA:
                for match in re.finditer(key , doc.text , re.IGNORECASE):
                    start, end = match.span()
                    span = doc.char_span(start, end)
                        yield (span.start , span.end ,  "DIPLOMA")

lf4 = heuristics.FunctionAnnotator("Diploma", Diploma_fun)#Create_Diploma_Major_Function
def Diploma_major_fun(doc):  
    for key in DIPLOMA_MAJOR:
                for match in re.finditer(key , doc.text , re.IGNORECASE):
                    start, end = match.span()
                    span = doc.char_span(start, end)
                        yield (span.start , span.end ,  "DIPLOMA_MAJOR")

lf2 = heuristics.FunctionAnnotator("Diploma_major", Diploma_major_fun)#Create_Function_For_diploma_major(Use Regex)
def diploma_major_detector (doc):
    expression=re.compile(r"(^.*(Ph.D|MS|Master|BA|Bachelor|BS)S*) in (S*)")
    for match in re.finditer(expression, doc.text):
        start, end = match.span(3)
        span = doc.char_span(start, end)
            yield span.start , span.end ,  "DIPLOMA_MAJOR"

lf5 = heuristics.FunctionAnnotator("Diploma_major", diploma_major_detector)

We aggregate all the functions together and use Skweak’s statistical model to find the best agreement to auto-label the data.

with open('Corpus.txt' , 'r') as f :
    for text in data:
        if (len(text) !=1):
            docs.append(doc)from skweak import aggregation
model = aggregation.HMM("hmm", ["DIPLOMA", "DIPLOMA_MAJOR" , "EXPERIENCE" , "SKILLS"])
docs = model.fit_and_aggregate(docs)
We are finally ready to train the model! We chose to train a spaCy model since it is easily integrated with the skweak library but we can of course use any other model such as transformers for example. Annotated datasets are available in the github repo.

for doc in docs:
    doc.ents = doc.spans["hmm"]
utils.docbin_writer(docs, "train.spacy")!python -m spacy train config.cfg --output ./output --paths.train train.spacy train.spacy


We are now ready to run the training with both datasets, fully hand-labeled and weakly labeled, having equal number of documents:

Hand-labeled dataset model performance:

================================== Results ==================================

TOK 100.00
NER P 74.27
NER R 80.10
NER F 77.08
SPEED 4506

=============================== NER (per type) ===============================

DIPLOMA 85.71 66.67 75.00
DIPLOMA_MAJOR 33.33 16.67 22.22
EXPERIENCE 81.82 81.82 81.82
SKILLS 74.05 83.03 78.29

Weakly-labeled dataset model performance:

================================== Results ==================================

TOK 100.00
NER P 31.78
NER R 17.80
NER F 22.82
SPEED 2711

=============================== NER (per type) ===============================

DIPLOMA 33.33 22.22 26.67
DIPLOMA_MAJOR 14.29 50.00 22.22
EXPERIENCE 100.00 27.27 42.86
SKILLS 33.77 15.76 21.49

Interestingly, the model performance of the hand-labeled dataset is significantly superior to the weakly labeled one by a wide margin 0.77 for supervised versus 0.22 for weak supervision. If we dig deeper, we will find that the performance gap still holds at the entity level (except the EXPERIENCE entity).

By adding more labeling functions such as crowd annotation, model labeling, rules, dictionaries, we would certainly expect improvement in model performance but it is very unclear whether the performance will ever match the subject matter expert labeled data. Second, figuring out the correct auto-labeling functions is an iterative and ad-hoc process. This issue is exacerbated when dealing with a highly technical dataset such as medical notes, legal documents or scientific articles where in some cases, it fails to properly capture the domain knowledge that users want to encode.

In this tutorial, we demonstrated a step by step comparison between model performance trained by weakly labeled data and hand-labeled data. We have shown that in this specific use case, the model performance from weakly labeled dataset is significantly lower than that of the fully supervised approach. This does not certainly mean weak labeling is not useful. We can use weak labeling to pre-annotate the dataset and hence bootstrap our labeling project but we cannot rely on it to perform fully unsupervised labeling.