Masked Language Model: All you need to Know

Mar 2nd, 2024

In this article, we delve into the world of Masked Language Model or MLM and explore how TensorFlow, Google’s open-source machine learning framework, can be leveraged to implement powerful MLM models. We’ll discuss the fundamentals of MLM, its applications in NLP, and step-by-step instructions to build and train an MLM model using TensorFlow. Specifically we’ll cover :

Masked Language Models
Environment Preparation
Dataset Loading (Quora Dataset)
Data Preprocessing
Training
Inference

So let’s dive in !

Source : MLM

Masked Language Model

Masked Language Modeling (MLM) is a widely used deep learning method in Natural Language Processing (NLP), especially for training Transformer models like BERT, GPT-2, and RoBERTa. In MLM, a portion of the input text is masked or replaced with a special token ([MASK]), and the model is trained to predict the original token based on the context. This approach helps the model learn the context and relationships between words in a sentence. MLM is self-supervised, meaning it learns from the input text itself, making it versatile for tasks like text classification, question answering, and text generation.

Environment Preparation

				
					!pip install transformers datasets evaluate

An optional step is to connect to your Hugging Face account so that you can after deploy the fine tuned model to it.

				
					from huggingface_hub import notebook_login
notebook_login()

Dataset Loading (Quora Dataset)

Now we’re gonna load a subset of the Quora dataset using the load_dataset function from the datasets library. This subset includes the first 500 examples from the training split. The loaded dataset will be stored in the quora variable.

				
					from datasets import load_dataset
quora = load_dataset("quora", split="train[:500]")
quora

We’re gonna split the quora dataset into training and testing sets using the train_test_split method. We’ll allocate 20% of the data to the test set (test_size=0.2).

				
					quora = quora.train_test_split(test_size=0.2)
quora["train"][0]

Data Preprocessing

Here, we’re using the AutoTokenizer class from the transformers library to load a tokenizer for the DistilRoBERTa model. This tokenizer is capable of processing text into tokens that can be fed into the model.

				
					from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

We’re then flattening the quora dataset, which means converting it into a one-dimensional array-like structure, potentially for easier processing or manipulation.

				
					quora=quora.flatten()
quora["train"][0]

Now, we define a preprocess_function that tokenizes text data using the tokenizer initialized earlier. It takes a batch of examples, joins the text in each example into a single string, and then tokenizes the strings. The map method is used to apply this function to the quora dataset, processing it in batches with parallel processing (num_proc=4). The remove_columns parameter specifies that the original text columns should be removed from the processed dataset.

				
					def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["questions.text"]])

tokenized_quora = quora.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=quora["train"].column_names,
)

We set up a DataCollatorForLanguageModeling object for training a language model using the Hugging Face Transformers library. By specifying mlm_probability=0.15, we configure it to randomly mask 15% of the tokens in each input sequence during training.

This piece of code encodes The input text into tokens and replaces a random token by the MASK token.

				
					from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf")

For example if we tokenize this input text :

				
					Hello, how are you?

We’ll get as output

But if we apply the masking using the previous code. We will be getting as output

Therefore The 2129 token has been replaced with the 103 token which encodes the MASK token.

Another approach for annotation would be utilizing the UBIAI annotation tool. Start by importing the document you want to annotate from the documents section.

Then go to settings and set up a new entity. For this example the entity is going to represent the masked token.

Now all you have to do, is go to the document and start annotating it.

So navigate to the annotation section, and start the annotation process.

UBIAI offers a visual representation of the annotation process, that way all annotated words are going to be visually distinguishable.

Now, all you have to do is to export your annotated data, and Voila !

The UBIAI annotation tool provides a user-friendly interface and numerous features, allowing flexibility in selecting tokens to mask. This empowers you to concentrate on crucial aspects, streamlining the annotation process for enhanced ease.

Training

We create an AdamW optimizer with a learning rate of 2e-5 and a weight decay rate of 0.01 for training the model. It then initializes a masked language model (MLM) using the TFAutoModelForMaskedLM class from the transformers library, loading the DistilRoBERTa base model pretrained weights.

				
					from transformers import create_optimizer, AdamWeightDecay


optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

from transformers import TFAutoModelForMaskedLM


model = TFAutoModelForMaskedLM.from_pretrained("distilroberta-base")

We then prepare TensorFlow datasets (tf_train_set and tf_test_set) from the lm_dataset for training and evaluation, respectively. The datasets are shuffled for training (shuffle=True) and not shuffled for testing (shuffle=False). Each batch contains 16 examples, and the data_collator function is used to collate batches.

				
					tf_train_set = model.prepare_tf_dataset(
    lm_dataset["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)


tf_test_set = model.prepare_tf_dataset(
    lm_dataset["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

We then compile the model using the specified optimizer (optimizer) and then trains the model using the fit method. The training data (tf_train_set) is used for training, and the validation data (tf_test_set) is used for validation during training. The model is trained for 3 epochs, and the PushToHubCallback callback is used to save the model and tokenizer to the Hugging Face Hub after training.

				
					import tensorflow as tf


model.compile(optimizer=optimizer)  # No loss argument!



from transformers.keras_callbacks import PushToHubCallback


callback = PushToHubCallback(
    output_dir="my_awesome_quora_mlm_model",
    tokenizer=tokenizer,
)

model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

Inference

For inference we use the pipeline from the Transformers library to create a fill-mask pipeline, which can fill in masked tokens in a given text. The mask_filler pipeline is initialized with the previously trained MLM model (my_awesome_quora_mlm_model). When mask_filler is called with the text and top_k=3 arguments, it predicts the top 3 most likely tokens to fill the <mask> placeholder in the text.

				
					text = "UBIAI is a <mask> company."

from transformers import pipeline


mask_filler = pipeline("fill-mask", "my_awesome_quora_mlm_model")
mask_filler(text, top_k=3)

We initialize a tokenizer for the previously trained MLM model (my_awesome_quora_mlm_model). It then uses the tokenizer to tokenize the text input and converts it into TensorFlow tensors (inputs). Finally, it identifies the index of the masked token (<mask>) in the tokenized input (mask_token_index) for further processing.

				
					from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("my_awesome_quora_mlm_model")
inputs = tokenizer(text, return_tensors="tf")
mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

Now, we compute the logits for the masked token in the input text using the model. The top 3 most probable tokens to fill the mask are extracted from the logits. Finally, it replaces the mask token with each of the top 3 tokens and prints the resulting text for each token.

				
					from transformers import TFAutoModelForMaskedLM


model = TFAutoModelForMaskedLM.from_pretrained("my_awesome_quora_mlm_model")
logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]

				
					top_3_tokens = tf.math.top_k(mask_token_logits, 3).indices.numpy()


for token in top_3_tokens:
    print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))

So for the text : UBIAI is a <mask> company. We get as output :

Conclusion:

In conclusion, this article has provided a comprehensive overview of Masked Language Modeling (MLM) using TensorFlow and the Transformers library. We’ve explored how MLM can be used to train language models to predict missing words in a sentence, leading to significant advancements in natural language processing tasks. Through practical examples and code snippets, we’ve demonstrated how to implement and train an MLM model using TensorFlow, fine-tune it on a custom dataset like Quora, and utilize it for text generation and prediction tasks.

What are you waiting for?

Automate your process!

The Services provided are really great, we received a genuine advice and at very reasonable cost. all the work went hassle-free and no complication.

Masked Language Model: All you need to Know

Mar 2nd, 2024

Masked Language Model

Environment Preparation

Dataset Loading (Quora Dataset)

Data Preprocessing

Training

Inference

Conclusion:

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

Masked Language Model: All you need to Know

Mar 2nd, 2024

Masked Language Model

Environment Preparation

Dataset Loading (Quora Dataset)

Data Preprocessing

Training

Inference

Conclusion:

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost

Fine Tuning LLMs on Your Own Dataset