In this article, we delve into the world of Masked Language Model or MLM and explore how TensorFlow, Google’s open-source machine learning framework, can be leveraged to implement powerful MLM models. We’ll discuss the fundamentals of MLM, its applications in NLP, and step-by-step instructions to build and train an MLM model using TensorFlow. Specifically we’ll cover :
So let’s dive in !
Source : MLM
Masked Language Modeling (MLM) is a widely used deep learning method in Natural Language Processing (NLP), especially for training Transformer models like BERT, GPT-2, and RoBERTa. In MLM, a portion of the input text is masked or replaced with a special token ([MASK]), and the model is trained to predict the original token based on the context. This approach helps the model learn the context and relationships between words in a sentence. MLM is self-supervised, meaning it learns from the input text itself, making it versatile for tasks like text classification, question answering, and text generation.
!pip install transformers datasets evaluate
An optional step is to connect to your Hugging Face account so that you can after deploy the fine tuned model to it.
from huggingface_hub import notebook_login
notebook_login()
Now we’re gonna load a subset of the Quora dataset using the load_dataset function from the datasets library. This subset includes the first 500 examples from the training split. The loaded dataset will be stored in the quora variable.
from datasets import load_dataset
quora = load_dataset("quora", split="train[:500]")
quora
We’re gonna split the quora dataset into training and testing sets using the train_test_split method. We’ll allocate 20% of the data to the test set (test_size=0.2).
quora = quora.train_test_split(test_size=0.2)
quora["train"][0]
Here, we’re using the AutoTokenizer class from the transformers library to load a tokenizer for the DistilRoBERTa model. This tokenizer is capable of processing text into tokens that can be fed into the model.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
We’re then flattening the quora dataset, which means converting it into a one-dimensional array-like structure, potentially for easier processing or manipulation.
quora=quora.flatten()
quora["train"][0]
Now, we define a preprocess_function that tokenizes text data using the tokenizer initialized earlier. It takes a batch of examples, joins the text in each example into a single string, and then tokenizes the strings. The map method is used to apply this function to the quora dataset, processing it in batches with parallel processing (num_proc=4). The remove_columns parameter specifies that the original text columns should be removed from the processed dataset.
def preprocess_function(examples):
return tokenizer([" ".join(x) for x in examples["questions.text"]])
tokenized_quora = quora.map(
preprocess_function,
batched=True,
num_proc=4,
remove_columns=quora["train"].column_names,
)
We set up a DataCollatorForLanguageModeling object for training a language model using the Hugging Face Transformers library. By specifying mlm_probability=0.15, we configure it to randomly mask 15% of the tokens in each input sequence during training.
This piece of code encodes The input text into tokens and replaces a random token by the MASK token.
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf")
For example if we tokenize this input text :
Hello, how are you?
We’ll get as output
But if we apply the masking using the previous code. We will be getting as output
Therefore The 2129 token has been replaced with the 103 token which encodes the MASK token.
Another approach for annotation would be utilizing the UBIAI annotation tool. Start by importing the document you want to annotate from the documents section.
Then go to settings and set up a new entity. For this example the entity is going to represent the masked token.
Now all you have to do, is go to the document and start annotating it.
So navigate to the annotation section, and start the annotation process.
UBIAI offers a visual representation of the annotation process, that way all annotated words are going to be visually distinguishable.
Now, all you have to do is to export your annotated data, and Voila !
The UBIAI annotation tool provides a user-friendly interface and numerous features, allowing flexibility in selecting tokens to mask. This empowers you to concentrate on crucial aspects, streamlining the annotation process for enhanced ease.
We create an AdamW optimizer with a learning rate of 2e-5 and a weight decay rate of 0.01 for training the model. It then initializes a masked language model (MLM) using the TFAutoModelForMaskedLM class from the transformers library, loading the DistilRoBERTa base model pretrained weights.
from transformers import create_optimizer, AdamWeightDecay
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
from transformers import TFAutoModelForMaskedLM
model = TFAutoModelForMaskedLM.from_pretrained("distilroberta-base")
We then prepare TensorFlow datasets (tf_train_set and tf_test_set) from the lm_dataset for training and evaluation, respectively. The datasets are shuffled for training (shuffle=True) and not shuffled for testing (shuffle=False). Each batch contains 16 examples, and the data_collator function is used to collate batches.
tf_train_set = model.prepare_tf_dataset(
lm_dataset["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
tf_test_set = model.prepare_tf_dataset(
lm_dataset["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
We then compile the model using the specified optimizer (optimizer) and then trains the model using the fit method. The training data (tf_train_set) is used for training, and the validation data (tf_test_set) is used for validation during training. The model is trained for 3 epochs, and the PushToHubCallback callback is used to save the model and tokenizer to the Hugging Face Hub after training.
import tensorflow as tf
model.compile(optimizer=optimizer) # No loss argument!
from transformers.keras_callbacks import PushToHubCallback
callback = PushToHubCallback(
output_dir="my_awesome_quora_mlm_model",
tokenizer=tokenizer,
)
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])
For inference we use the pipeline from the Transformers library to create a fill-mask pipeline, which can fill in masked tokens in a given text. The mask_filler pipeline is initialized with the previously trained MLM model (my_awesome_quora_mlm_model). When mask_filler is called with the text and top_k=3 arguments, it predicts the top 3 most likely tokens to fill the <mask> placeholder in the text.
text = "UBIAI is a company."
from transformers import pipeline
mask_filler = pipeline("fill-mask", "my_awesome_quora_mlm_model")
mask_filler(text, top_k=3)
We initialize a tokenizer for the previously trained MLM model (my_awesome_quora_mlm_model). It then uses the tokenizer to tokenize the text input and converts it into TensorFlow tensors (inputs). Finally, it identifies the index of the masked token (<mask>) in the tokenized input (mask_token_index) for further processing.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("my_awesome_quora_mlm_model")
inputs = tokenizer(text, return_tensors="tf")
mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
Now, we compute the logits for the masked token in the input text using the model. The top 3 most probable tokens to fill the mask are extracted from the logits. Finally, it replaces the mask token with each of the top 3 tokens and prints the resulting text for each token.
from transformers import TFAutoModelForMaskedLM
model = TFAutoModelForMaskedLM.from_pretrained("my_awesome_quora_mlm_model")
logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]
top_3_tokens = tf.math.top_k(mask_token_logits, 3).indices.numpy()
for token in top_3_tokens:
print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
So for the text : UBIAI is a <mask> company. We get as output :
In conclusion, this article has provided a comprehensive overview of Masked Language Modeling (MLM) using TensorFlow and the Transformers library. We’ve explored how MLM can be used to train language models to predict missing words in a sentence, leading to significant advancements in natural language processing tasks. Through practical examples and code snippets, we’ve demonstrated how to implement and train an MLM model using TensorFlow, fine-tune it on a custom dataset like Quora, and utilize it for text generation and prediction tasks.