Mistral AI Introduces Mixtral 8x7B: A Revolutionary AI Model
Mar 11th, 2024
Mistral AI stands at the forefront of innovation, dedicated to providing the developer community with cutting-edge open models. In the dynamic landscape of artificial intelligence, progress demands more than mere replication of established architectures and training methods. True advancement necessitates the introduction of novel models that not only push the boundaries of performance but also empower the community to explore new avenues of invention and application.
In line with this commitment, Mistral AI proudly announces the release of Mixtral 8x7B—a groundbreaking sparse mixture of expert models (SMoE) boasting open weights. Released under the permissive Apache 2.0 license.
In this article we will dive into m8x7b ai showing its capabilities in real life scenarios. Specifically we will see:
1- What is m8x7b ai
2- Performance compared to state of the art
3- m8x7b ai for Question Answering
3.1 – Configuration
3.2 – Dependencies
3.3 – Model Initialization
3.4 – Inference
4- m8x7b ai for Text classification
4.1 – Required packages
4.2 – Load Dataset
4.3 – Load model from Hugging Face
4.4 – Create a template for the LLM
4.5 – Inference with LLMChain
5- Conclusion
1- What is Mixtral 8x7B:
Mixtral represents a significant leap forward in AI capabilities. Surpassing its predecessors, including Llama 2 70B, Mixtral sets new benchmarks with its exceptional performance, offering inference speeds six times faster while maintaining superior quality. Notably, it emerges as the premier open-weight model, striking an optimal balance between cost and performance, and establishing itself as the model of choice across various standard benchmarks, often outperforming even established benchmarks like GPT3.5 the Generative model developed by OpenAi. Mixtral showcases a myriad of capabilities that position it as a transformative force in the AI domain. With the ability to gracefully manage contexts of up to 32k tokens, Mixtral empowers users with an unprecedented capacity for comprehensive analysis and generation. Catering to diverse linguistic needs, Mixtral also seamlessly supports English, French, Italian, German, and Spanish, ensuring versatility and accessibility across global communities. Leveraging its robust architecture, This model demonstrates exceptional proficiency in generating code, enabling developers to streamline their workflows and enhance productivity. Beyond its inherent capabilities, it also offers the flexibility for fine-tuning, enabling the creation of specialized models tailored to specific tasks. Notably, its adaptability extends to achieving impressive scores, such as an impressive 8.3 on the MT-Bench, when configured into an instruction-following model.
2- Performance compared to state-of-the-art:
Mixtral matches or even surpasses the performance of Llama 2 70B, as well as GPT3.5, across a wide array of benchmarks.
In assessing the quality versus inference budget tradeoff, it becomes evident that Mistral 7B and m8x7b ai exemplify a family of highly efficient models when contrasted with the Llama 2 counterparts. These models, characterized by their superior performance and optimized resource utilization, underscore Mistral AI’s commitment to delivering cutting-edge solutions that strike an optimal balance between quality and computational efficiency.
3- M8x7b ai for Question Answering :
3.1 Configuration:
In this Python class CFG, we’ve included functionalities for a chatbot. We’ve set a boolean variable, is_interactive, to False by default, indicating non-interactive mode. Additionally, we’ve defined a list variable, prompts, which contains example queries the chatbot can handle.
class CFG:
## Whether to interact with the chatbot interactively
is_interactive = False
prompts = [
"Could you build an IMDB text classifier using Tensorflow?",
"Could you build an IMDB text classifier using Pytorch?"
]
Here, we’re using Git to clone the repository named mixtral-offloading from GitHub, which contains code related to Mixtral offloading. Then, we navigate into the cloned repository and install the required dependencies specified in the requirements.txt file using pip. Additionally, we install the triton library via pip, which is likely required for the offloading functionality. Finally, we utilize the Hugging Face CLI to download the model named lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo quietly and store it locally in a directory named Mixtral-8x7B-Instruct-v0.1-offloading-demo.
!git clone https://github.com/dvmazur/mixtral-offloading
!cd mixtral-offloading && pip install -q -r requirements.txt
!pip install triton
!huggingface-cli download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --quiet --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo
3.2 Dependencies:
We first import the os module to set the CUDA_VISIBLE_DEVICES environment variable, specifying which GPU devices should be visible to CUDA. Here, we’ve set it to devices 0 and 1. We also import the sys module for system-specific parameters and functions, and the time module to track time. We append the path to the mixtral-offloading directory to the Python path using sys.path.append() to enable importing modules from this directory.
Next, we import necessary modules from PyTorch, Hugging Face’s transformers library, and other relevant packages. Specifically, we import classes and functions related to quantization (BaseQuantizeConfig), offloading configuration (OffloadConfig), and model building (build_model). Additionally, we import utilities for logging and progress tracking (hf_logging, trange from tqdm.auto).
This setup enables us to configure GPU visibility, import required modules, and set up the necessary components for model building and quantization.
3.3 Model initialization:
Now, we define several variables and configurations for model setup and quantization:
model_name: Specifies the name of the original Mixtral model to be used.
quantized_model_name: Specifies the name of the quantized Mixtral model for offloading.
state_path: Specifies the directory path where the quantized model’s state is stored.
We then initialize configuration objects for offloading, attention layers, and feed-forward networks, specifying parameters such as the number of bits for quantization, group size, and whether to quantize zero values and scales. Additionally, we configure offloading parameters such as the main and offload sizes per layer, as well as buffer size.
The tokenizer object is created using the AutoTokenizer class from Hugging Face’s transformers library, initialized with the specified model_name.
Finally, the build_model function is called with the provided device, quantization configuration, offload configuration, and state path.
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"
config = AutoConfig.from_pretrained(quantized_model_name)
device = torch.device("cuda")
##### Change this to 5 if you have only 12 GB of GPU VRAM #####
offload_per_layer = 4
num_experts = config.num_local_experts
offload_config = OffloadConfig(
main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
offload_size=config.num_hidden_layers * offload_per_layer,
buffer_size=4,
offload_per_layer=offload_per_layer,
)
attn_config = BaseQuantizeConfig(
nbits=4,
group_size=64,
quant_zero=True,
quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256
ffn_config = BaseQuantizeConfig(
nbits=2,
group_size=16,
quant_zero=True,
quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = build_model(
device=device,
quant_config=quant_config,
offload_config=offload_config,
state_path=state_path,
)
This function constructs and returns the offloaded Mixtral model, ready for further use in the specified device.
3.4 Inference:
Now let’s run the model. This is going to be done in a loop that :
First, check whether the chatbot is operating interactively or if it’s using predefined prompts. If interactive mode is enabled, it prompts the user for input. Otherwise, it uses predefined prompts stored in the CFG.prompts list.
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
past_key_values = None
sequence = None
seq_len = 0
i = 0
while True:
if CFG.is_interactive:
print("User: ", end="")
user_input = input()
else:
print("User: ", end="")
if i >= len(CFG.prompts):
break
user_input = CFG.prompts[i]
i += 1
print(f"{user_input}\n")
Once the user input is obtained, it’s tokenized using the tokenizer provided. If this is the first iteration or there are no previous key values, an attention mask is created to cover the input sequence length. Otherwise, it computes the new sequence length based on the input size and the previous key values.
user_entry = dict(role="user", content=user_input)
input_ids = tokenizer.apply_chat_template([user_entry], return_tensors="pt").to(device)
if past_key_values is None:
attention_mask = torch.ones_like(input_ids)
else:
seq_len = input_ids.size(1) + past_key_values[0][0][0].size(1)
attention_mask = torch.ones([1, seq_len - 1], dtype=torch.int, device=device)
print("Mixtral: ", end="")
begin = time.time()
The Mixtral model then generates a response based on the provided input, utilizing the generate method. Various generation parameters are specified, including temperature, top-p sampling, and maximum token length. The elapsed time for generating the response is recorded and displayed.
result = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
past_key_values=past_key_values,
streamer=streamer,
do_sample=True,
temperature=0.9,
top_p=0.9,
max_new_tokens=1024,
pad_token_id=tokenizer.eos_token_id,
return_dict_in_generate=True,
output_hidden_states=True,
)
elapsed = time.time() - begin
print(f"Elapsed time: {elapsed:.2f}s")
print("\n")
sequence = result["sequences"]
past_key_values = result["past_key_values"]
Example 1:
Q: Could you build an IMDB text classifier using Tensorflow?
Example 2:
Q:Could you build an IMDB text classifier using Pytorch?
4- M8x7b ai for Text Classification :
For this section we’re gonna be using the TEXT Document Classification dataset : https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset.
4.1 - Required packages:
We start by installing required packages. It is worth noting we will be utilizing Langchain in this example as we believe it is becoming more and more used when working with LLMs.
!pip install langchain
We begin by importing the necessary libraries for data manipulation, including pandas for handling data frames and numpy for numerical computations. We also suppress any warning messages for cleaner output.
Additionally, we import modules from the langchain framework, specifically from HuggingFaceHub and LLMChain. These modules facilitate interaction with Hugging Face’s model hub and enable the creation of language generation chains.
Furthermore, we import the PromptTemplate module for constructing prompts. Lastly, we import the os module to handle the saving of the Hugging Face access token in the environment.
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
from langchain import HuggingFaceHub
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import os # To save hugging face access token in environment
4.2 - Load dataset:
data = pd.read_csv("/kaggle/input/text-document-classification-dataset/df_file.csv")
data.head()
# updating data with 'Label' column to decode the integer labels with categorical labesl for easy inference
vis_df = data
vis_df['Label'] = vis_df['Label'].map({0:'Politics',1:'Sport',2:'Technology',3:'Entertainment',4:'Business'})
4.3 - Load model from Hugging Face Hub:
The parameters for the model are :
- Temperature: Temperature in text generation refers to the degree of randomness or diversity in the generated outputs. Higher temperatures result in more diverse and creative responses, but they also increase the likelihood of the model straying from the context provided. Conversely, lower temperatures produce more focused and deterministic responses, as the model tends to stick closely to the most likely prediction. A temperature of 0 means that only the most likely tokens are selected, eliminating randomness entirely.
- Top-p Sampling: Top-p sampling is a method used to select tokens during text generation based on their probabilities. It considers the cumulative probability of tokens until it reaches a predefined threshold, typically denoted as “p”. This approach limits the number of choices available to the model, helping to prevent overly diverse or nonsensical outputs by focusing on the most likely tokens.
- Max New Tokens: Max new tokens represent the maximum number of tokens to generate in the output sequence, excluding those present in the prompt. This parameter determines the size of the generated response. It serves as a practical limit to the length of the generated text, ensuring that the output remains manageable and relevant.
- Repetition Penalty: Repetition penalty is a technique employed to discourage the generation of tokens that have recently appeared in the generated text. By penalizing or reducing the probability of repeating tokens, the model is encouraged to produce more diverse and non-repetitive outputs. This helps to enhance the overall quality and coherence of the generated text by promoting variety and avoiding redundancy.
os.environ['HUGGINGFACEHUB_API_TOKEN'] = 'your_hugging_hub_token'
llm=HuggingFaceHub(repo_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
model_kwargs={"temperature":0.9,
"top_p":0.95,
"max_new_tokens": 250,
"repetition_penalty":1.1})
4.4 - Create a Template for LLM Prompt:
The purpose of the template is classifying news text into predefined categories: Politics, Sport, Technology, Entertainment, or Business. The template instructs to refrain from coding and simply return a single-word answer indicating the category to which the given news text belongs.
template = """
Act as a highly intelligent news chatbot and classify the given news text into one of the following categories only 1. Politics 2.Sport 3.Technology 4.Entertainment 5.Business
Do not code. Return only one word answer with only the category name that the given news text belongs to
News text: {news}
"""
prompt = PromptTemplate(input_variables=["news"],template=template)
We initialize a language model chain, denoted as chain, using the LLMChain class. This chain incorporates a language model (llm) and a prompt template (prompt).
Additionally, a post-processing function named process_llm_output is defined to extract the category name from the LLM output. This function iterates through a list of predefined categories and compares them with the generated answer, returning the matched category. This ensures that only the category name is extracted from the LLM output, disregarding any additional tokens generated along with the answer.
chain = LLMChain(llm= llm,prompt=prompt)
#post processing function to extract only the category name from the LLM output in case of any new tokens generated in addition the answer
def process_llm_output(answer):
for category in categories:
if category.lower() in answer.lower():
return category
4.5 - Inference with LLMChain:
data.iloc[0]['Text']
news_text = data.iloc[0]['Text']
chain.run(news_text)
Conclusion:
In conclusion, this article Discusses m8x7b ai, a cutting-edge AI model known for its exceptional performance and versatility. By outperforming state-of-the-art models, m8x7b ai showcases its superiority across various benchmarks. It demonstrates its prowess in tasks like question answering and text classification, offering a user-friendly interface and straightforward implementation steps. With its robust architecture and wide-ranging applications, Mixtral 8x7B stands poised to revolutionize AI solutions, driving progress and enabling breakthroughs across diverse domains.