Fine-tune LLM for agentic reasoning to demonstrate better performance compared to vanilla LLMs

Jan 21st, 2025

In this tutorial, we’ll guide you through building a reasoning agent: a conversational AI system capable of selecting the right specialized tools to answer specific questions. The core idea is to fine-tune two smaller LLaMA 3.2-3B models on reasoning-specific datasets for math and coding tasks. These fine-tuned models will then be integrated as tools, enabling a larger model to act as a smart agent that determines which model to use for a given query.

What Makes a Reasoning Agent Different?

Vanilla LLMs are impressive but struggle with reasoning-intensive tasks, especially in domains like math. These models can sometimes hallucinate information, provide irrelevant answers, or fail to follow logical reasoning steps, particularly in tasks that require step-by-step calculations, complex problem-solving, or code generation.

Here’s why they fall short:

Generalization Over Specificity: They are trained broadly, which limits their ability to excel in specialized scenarios.
Hallucinations: LLMs often generate incorrect information confidently.
Lack of Structured Problem-Solving: Complex reasoning tasks, like multi-step calculations or debugging code, require following logical steps, which vanilla models often overlook.

The Need for Fine-Tuning for Reasoning

To improve the performance of large language models (LLMs) on reasoning tasks, we use COT-based fine-tuning, a process where the model is trained on specific reasoning datasets (in our case chain of thought). These datasets are designed to help the model learn the steps needed to answer questions systematically, using logical progression rather than providing direct answers.

What is Chain of Thought (CoT) Reasoning?

Chain of Thought reasoning is a technique where the model is trained to break down problems into intermediate steps, much like how humans think through complex problems.

For example:

Math Query:

“If a train travels at 60 km/h for 2 hours and 40 km/h for 1 hour, what is the total distance traveled?”
The model reasons as follows:

"For the first 2 hours at 60 km/h, the distance is 60×2=120 km."

"For the next 1 hour at 40 km/h, the distance is 40×1=40 km."

"Total distance traveled is 120+40=160 km."

Coding Query:

"Write a Python function to calculate the factorial of a number."
The model reasons:

"The factorial of a number
𝑛 is defined as the product of all positive integers less than or equal to 𝑛."
"For example, factorial(3) = 3×2×1=6."
"The code needs a loop to multiply numbers from 1 to 𝑛.”
[Generates Code].

CoT allows models to “think out loud,” significantly improving their problem-solving ability.

Fine-tune and evaluate your model with UBIAI

Prepare your high quality Training Data
Train best-in-class LLMs: Build domain-specific models that truly understand your context, fine-tune effortlessly, no coding required
Deploy with just few clicks: Go from a fine-tuned model to a live API endpoint with a single click
Optimize with confidence: unlock instant, scalable ROI by monitoring and analyzing model performance to ensure peak accuracy and tailored outcomes.

The Limits of Fine-Tuning

Fine-tuning is undoubtedly a lifesaver when it comes to improving the performance of LLMs in specialized reasoning tasks. However, it has limitations: A fine-tuned model excels only in the domain it was trained on (e.g., math, coding, etc.), making it difficult to generalize across multiple domains.

What If You Need Multi-Domain Specialization?

This is where AI agent Reasoning come into play: a unique approach to multi-domain reasoning.

What Are AI Agents?

AI agents are advanced AI systems that have the ability to use outside tools to enhance their functionality. These tools can range from calculators to APIs, databases, or even smaller, fine-tuned models. By combining external resources, AI agents overcome the limitations of traditional LLMs. With function calling, AI agents can dynamically decide which tool to use based on the user’s query.

Our Approach

In this tutorial we will combine fine-tuning with AI agents to obtain an ultimate reasoning agent. the idea is simple:

1. We’ll fine-tune two smaller LLaMA models:

Model 1 (Math): Trained on the AI-MO/NuminaMath-CoT dataset.
Model 2 (Coding): Trained on the DONG19/CoT_code_instruction_dataset.

This step will help teach the models to reason step-by-step in their assigned domain.

2. Using function calling, we’ll integrate these two models as tools for a larger conversational AI system. This agent will analyze user queries, decide which specialized model to call, and provide enriched, accurate responses.

3. We’ll test the agent with various queries, showcasing its ability to outperform vanilla LLMs in reasoning tasks.

Part 1: Fine-Tuning the Smaller Models

Before building the AI agent, let’s quickly revisit the fine-tuning process for the smaller LLaMA models. If you’ve already followed my previous tutorials, you’ll be familiar with the steps. However, we’ll briefly go over the key points to ensure we’re on the same page.

To get started with fine-tuning, we first need to install the Sloth library, which provides an easy interface for fine-tuning models. You can install it using pip:

				
					%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

				
					from unsloth import FastLanguageModel
import torch

				
					max_seq_length = 2048
dtype = None
load_in_4bit = True

Let’s pick a model. Since we’re working with limited resources, we’ll choose a quantized version of the LLaMA 3.2-3B model. The quantized model uses 4-bit precision, which reduces the model size and memory usage without compromising too much on performance. This makes it feasible to fine-tune the model without requiring massive computational resources.

				
					fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",
    "unsloth/Mistral-Small-Instruct-2409",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",

    "unsloth/Llama-3.2-1B-bnb-4bit",
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,

)

To fine-tune the model efficiently, we will use LoRA (Low-Rank Adaptation), a method that allows us to update only a small percentage of the model’s parameters (about 10%). LoRA reduces the computational cost of fine-tuning and allows us to focus on specific parts of the model that are most relevant to the reasoning tasks.

				
					model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

For fine-tuning, we will use predefined prompts to map the input and output of the datasets. These prompts will ensure that the model is trained to generate reasoning steps for both math and coding tasks.

				
					alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer the following question:

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    inputs       = examples["instruction"]
    outputs      = examples["output"]
    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("DONG19/CoT_code_instruction_dataset", split = "train")
dataset = dataset.filter(lambda example: example["input"].strip() == "")

				
					dataset = dataset.map(formatting_prompts_func, batched = True,)

				
					dataset= dataset.select(range(5000))

				
					print(dataset['text'][11])

Once the data is prepared, we can set up the trainer and start the fine-tuning process. We will use a standard training loop to fine-tune the models on the math and coding datasets.

				
					from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 320,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

				
					trainer_stats = trainer.train()

				
					import os
os.environ['HF_token'] = ''
HF_token = os.getenv('HF_token')

After training, we’ll save the fine-tuned models to Hugging Face for easy integration into our conversational agent.

				
					model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
model.push_to_hub("melekmessoussi/Coding_Model_LoRA", token = HF_token)
tokenizer.push_to_hub("melekmessoussi/Coding_Model_LoRA", token = HF_token)

Before moving to the next part make sure to run the first part to make as many models (tools) as you need to use in the next part. I only did two as proof of concept but you can go wild and have fun with it.

Part 2: Building the AI Agent with Function Calling

Now that we have fine-tuned our models, let’s move on to building the conversational AI agent that will use the fine-tuned models as tools to answer user queries.

				
					from unsloth import FastLanguageModel
from transformers import TextStreamer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!

Importing the Models

First, we need to import the fine-tuned models. We will use FastLanguageModel to load the models efficiently.

				
					Math = "melekmessoussi/Math_Model_LoRA"
Code = "melekmessoussi/Coding_Model_LoRA"

				
					Code_model, Code_tokenizer = FastLanguageModel.from_pretrained(
    model_name= Code,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

FastLanguageModel.for_inference(Code_model)

==((====))== Unsloth 2025.1.5: Fast Llama patching. Transformers: 4.47.1.
\\ /| GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\ / Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
"-____-" Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
PeftModelForCausalLM(
(base_model): LoraModel(
(model): LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 3072, padding_idx=128004)
(layers): ModuleList(
(0-27): 28 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=3072, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(k_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(v_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(o_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=3072, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=8192, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=8192, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(up_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=8192, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=8192, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(down_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=8192, out_features=3072, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=8192, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=3072, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((3072,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=3072, out_features=128256, bias=False)
)
)
)

				
					Math_model, Math_tokenizer = FastLanguageModel.from_pretrained(
    model_name= Math,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)


FastLanguageModel.for_inference(Math_model)

==((====))== Unsloth 2025.1.5: Fast Llama patching. Transformers: 4.47.1.
\\ /| GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\ / Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
"-____-" Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
PeftModelForCausalLM(
(base_model): LoraModel(
(model): LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 3072, padding_idx=128004)
(layers): ModuleList(
(0-27): 28 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=3072, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(k_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(v_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(o_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=3072, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=8192, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=8192, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(up_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=3072, out_features=8192, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3072, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=8192, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(down_proj): lora.Linear4bit(
(base_layer): Linear4bit(in_features=8192, out_features=3072, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=8192, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=3072, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((3072,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=3072, out_features=128256, bias=False)
)
)
)

Preparing the Tools (Models)

We need to define the tools that the AI agent will use. These tools are functions that will call the fine-tuned models for math and code reasoning. We will create helper functions for math and code queries that utilize the models to generate responses.

				
					alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer the following question:

### Input:
{}

### Response:
{}"""

				
					def parse_response(output):
    if isinstance(output, list):
        output = output[0]
    start_marker = "### Response:\n"
    end_marker = "<|eot_id|>"
    if start_marker in output and end_marker in output:
        start_index = output.index(start_marker) + len(start_marker)
        end_index = output.index(end_marker)
        response = output[start_index:end_index].strip()
        return response
    return None

				
					def get_Coding_answers(question, alpaca_prompt=alpaca_prompt, model=Code_model, tokenizer=Code_tokenizer):

    inputs = tokenizer(
        [
            alpaca_prompt.format(
                question,
                "",
            )
        ], return_tensors="pt"
    ).to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 128, temperature=0.1, use_cache = True)
    tokenizer.batch_decode(outputs)
    return parse_response(tokenizer.batch_decode(outputs))

				
					def get_Math_answers(question, alpaca_prompt=alpaca_prompt, model=Math_model, tokenizer=Math_tokenizer):


    inputs = tokenizer(
        [
            alpaca_prompt.format(
                question,
                "",
            )
        ], return_tensors="pt"
    ).to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 128, temperature=0.1, use_cache = True)
    tokenizer.batch_decode(outputs)
    return parse_response(tokenizer.batch_decode(outputs))

				
					Coding_tool={
    "type": "function",
    "function": {
        "name": "get_Coding_answers",
        "description": "Provides Coding solutions and answers to Coding queries.",
        "parameters": {
            "type": "object",
            "properties": {
                "question": {
                    "type": "string",
                    "description": "For Coding question or topic needing an answer."
                }
            },
            "required": ["question"]
        }
    }
}

Math_tool={
    "type": "function",
    "function": {
        "name": "get_Math_answers",
        "description": "Provides Math solutions and answers to Math related queries.",
        "parameters": {
            "type": "object",
            "properties": {
                "question": {
                    "type": "string",
                    "description": "For Math question or topic needing an answer."
                }
            },
            "required": ["question"]
        }
    }
}

Creating the Conversational AI Agent

Now, let’s create the main logic for our conversational agent. The agent will listen for user input, determine whether a function call is needed, and use the appropriate model to answer the question.

The AI Agent will not give the finetuned model answer directly but it will use that answer as a basis for its own answer and then answer. very similar to how RAG works.

What is important to know here is that the fine-tuned models do the reasoning while the bigger Model AI Agent uses that reasoning chain to answer the users’ query.

				
					!pip install rich
!pip install groq

Requirement already satisfied: rich in /usr/local/lib/python3.11/dist-packages (13.9.4)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich) (2.18.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich) (0.1.2)
Collecting groq
Downloading groq-0.15.0-py3-none-any.whl.metadata (14 kB)
Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.11/dist-packages (from groq) (3.7.1)
Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.11/dist-packages (from groq) (1.9.0)
Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.11/dist-packages (from groq) (0.28.1)
Requirement already satisfied: pydantic<3,>=1.9.0 in /usr/local/lib/python3.11/dist-packages (from groq) (2.10.5)
Requirement already satisfied: sniffio in /usr/local/lib/python3.11/dist-packages (from groq) (1.3.1)
Requirement already satisfied: typing-extensions<5,>=4.10 in /usr/local/lib/python3.11/dist-packages (from groq) (4.12.2)
Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.11/dist-packages (from anyio<5,>=3.5.0->groq) (3.10)
Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpx<1,>=0.23.0->groq) (2024.12.14)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.11/dist-packages (from httpx<1,>=0.23.0->groq) (1.0.7)
Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.11/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->groq) (0.14.0)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic<3,>=1.9.0->groq) (0.7.0)
Requirement already satisfied: pydantic-core==2.27.2 in /usr/local/lib/python3.11/dist-packages (from pydantic<3,>=1.9.0->groq) (2.27.2)
Downloading groq-0.15.0-py3-none-any.whl (109 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 109.6/109.6 kB 7.0 MB/s eta 0:00:00
Installing collected packages: groq
Successfully installed groq-0.15.0

				
					import os
from groq import Groq
from google.colab import userdata
import json
from rich import print

				
					client = Groq(api_key = '')

				
					history = [
    {
        "role": "system",
        "content": "You are a chat assistant with specialized tools for answering questions in the fields of Coding and Math. only answer math related questions using the get_Math_answers tool and coding instructions using the get_Coding_answers tool. rely only on the results from the tools to answer then give a short direct answer."
    }
]

tools = [Coding_tool, Math_tool]
MODEL = "llama-3.2-90b-vision-preview"

Let’s set up our agent and test it.

				
					while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        print("Ending conversation. Goodbye!")
        break

    history.append({"role": "user", "content": user_input})

    response = client.chat.completions.create(
        model=MODEL,
        messages=history,
        stream=False,
        tools=tools,
        tool_choice="auto",
        temperature=0,
        max_tokens=4096
    )

    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls
    history.append({"role": "assistant", "content": response_message.content})

    if response_message.content is not None:
      print(f"[red]Assistant: {response_message.content}[/red]")

    if tool_calls:
        available_functions = {
            "get_Coding_answers": get_Coding_answers,
            "get_Math_answers": get_Math_answers,
        }

        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_to_call = available_functions.get(function_name)

            if function_to_call:
                function_args = json.loads(tool_call.function.arguments)
                function_response = function_to_call(
                    question=function_args.get("question")
                )

                history.append({
                    "role": "tool",
                    "name": function_name,
                    "content": function_response,
                    "tool_call_id": tool_call.id
                })

                print(f"[blue][{function_name} Tool]: {function_response}[/blue]")

        second_response = client.chat.completions.create(
            model=MODEL,
            temperature=0,
            messages=history
        )

        final_response = second_response.choices[0].message.content
        print(f"[green]Assistant: {final_response}[/green]")
        history.append({"role": "assistant", "content": final_response})

You: hello 
Assistant: I'm here to help with Math and Coding questions. What's on your mind?
You: Stella wanted to buy a new dress for the upcoming dance. At the store, she found out that the dress she wanted was $50. The store was offering a certain discount on everything in the store, and the final cost of the dress was $35. What was the percentage of the discount offered by the store?
: The discount on the item is $50 - $35 = $15. To find the percentage discount, divide the discount by the original
price and multiply by 100:

[
\left(\frac{\$15}{\$50}\right) \times 100\% = 30\%
\]

Thus, the response is $\boxed{30\%}$.
Assistant: The store offered a 30% discount.
You: Create a function that takes an array as an argument and returns the sum of all the elements in the array.
: To complete the task, you need to create a function that accepts an array as an argument. Within the function,
you should calculate the sum of all the elements in the array. Finally, you should return the calculated sum.
Assistant: def sum_array_elements(array):
return sum(array)
You: How many different triangles can be formed having a perimeter of 7 units if each side must have integral length?
: There are 2 different triangles that can be formed with a perimeter of 7 units. The triangles are $\{1, 2, 4\}$
and $\{2, 2, 3\}$.

Thus, the final answer is $\boxed{2}$.
Assistant: There are 2 different triangles that can be formed with a perimeter of 7 units.
You: exit
Ending conversation. Goodbye!

The agent should be able to select the right model for math or coding questions and provide detailed, reasoned answers. As you can see in our test that is the case for the questions I asked:

When the model is not asked anything related to math or coding it answers directly.

when it’s asked a question related to math it calls the get_Math_answers function to get the reasoning for the answer. (the blue text is the reasoning from that function)

when it’s asked a question related to coding it calls the get_Coding_answers function to get the reasoning for the answer. (the blue text is the reasoning from that function)

				
					history = []

while True:
    user_input = input("You: ")

    if user_input.lower() == "exit":
        break

    history.append({"role": "user", "content": user_input})

    chat_completion = client.chat.completions.create(
        messages=history,
        model="llama-3.2-90b-vision-preview",
    )

    assistant_response = chat_completion.choices[0].message.content
    print(f"Assistant: {assistant_response}")

    history.append({"role": "assistant", "content": assistant_response})

You: Stella wanted to buy a new dress for the upcoming dance. At the store, she found out that the dress she wanted was $50. The store was offering a certain discount on everything in the store, and the final cost of the dress was $35. What was the percentage of the discount offered by the store?
Assistant: 15 percent or 15%.
You: are you sure?
Assistant: original price of the dress: $50
discounted price of the dress: $35
difference: $15
percentage of discount: 30%
You: okay make me a function that will sum up 2 numbers
Assistant: ```python
def sum_numbers(a, b):
return a + b
```

You can use this function like this:

“`python
result = sum_numbers(2, 3)
print(result) # Outputs: 5
“`
You: exit

As you can see the vanilla model made a mistake in the Math question proving how poor they are at that type of reasoning. It’s clear that AI agents might just be the future of reasoning in AI, providing a modular and scalable way to handle multi-domain tasks. By combining smaller, specialized models through function calling, these agents achieve unparalleled versatility and accuracy, making them the ideal solution for reasoning-intensive applications, clearly surpassing vanialla LLMs.

What are you waiting for?

Automate your process!

The Services provided are really great, we received a genuine advice and at very reasonable cost. all the work went hassle-free and no complication.

Fine-tune LLM for agentic reasoning to demonstrate better performance compared to vanilla LLMs

What Makes a Reasoning Agent Different?

The Need for Fine-Tuning for Reasoning

What is Chain of Thought (CoT) Reasoning?

Fine-tune and evaluate your model with UBIAI

The Limits of Fine-Tuning

What If You Need Multi-Domain Specialization?

What Are AI Agents?

Our Approach

Part 1: Fine-Tuning the Smaller Models

Part 2: Building the AI Agent with Function Calling

Importing the Models

Preparing the Tools (Models)

Creating the Conversational AI Agent

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

Fine-tune LLM for agentic reasoning to demonstrate better performance compared to vanilla LLMs

What Makes a Reasoning Agent Different?

The Need for Fine-Tuning for Reasoning

What is Chain of Thought (CoT) Reasoning?

Fine-tune and evaluate your model with UBIAI

The Limits of Fine-Tuning

What If You Need Multi-Domain Specialization?

What Are AI Agents?

Our Approach

Part 1: Fine-Tuning the Smaller Models

Part 2: Building the AI Agent with Function Calling

Importing the Models

Preparing the Tools (Models)

Creating the Conversational AI Agent

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost

Fine Tuning LLMs on Your Own Dataset