Join our new webinar “Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost” on March 5th at 9AM PT  ||  Register today ->

ubiai deep learning

From Neural Nets to Transformers: How We Got to Fine-Tuning Large Language Models (LLMs)

APRIL 8th, 2025

The rise of Large Language Models (LLMs) has undeniably transformed the way we interact with technology, bringing artificial intelligence into everyday conversations and making it more accessible than ever before. What once seemed like science fiction is now a natural part of our routines; with machines capable of writing our emails, answering our questions, and performing tasks that were formerly reserved for humans. But this transformation did not happen overnight. The shift from simple rule-based systems to modern Large Language Models (LLMs) is the result of years of progress in the field of natural language processing (NLP).

To fully understand how we got to today’s era of fine-tuning LLMs, we need to trace the evolution of the underlying technologies, starting with neural networks and moving on to the important breakthroughs that led to the transformer-based models we now fine-tune using techniques like supervised fine-tuning (SFT), RLHF fine-tuning, and LoRA.

The Foundations: Neural Networks in NLP

Neural networks are computational models that mimic the architecture of the human brain. These networks are composed of layers of artificial neurons that process incoming input via a series of mathematical operations. A neural network can learn to recognize patterns in data by changing the weights of neurons’ connections during training, allowing the model to make predictions or decisions based on new data.

In the early days of artificial intelligence, neural networks were rather simplistic and incapable of effectively handling the complexity of sequential data such as language. While they were capable of doing image recognition and categorization tasks, they struggled with activities requiring a more in-depth comprehension of context, such as language translation, question responding, or text generation.

You can watch the tutorial below:

Recurrent Neural Networks (RNNs) and LSTMs

Recurrent neural networks (RNNs) were created to address the issue of sequential dependencies in language. Unlike traditional Feedforward Neural Networks (FNNs), where information flows in one direction, RNNs have feedback loops that allow information to persist across time steps, enabling them to “remember” previous inputs and consider them when processing subsequent ones.

Despite being popular and effective for handling sequential data, RNNs still faced two major challenges:

 

  • Vanishing and Exploding Gradients: When training an RNN, the gradients—used to adjust the model’s weights—either became too small (vanishing gradients) or too large (exploding gradients), especially when sequences are long.

  • Short-Term Memory: Even though RNNs have feedback loops designed to remember information, they still struggled with long-term dependencies; they had a hard time holding on to information over long sequences, like an entire paragraph or a full document, which limited their effectiveness in more complex tasks.

 

These particular issues prompted the development of Long Short-Term Memory (LSTM) networks. LSTMs are a particular type of RNN that use memory cells and gates to selectively recall or forget information, allowing them to handle long-range dependencies better than traditional RNNs and reducing the risk of vanishing gradients.

Our new approach combines:

  • LlamaParse: A document parsing tool that converts documents into text.
  • Qwen LLM: A Large Language model for natural language processing tasks.
 

This combination allows us to improve extraction performance by first structuring the data effectively and then using a tailored LLM to extract key information.

The Transformer Architecture

The transformer architecture marked a turning point in how we processed language with artificial intelligence. As opposed to previous models, Transformers took a completely different approach at handling sequences, eliminating the need for repetition and step-by-step processing. Instead they processed sequences using a mechanism known as self-attention, which accelerated and improved the process of understanding distant links between words.

Self-Attention and Multi-Head Attention

The key innovation of the transformer architecture is the self-attention mechanism, which enables the model to understand the relationships between different words in a sentence, no matter how far apart they are.

  • Words as Vectors: Each word in a sequence is converted into a vector—a numerical representation of the word’s meaning. These word vectors are then transformed into three types of vectors: Query (Q), Key (K), and Value (V).
 
  • Calculating Similarity: To understand how words relate to each other, the model compares the Query (Q) of one word with the Key (K) of all other words in the sequence. This comparison produces a score that tells the model how much attention to give to each word based on its relevance to the current word.
 
  • Weighting Words: The scores (called attention scores) determine how much importance each word gets. In the following sentence “The cat sat on the mat,” the word “cat” would pay more attention to “sat” because they are closely related in meaning. The attention score between “cat” and “sat” would be higher than between “cat” and “on.”
 
  • Final Word Representation: After calculating the attention scores, the model applies these to weight the Value (V) vectors. This means that words that are essential in the context of others will have a greater impact on the final representation of the term. For example, “cat” would consider the weighted value of “sat” more than “on.”
 
 
 
Multi-head attention improves upon this mechanism by running multiple attention processes in parallel with varying weights. That way each “head” of attention collects different aspects of word relationships; one attention head might focus on syntactic structures, while another may capture semantic relationships. By combining the outputs of these heads, the transformer is able to gather richer and more nuanced representations of the sequence.

Positional Encoding

 
Transformers process entire sequences of words all at once. This parallel processing is problematic because the order of the words is crucial for providing meaning in language. To deal with this, transformers use a mechanism called positional encoding. The idea is to add some extra information to each word that tells the model where in the sequence it lies. The positional encodings are often computed from sinusoidal functions, which ensures that every word gets a unique representation based on its position.
 

Feedforward Networks and Normalization

After the self-attention mechanism works its magic, transformers use a feedforward neural network to further enrich the representations of each word. These neural networks contain layers to add complexity and help the model understand how the words interrelate with each other. To improve the model’s learning efficiency, layer normalization is also used, which stabilizes the training process and improves general performance.

Making LLMs Smarter and More Useful

Once we have a pretrained transformer model, like GPT, the next step is adapting it to specific tasks—a process known as fine-tuning. But how does LLM fine-tuning work?

What is Supervised Fine-Tuning (SFT)?

Supervised fine-tuning is one of the most common methods used to adapt a general-purpose language model to a domain-specific task. For instance, if you want to fine-tune GPT to write legal contracts, you would use a dataset of labeled examples (contracts and expected completions). The model then learns to produce outputs aligned with that structure.

This is what people refer to when they say they’re doing SFT fine-tuning—it’s about taking a pretrained model and making it perform well in a supervised setup, using labeled input-output pairs.

Fine-Tuning LLaMA, Mistral, and Other Open Models

Open-source models like LLaMA and Mistral 7B have made it easier than ever to experiment with fine-tuning at scale. Whether you’re trying to fine-tune Mistral 7B for legal QA or fine-tune LLaMA for summarizing scientific papers, the principle is the same—customize a general model using your data.

 

Techniques like LoRA (Low-Rank Adaptation) make this even more efficient by freezing most of the model and training only a few parameters. This drastically reduces compute requirements and allows fine-tuning even on modest hardware.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !