Large Language Models (LLMs) are advanced artificial intelligence systems designed to understand, generate, and manipulate human-like text.
Powered by deep learning architectures like transformers, models such as OpenAI’s GPT, Google’s Bard, and Meta’s LLaMA have demonstrated unprecedented capabilities in tasks ranging from answering complex questions to writing essays, debugging code, and even creating poetry.
These models are trained on vast amounts of text data, enabling them to capture the nuances of language and produce outputs that often feel remarkably human.
LLM evaluation is essential to ensure these powerful AI systems are reliable, safe, and ethical. Here’s why it matters:
LLMs are used in high-stakes applications like healthcare and education. Evaluation ensures their outputs are accurate and appropriate, preventing harmful mistakes.
LLMs can inherit biases from training data, leading to unfair or discriminatory outputs. Evaluation helps detect and address these issues, promoting fairness.
Evaluation allows developers to measure improvements, compare models, and benchmark against state-of-the-art systems.
Not all LLMs are equal. Evaluation helps stakeholders choose the right model for their needs, balancing performance, efficiency, and ethical considerations.
In short, LLM evaluation builds trust in AI by ensuring models are reliable, fair, and aligned with human values. Next, we’ll explore the key metrics used to evaluate LLMs.
Evaluating LLMs is no simple task. Unlike traditional AI systems, where metrics like accuracy or error rates suffice, LLMs require a more nuanced approach. Their outputs are often subjective, context-dependent, and open to interpretation. For example:
These challenges highlight the need for robust evaluation frameworks that go beyond simple metrics and consider the multifaceted nature of human language.
In this blog, we’ll explore the best practices for evaluating LLMs, covering:
By the end of this guide, you’ll have a clear understanding of how to assess LLMs effectively, ensuring they deliver value while minimizing risks. Let’s dive in!
Evaluating large language models (LLMs) requires a multifaceted approach that combines different methodologies to gain a comprehensive understanding of model performance.
Each approach offers unique strengths and limitations, making them complementary rather than mutually exclusive. Let’s explore the four primary evaluation paradigms used in the field today.
Human evaluation remains the gold standard for assessing LLM outputs, particularly for subjective qualities that are difficult to measure automatically.
Key characteristics:
Captures nuanced aspects of language like relevance, helpfulness, and naturalness that automated metrics often miss.
Typically involves annotators rating model outputs on predefined criteria using Likert scales or preference judgments.
Crucial for evaluating creative content, instruction-following, and alignment with human values.
Expensive and time-consuming to implement at scale – Subject to annotator biases and inconsistencies – Difficult to standardize across different evaluation campaigns.
Use clear rubrics and evaluation criteria – Employ multiple annotators per example and measure inter-annotator agreement – Consider specialized annotators for domain-specific tasks
Automated metrics provide scalable, reproducible measurements that can be applied to large datasets without human intervention.
Common automated metrics:
Compare model outputs to humanwritten references (BLEU, ROUGE, BERTScore)
Evaluate outputs without needing human references (perplexity, coherence scores)
Using other LLMs to evaluate outputs (e.g., GPT-4 for scoring)
Strengths and weaknesses:
When to use:
For rapid iteration during development – As part of a broader evaluation strategy – When monitoring performance changes over time
Standardized benchmark datasets provide consistent testing grounds to compare different models on specific capabilities.
Popular LLM benchmarks:
Adversarial testing deliberately probes model weaknesses by designing inputs specifically meant to cause failures.
Approaches include:
The field of LLM evaluation continues to evolve rapidly with new approaches gaining traction:
The most effective evaluation strategies combine multiple approaches:
Remember that evaluation should be tailored to your specific use case and deployment context.
For customer-facing applications, human evaluation on realworld queries may be most valuable, while research contexts might benefit more from standardized benchmarks.
Selecting the right metrics is crucial for meaningful LLM evaluation.
Different metrics capture different aspects of model performance, and understanding their strengths and limitations is essential for comprehensive assessment.
This section explores the primary categories of metrics used in LLM evaluation.
Accuracy metrics measure how closely model outputs match expected answers or references.
These are particularly important for tasks with clear correct answers.
Limitation: Less informative when classes are highly imbalanced
Fluency metrics assess how natural, grammatical, and human-like the language produced by the model is.
These metrics evaluate whether model outputs are relevant to the input and consistent with known facts or the model’s previous statements.
May not capture nuanced aspects of relevance
These metrics assess potential harms, biases, and unintended behaviors in model outputs.
New metrics continue to be developed as the field evolves:
When selecting metrics for your evaluation:
1. Consider your use case: Different applications require different metrics
2. Use multiple metrics: No single metric captures all aspects of performance
3. Balance automated and human evaluation: Use automated metrics for efficiency, but validate with human judgment
4. Establish baselines: Compare against human performance and other models
5. Track progress over time: Monitor how metrics change as you iterate on your model
Remember that metrics are tools to guide improvement, not ends in themselves.
The ultimate goal is to create models that provide value to users, which sometimes requires looking beyond traditional metrics.
The rapid development of LLMs has sparked the creation of numerous evaluation frameworks and tools.
This section explores the most widely used and effective tools available for comprehensive LLM evaluation.
HELM provides one of the most comprehensive frameworks for evaluating language models across multiple dimensions.
Key features:
Best for:
Website: HELM at Stanford CRFM
A flexible and extensible codebase for evaluating language models on a wide range of tasks and benchmarks.
Key features:
Best for:
GitHub: EleutherAI lm–evaluation–harness
Part of the popular HuggingFace ecosystem, this library provides evaluation metrics for NLP tasks.
Key features:
Best for:
Documentation: HuggingFace Evaluate
An open-source library focused on evaluating LLM applications and chains, particularly relevant for RAG systems.
Key features:
Best for:
Evaluating production LLM applications and RAG systems.
GitHub: TruLens
A comprehensive benchmark for evaluating knowledge and reasoning across 57 subjects.
Key features:
Best for:
A collaborative benchmark with over 200 tasks designed to probe model capabilities beyond standard metrics.
Key features:
Best for:
Focused on evaluating instruction-following capabilities through win rates against reference models.
Key features:
Best for:
Comparing instruction-following capabilities between models.
A comprehensive platform for testing and evaluating production LLM applications.
Key features:
Best for:
An ML observability platform with extensive LLM evaluation capabilities.
Key features:
Best for:
An evaluation platform built for production AI systems with comprehensive metrics.
Key features:
Best for:
Focuses on comparing models based on helpfulness and harmlessness preferences.
Key features:
Best for:
Integrated platform for tracking and visualizing LLM evaluations.
Key features:
Best for:
An evaluation framework designed for evaluating model capabilities and safety.
Key features:
Best for:
Most organizations benefit from combining multiple evaluation tools:
1. Start with comprehensive frameworks like HELM or EleutherAI’s Harness for broad capability assessment
2. Add specialized tools for your specific use cases:
3. Implement continuous evaluation using tools that integrate with your MLOps pipeline
4. Complement with human evaluation for critical aspects not wellcaptured by automated metrics
When choosing evaluation tools, consider:
• Integration with your tech stack
• Coverage of metrics relevant to your use case
• Cost and resource requirements
• Ease of use and documentation quality
• Community support and active development
• Scalability for your evaluation needs
Remember that the best evaluation strategy combines multiple tools and approaches, as no single tool can capture all aspects of LLM performance.
Evaluating chatbots requires a focus on various metrics to ensure the model serves its purpose effectively.
For customer service, the main criteria often include:
How well does the chatbot answer the user’s queries? Is it relevant, helpful, and clear?
This can be evaluated using relevance and factual consistency metrics.
In research, the evaluation of LLMs follows a structured methodology to ensure reproducibility and fairness:
Major AI labs like OpenAI, Google Research, and DeepMind follow rigorous evaluation frameworks to assess LLMs:
This comprehensive analysis explores the text generation and summarization capabilities of Meta’s LLaMA 3.1 8B model.
Using industry-standard metrics like BLEU, ROUGE, and BERTScore, we quantitatively evaluate the model’s performance while providing real-world examples of its outputs.
Our findings reveal an interesting disconnect: while semantic understanding (measured by BERTScore) is remarkably strong, lexical precision and structural integrity (measured by BLEU and ROUGE) show significant room for improvement.
Complete with visualizations and side-by-side comparisons of generated vs. expected outputs, this evaluation provides valuable insights for researchers and practitioners looking to implement or fine-tune large language models for specific text generation tasks.
First, let’s install the necessary packages for our evaluation:
# Installing required packages for model loading, evaluation, and visualization
!pip install torch transformers evaluate matplotlib seaborn huggingface_hub rouge_score bert_score
Requirement already satisfied: torch in /usr/local/lib/python3.11/dist-packages (2.6.0+cu124)
Requirement already satisfied: transformers in /usr/local/lib/python3.11/dist-packages (4.50.0)
Requirement already satisfied: evaluate in /usr/local/lib/python3.11/dist-packages (0.4.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.11/dist-packages (3.10.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.11/dist-packages (0.13.2)
Requirement already satisfied: huggingface_hub in /usr/local/lib/python3.11/dist-packages (0.29.3)
Requirement already satisfied: rouge_score in /usr/local/lib/python3.11/dist-packages (0.1.2)
Requirement already satisfied: bert_score in /usr/local/lib/python3.11/dist-packages (0.3.13)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from torch) (3.18.0)
Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.11/dist-packages (from torch) (4.12.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.11/dist-packages (from torch) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from torch) (3.1.6)
Requirement already satisfied: fsspec in /usr/local/lib/python3.11/dist-packages (from torch) (2024.12.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.11/dist-packages (from torch) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.5.8)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.11/dist-packages (from torch) (11.2.1.3)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.11/dist-packages (from torch) (10.3.5.147)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.11/dist-packages (from torch) (11.6.1.9)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.11/dist-packages (from torch) (12.3.1.170)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in /usr/local/lib/python3.11/dist-packages (from torch) (0.6.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.11/dist-packages (from torch) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: triton==3.2.0 in /usr/local/lib/python3.11/dist-packages (from torch) (3.2.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.11/dist-packages (from torch) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from sympy==1.13.1->torch) (1.3.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2.0.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from transformers) (24.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.11/dist-packages (from transformers) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2024.11.6)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from transformers) (2.32.3)
Requirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.21.1)
Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.5.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.11/dist-packages (from transformers) (4.67.1)
Requirement already satisfied: datasets>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from evaluate) (3.4.1)
Requirement already satisfied: dill in /usr/local/lib/python3.11/dist-packages (from evaluate) (0.3.8)
Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (from evaluate) (2.2.2)
Requirement already satisfied: xxhash in /usr/local/lib/python3.11/dist-packages (from evaluate) (3.5.0)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.11/dist-packages (from evaluate) (0.70.16)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.4.8)
Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: absl-py in /usr/local/lib/python3.11/dist-packages (from rouge_score) (1.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.11/dist-packages (from rouge_score) (3.9.1)
Requirement already satisfied: six>=1.14.0 in /usr/local/lib/python3.11/dist-packages (from rouge_score) (1.17.0)
Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.11/dist-packages (from datasets>=2.0.0->evaluate) (18.1.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.11/dist-packages (from datasets>=2.0.0->evaluate) (3.11.14)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas->evaluate) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas->evaluate) (2025.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2025.1.31)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->torch) (3.0.2)
Requirement already satisfied: click in /usr/local/lib/python3.11/dist-packages (from nltk->rouge_score) (8.1.8)
Requirement already satisfied: joblib in /usr/local/lib/python3.11/dist-packages (from nltk->rouge_score) (1.4.2)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (2.6.1)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.2)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (25.3.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (6.2.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (0.3.0)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.18.3)
Now, let’s import the libraries we’ll use and authenticate with Hugging Face Hub:
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import evaluate
import torch
from huggingface_hub import login
# Log in to Hugging Face (enter your token when prompted)
login()
We’ll load the Llama 3.1 8B model with appropriate configurations for optimal performance:
# ---- MODEL LOADING ----
# Load model with GPU optimization
model_name = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(
model_name,
padding_side='left' # Important for decoder-only models
)
# Set padding token for the tokenizer
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # Use half-precision for efficiency
device_map="auto" # Automatically determine best device placement
)
model.config.pad_token_id = tokenizer.pad_token_id
Next, we’ll define functions for text generation and summarization with few-shot prompting:
# ---- IMPROVED MODEL INFERENCE FUNCTIONS ----
def generate_text_batch(prompts, max_new_tokens=50):
# Use few-shot prompting with explicit examples
few_shot_prefix = """
Answer the following questions directly and concisely in 1-2 sentences:
Q: What is machine learning?
A: Machine learning is a branch of AI that enables computers to learn from data and make predictions without explicit programming. It uses statistical techniques to improve performance on specific tasks.
Q: How does a combustion engine work?
A: Combustion engines work by burning fuel in a confined space to create expanding gases that move pistons. This mechanical motion is then converted to power vehicles or machinery.
Now answer this question:
Q: """
results = []
# Process each prompt individually for better control
for prompt in prompts:
full_prompt = few_shot_prefix + prompt + "\nA:"
inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=max_new_tokens,
do_sample=False, # Use greedy decoding for more predictable results
num_return_sequences=1
)
# Extract just the answer part
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = response.split("\nA:")[-1].strip()
results.append(answer)
return results
def summarize_with_llm_batch(documents):
# Use few-shot prompting with examples of good summaries
few_shot_prefix = """
Summarize the following texts in a single concise sentence:
Text: Solar panels convert sunlight directly into electricity through the photovoltaic effect. The technology has become more efficient and affordable in recent years, leading to widespread adoption.
Summary: Solar panels convert sunlight to electricity using the photovoltaic effect, becoming more efficient and affordable over time.
Text: The Great Barrier Reef is experiencing severe coral bleaching due to rising ocean temperatures. Scientists warn that without immediate action on climate change, the reef could suffer irreversible damage.
Summary: Rising ocean temperatures are causing severe coral bleaching in the Great Barrier Reef, threatening permanent damage.
Now summarize this text:
Text: """
results = []
# Process each document individually
for doc in documents:
full_prompt = few_shot_prefix + doc + "\nSummary:"
inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=30, # Shorter limit to force conciseness
do_sample=False, # Deterministic output
num_return_sequences=1
)
# Extract just the summary
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
summary = response.split("\nSummary:")[-1].strip()
# Cut off at the first period to ensure single sentence
period_idx = summary.find('.')
if period_idx > 0:
summary = summary[:period_idx+1]
results.append(summary)
return results
Few-Shot Approach: We implement in-context learning with examples that demonstrate the desired output style.
This explicit conditioning is more effective than simply stating instructions, as it gives the model concrete examples
We’ll create test datasets for both text generation and summarization tasks:
# Task 1: Text Generation
prompts_gen = [
"Explain the importance of renewable energy sources.",
"Describe the process of photosynthesis.",
"What are the causes and effects of global warming?",
]
references_gen = [
"Renewable energy sources are vital because they reduce greenhouse gas emissions and dependence on fossil fuels.",
"Photosynthesis is the process by which plants convert sunlight, carbon dioxide, and water into energy in the form of glucose.",
"Global warming is caused by the release of greenhouse gases and has effects such as rising sea levels and extreme weather events."
]
# Task 2: Text Summarization
documents_sum = [
"Renewable energy sources, including solar, wind, and hydropower, offer significant benefits for the planet. "
"They help reduce greenhouse gas emissions, mitigate climate change, and provide sustainable energy solutions.",
"Photosynthesis is a crucial process in which plants use sunlight to convert carbon dioxide and water into glucose and oxygen. "
"This process provides energy for plants and oxygen for other living organisms.",
"Global warming is primarily caused by human activities, such as burning fossil fuels and deforestation. "
"Its consequences include rising sea levels, more frequent extreme weather events, and loss of biodiversity."
]
references_sum = [
"Renewable energy reduces emissions and mitigates climate change.",
"Photosynthesis converts sunlight, carbon dioxide, and water into glucose.",
"Global warming leads to rising sea levels and extreme weather."
]
Now we’ll run the model on our datasets
# ---- MODEL INFERENCE ----
# Run a small warm-up to initialize CUDA
print("Warming up GPU...")
_ = generate_text_batch(["Hello, world!"], max_new_tokens=10)
# Generate responses for both tasks
print("Generating responses...")
predictions_gen = generate_text_batch(prompts_gen, max_new_tokens=50)
predictions_sum = summarize_with_llm_batch(documents_sum)
The warm-up run serves to initialize CUDA context and allocate memory before main execution, preventing timing inconsistencies from first-time CUDA initialization overhead.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Warming up GPU...
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating responses...
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
We’ll evaluate the model using standard NLP metrics.
# ---- EVALUATION ----
# Load evaluation metrics
print("Loading evaluation metrics...")
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")
# Evaluate text generation
print("\nText Generation Evaluation Results:")
bleu_score_gen = bleu.compute(predictions=predictions_gen, references=references_gen)
rouge_score_gen = rouge.compute(predictions=predictions_gen, references=references_gen)
bertscore_result_gen = bertscore.compute(predictions=predictions_gen, references=references_gen, lang="en")
# Display generation results immediately
print(f"BLEU Score: {bleu_score_gen['bleu']:.4f}")
print(f"ROUGE-1: {rouge_score_gen['rouge1']:.4f}, ROUGE-2: {rouge_score_gen['rouge2']:.4f}, ROUGE-L: {rouge_score_gen['rougeL']:.4f}")
print(f"BERTScore (F1): {sum(bertscore_result_gen['f1']) / len(bertscore_result_gen['f1']):.4f}")
# Evaluate text summarization
print("\nText Summarization Evaluation Results:")
bleu_score_sum = bleu.compute(predictions=predictions_sum, references=references_sum)
rouge_score_sum = rouge.compute(predictions=predictions_sum, references=references_sum)
bertscore_result_sum = bertscore.compute(predictions=predictions_sum, references=references_sum, lang="en")
# Display summarization results immediately
print(f"BLEU Score: {bleu_score_sum['bleu']:.4f}")
print(f"ROUGE-1: {rouge_score_sum['rouge1']:.4f}, ROUGE-2: {rouge_score_sum['rouge2']:.4f}, ROUGE-L: {rouge_score_sum['rougeL']:.4f}")
print(f"BERTScore (F1): {sum(bertscore_result_sum['f1']) / len(bertscore_result_sum['f1']):.4f}")
We’re using the Hugging Face evaluate library because it provides standardized implementations of NLP metrics, ensuring consistency and reproducibility.
We selected these three metrics specifically for complementary reasons:
This multi-metric approach compensates for the known limitations of any single metric in isolation.
For the text generation task, we’re computing all metrics at once rather than in separate evaluation passes, which is more efficient.
We specify lang=”en” for BERTScore to use the appropriate language model.
The immediate display of results provides quick feedback, and we format to 4 decimal places for readability while maintaining sufficient precision.
For BERTScore, we calculate the average F1 score across all samples since BERTScore returns individual scores for each prediction-reference pair, unlike BLEU and ROUGE which return aggregate scores.
We maintain the same evaluation structure for summarization to enable direct comparison with the generation task. This consistency is crucial for valid comparative analysis.
Though summarization typically emphasizes ROUGE metrics (especially ROUGE-L for capturing the longest common subsequence), we include all metrics to enable comprehensive comparison.
Loading evaluation metrics...
Text Generation Evaluation Results:
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
BLEU Score: 0.2115
ROUGE-1: 0.5707, ROUGE-2: 0.3633, ROUGE-L: 0.4847
BERTScore (F1): 0.9396
Text Summarization Evaluation Results:
BLEU Score: 0.0597
ROUGE-1: 0.4444, ROUGE-2: 0.2383, ROUGE-L: 0.4044
BERTScore (F1): 0.9209
Let’s examine the generated outputs compared to the references:
# ---- PRINT DETAILED RESULTS ----
print("\n--- TEXT GENERATION TASK RESULTS ---")
for i in range(len(prompts_gen)):
print(f"\nPrompt {i+1}: {prompts_gen[i]}")
print(f"Generated Output: {predictions_gen[i]}")
print(f"Expected Output: {references_gen[i]}")
print("\n--- TEXT SUMMARIZATION TASK RESULTS ---")
for i in range(len(documents_sum)):
print(f"\nOriginal Document {i+1}: {documents_sum[i]}")
print(f"Generated Summary: {predictions_sum[i]}")
print(f"Expected Summary: {references_sum[i]}")
Quantitative metrics alone can be misleading without qualitative context.
This section provides side-by-side comparisons of inputs, predictions, and references for qualitative assessment.
The structured format makes it easy to spot differences between model outputs and references, providing context for interpreting the numerical metrics.
This is especially important since LLMs often produce semantically correct answers with different phrasings that quantitative metrics might penalize.
--- TEXT GENERATION TASK RESULTS ---
Prompt 1: Explain the importance of renewable energy sources.
Generated Output: Renewable energy sources are important because they provide clean, sustainable power that reduces our reliance on fossil fuels and their associated environmental impacts. They also help mitigate climate change by reducing greenhouse gas emissions.
Expected Output: Renewable energy sources are vital because they reduce greenhouse gas emissions and dependence on fossil fuels.
Prompt 2: Describe the process of photosynthesis.
Generated Output: Photosynthesis is the process by which plants and some other organisms use sunlight to convert carbon dioxide and water into glucose and oxygen. It is a complex process involving several steps, including the absorption of light energy, the production of ATP and NADPH,
Expected Output: Photosynthesis is the process by which plants convert sunlight, carbon dioxide, and water into energy in the form of glucose.
Prompt 3: What are the causes and effects of global warming?
Generated Output: Global warming is caused by the release of greenhouse gases into the atmosphere, primarily from human activities such as burning fossil fuels. The effects include rising temperatures, melting ice caps, and more frequent extreme weather events.
Expected Output: Global warming is caused by the release of greenhouse gases and has effects such as rising sea levels and extreme weather events.
--- TEXT SUMMARIZATION TASK RESULTS ---
Original Document 1: Renewable energy sources, including solar, wind, and hydropower, offer significant benefits for the planet. They help reduce greenhouse gas emissions, mitigate climate change, and provide sustainable energy solutions.
Generated Summary: Renewable energy sources like solar, wind, and hydropower offer environmental benefits by reducing greenhouse gas emissions and mitigating climate change while providing sustainable energy solutions
Expected Summary: Renewable energy reduces emissions and mitigates climate change.
Original Document 2: Photosynthesis is a crucial process in which plants use sunlight to convert carbon dioxide and water into glucose and oxygen. This process provides energy for plants and oxygen for other living organisms.
Generated Summary: Photosynthesis is a process where plants use sunlight to make glucose and oxygen from carbon dioxide and water, providing energy for plants and oxygen for other organisms.
Expected Summary: Photosynthesis converts sunlight, carbon dioxide, and water into glucose.
Original Document 3: Global warming is primarily caused by human activities, such as burning fossil fuels and deforestation. Its consequences include rising sea levels, more frequent extreme weather events, and loss of biodiversity.
Generated Summary: Human activities like burning fossil fuels and deforestation cause global warming, leading to rising sea levels, extreme weather, and biodiversity loss.
Expected Summary: Global warming leads to rising sea levels and extreme weather.
The evaluation of Llama 3.1 8B reveals interesting insights about its capabilities across different NLP tasks:
For text generation, we achieved a BLEU score of around 0.21, while summarization yielded a lower score of about 0.06.
The relatively low BLEU scores indicate challenges with exact n-gram matching between model outputs and human references.
This isn’t necessarily a critical flaw—BLEU prioritizes exact matches, while LLMs often produce semantically equivalent but lexically different outputs.
The ROUGE metrics show moderate performance with ROUGE-1 at 0.57 for generation and 0.44 for summarization, while ROUGE-L is around 0.48 and 0.40 respectively.
These scores indicate the model captures a fair portion of the reference content, but there’s still substantial divergence in the exact phrasing and structure.
The BERTScore F1 values are notably high at around 0.94 for generation and 0.92 for summarization.
This suggests that while the surface form (exact wording) differs from references, the semantic meaning is largely preserved. BERTScore’s contextual embeddings capture these semantic similarities that n-gram based metrics miss.
Text generation performed better than summarization across all metrics.
This suggests the model may find it easier to expand on concepts (generation) than to compress information while preserving key points (summarization).
The summarization task requires more complex reasoning about information importance and conciseness.
The high BERTScore with lower BLEU/ROUGE scores reveals an important characteristic of modern LLMs:
They excel at capturing meaning but express it in their own words rather than reproducing exact reference phrasing.
This makes them valuable for creative content generation and information reformulation, even if they don’t achieve perfect metric scores on traditional benchmarks.
The evaluation demonstrates that Llama 3.1 8B is a capable model with stronger semantic understanding than exact reproduction abilities, which aligns with its intended use cases in creative text generation and flexible information processing.
Evaluating LLMs is crucial for ensuring their accuracy, safety, and relevance in real-world applications.
By combining automated metrics, human judgment, and continuous testing, we can identify strengths and mitigate risks such as bias and toxicity.
Adopting a comprehensive evaluation framework leads to more reliable and ethical AI models.
As the field evolves, staying updated with new tools and practices will help improve model performance and transparency.
Incorporating robust evaluation practices is essential for responsible AI development and the successful deployment of LLMs in various industries.