LLM Evaluation: Best Metrics & Tools

March 26th, 2025

Futuristic cyberpunk control room with an operator analyzing complex data on holographic AI displays, symbolizing advanced LLM fine-tuning and management.

Introduction

Large Language Models (LLMs) are advanced artificial intelligence systems designed to understand, generate, and manipulate human-like text. 

Powered by deep learning architectures like transformers, models such as OpenAI’s GPT, Google’s Bard, and Meta’s LLaMA have demonstrated unprecedented capabilities in tasks ranging from answering complex questions to writing essays, debugging code, and even creating poetry.

These models are trained on vast amounts of text data, enabling them to capture the nuances of language and produce outputs that often feel remarkably human.

Why LLM Evaluation Is Critical

LLM evaluation is essential to ensure these powerful AI systems are reliable, safe, and ethical. Here’s why it matters:

Reliability and Safety

LLMs are used in high-stakes applications like healthcare and education. Evaluation ensures their outputs are accurate and appropriate, preventing harmful mistakes.

Identifying Biases

LLMs can inherit biases from training data, leading to unfair or discriminatory outputs. Evaluation helps detect and address these issues, promoting fairness.

Tracking Progress

Evaluation allows developers to measure improvements, compare models, and benchmark against state-of-the-art systems.

Informed Deployment

Not all LLMs are equal. Evaluation helps stakeholders choose the right model for their needs, balancing performance, efficiency, and ethical considerations.

In short, LLM evaluation builds trust in AI by ensuring models are reliable, fair, and aligned with human values. Next, we’ll explore the key metrics used to evaluate LLMs.

The Challenge of Evaluating Human-Like Text

Evaluating LLMs is no simple task. Unlike traditional AI systems, where metrics like accuracy or error rates suffice, LLMs require a more nuanced approach. Their outputs are often subjective, context-dependent, and open to interpretation. For example:

  • How do you measure the “quality” of a creative story generated by an LLM?
  • How do you ensure factual accuracy in a model that summarizes complex topics?
  • How do you detect and mitigate subtle biases in generated text?

These challenges highlight the need for robust evaluation frameworks that go beyond simple metrics and consider the multifaceted nature of human language.

What Will This Blog Cover?

In this blog, we’ll explore the best practices for evaluating LLMs, covering:

  • Key Metrics: What to measure and why.
  • Evaluation Methods: Automated vs. human-centric approaches.
  • Tools and Frameworks: Popular tools for streamlining LLM evaluation.
  • Best Practices: How to ensure comprehensive and ethical evaluations.

By the end of this guide, you’ll have a clear understanding of how to assess LLMs effectively, ensuring they deliver value while minimizing risks. Let’s dive in!

Core LLM Evaluation Approaches

Evaluating large language models (LLMs) requires a multifaceted approach that combines different methodologies to gain a comprehensive understanding of model performance.

Each approach offers unique strengths and limitations, making them complementary rather than mutually exclusive. Let’s explore the four primary evaluation paradigms used in the field today.

Human Evaluation

Human evaluation remains the gold standard for assessing LLM outputs, particularly for subjective qualities that are difficult to measure automatically.

Key characteristics:

Strengths

Captures nuanced aspects of language like relevance, helpfulness, and naturalness that automated metrics often miss.

Process

Typically involves annotators rating model outputs on predefined criteria using Likert scales or preference judgments.

Applications

Crucial for evaluating creative content, instruction-following, and alignment with human values.

Limitations

Expensive and time-consuming to implement at scale – Subject to annotator biases and inconsistencies – Difficult to standardize across different evaluation campaigns.

Best practices

Use clear rubrics and evaluation criteria – Employ multiple annotators per example and measure inter-annotator agreement – Consider specialized annotators for domain-specific tasks

Automated Metrics

Automated metrics provide scalable, reproducible measurements that can be applied to large datasets without human intervention.

Common automated metrics:

Reference-based

Compare model outputs to humanwritten references (BLEU, ROUGE, BERTScore)

Referencefree

Evaluate outputs without needing human references (perplexity, coherence scores)

LLM-as-a-judge

Using other LLMs to evaluate outputs (e.g., GPT-4 for scoring)

Strengths and weaknesses:

  • Strengths: Consistent, scalable, reproducible, and costeffective
  • Weaknesses: Often fail to capture semantic nuances and can reward shallow pattern matching over genuine understanding

When to use:

For rapid iteration during development – As part of a broader evaluation strategy – When monitoring performance changes over time

Benchmark Datasets

Standardized benchmark datasets provide consistent testing grounds to compare different models on specific capabilities.

Popular LLM benchmarks:

  • MMLU (Massive Multitask Language Understanding): Tests knowledge across 57 subjects
  • HELM (Holistic Evaluation of Language Models): Comprehensive suite of tasks and metrics
  • BIG-bench: Collaborative benchmark with 200+ diverse tasks
  • TruthfulQA: Measures propensity to generate truthful answers
  • GSM8K: Tests mathematical reasoning capabilities

Benchmark considerations

  •  Look for benchmarks aligned with your specific use cases
  • Be aware of benchmark limitations and potential data contamination
  • Consider evaluating on multiple benchmarks to get a comprehensive view

Adversarial Testing

Adversarial testing deliberately probes model weaknesses by designing inputs specifically meant to cause failures.

Approaches include:

  • Red-teaming: Having experts attempt to break the model or elicit harmful outputs
  • Jailbreaking: Testing resistance to prompt injection and policy circumvention
  • Robustness testing: Evaluating performance under input perturbations and edge cases

Benefits

  • Identifies vulnerabilities before deployment 
  • Reveals failure modes that might not appear in standard benchmarks
  • Helps prioritize safety improvements

Implementation strategies

  • Combine automated adversarial testing with human redteaming 
  • Document and classify discovered vulnerabilities
  • Create regression tests for previously identified issues

Emerging Evaluation Paradigms

The field of LLM evaluation continues to evolve rapidly with new approaches gaining traction:

  • Agent-based evaluation: Testing LLMs in interactive environments
  • Self-evaluation: Having models critique their own outputs
  • Process-based evaluation: Focusing on reasoning steps rather than just final answers
  • Distribution-aware evaluation: Testing performance across different demographic groups and content domains
Four primary LLM evaluation approaches: human, automated, benchmarks, adversarial testing

Choosing the Right Approach

The most effective evaluation strategies combine multiple approaches:

  • Use automated metrics for continuous monitoring and rapid iteration
  • Incorporate benchmark testing for standardized comparison
  • Apply human evaluation for critical aspects and final quality assessment
  • Employ adversarial testing to identify and address weaknesses

Remember that evaluation should be tailored to your specific use case and deployment context.

For customer-facing applications, human evaluation on realworld queries may be most valuable, while research contexts might benefit more from standardized benchmarks.

Key Metrics for LLM Evaluation

Selecting the right metrics is crucial for meaningful LLM evaluation.

Different metrics capture different aspects of model performance, and understanding their strengths and limitations is essential for comprehensive assessment.

This section explores the primary categories of metrics used in LLM evaluation.

Accuracy Metrics

Accuracy metrics measure how closely model outputs match expected answers or references.

These are particularly important for tasks with clear correct answers.

Text Generation Metrics

BLEU (Bilingual Evaluation Understudy)
  •  Originally designed for machine translation
  • Measures n-gram overlap between model output and reference text
  • Scores range from 0 to 1 (higher is better)
Limitation
  • Focuses on precision without considering recall; sensitive to surface-level differences
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
  • Common for summarization tasks
  • Various subtypes (ROUGE-N, ROUGE-L, ROUGE-S)
  • Emphasizes recall more than BLEU
Limitation
  • Still primarily focused on lexical overlap rather than semantic meaning
BERTScore
  • Uses contextual embeddings to measure semantic similarity
  • More robust to paraphrasing than BLEU or ROUGE
  • Correlates better with human judgments
Limitation
  • Computationally more expensive than n-gram based metrics

Classification Metrics

Accuracy
  • Proportion of correct predictions
  • Simple and intuitive
Limitation
  • Problematic for imbalanced datasets
F1 Score
  • Harmonic mean of precision and recall
  • Balances false positives and false negatives
Limitation
  • Can hide trade-offs between precision and recall
Area Under the ROC Curve (AUC-ROC)
  • Measures ability to distinguish between classes
  • Independent of decision threshold

Limitation: Less informative when classes are highly imbalanced

Fluency Metrics

Fluency metrics assess how natural, grammatical, and human-like the language produced by the model is.

Perplexity
  • Measures how well a model predicts a sample
  • Lower values indicate better fluency
Limitation
  • Not directly comparable across different models or tokenizers
Grammatical Error Rate
  • Proportion of outputs containing grammatical errors
  • Can be measured using automated grammar checkers
Limitation
  • May miss subtle errors or flag unconventional but valid constructions
Self-BLEU
  • Measures diversity of generated text
  • Lower values indicate more diverse outputs
Limitation
  • Doesn’t account for quality or relevance

Relevance & Consistency Metrics

These metrics evaluate whether model outputs are relevant to the input and consistent with known facts or the model’s previous statements.

Factual Consistency
  • Percentage of outputs containing factual errors
  • Can be measured using fact-checking models
Limitation
  • Reference knowledge may be incomplete or outdated
Semantic Similarity
  • Cosine similarity between embeddings of query and response
  • Measures topical relevance
Limitation

May not capture nuanced aspects of relevance

Faithfulness
  • For summarization, measures whether the summary contains information not in the source
  • Lower hallucination rates indicate higher faithfulness
Limitation
  • Challenging to automate reliably
Consistency Measures
  • Evaluates whether model contradicts itself across outputs
  • Can be measured using contradiction detection models
Limitation
  • Requires context beyond individual responses

Safety & Bias Metrics

These metrics assess potential harms, biases, and unintended behaviors in model outputs.

Toxicity Scores
  • Measures harmful, offensive, or inappropriate content
  • Often uses classifiers like Perspective API
Limitation
  • Cultural and contextual sensitivity issues
Stereotype Bias
  • Measures bias against protected groups
  • Can use templates to test for specific biases
Limitation
  • May not capture subtle or intersectional biases
Robustness to Adversarial Inputs
  • Measures how well model maintains safe behavior under attack
  • Pass rate on jailbreaking attempts
Limitation
  • Adversarial techniques evolve rapidly
Fairness Across Demographics
  • Performance disparity across different groups
  • Helps identify if the model favors certain perspectives
Limitation
  • Requires careful demographic categorization
Taxonomy of LLM evaluation metrics including accuracy, fluency, safety, and relevance

Emerging Metrics

New metrics continue to be developed as the field evolves:

Human Preference Alignment

  • How well outputs align with human preference
  • Often measured through pairwise comparisons
  • Becoming increasingly important for RLHF

Reasoning Evaluation

  • Assesses step-by-step reasoning ability
  • Focuses on process rather than just final answer
  • Important for tasks requiring multi-step thinking

Truthfulness Indices

  • Combines multiple measures of factuality
  • Provides a more comprehensive assessment of model honesty
  • Helps quantify tendency to hallucinate

Choosing the Right Metrics

When selecting metrics for your evaluation:

1. Consider your use case: Different applications require different metrics

2. Use multiple metrics: No single metric captures all aspects of performance

3. Balance automated and human evaluation: Use automated metrics for efficiency, but validate with human judgment

4. Establish baselines: Compare against human performance and other models

5. Track progress over time: Monitor how metrics change as you iterate on your model

Remember that metrics are tools to guide improvement, not ends in themselves.

The ultimate goal is to create models that provide value to users, which sometimes requires looking beyond traditional metrics.

Popular LLM Evaluation Tools

The rapid development of LLMs has sparked the creation of numerous evaluation frameworks and tools.

This section explores the most widely used and effective tools available for comprehensive LLM evaluation.

Open-Source Evaluation Frameworks HELM (Holistic Evaluation of Language Models)

HELM provides one of the most comprehensive frameworks for evaluating language models across multiple dimensions.

Key features:

  • Evaluates models on 42 scenarios across 7 categories
  • Measures multiple metrics simultaneously (accuracy, robustness, fairness, etc.) Standardized evaluation protocol for fair comparison
  • Regularly updated leaderboard

Best for:

  • Organizations seeking comprehensive, multi-dimensional evaluation.

EleutherAI LM Evaluation Harness

A flexible and extensible codebase for evaluating language models on a wide range of tasks and benchmarks.

Key features:

  • Support for many popular benchmarks (MMLU, TruthfulQA, HellaSwag, etc.)
  • Compatible with most open-source and commercial LLMs Highly customizable evaluation settings
  • Active community development

Best for:

  • ML researchers and engineers working with multiple models.

HuggingFace Evaluate

Part of the popular HuggingFace ecosystem, this library provides evaluation metrics for NLP tasks.

Key features:

  • Integration with HuggingFace models and datasets
  • Comprehensive collection of metrics (BLEU, ROUGE, BERTScore, etc.)
  • Easy-to-use API with consistent interface
  • Well-documented and maintained

Best for:

  • Teams already using the HuggingFace ecosystem.

Documentation: HuggingFace Evaluate

TruLens

An open-source library focused on evaluating LLM applications and chains, particularly relevant for RAG systems.

Key features:

  • Feedback functions for groundedness, relevance, and coherence
  • Instrumentation for LLM chains and applications
  • Detailed tracing and evaluation dashboards
  • Integration with popular LLM frameworks

Best for:

Evaluating production LLM applications and RAG systems.

GitHub: TruLens

Benchmark-Specific Tools MMLU (Massive Multitask Language Understanding)

A comprehensive benchmark for evaluating knowledge and reasoning across 57 subjects.

Key features:

  • Tests knowledge across domains (STEM, humanities, social sciences, etc.)
  • Multiple-choice format for easy evaluation
  • Varying difficulty levels
  • Wide adoption in academic and industry research

Best for:

  • Measuring general knowledge and reasoning capabilities.

BIG-bench

A collaborative benchmark with over 200 tasks designed to probe model capabilities beyond standard metrics.

Key features:

  • Diverse task types (reasoning, knowledge, multilingual, etc.)
  • Community-contributed tasks
  • Open-ended evaluation beyond standard metrics
  • Challenging tasks designed to test model limitations

Best for:

  • Identifying specific model strengths and weaknesses.

AlpacaEval

Focused on evaluating instruction-following capabilities through win rates against reference models.

Key features:

  • Uses strong judge models (like GPT-4) to evaluate responses
  • Computes win rates rather than absolute scores
  • Diverse set of instructions across domains
  • Quick to run and easy to interpret results

Best for:

Comparing instruction-following capabilities between models.

Commercial Evaluation Platforms

DeepEval

A comprehensive platform for testing and evaluating production LLM applications.

Key features:

  • End-to-end testing framework
  • Custom evaluation metrics CI/CD integration
  • Performance monitoring dashboards

Best for:

  • DevOps teams integrating LLM evaluation into development pipelines.

Arthur.ai

An ML observability platform with extensive LLM evaluation capabilities.

Key features:

  • Performance monitoring
  • Bias and fairness detection – Data drift analysis
  • Explanation tools

Best for:

  • Enterprise teams requiring robust monitoring and governance.

Scale Spellbook

An evaluation platform built for production AI systems with comprehensive metrics.

Key features:

  • Human evaluation integration
  • Custom evaluation workflows
  • Performance analytics
  • Integration with popular LLM providers

Best for:

  • Teams requiring both automated and human evaluation.

Specialized Evaluation Tools

Anthropic’s RLHF Leaderboard

Focuses on comparing models based on helpfulness and harmlessness preferences.

Key features:

  • Models evaluated on alignment with human preferences
  • Standardized prompts across domains
  • Regular updates with new models Transparent methodology

Best for:

  • Comparing models on alignment with human values.

Weights & Biases LLM Evaluation

Integrated platform for tracking and visualizing LLM evaluations.

Key features:

  • Experiment tracking
  • Performance visualization
  • Prompt and response versioning
  • Collaboration tools

Best for:

  • Teams tracking iterative LLM improvements over time.

OpenAI Evals

An evaluation framework designed for evaluating model capabilities and safety.

Key features:

  • Standardized evaluation protocols
  • Safety-specific evaluations
  • Customizable evaluation datasets
  • Integration with OpenAI models

Best for:

  • Evaluating models against OpenAI benchmarks and safety standards.
Comparison table of popular LLM evaluation frameworks like HELM, MMLU, and Arthur.ai

Building an Evaluation Stack

Most organizations benefit from combining multiple evaluation tools:

1. Start with comprehensive frameworks like HELM or EleutherAI’s Harness for broad capability assessment

2. Add specialized tools for your specific use cases:

    • Customer service → Conversation quality metrics
    • Content generation → Creativity and factuality tools
    • Code generation → Functional correctness evaluators

3. Implement continuous evaluation using tools that integrate with your MLOps pipeline

4. Complement with human evaluation for critical aspects not wellcaptured by automated metrics

Selection Criteria for Evaluation Tools

When choosing evaluation tools, consider:

• Integration with your tech stack

• Coverage of metrics relevant to your use case

• Cost and resource requirements

• Ease of use and documentation quality

• Community support and active development

• Scalability for your evaluation needs

Remember that the best evaluation strategy combines multiple tools and approaches, as no single tool can capture all aspects of LLM performance.

Implementing Effective Evaluation Workflows

  • Set Clear Goals: Define the purpose and what you want to evaluate (e.g., accuracy, bias).
  • Select Metrics: Choose metrics tailored to your model’s task (e.g., F1 for classification, BLEU for generation).
  • Test Datasets: Use diverse and representative datasets for evaluation.
  • Baseline Performance: Establish benchmarks to compare results.
  • Continuous Evaluation: Regularly assess models during development to catch issues early.

Case Studies: Evaluation in Practice

Customer Service Chatbot Evaluation

Evaluating chatbots requires a focus on various metrics to ensure the model serves its purpose effectively.

For customer service, the main criteria often include:

Response Quality

 How well does the chatbot answer the user’s queries? Is it relevant, helpful, and clear?

This can be evaluated using relevance and factual consistency metrics.

  • Response Time: How fast does the chatbot provide an answer? This could be evaluated using latency measures and user experience surveys.
  • Safety & Bias: It’s crucial to evaluate chatbots for toxicity and harmful responses. Tools like Toxicity measures and Fairness evaluations are essential in identifying harmful behavior and ensuring the model’s output aligns with ethical guidelines.
  • User Satisfaction: Often measured by post-interaction surveys asking users if their needs were met. This can also be evaluated using a combination of human judgment and automated metrics (e.g., BLEU for understanding fluency).

Research Methodologies

In research, the evaluation of LLMs follows a structured methodology to ensure reproducibility and fairness:

  • Automated Metrics: These are commonly used in papers to quickly assess model performance, especially for tasks like text generation (e.g., BLEU, ROUGE). However, they don’t capture all nuances of human language.
  • Human Evaluation: Research papers often supplement automated metrics with human evaluation to judge fluency, relevance, and coherence. This is considered the gold standard but is time-consuming and expensive.
  • Comparing Multiple Models: A typical research methodology involves comparing multiple models on the same benchmark dataset. This allows researchers to evaluate which models outperform others, based on consistent metrics.

Leading AI Labs’ Approach

Major AI labs like OpenAI, Google Research, and DeepMind follow rigorous evaluation frameworks to assess LLMs:

  • Standardized Datasets: These labs frequently use popular benchmark datasets like GLUE, SuperGLUE, or SQuAD, which have been designed to evaluate language models on various NLP tasks (e.g., question answering, sentiment analysis).
  • Human & Automated Evaluations: Labs combine automated metrics for scalability with human judgment to assess more subjective aspects like conversational coherence and model behavior in edge cases.
  • Bias and Fairness Testing: Leading labs also place significant emphasis on identifying and mitigating biases in their models. They utilize tools like Bias detection methods and Fairness evaluations to ensure their models are safe and equitable for real-world deployment.
  • Adversarial Testing: Labs often use adversarial testing to push the models to their limits, identifying failure points and improving robustness.

Lessons Learned

  • Iterative Improvements: Evaluation isn’t a one-time task; it’s a continuous process that involves iterating over models and metrics to uncover new insights.
  • Human Oversight is Essential: Even with advanced automated metrics, human oversight is needed to understand the full implications of a model’s behavior, especially in high-stakes applications like healthcare or customer service.
  • Holistic Evaluation: A comprehensive evaluation approach combining multiple metrics—accuracy, fluency, safety, and user feedback—provides the most reliable insights into model performance.

Future Trends in LLM Evaluation

  • Self-Evaluation: Future models might assess their own performance autonomously.
  • Multi-modal Evaluation: Evaluation will extend beyond text, including images and audio.
  • Human Preferences: Evaluations will align more closely with human judgment.
  • Community Standards: Growing consensus on standard evaluation practices to ensure transparency.

Example: Beyond the Numbers: A Deep Dive into LLaMA 3.1 8B's Text Generation and Summarization Capabilities

This comprehensive analysis explores the text generation and summarization capabilities of Meta’s LLaMA 3.1 8B model.

Using industry-standard metrics like BLEU, ROUGE, and BERTScore, we quantitatively evaluate the model’s performance while providing real-world examples of its outputs.

Our findings reveal an interesting disconnect: while semantic understanding (measured by BERTScore) is remarkably strong, lexical precision and structural integrity (measured by BLEU and ROUGE) show significant room for improvement.

Complete with visualizations and side-by-side comparisons of generated vs. expected outputs, this evaluation provides valuable insights for researchers and practitioners looking to implement or fine-tune large language models for specific text generation tasks.

Setting Up the Environment

First, let’s install the necessary packages for our evaluation:

 
				
					# Installing required packages for model loading, evaluation, and visualization
!pip install torch transformers evaluate matplotlib seaborn huggingface_hub rouge_score bert_score
				
			
				
					Requirement already satisfied: torch in /usr/local/lib/python3.11/dist-packages (2.6.0+cu124)
Requirement already satisfied: transformers in /usr/local/lib/python3.11/dist-packages (4.50.0)
Requirement already satisfied: evaluate in /usr/local/lib/python3.11/dist-packages (0.4.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.11/dist-packages (3.10.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.11/dist-packages (0.13.2)
Requirement already satisfied: huggingface_hub in /usr/local/lib/python3.11/dist-packages (0.29.3)
Requirement already satisfied: rouge_score in /usr/local/lib/python3.11/dist-packages (0.1.2)
Requirement already satisfied: bert_score in /usr/local/lib/python3.11/dist-packages (0.3.13)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from torch) (3.18.0)
Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.11/dist-packages (from torch) (4.12.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.11/dist-packages (from torch) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from torch) (3.1.6)
Requirement already satisfied: fsspec in /usr/local/lib/python3.11/dist-packages (from torch) (2024.12.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.11/dist-packages (from torch) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.5.8)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.11/dist-packages (from torch) (11.2.1.3)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.11/dist-packages (from torch) (10.3.5.147)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.11/dist-packages (from torch) (11.6.1.9)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.11/dist-packages (from torch) (12.3.1.170)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in /usr/local/lib/python3.11/dist-packages (from torch) (0.6.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.11/dist-packages (from torch) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: triton==3.2.0 in /usr/local/lib/python3.11/dist-packages (from torch) (3.2.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.11/dist-packages (from torch) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from sympy==1.13.1->torch) (1.3.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2.0.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from transformers) (24.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.11/dist-packages (from transformers) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2024.11.6)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from transformers) (2.32.3)
Requirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.21.1)
Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.5.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.11/dist-packages (from transformers) (4.67.1)
Requirement already satisfied: datasets>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from evaluate) (3.4.1)
Requirement already satisfied: dill in /usr/local/lib/python3.11/dist-packages (from evaluate) (0.3.8)
Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (from evaluate) (2.2.2)
Requirement already satisfied: xxhash in /usr/local/lib/python3.11/dist-packages (from evaluate) (3.5.0)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.11/dist-packages (from evaluate) (0.70.16)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.4.8)
Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: absl-py in /usr/local/lib/python3.11/dist-packages (from rouge_score) (1.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.11/dist-packages (from rouge_score) (3.9.1)
Requirement already satisfied: six>=1.14.0 in /usr/local/lib/python3.11/dist-packages (from rouge_score) (1.17.0)
Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.11/dist-packages (from datasets>=2.0.0->evaluate) (18.1.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.11/dist-packages (from datasets>=2.0.0->evaluate) (3.11.14)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas->evaluate) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas->evaluate) (2025.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2025.1.31)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->torch) (3.0.2)
Requirement already satisfied: click in /usr/local/lib/python3.11/dist-packages (from nltk->rouge_score) (8.1.8)
Requirement already satisfied: joblib in /usr/local/lib/python3.11/dist-packages (from nltk->rouge_score) (1.4.2)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (2.6.1)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.2)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (25.3.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (6.2.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (0.3.0)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.18.3)
				
			

Now, let’s import the libraries we’ll use and authenticate with Hugging Face Hub:

				
					import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import evaluate
import torch
from huggingface_hub import login

# Log in to Hugging Face (enter your token when prompted)
login()
				
			
Hugging Face login prompt featuring a happy emoji, asking for a token to be pasted and offering an option to save it as a git credential.

Loading the Model

We’ll load the Llama 3.1 8B model with appropriate configurations for optimal performance:

				
					# ---- MODEL LOADING ----
# Load model with GPU optimization 
model_name = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    padding_side='left'  # Important for decoder-only models
)

# Set padding token for the tokenizer
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # Use half-precision for efficiency
    device_map="auto"           # Automatically determine best device placement
)

model.config.pad_token_id = tokenizer.pad_token_id
				
			

Defining Inference Functions

Next, we’ll define functions for text generation and summarization with few-shot prompting:

				
					# ---- IMPROVED MODEL INFERENCE FUNCTIONS ----
def generate_text_batch(prompts, max_new_tokens=50):
    # Use few-shot prompting with explicit examples
    few_shot_prefix = """
Answer the following questions directly and concisely in 1-2 sentences:

Q: What is machine learning?
A: Machine learning is a branch of AI that enables computers to learn from data and make predictions without explicit programming. It uses statistical techniques to improve performance on specific tasks.

Q: How does a combustion engine work?
A: Combustion engines work by burning fuel in a confined space to create expanding gases that move pistons. This mechanical motion is then converted to power vehicles or machinery.

Now answer this question:
Q: """

    results = []
    # Process each prompt individually for better control
    for prompt in prompts:
        full_prompt = few_shot_prefix + prompt + "\nA:"

        inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)

        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Use greedy decoding for more predictable results
            num_return_sequences=1
        )

        # Extract just the answer part
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        answer = response.split("\nA:")[-1].strip()
        results.append(answer)

    return results

def summarize_with_llm_batch(documents):
    # Use few-shot prompting with examples of good summaries
    few_shot_prefix = """
Summarize the following texts in a single concise sentence:

Text: Solar panels convert sunlight directly into electricity through the photovoltaic effect. The technology has become more efficient and affordable in recent years, leading to widespread adoption.
Summary: Solar panels convert sunlight to electricity using the photovoltaic effect, becoming more efficient and affordable over time.

Text: The Great Barrier Reef is experiencing severe coral bleaching due to rising ocean temperatures. Scientists warn that without immediate action on climate change, the reef could suffer irreversible damage.
Summary: Rising ocean temperatures are causing severe coral bleaching in the Great Barrier Reef, threatening permanent damage.

Now summarize this text:
Text: """

    results = []
    # Process each document individually
    for doc in documents:
        full_prompt = few_shot_prefix + doc + "\nSummary:"

        inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)

        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=30,  # Shorter limit to force conciseness
            do_sample=False,    # Deterministic output
            num_return_sequences=1
        )

        # Extract just the summary
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        summary = response.split("\nSummary:")[-1].strip()

        # Cut off at the first period to ensure single sentence
        period_idx = summary.find('.')
        if period_idx > 0:
            summary = summary[:period_idx+1]

        results.append(summary)

    return results
				
			

Few-Shot Approach: We implement in-context learning with examples that demonstrate the desired output style.

This explicit conditioning is more effective than simply stating instructions, as it gives the model concrete examples

Progress bars showing the download and loading of various files for a large language model, including tokenizer configuration, model weights, and generation configuration.

Creating Evaluation Datasets

We’ll create test datasets for both text generation and summarization tasks:

				
					# Task 1: Text Generation
prompts_gen = [
    "Explain the importance of renewable energy sources.",
    "Describe the process of photosynthesis.",
    "What are the causes and effects of global warming?",
]
references_gen = [
    "Renewable energy sources are vital because they reduce greenhouse gas emissions and dependence on fossil fuels.",
    "Photosynthesis is the process by which plants convert sunlight, carbon dioxide, and water into energy in the form of glucose.",
    "Global warming is caused by the release of greenhouse gases and has effects such as rising sea levels and extreme weather events."
]

# Task 2: Text Summarization
documents_sum = [
    "Renewable energy sources, including solar, wind, and hydropower, offer significant benefits for the planet. "
    "They help reduce greenhouse gas emissions, mitigate climate change, and provide sustainable energy solutions.",

    "Photosynthesis is a crucial process in which plants use sunlight to convert carbon dioxide and water into glucose and oxygen. "
    "This process provides energy for plants and oxygen for other living organisms.",

    "Global warming is primarily caused by human activities, such as burning fossil fuels and deforestation. "
    "Its consequences include rising sea levels, more frequent extreme weather events, and loss of biodiversity."
]
references_sum = [
    "Renewable energy reduces emissions and mitigates climate change.",
    "Photosynthesis converts sunlight, carbon dioxide, and water into glucose.",
    "Global warming leads to rising sea levels and extreme weather."
]
				
			

Model Inference

Now we’ll run the model on our datasets

				
					# ---- MODEL INFERENCE ----
# Run a small warm-up to initialize CUDA
print("Warming up GPU...")
_ = generate_text_batch(["Hello, world!"], max_new_tokens=10)

# Generate responses for both tasks
print("Generating responses...")
predictions_gen = generate_text_batch(prompts_gen, max_new_tokens=50)
predictions_sum = summarize_with_llm_batch(documents_sum)
				
			

The warm-up run serves to initialize CUDA context and allocate memory before main execution, preventing timing inconsistencies from first-time CUDA initialization overhead.

				
					Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Warming up GPU...
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating responses...
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
				
			

Evaluating Model Performance

We’ll evaluate the model using standard NLP metrics.

				
					# ---- EVALUATION ----
# Load evaluation metrics
print("Loading evaluation metrics...")
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# Evaluate text generation
print("\nText Generation Evaluation Results:")
bleu_score_gen = bleu.compute(predictions=predictions_gen, references=references_gen)
rouge_score_gen = rouge.compute(predictions=predictions_gen, references=references_gen)
bertscore_result_gen = bertscore.compute(predictions=predictions_gen, references=references_gen, lang="en")
# Display generation results immediately
print(f"BLEU Score: {bleu_score_gen['bleu']:.4f}")
print(f"ROUGE-1: {rouge_score_gen['rouge1']:.4f}, ROUGE-2: {rouge_score_gen['rouge2']:.4f}, ROUGE-L: {rouge_score_gen['rougeL']:.4f}")
print(f"BERTScore (F1): {sum(bertscore_result_gen['f1']) / len(bertscore_result_gen['f1']):.4f}")

# Evaluate text summarization
print("\nText Summarization Evaluation Results:")
bleu_score_sum = bleu.compute(predictions=predictions_sum, references=references_sum)
rouge_score_sum = rouge.compute(predictions=predictions_sum, references=references_sum)
bertscore_result_sum = bertscore.compute(predictions=predictions_sum, references=references_sum, lang="en")
# Display summarization results immediately
print(f"BLEU Score: {bleu_score_sum['bleu']:.4f}")
print(f"ROUGE-1: {rouge_score_sum['rouge1']:.4f}, ROUGE-2: {rouge_score_sum['rouge2']:.4f}, ROUGE-L: {rouge_score_sum['rougeL']:.4f}")
print(f"BERTScore (F1): {sum(bertscore_result_sum['f1']) / len(bertscore_result_sum['f1']):.4f}")
				
			

We’re using the Hugging Face evaluate library because it provides standardized implementations of NLP metrics, ensuring consistency and reproducibility.

We selected these three metrics specifically for complementary reasons:

  • BLEU: A precision-focused metric that measures exact n-gram overlap, useful for detecting exact phrase matching.
  • ROUGE: Measures recall of n-grams, giving us insights into how much of the reference content is captured.
  • BERTScore: Uses contextual embeddings rather than exact matches, helping us evaluate semantic similarity when wording differs.

This multi-metric approach compensates for the known limitations of any single metric in isolation.

For the text generation task, we’re computing all metrics at once rather than in separate evaluation passes, which is more efficient.

We specify lang=”en” for BERTScore to use the appropriate language model.

The immediate display of results provides quick feedback, and we format to 4 decimal places for readability while maintaining sufficient precision.

For BERTScore, we calculate the average F1 score across all samples since BERTScore returns individual scores for each prediction-reference pair, unlike BLEU and ROUGE which return aggregate scores.

We maintain the same evaluation structure for summarization to enable direct comparison with the generation task. This consistency is crucial for valid comparative analysis.

Though summarization typically emphasizes ROUGE metrics (especially ROUGE-L for capturing the longest common subsequence), we include all metrics to enable comprehensive comparison.

				
					Loading evaluation metrics...

Text Generation Evaluation Results:
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
BLEU Score: 0.2115
ROUGE-1: 0.5707, ROUGE-2: 0.3633, ROUGE-L: 0.4847
BERTScore (F1): 0.9396

Text Summarization Evaluation Results:
BLEU Score: 0.0597
ROUGE-1: 0.4444, ROUGE-2: 0.2383, ROUGE-L: 0.4044
BERTScore (F1): 0.9209
				
			

Reviewing the Outputs

Let’s examine the generated outputs compared to the references:

				
					# ---- PRINT DETAILED RESULTS ----
print("\n--- TEXT GENERATION TASK RESULTS ---")
for i in range(len(prompts_gen)):
    print(f"\nPrompt {i+1}: {prompts_gen[i]}")
    print(f"Generated Output: {predictions_gen[i]}")
    print(f"Expected Output: {references_gen[i]}")

print("\n--- TEXT SUMMARIZATION TASK RESULTS ---")
for i in range(len(documents_sum)):
    print(f"\nOriginal Document {i+1}: {documents_sum[i]}")
    print(f"Generated Summary: {predictions_sum[i]}")
    print(f"Expected Summary: {references_sum[i]}")
				
			

Quantitative metrics alone can be misleading without qualitative context.

This section provides side-by-side comparisons of inputs, predictions, and references for qualitative assessment.

The structured format makes it easy to spot differences between model outputs and references, providing context for interpreting the numerical metrics.

This is especially important since LLMs often produce semantically correct answers with different phrasings that quantitative metrics might penalize.

				
					--- TEXT GENERATION TASK RESULTS ---

Prompt 1: Explain the importance of renewable energy sources.
Generated Output: Renewable energy sources are important because they provide clean, sustainable power that reduces our reliance on fossil fuels and their associated environmental impacts. They also help mitigate climate change by reducing greenhouse gas emissions.
Expected Output: Renewable energy sources are vital because they reduce greenhouse gas emissions and dependence on fossil fuels.

Prompt 2: Describe the process of photosynthesis.
Generated Output: Photosynthesis is the process by which plants and some other organisms use sunlight to convert carbon dioxide and water into glucose and oxygen. It is a complex process involving several steps, including the absorption of light energy, the production of ATP and NADPH,
Expected Output: Photosynthesis is the process by which plants convert sunlight, carbon dioxide, and water into energy in the form of glucose.

Prompt 3: What are the causes and effects of global warming?
Generated Output: Global warming is caused by the release of greenhouse gases into the atmosphere, primarily from human activities such as burning fossil fuels. The effects include rising temperatures, melting ice caps, and more frequent extreme weather events.
Expected Output: Global warming is caused by the release of greenhouse gases and has effects such as rising sea levels and extreme weather events.

--- TEXT SUMMARIZATION TASK RESULTS ---

Original Document 1: Renewable energy sources, including solar, wind, and hydropower, offer significant benefits for the planet. They help reduce greenhouse gas emissions, mitigate climate change, and provide sustainable energy solutions.
Generated Summary: Renewable energy sources like solar, wind, and hydropower offer environmental benefits by reducing greenhouse gas emissions and mitigating climate change while providing sustainable energy solutions
Expected Summary: Renewable energy reduces emissions and mitigates climate change.

Original Document 2: Photosynthesis is a crucial process in which plants use sunlight to convert carbon dioxide and water into glucose and oxygen. This process provides energy for plants and oxygen for other living organisms.
Generated Summary: Photosynthesis is a process where plants use sunlight to make glucose and oxygen from carbon dioxide and water, providing energy for plants and oxygen for other organisms.
Expected Summary: Photosynthesis converts sunlight, carbon dioxide, and water into glucose.

Original Document 3: Global warming is primarily caused by human activities, such as burning fossil fuels and deforestation. Its consequences include rising sea levels, more frequent extreme weather events, and loss of biodiversity.
Generated Summary: Human activities like burning fossil fuels and deforestation cause global warming, leading to rising sea levels, extreme weather, and biodiversity loss.
Expected Summary: Global warming leads to rising sea levels and extreme weather.
				
			

Analyzing the Evaluation Metrics

The evaluation of Llama 3.1 8B reveals interesting insights about its capabilities across different NLP tasks:

BLEU Score Analysis

For text generation, we achieved a BLEU score of around 0.21, while summarization yielded a lower score of about 0.06.

The relatively low BLEU scores indicate challenges with exact n-gram matching between model outputs and human references.

This isn’t necessarily a critical flaw—BLEU prioritizes exact matches, while LLMs often produce semantically equivalent but lexically different outputs.

ROUGE Score Analysis

The ROUGE metrics show moderate performance with ROUGE-1 at 0.57 for generation and 0.44 for summarization, while ROUGE-L is around 0.48 and 0.40 respectively.

These scores indicate the model captures a fair portion of the reference content, but there’s still substantial divergence in the exact phrasing and structure.

BERTScore Analysis

The BERTScore F1 values are notably high at around 0.94 for generation and 0.92 for summarization.

This suggests that while the surface form (exact wording) differs from references, the semantic meaning is largely preserved. BERTScore’s contextual embeddings capture these semantic similarities that n-gram based metrics miss.

Task Comparison

Text generation performed better than summarization across all metrics.

This suggests the model may find it easier to expand on concepts (generation) than to compress information while preserving key points (summarization).

The summarization task requires more complex reasoning about information importance and conciseness.

Practical Implications

The high BERTScore with lower BLEU/ROUGE scores reveals an important characteristic of modern LLMs:

They excel at capturing meaning but express it in their own words rather than reproducing exact reference phrasing.

This makes them valuable for creative content generation and information reformulation, even if they don’t achieve perfect metric scores on traditional benchmarks.

The evaluation demonstrates that Llama 3.1 8B is a capable model with stronger semantic understanding than exact reproduction abilities, which aligns with its intended use cases in creative text generation and flexible information processing.

Conclusion

Evaluating LLMs is crucial for ensuring their accuracy, safety, and relevance in real-world applications.

By combining automated metrics, human judgment, and continuous testing, we can identify strengths and mitigate risks such as bias and toxicity.

Adopting a comprehensive evaluation framework leads to more reliable and ethical AI models.

As the field evolves, staying updated with new tools and practices will help improve model performance and transparency.

Incorporating robust evaluation practices is essential for responsible AI development and the successful deployment of LLMs in various industries.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !