LLM Evaluation: Best Metrics & Tools

March 26th, 2025

Introduction

Large Language Models (LLMs) are advanced artificial intelligence systems designed to understand, generate, and manipulate human-like text.

Powered by deep learning architectures like transformers, models such as OpenAI’s GPT, Google’s Bard, and Meta’s LLaMA have demonstrated unprecedented capabilities in tasks ranging from answering complex questions to writing essays, debugging code, and even creating poetry.

These models are trained on vast amounts of text data, enabling them to capture the nuances of language and produce outputs that often feel remarkably human.

Why LLM Evaluation Is Critical

LLM evaluation is essential to ensure these powerful AI systems are reliable, safe, and ethical. Here’s why it matters:

Reliability and Safety

LLMs are used in high-stakes applications like healthcare and education. Evaluation ensures their outputs are accurate and appropriate, preventing harmful mistakes.

Identifying Biases

LLMs can inherit biases from training data, leading to unfair or discriminatory outputs. Evaluation helps detect and address these issues, promoting fairness.

Tracking Progress

Evaluation allows developers to measure improvements, compare models, and benchmark against state-of-the-art systems.

Informed Deployment

Not all LLMs are equal. Evaluation helps stakeholders choose the right model for their needs, balancing performance, efficiency, and ethical considerations.

In short, LLM evaluation builds trust in AI by ensuring models are reliable, fair, and aligned with human values. Next, we’ll explore the key metrics used to evaluate LLMs.

The Challenge of Evaluating Human-Like Text

Evaluating LLMs is no simple task. Unlike traditional AI systems, where metrics like accuracy or error rates suffice, LLMs require a more nuanced approach. Their outputs are often subjective, context-dependent, and open to interpretation. For example:

How do you measure the “quality” of a creative story generated by an LLM?

How do you ensure factual accuracy in a model that summarizes complex topics?

How do you detect and mitigate subtle biases in generated text?

These challenges highlight the need for robust evaluation frameworks that go beyond simple metrics and consider the multifaceted nature of human language.

What Will This Blog Cover?

In this blog, we’ll explore the best practices for evaluating LLMs, covering:

Key Metrics: What to measure and why.

Evaluation Methods: Automated vs. human-centric approaches.

Tools and Frameworks: Popular tools for streamlining LLM evaluation.

Best Practices: How to ensure comprehensive and ethical evaluations.

By the end of this guide, you’ll have a clear understanding of how to assess LLMs effectively, ensuring they deliver value while minimizing risks. Let’s dive in!

Core LLM Evaluation Approaches

Evaluating large language models (LLMs) requires a multifaceted approach that combines different methodologies to gain a comprehensive understanding of model performance.

Each approach offers unique strengths and limitations, making them complementary rather than mutually exclusive. Let’s explore the four primary evaluation paradigms used in the field today.

Human Evaluation

Human evaluation remains the gold standard for assessing LLM outputs, particularly for subjective qualities that are difficult to measure automatically.

Key characteristics:

Strengths

Captures nuanced aspects of language like relevance, helpfulness, and naturalness that automated metrics often miss.

Process

Typically involves annotators rating model outputs on predefined criteria using Likert scales or preference judgments.

Applications

Crucial for evaluating creative content, instruction-following, and alignment with human values.

Limitations

Expensive and time-consuming to implement at scale – Subject to annotator biases and inconsistencies – Difficult to standardize across different evaluation campaigns.

Best practices

Use clear rubrics and evaluation criteria – Employ multiple annotators per example and measure inter-annotator agreement – Consider specialized annotators for domain-specific tasks

Automated Metrics

Automated metrics provide scalable, reproducible measurements that can be applied to large datasets without human intervention.

Common automated metrics:

Reference-based

Compare model outputs to humanwritten references (BLEU, ROUGE, BERTScore)

Referencefree

Evaluate outputs without needing human references (perplexity, coherence scores)

LLM-as-a-judge

Using other LLMs to evaluate outputs (e.g., GPT-4 for scoring)

Strengths and weaknesses:

Strengths: Consistent, scalable, reproducible, and costeffective

Weaknesses: Often fail to capture semantic nuances and can reward shallow pattern matching over genuine understanding

When to use:

For rapid iteration during development – As part of a broader evaluation strategy – When monitoring performance changes over time

Benchmark Datasets

Standardized benchmark datasets provide consistent testing grounds to compare different models on specific capabilities.

Popular LLM benchmarks:

MMLU (Massive Multitask Language Understanding): Tests knowledge across 57 subjects

HELM (Holistic Evaluation of Language Models): Comprehensive suite of tasks and metrics

BIG-bench: Collaborative benchmark with 200+ diverse tasks

TruthfulQA: Measures propensity to generate truthful answers

GSM8K: Tests mathematical reasoning capabilities

Benchmark considerations

Look for benchmarks aligned with your specific use cases

Be aware of benchmark limitations and potential data contamination

Consider evaluating on multiple benchmarks to get a comprehensive view

Adversarial Testing

Adversarial testing deliberately probes model weaknesses by designing inputs specifically meant to cause failures.

Approaches include:

Red-teaming: Having experts attempt to break the model or elicit harmful outputs

Jailbreaking: Testing resistance to prompt injection and policy circumvention

Robustness testing: Evaluating performance under input perturbations and edge cases

Benefits

Identifies vulnerabilities before deployment

Reveals failure modes that might not appear in standard benchmarks

Helps prioritize safety improvements

Implementation strategies

Combine automated adversarial testing with human redteaming

Document and classify discovered vulnerabilities

Create regression tests for previously identified issues

Emerging Evaluation Paradigms

The field of LLM evaluation continues to evolve rapidly with new approaches gaining traction:

Agent-based evaluation: Testing LLMs in interactive environments

Self-evaluation: Having models critique their own outputs

Process-based evaluation: Focusing on reasoning steps rather than just final answers

Distribution-aware evaluation: Testing performance across different demographic groups and content domains

Choosing the Right Approach

The most effective evaluation strategies combine multiple approaches:

Use automated metrics for continuous monitoring and rapid iteration

Incorporate benchmark testing for standardized comparison

Apply human evaluation for critical aspects and final quality assessment

Employ adversarial testing to identify and address weaknesses

Remember that evaluation should be tailored to your specific use case and deployment context.

For customer-facing applications, human evaluation on realworld queries may be most valuable, while research contexts might benefit more from standardized benchmarks.

Key Metrics for LLM Evaluation

Selecting the right metrics is crucial for meaningful LLM evaluation.

Different metrics capture different aspects of model performance, and understanding their strengths and limitations is essential for comprehensive assessment.

This section explores the primary categories of metrics used in LLM evaluation.

Accuracy Metrics

Accuracy metrics measure how closely model outputs match expected answers or references.

These are particularly important for tasks with clear correct answers.

Text Generation Metrics

BLEU (Bilingual Evaluation Understudy)

Originally designed for machine translation

Measures n-gram overlap between model output and reference text

Scores range from 0 to 1 (higher is better)

Limitation

Focuses on precision without considering recall; sensitive to surface-level differences

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Common for summarization tasks

Various subtypes (ROUGE-N, ROUGE-L, ROUGE-S)

Emphasizes recall more than BLEU

Limitation

Still primarily focused on lexical overlap rather than semantic meaning

BERTScore

Uses contextual embeddings to measure semantic similarity

More robust to paraphrasing than BLEU or ROUGE

Correlates better with human judgments

Limitation

Computationally more expensive than n-gram based metrics

Classification Metrics

Accuracy

Proportion of correct predictions

Simple and intuitive

Limitation

Problematic for imbalanced datasets

F1 Score

Harmonic mean of precision and recall

Balances false positives and false negatives

Limitation

Can hide trade-offs between precision and recall

Area Under the ROC Curve (AUC-ROC)

Measures ability to distinguish between classes

Independent of decision threshold

Limitation: Less informative when classes are highly imbalanced

Fluency Metrics

Fluency metrics assess how natural, grammatical, and human-like the language produced by the model is.

Perplexity

Measures how well a model predicts a sample

Lower values indicate better fluency

Limitation

Not directly comparable across different models or tokenizers

Grammatical Error Rate

Proportion of outputs containing grammatical errors

Can be measured using automated grammar checkers

Limitation

May miss subtle errors or flag unconventional but valid constructions

Self-BLEU

Measures diversity of generated text

Lower values indicate more diverse outputs

Limitation

Doesn’t account for quality or relevance

Relevance & Consistency Metrics

These metrics evaluate whether model outputs are relevant to the input and consistent with known facts or the model’s previous statements.

Factual Consistency

Percentage of outputs containing factual errors

Can be measured using fact-checking models

Limitation

Reference knowledge may be incomplete or outdated

Semantic Similarity

Cosine similarity between embeddings of query and response

Measures topical relevance

Limitation

May not capture nuanced aspects of relevance

Faithfulness

For summarization, measures whether the summary contains information not in the source

Lower hallucination rates indicate higher faithfulness

Limitation

Challenging to automate reliably

Consistency Measures

Evaluates whether model contradicts itself across outputs

Can be measured using contradiction detection models

Limitation

Requires context beyond individual responses

Safety & Bias Metrics

These metrics assess potential harms, biases, and unintended behaviors in model outputs.

Toxicity Scores

Measures harmful, offensive, or inappropriate content

Often uses classifiers like Perspective API

Limitation

Cultural and contextual sensitivity issues

Stereotype Bias

Measures bias against protected groups

Can use templates to test for specific biases

Limitation

May not capture subtle or intersectional biases

Robustness to Adversarial Inputs

Measures how well model maintains safe behavior under attack

Pass rate on jailbreaking attempts

Limitation

Adversarial techniques evolve rapidly

Fairness Across Demographics

Performance disparity across different groups

Helps identify if the model favors certain perspectives

Limitation

Requires careful demographic categorization

Emerging Metrics

New metrics continue to be developed as the field evolves:

Human Preference Alignment

How well outputs align with human preference

Often measured through pairwise comparisons

Becoming increasingly important for RLHF

Reasoning Evaluation

Assesses step-by-step reasoning ability

Focuses on process rather than just final answer

Important for tasks requiring multi-step thinking

Truthfulness Indices

Combines multiple measures of factuality

Provides a more comprehensive assessment of model honesty

Helps quantify tendency to hallucinate

Choosing the Right Metrics

When selecting metrics for your evaluation:

1. Consider your use case: Different applications require different metrics

2. Use multiple metrics: No single metric captures all aspects of performance

3. Balance automated and human evaluation: Use automated metrics for efficiency, but validate with human judgment

4. Establish baselines: Compare against human performance and other models

5. Track progress over time: Monitor how metrics change as you iterate on your model

Remember that metrics are tools to guide improvement, not ends in themselves.

The ultimate goal is to create models that provide value to users, which sometimes requires looking beyond traditional metrics.

Popular LLM Evaluation Tools

The rapid development of LLMs has sparked the creation of numerous evaluation frameworks and tools.

This section explores the most widely used and effective tools available for comprehensive LLM evaluation.

Open-Source Evaluation Frameworks HELM (Holistic Evaluation of Language Models)

HELM provides one of the most comprehensive frameworks for evaluating language models across multiple dimensions.

Key features:

Evaluates models on 42 scenarios across 7 categories

Measures multiple metrics simultaneously (accuracy, robustness, fairness, etc.) Standardized evaluation protocol for fair comparison

Regularly updated leaderboard

Best for:

Organizations seeking comprehensive, multi-dimensional evaluation.

Website: HELM at Stanford CRFM

EleutherAI LM Evaluation Harness

A flexible and extensible codebase for evaluating language models on a wide range of tasks and benchmarks.

Key features:

Support for many popular benchmarks (MMLU, TruthfulQA, HellaSwag, etc.)

Compatible with most open-source and commercial LLMs Highly customizable evaluation settings

Active community development

Best for:

ML researchers and engineers working with multiple models.

GitHub: EleutherAI lm –evaluation –harness

HuggingFace Evaluate

Part of the popular HuggingFace ecosystem, this library provides evaluation metrics for NLP tasks.

Key features:

Integration with HuggingFace models and datasets

Comprehensive collection of metrics (BLEU, ROUGE, BERTScore, etc.)

Easy-to-use API with consistent interface

Well-documented and maintained

Best for:

Teams already using the HuggingFace ecosystem.

Documentation: HuggingFace Evaluate

TruLens

An open-source library focused on evaluating LLM applications and chains, particularly relevant for RAG systems.

Key features:

Feedback functions for groundedness, relevance, and coherence

Instrumentation for LLM chains and applications

Detailed tracing and evaluation dashboards

Integration with popular LLM frameworks

Best for:

Evaluating production LLM applications and RAG systems.

GitHub: TruLens

Benchmark-Specific Tools MMLU (Massive Multitask Language Understanding)

A comprehensive benchmark for evaluating knowledge and reasoning across 57 subjects.

Key features:

Tests knowledge across domains (STEM, humanities, social sciences, etc.)

Multiple-choice format for easy evaluation

Varying difficulty levels

Wide adoption in academic and industry research

Best for:

Measuring general knowledge and reasoning capabilities.

BIG-bench

A collaborative benchmark with over 200 tasks designed to probe model capabilities beyond standard metrics.

Key features:

Diverse task types (reasoning, knowledge, multilingual, etc.)

Community-contributed tasks

Open-ended evaluation beyond standard metrics

Challenging tasks designed to test model limitations

Best for:

Identifying specific model strengths and weaknesses.

AlpacaEval

Focused on evaluating instruction-following capabilities through win rates against reference models.

Key features:

Uses strong judge models (like GPT-4) to evaluate responses

Computes win rates rather than absolute scores

Diverse set of instructions across domains

Quick to run and easy to interpret results

Best for:

Comparing instruction-following capabilities between models.

Commercial Evaluation Platforms

DeepEval

A comprehensive platform for testing and evaluating production LLM applications.

Key features:

End-to-end testing framework

Custom evaluation metrics CI/CD integration

Performance monitoring dashboards

Best for:

DevOps teams integrating LLM evaluation into development pipelines.

Arthur.ai

An ML observability platform with extensive LLM evaluation capabilities.

Key features:

Performance monitoring

Bias and fairness detection – Data drift analysis

Explanation tools

Best for:

Enterprise teams requiring robust monitoring and governance.

Scale Spellbook

An evaluation platform built for production AI systems with comprehensive metrics.

Key features:

Human evaluation integration

Custom evaluation workflows

Performance analytics

Integration with popular LLM providers

Best for:

Teams requiring both automated and human evaluation.

Specialized Evaluation Tools

Anthropic’s RLHF Leaderboard

Focuses on comparing models based on helpfulness and harmlessness preferences.

Key features:

Models evaluated on alignment with human preferences

Standardized prompts across domains

Regular updates with new models Transparent methodology

Best for:

Comparing models on alignment with human values.

Weights & Biases LLM Evaluation

Integrated platform for tracking and visualizing LLM evaluations.

Key features:

Experiment tracking

Performance visualization

Prompt and response versioning

Collaboration tools

Best for:

Teams tracking iterative LLM improvements over time.

OpenAI Evals

An evaluation framework designed for evaluating model capabilities and safety.

Key features:

Standardized evaluation protocols

Safety-specific evaluations

Customizable evaluation datasets

Integration with OpenAI models

Best for:

Evaluating models against OpenAI benchmarks and safety standards.

Building an Evaluation Stack

Most organizations benefit from combining multiple evaluation tools:

1. Start with comprehensive frameworks like HELM or EleutherAI’s Harness for broad capability assessment

2. Add specialized tools for your specific use cases:

- Customer service → Conversation quality metrics
- Content generation → Creativity and factuality tools
- Code generation → Functional correctness evaluators

3. Implement continuous evaluation using tools that integrate with your MLOps pipeline

4. Complement with human evaluation for critical aspects not wellcaptured by automated metrics

Selection Criteria for Evaluation Tools

When choosing evaluation tools, consider:

• Integration with your tech stack

• Coverage of metrics relevant to your use case

• Cost and resource requirements

• Ease of use and documentation quality

• Community support and active development

• Scalability for your evaluation needs

Remember that the best evaluation strategy combines multiple tools and approaches, as no single tool can capture all aspects of LLM performance.

Implementing Effective Evaluation Workflows

Set Clear Goals: Define the purpose and what you want to evaluate (e.g., accuracy, bias).

Select Metrics: Choose metrics tailored to your model’s task (e.g., F1 for classification, BLEU for generation).

Test Datasets: Use diverse and representative datasets for evaluation.

Baseline Performance: Establish benchmarks to compare results.

Continuous Evaluation: Regularly assess models during development to catch issues early.

Case Studies: Evaluation in Practice

Customer Service Chatbot Evaluation

Evaluating chatbots requires a focus on various metrics to ensure the model serves its purpose effectively.

For customer service, the main criteria often include:

Response Quality

How well does the chatbot answer the user’s queries? Is it relevant, helpful, and clear?

This can be evaluated using relevance and factual consistency metrics.

Response Time: How fast does the chatbot provide an answer? This could be evaluated using latency measures and user experience surveys.

Safety & Bias: It’s crucial to evaluate chatbots for toxicity and harmful responses. Tools like Toxicity measures and Fairness evaluations are essential in identifying harmful behavior and ensuring the model’s output aligns with ethical guidelines.

User Satisfaction: Often measured by post-interaction surveys asking users if their needs were met. This can also be evaluated using a combination of human judgment and automated metrics (e.g., BLEU for understanding fluency).

Research Methodologies

In research, the evaluation of LLMs follows a structured methodology to ensure reproducibility and fairness:

Automated Metrics: These are commonly used in papers to quickly assess model performance, especially for tasks like text generation (e.g., BLEU, ROUGE). However, they don’t capture all nuances of human language.

Human Evaluation: Research papers often supplement automated metrics with human evaluation to judge fluency, relevance, and coherence. This is considered the gold standard but is time-consuming and expensive.

Comparing Multiple Models: A typical research methodology involves comparing multiple models on the same benchmark dataset. This allows researchers to evaluate which models outperform others, based on consistent metrics.

Leading AI Labs’ Approach

Major AI labs like OpenAI, Google Research, and DeepMind follow rigorous evaluation frameworks to assess LLMs:

Standardized Datasets: These labs frequently use popular benchmark datasets like GLUE, SuperGLUE, or SQuAD, which have been designed to evaluate language models on various NLP tasks (e.g., question answering, sentiment analysis).

Human & Automated Evaluations: Labs combine automated metrics for scalability with human judgment to assess more subjective aspects like conversational coherence and model behavior in edge cases.

Bias and Fairness Testing: Leading labs also place significant emphasis on identifying and mitigating biases in their models. They utilize tools like Bias detection methods and Fairness evaluations to ensure their models are safe and equitable for real-world deployment.

Adversarial Testing: Labs often use adversarial testing to push the models to their limits, identifying failure points and improving robustness.

Lessons Learned

Iterative Improvements: Evaluation isn’t a one-time task; it’s a continuous process that involves iterating over models and metrics to uncover new insights.

Human Oversight is Essential: Even with advanced automated metrics, human oversight is needed to understand the full implications of a model’s behavior, especially in high-stakes applications like healthcare or customer service.

Holistic Evaluation: A comprehensive evaluation approach combining multiple metrics—accuracy, fluency, safety, and user feedback—provides the most reliable insights into model performance.

Future Trends in LLM Evaluation

Self-Evaluation: Future models might assess their own performance autonomously.

Multi-modal Evaluation: Evaluation will extend beyond text, including images and audio.

Human Preferences: Evaluations will align more closely with human judgment.

Community Standards: Growing consensus on standard evaluation practices to ensure transparency.

Example: Beyond the Numbers: A Deep Dive into LLaMA 3.1 8B's Text Generation and Summarization Capabilities

This comprehensive analysis explores the text generation and summarization capabilities of Meta’s LLaMA 3.1 8B model.

Using industry-standard metrics like BLEU, ROUGE, and BERTScore, we quantitatively evaluate the model’s performance while providing real-world examples of its outputs.

Our findings reveal an interesting disconnect: while semantic understanding (measured by BERTScore) is remarkably strong, lexical precision and structural integrity (measured by BLEU and ROUGE) show significant room for improvement.

Complete with visualizations and side-by-side comparisons of generated vs. expected outputs, this evaluation provides valuable insights for researchers and practitioners looking to implement or fine-tune large language models for specific text generation tasks.

Setting Up the Environment

First, let’s install the necessary packages for our evaluation:

				
					# Installing required packages for model loading, evaluation, and visualization
!pip install torch transformers evaluate matplotlib seaborn huggingface_hub rouge_score bert_score

				
					Requirement already satisfied: torch in /usr/local/lib/python3.11/dist-packages (2.6.0+cu124)
Requirement already satisfied: transformers in /usr/local/lib/python3.11/dist-packages (4.50.0)
Requirement already satisfied: evaluate in /usr/local/lib/python3.11/dist-packages (0.4.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.11/dist-packages (3.10.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.11/dist-packages (0.13.2)
Requirement already satisfied: huggingface_hub in /usr/local/lib/python3.11/dist-packages (0.29.3)
Requirement already satisfied: rouge_score in /usr/local/lib/python3.11/dist-packages (0.1.2)
Requirement already satisfied: bert_score in /usr/local/lib/python3.11/dist-packages (0.3.13)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from torch) (3.18.0)
Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.11/dist-packages (from torch) (4.12.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.11/dist-packages (from torch) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from torch) (3.1.6)
Requirement already satisfied: fsspec in /usr/local/lib/python3.11/dist-packages (from torch) (2024.12.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.11/dist-packages (from torch) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.5.8)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.11/dist-packages (from torch) (11.2.1.3)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.11/dist-packages (from torch) (10.3.5.147)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.11/dist-packages (from torch) (11.6.1.9)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.11/dist-packages (from torch) (12.3.1.170)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in /usr/local/lib/python3.11/dist-packages (from torch) (0.6.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.11/dist-packages (from torch) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)
Requirement already satisfied: triton==3.2.0 in /usr/local/lib/python3.11/dist-packages (from torch) (3.2.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.11/dist-packages (from torch) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from sympy==1.13.1->torch) (1.3.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2.0.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from transformers) (24.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.11/dist-packages (from transformers) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2024.11.6)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from transformers) (2.32.3)
Requirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.21.1)
Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.5.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.11/dist-packages (from transformers) (4.67.1)
Requirement already satisfied: datasets>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from evaluate) (3.4.1)
Requirement already satisfied: dill in /usr/local/lib/python3.11/dist-packages (from evaluate) (0.3.8)
Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (from evaluate) (2.2.2)
Requirement already satisfied: xxhash in /usr/local/lib/python3.11/dist-packages (from evaluate) (3.5.0)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.11/dist-packages (from evaluate) (0.70.16)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.4.8)
Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: absl-py in /usr/local/lib/python3.11/dist-packages (from rouge_score) (1.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.11/dist-packages (from rouge_score) (3.9.1)
Requirement already satisfied: six>=1.14.0 in /usr/local/lib/python3.11/dist-packages (from rouge_score) (1.17.0)
Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.11/dist-packages (from datasets>=2.0.0->evaluate) (18.1.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.11/dist-packages (from datasets>=2.0.0->evaluate) (3.11.14)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas->evaluate) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas->evaluate) (2025.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2025.1.31)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->torch) (3.0.2)
Requirement already satisfied: click in /usr/local/lib/python3.11/dist-packages (from nltk->rouge_score) (8.1.8)
Requirement already satisfied: joblib in /usr/local/lib/python3.11/dist-packages (from nltk->rouge_score) (1.4.2)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (2.6.1)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.2)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (25.3.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (6.2.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (0.3.0)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.18.3)

Now, let’s import the libraries we’ll use and authenticate with Hugging Face Hub:

				
					import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import evaluate
import torch
from huggingface_hub import login

# Log in to Hugging Face (enter your token when prompted)
login()

Loading the Model

We’ll load the Llama 3.1 8B model with appropriate configurations for optimal performance:

				
					# ---- MODEL LOADING ----
# Load model with GPU optimization 
model_name = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    padding_side='left'  # Important for decoder-only models
)

# Set padding token for the tokenizer
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # Use half-precision for efficiency
    device_map="auto"           # Automatically determine best device placement
)

model.config.pad_token_id = tokenizer.pad_token_id

Defining Inference Functions

Next, we’ll define functions for text generation and summarization with few-shot prompting:

				
					# ---- IMPROVED MODEL INFERENCE FUNCTIONS ----
def generate_text_batch(prompts, max_new_tokens=50):
    # Use few-shot prompting with explicit examples
    few_shot_prefix = """
Answer the following questions directly and concisely in 1-2 sentences:

Q: What is machine learning?
A: Machine learning is a branch of AI that enables computers to learn from data and make predictions without explicit programming. It uses statistical techniques to improve performance on specific tasks.

Q: How does a combustion engine work?
A: Combustion engines work by burning fuel in a confined space to create expanding gases that move pistons. This mechanical motion is then converted to power vehicles or machinery.

Now answer this question:
Q: """

    results = []
    # Process each prompt individually for better control
    for prompt in prompts:
        full_prompt = few_shot_prefix + prompt + "\nA:"

        inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)

        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Use greedy decoding for more predictable results
            num_return_sequences=1
        )

        # Extract just the answer part
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        answer = response.split("\nA:")[-1].strip()
        results.append(answer)

    return results

def summarize_with_llm_batch(documents):
    # Use few-shot prompting with examples of good summaries
    few_shot_prefix = """
Summarize the following texts in a single concise sentence:

Text: Solar panels convert sunlight directly into electricity through the photovoltaic effect. The technology has become more efficient and affordable in recent years, leading to widespread adoption.
Summary: Solar panels convert sunlight to electricity using the photovoltaic effect, becoming more efficient and affordable over time.

Text: The Great Barrier Reef is experiencing severe coral bleaching due to rising ocean temperatures. Scientists warn that without immediate action on climate change, the reef could suffer irreversible damage.
Summary: Rising ocean temperatures are causing severe coral bleaching in the Great Barrier Reef, threatening permanent damage.

Now summarize this text:
Text: """

    results = []
    # Process each document individually
    for doc in documents:
        full_prompt = few_shot_prefix + doc + "\nSummary:"

        inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)

        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_new_tokens=30,  # Shorter limit to force conciseness
            do_sample=False,    # Deterministic output
            num_return_sequences=1
        )

        # Extract just the summary
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        summary = response.split("\nSummary:")[-1].strip()

        # Cut off at the first period to ensure single sentence
        period_idx = summary.find('.')
        if period_idx > 0:
            summary = summary[:period_idx+1]

        results.append(summary)

    return results

Few-Shot Approach: We implement in-context learning with examples that demonstrate the desired output style.

This explicit conditioning is more effective than simply stating instructions, as it gives the model concrete examples

Creating Evaluation Datasets

We’ll create test datasets for both text generation and summarization tasks:

				
					# Task 1: Text Generation
prompts_gen = [
    "Explain the importance of renewable energy sources.",
    "Describe the process of photosynthesis.",
    "What are the causes and effects of global warming?",
]
references_gen = [
    "Renewable energy sources are vital because they reduce greenhouse gas emissions and dependence on fossil fuels.",
    "Photosynthesis is the process by which plants convert sunlight, carbon dioxide, and water into energy in the form of glucose.",
    "Global warming is caused by the release of greenhouse gases and has effects such as rising sea levels and extreme weather events."
]

# Task 2: Text Summarization
documents_sum = [
    "Renewable energy sources, including solar, wind, and hydropower, offer significant benefits for the planet. "
    "They help reduce greenhouse gas emissions, mitigate climate change, and provide sustainable energy solutions.",

    "Photosynthesis is a crucial process in which plants use sunlight to convert carbon dioxide and water into glucose and oxygen. "
    "This process provides energy for plants and oxygen for other living organisms.",

    "Global warming is primarily caused by human activities, such as burning fossil fuels and deforestation. "
    "Its consequences include rising sea levels, more frequent extreme weather events, and loss of biodiversity."
]
references_sum = [
    "Renewable energy reduces emissions and mitigates climate change.",
    "Photosynthesis converts sunlight, carbon dioxide, and water into glucose.",
    "Global warming leads to rising sea levels and extreme weather."
]

Model Inference

Now we’ll run the model on our datasets

				
					# ---- MODEL INFERENCE ----
# Run a small warm-up to initialize CUDA
print("Warming up GPU...")
_ = generate_text_batch(["Hello, world!"], max_new_tokens=10)

# Generate responses for both tasks
print("Generating responses...")
predictions_gen = generate_text_batch(prompts_gen, max_new_tokens=50)
predictions_sum = summarize_with_llm_batch(documents_sum)

The warm-up run serves to initialize CUDA context and allocate memory before main execution, preventing timing inconsistencies from first-time CUDA initialization overhead.

				
					Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Warming up GPU...
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Generating responses...
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

Evaluating Model Performance

We’ll evaluate the model using standard NLP metrics.

				
					# ---- EVALUATION ----
# Load evaluation metrics
print("Loading evaluation metrics...")
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# Evaluate text generation
print("\nText Generation Evaluation Results:")
bleu_score_gen = bleu.compute(predictions=predictions_gen, references=references_gen)
rouge_score_gen = rouge.compute(predictions=predictions_gen, references=references_gen)
bertscore_result_gen = bertscore.compute(predictions=predictions_gen, references=references_gen, lang="en")
# Display generation results immediately
print(f"BLEU Score: {bleu_score_gen['bleu']:.4f}")
print(f"ROUGE-1: {rouge_score_gen['rouge1']:.4f}, ROUGE-2: {rouge_score_gen['rouge2']:.4f}, ROUGE-L: {rouge_score_gen['rougeL']:.4f}")
print(f"BERTScore (F1): {sum(bertscore_result_gen['f1']) / len(bertscore_result_gen['f1']):.4f}")

# Evaluate text summarization
print("\nText Summarization Evaluation Results:")
bleu_score_sum = bleu.compute(predictions=predictions_sum, references=references_sum)
rouge_score_sum = rouge.compute(predictions=predictions_sum, references=references_sum)
bertscore_result_sum = bertscore.compute(predictions=predictions_sum, references=references_sum, lang="en")
# Display summarization results immediately
print(f"BLEU Score: {bleu_score_sum['bleu']:.4f}")
print(f"ROUGE-1: {rouge_score_sum['rouge1']:.4f}, ROUGE-2: {rouge_score_sum['rouge2']:.4f}, ROUGE-L: {rouge_score_sum['rougeL']:.4f}")
print(f"BERTScore (F1): {sum(bertscore_result_sum['f1']) / len(bertscore_result_sum['f1']):.4f}")

We’re using the Hugging Face evaluate library because it provides standardized implementations of NLP metrics, ensuring consistency and reproducibility.

We selected these three metrics specifically for complementary reasons:

BLEU: A precision-focused metric that measures exact n-gram overlap, useful for detecting exact phrase matching.

ROUGE: Measures recall of n-grams, giving us insights into how much of the reference content is captured.

BERTScore: Uses contextual embeddings rather than exact matches, helping us evaluate semantic similarity when wording differs.

This multi-metric approach compensates for the known limitations of any single metric in isolation.

For the text generation task, we’re computing all metrics at once rather than in separate evaluation passes, which is more efficient.

We specify lang=”en” for BERTScore to use the appropriate language model.

The immediate display of results provides quick feedback, and we format to 4 decimal places for readability while maintaining sufficient precision.

For BERTScore, we calculate the average F1 score across all samples since BERTScore returns individual scores for each prediction-reference pair, unlike BLEU and ROUGE which return aggregate scores.

We maintain the same evaluation structure for summarization to enable direct comparison with the generation task. This consistency is crucial for valid comparative analysis.

Though summarization typically emphasizes ROUGE metrics (especially ROUGE-L for capturing the longest common subsequence), we include all metrics to enable comprehensive comparison.

				
					Loading evaluation metrics...

Text Generation Evaluation Results:
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
BLEU Score: 0.2115
ROUGE-1: 0.5707, ROUGE-2: 0.3633, ROUGE-L: 0.4847
BERTScore (F1): 0.9396

Text Summarization Evaluation Results:
BLEU Score: 0.0597
ROUGE-1: 0.4444, ROUGE-2: 0.2383, ROUGE-L: 0.4044
BERTScore (F1): 0.9209

Reviewing the Outputs

Let’s examine the generated outputs compared to the references:

				
					# ---- PRINT DETAILED RESULTS ----
print("\n--- TEXT GENERATION TASK RESULTS ---")
for i in range(len(prompts_gen)):
    print(f"\nPrompt {i+1}: {prompts_gen[i]}")
    print(f"Generated Output: {predictions_gen[i]}")
    print(f"Expected Output: {references_gen[i]}")

print("\n--- TEXT SUMMARIZATION TASK RESULTS ---")
for i in range(len(documents_sum)):
    print(f"\nOriginal Document {i+1}: {documents_sum[i]}")
    print(f"Generated Summary: {predictions_sum[i]}")
    print(f"Expected Summary: {references_sum[i]}")

Quantitative metrics alone can be misleading without qualitative context.

This section provides side-by-side comparisons of inputs, predictions, and references for qualitative assessment.

The structured format makes it easy to spot differences between model outputs and references, providing context for interpreting the numerical metrics.

This is especially important since LLMs often produce semantically correct answers with different phrasings that quantitative metrics might penalize.

				
					--- TEXT GENERATION TASK RESULTS ---

Prompt 1: Explain the importance of renewable energy sources.
Generated Output: Renewable energy sources are important because they provide clean, sustainable power that reduces our reliance on fossil fuels and their associated environmental impacts. They also help mitigate climate change by reducing greenhouse gas emissions.
Expected Output: Renewable energy sources are vital because they reduce greenhouse gas emissions and dependence on fossil fuels.

Prompt 2: Describe the process of photosynthesis.
Generated Output: Photosynthesis is the process by which plants and some other organisms use sunlight to convert carbon dioxide and water into glucose and oxygen. It is a complex process involving several steps, including the absorption of light energy, the production of ATP and NADPH,
Expected Output: Photosynthesis is the process by which plants convert sunlight, carbon dioxide, and water into energy in the form of glucose.

Prompt 3: What are the causes and effects of global warming?
Generated Output: Global warming is caused by the release of greenhouse gases into the atmosphere, primarily from human activities such as burning fossil fuels. The effects include rising temperatures, melting ice caps, and more frequent extreme weather events.
Expected Output: Global warming is caused by the release of greenhouse gases and has effects such as rising sea levels and extreme weather events.

--- TEXT SUMMARIZATION TASK RESULTS ---

Original Document 1: Renewable energy sources, including solar, wind, and hydropower, offer significant benefits for the planet. They help reduce greenhouse gas emissions, mitigate climate change, and provide sustainable energy solutions.
Generated Summary: Renewable energy sources like solar, wind, and hydropower offer environmental benefits by reducing greenhouse gas emissions and mitigating climate change while providing sustainable energy solutions
Expected Summary: Renewable energy reduces emissions and mitigates climate change.

Original Document 2: Photosynthesis is a crucial process in which plants use sunlight to convert carbon dioxide and water into glucose and oxygen. This process provides energy for plants and oxygen for other living organisms.
Generated Summary: Photosynthesis is a process where plants use sunlight to make glucose and oxygen from carbon dioxide and water, providing energy for plants and oxygen for other organisms.
Expected Summary: Photosynthesis converts sunlight, carbon dioxide, and water into glucose.

Original Document 3: Global warming is primarily caused by human activities, such as burning fossil fuels and deforestation. Its consequences include rising sea levels, more frequent extreme weather events, and loss of biodiversity.
Generated Summary: Human activities like burning fossil fuels and deforestation cause global warming, leading to rising sea levels, extreme weather, and biodiversity loss.
Expected Summary: Global warming leads to rising sea levels and extreme weather.

Analyzing the Evaluation Metrics

The evaluation of Llama 3.1 8B reveals interesting insights about its capabilities across different NLP tasks:

BLEU Score Analysis

For text generation, we achieved a BLEU score of around 0.21, while summarization yielded a lower score of about 0.06.

The relatively low BLEU scores indicate challenges with exact n-gram matching between model outputs and human references.

This isn’t necessarily a critical flaw—BLEU prioritizes exact matches, while LLMs often produce semantically equivalent but lexically different outputs.

ROUGE Score Analysis

The ROUGE metrics show moderate performance with ROUGE-1 at 0.57 for generation and 0.44 for summarization, while ROUGE-L is around 0.48 and 0.40 respectively.

These scores indicate the model captures a fair portion of the reference content, but there’s still substantial divergence in the exact phrasing and structure.

BERTScore Analysis

The BERTScore F1 values are notably high at around 0.94 for generation and 0.92 for summarization.

This suggests that while the surface form (exact wording) differs from references, the semantic meaning is largely preserved. BERTScore’s contextual embeddings capture these semantic similarities that n-gram based metrics miss.

Task Comparison

Text generation performed better than summarization across all metrics.

This suggests the model may find it easier to expand on concepts (generation) than to compress information while preserving key points (summarization).

The summarization task requires more complex reasoning about information importance and conciseness.

Practical Implications

The high BERTScore with lower BLEU/ROUGE scores reveals an important characteristic of modern LLMs:

They excel at capturing meaning but express it in their own words rather than reproducing exact reference phrasing.

This makes them valuable for creative content generation and information reformulation, even if they don’t achieve perfect metric scores on traditional benchmarks.

The evaluation demonstrates that Llama 3.1 8B is a capable model with stronger semantic understanding than exact reproduction abilities, which aligns with its intended use cases in creative text generation and flexible information processing.

Conclusion

Evaluating LLMs is crucial for ensuring their accuracy, safety, and relevance in real-world applications.

By combining automated metrics, human judgment, and continuous testing, we can identify strengths and mitigate risks such as bias and toxicity.

Adopting a comprehensive evaluation framework leads to more reliable and ethical AI models.

As the field evolves, staying updated with new tools and practices will help improve model performance and transparency.

Incorporating robust evaluation practices is essential for responsible AI development and the successful deployment of LLMs in various industries.

What are you waiting for?

Fine-tune Your Model for Free

The Services provided are really great, we received a genuine advice and at very reasonable cost. all the work went hassle-free and no complication.