LLM Evaluation Framework – Matthias De Paolis

The discontinuation of Hugging Face’s Open LLM Leaderboard has left a gap in the community for standardized evaluation of large language models (LLMs). To address this, I developed the LLM Evaluation Framework, a comprehensive and modular tool designed to facilitate reproducible and extensible benchmarking of LLMs across various tasks and benchmarks.

The LLM Evaluation Framework can be found on my Github account: LLM Evaluation Framework

🧩 Empowering Transparent and Reproducible LLM Evaluations

The Open LLM Leaderboard was instrumental in providing a centralized platform for evaluating and comparing LLMs. Its retirement has underscored the need for tools that allow researchers and developers to conduct their own evaluations with transparency and consistency. The LLM Evaluation Framework aims to fill this void by offering:

Modular Design: Inspired by microservice architecture, enabling easy integration and customization.
Multiple Model Backends: Support for Hugging Face (hf) and vLLM backends, allowing flexibility in model loading and inference.
Quantization Support: Evaluate quantized models (e.g., 4-bit, 8-bit with hf, AWQ with vLLM) to assess performance under resource constraints.
Comprehensive Benchmarks: Includes support for standard benchmarks like MMLU, GSM8K, BBH, and more.
Leaderboard Replication: Easily run evaluations mimicking the Open LLM Leaderboard setup with standardized few-shot settings.
Flexible Configuration: Customize evaluations via CLI arguments or programmatic usage.
Detailed Reporting: Generates JSON results and Markdown reports for easy analysis.
Parallelism: Leverages vLLM for efficient inference, including tensor parallelism across multiple GPUs.

🚀 Getting Started

Installation 1. Clone the Repository:

!git clone https://github.com/mattdepaolis/llm-evaluation.git
!cd llm-evaluation

Set Up a Virtual Environment:

!python -m venv .venv
!source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`

Install Dependencies:

!pip install -e lm-evaluation-harness
!pip install torch numpy tqdm transformers accelerate bitsandbytes sentencepiece
!pip install vllm  # If you plan to use the vLLM backend

🧪 Example: Evaluating Your Model on the LEADERBOARD Benchmark

Using the Command-Line Interface (CLI)

Let’s illustrate how the LLM Evaluation Framework simplifies benchmarking by replicating the popular Hugging Face Open LLM Leaderboard setup—particularly useful given its recent discontinuation. Here’s a practical CLI example that runs the complete leaderboard evaluation:

!python llm_eval_cli.py \
  --model hf \
  --model_name meta-llama/Llama-2-13b-chat-hf \
  --leaderboard \
  --device cuda \
  --gpu_memory_utilization 0.9  # Adjust based on your GPU availability

With this simple command, the framework evaluates your model across several key benchmarks including BBH, GPQA, MMLU-Pro, MUSR, IFEval, and Math-lvl-5, automatically configuring the appropriate few-shot examples for each benchmark.

Using as a Python Library

Integrate the evaluation logic directly into your Python scripts:

from llm_eval import evaluate_model

# Run the evaluation
results, output_path = evaluate_model(
    model_type="hf",
    model_name="mistralai/Ministral-8B-Instruct-2410",
    tasks=["leaderboard"],
    num_samples=1,
    device="cuda",
    quantize=True,
    quantization_method="4bit",
    preserve_default_fewshot=True  # This ensures the correct few-shot settings for each benchmark task
)

# Print the paths to the results and report
print(f"Results saved to: {output_path}")

# The report path is derived from the output path
import os
from llm_eval.reporting.report_generator import get_reports_dir

# Get the base filename without extension
basename = os.path.basename(output_path)
basename = os.path.splitext(basename)[0]

# Construct the report path
reports_dir = get_reports_dir()
report_path = os.path.join(reports_dir, f"{basename}_report.md")

if os.path.exists(report_path):
    print(f"Report generated at: {report_path}")
else:
    print("Report was not generated. Check if there were any errors during evaluation.")

📊 Reporting and Results

The framework generates:

JSON Results: Detailed results for each task, including individual sample predictions (if applicable), metrics, and configuration details, saved in the results/ directory.
Markdown Reports: A summary report aggregating scores across tasks, generated in the reports/ directory.

📄 How the Evaluation Report Looks

When you run an evaluation using the LLM Evaluation Framework, it generates comprehensive yet easy-to-understand reports in both Markdown and JSON formats. Here’s a broad overview of what you can expect from the Markdown report:

1. 📊 Summary of Metrics

This section offers a concise table summarizing your model’s performance across each task evaluated. Each row clearly indicates:

• Task: The specific benchmark or task evaluated (e.g., leaderboard_bbh_boolean_expressions).

• Metric: The evaluation metric employed (e.g., accuracy, exact match).

• Value: Your model’s performance score on that task.

This summary makes it easy to quickly gauge overall performance across multiple tasks at a glance.

2. 📈 Normalized Scores

To provide clearer insights, the framework calculates normalized scores, presenting a straightforward percentage-based representation of your model’s performance relative to established benchmarks. Each benchmark will show:

• Benchmark: Name of the benchmark.

• Score: Normalized percentage score.

This helps you quickly pinpoint your model’s relative strengths and identify areas needing improvement.

3. 🔍 Task Samples (Detailed Examples)

The detailed samples section gives you valuable qualitative insights into your model’s performance by presenting clear examples directly from evaluated tasks. Each example includes:

• Question: The evaluation sample question posed to your model.

• Ground Truth: The expected correct answer.

• Model Response: Your model’s exact response, explicitly marked as correct or incorrect.

These detailed examples are especially useful for conducting error analysis, allowing you to dive deeper into how your model handles specific questions or scenarios.

⚙️ Customization

Beyond these default outputs, the reporting mechanism in this framework is highly customizable. You can easily extend or modify report generation logic to meet specialized requirements or incorporate additional analysis, enabling deeper and more tailored insights into your model’s performance.

By providing structured and comprehensive reports, this framework empowers you to effectively evaluate, understand, and communicate the strengths and limitations of your large language models.

🔧 Extending the Framework

The modular design makes it easier to add new functionalities:

Adding New Tasks/Benchmarks:

Define the task configuration in llm_eval/tasks/task_registry.py or a similar configuration file.
Ensure the task is compatible with the lm-evaluation-harness structure or adapt it.

Supporting New Model Backends:

Create a new model handler class in llm_eval/models/ inheriting from a base model class (if applicable).
Implement the required methods for loading, inference, etc.
Register the new backend type.

Customizing Reporting:

Modify the report generation logic in llm_eval/reporting/ to change the format or content of the Markdown/JSON outputs.

🤝 Contributing

Contributions are welcome! Please follow standard practices:

Fork the repository.
Create a new branch for your feature or bug fix (git checkout -b feature/my-new-feature).
Make your changes and commit them (git commit -am ‘Add some feature’).
Push to the branch (git push origin feature/my-new-feature).
Create a new Pull Request.