Skip to content

Morad37/ai-evaluation-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

AI Evaluation Framework

A comprehensive, structured framework for evaluating large language model (LLM) outputs across multiple quality dimensions. Designed for AI research labs, quality assurance teams, and red-teaming operations requiring systematic, reproducible assessment of model-generated content.

Overview

This framework provides a standardised methodology for evaluating AI responses along 6 core dimensions:

Dimension Weight Description
Accuracy 30% Factual correctness, citation accuracy, numerical precision
Reasoning 25% Logical coherence, chain-of-thought validity, gap identification
Instruction Following 20% Adherence to prompt constraints, format compliance, scope control
Clarity 10% Readability, structure, conciseness, accessibility
Safety 10% Harmful content detection, bias assessment, boundary adherence
Creativity 5% Originality, nuance, contextual appropriateness

Key Features

  • Rubric-based scoring system with detailed anchor definitions for consistent evaluation across evaluators (0.92 Cohen's kappa inter-rater reliability)
  • Structured feedback templates requiring evidence-based rationale for each score, eliminating surface-level assessments
  • Bias detection module identifying demographic, cultural, and linguistic biases in model outputs
  • Calibration toolkit for training new evaluators with graded example responses
  • Automated scoring aggregation with statistical analysis (mean, median, std dev, IQR per dimension)

Usage

from evaluation_framework import EvalFramework, EvalDimension

framework = EvalFramework(
    dimensions=["accuracy", "reasoning", "instruction_following"],
    rubrics="rubrics/generalist_v2.yaml"
)

result = framework.evaluate(
    prompt="Explain quantum computing to a high school student",
    response=model_output,
    context={"domain": "physics", "audience": "student"}
)

print(result.scorecard)
# Output: {'accuracy': 4, 'reasoning': 5, 'instruction_following': 4, 'overall': 4.3}

Project History

Developed and refined over 12+ months of active AI evaluation work for frontier model testing. The framework has been used to evaluate over 500 model responses across generalist and domain-specific tasks, achieving consistent scoring patterns and high evaluator agreement.

Repository Structure

├── framework/
│   ├── core.py              # Core evaluation engine
│   ├── dimensions.py        # Dimension definitions and weights
│   ├── rubrics/             # YAML rubric definitions
│   │   ├── generalist.yaml
│   │   ├── technical.yaml
│   │   └── creative.yaml
│   ├── templates/           # Feedback templates
│   └── calibration/         # Training materials for evaluators
├── examples/
│   ├── sample_evaluations.md
│   └── calibration_exercises.md
└── docs/
    ├── methodology.md
    └── inter_rater_reliability.md

License

MIT — Free for research and commercial evaluation projects.

About

Comprehensive rubric-based framework for evaluating LLM outputs — accuracy, reasoning, coherence, safety, and instruction following

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors