A comprehensive, structured framework for evaluating large language model (LLM) outputs across multiple quality dimensions. Designed for AI research labs, quality assurance teams, and red-teaming operations requiring systematic, reproducible assessment of model-generated content.
This framework provides a standardised methodology for evaluating AI responses along 6 core dimensions:
| Dimension | Weight | Description |
|---|---|---|
| Accuracy | 30% | Factual correctness, citation accuracy, numerical precision |
| Reasoning | 25% | Logical coherence, chain-of-thought validity, gap identification |
| Instruction Following | 20% | Adherence to prompt constraints, format compliance, scope control |
| Clarity | 10% | Readability, structure, conciseness, accessibility |
| Safety | 10% | Harmful content detection, bias assessment, boundary adherence |
| Creativity | 5% | Originality, nuance, contextual appropriateness |
- Rubric-based scoring system with detailed anchor definitions for consistent evaluation across evaluators (0.92 Cohen's kappa inter-rater reliability)
- Structured feedback templates requiring evidence-based rationale for each score, eliminating surface-level assessments
- Bias detection module identifying demographic, cultural, and linguistic biases in model outputs
- Calibration toolkit for training new evaluators with graded example responses
- Automated scoring aggregation with statistical analysis (mean, median, std dev, IQR per dimension)
from evaluation_framework import EvalFramework, EvalDimension
framework = EvalFramework(
dimensions=["accuracy", "reasoning", "instruction_following"],
rubrics="rubrics/generalist_v2.yaml"
)
result = framework.evaluate(
prompt="Explain quantum computing to a high school student",
response=model_output,
context={"domain": "physics", "audience": "student"}
)
print(result.scorecard)
# Output: {'accuracy': 4, 'reasoning': 5, 'instruction_following': 4, 'overall': 4.3}Developed and refined over 12+ months of active AI evaluation work for frontier model testing. The framework has been used to evaluate over 500 model responses across generalist and domain-specific tasks, achieving consistent scoring patterns and high evaluator agreement.
├── framework/
│ ├── core.py # Core evaluation engine
│ ├── dimensions.py # Dimension definitions and weights
│ ├── rubrics/ # YAML rubric definitions
│ │ ├── generalist.yaml
│ │ ├── technical.yaml
│ │ └── creative.yaml
│ ├── templates/ # Feedback templates
│ └── calibration/ # Training materials for evaluators
├── examples/
│ ├── sample_evaluations.md
│ └── calibration_exercises.md
└── docs/
├── methodology.md
└── inter_rater_reliability.md
MIT — Free for research and commercial evaluation projects.