Skip to content

Add LLM Evaluation Feature#90

Merged
siawchen merged 10 commits into
opea-project:mainfrom
wanhakim:wwanarif/eval
May 15, 2026
Merged

Add LLM Evaluation Feature#90
siawchen merged 10 commits into
opea-project:mainfrom
wanhakim:wwanarif/eval

Conversation

@wanhakim
Copy link
Copy Markdown
Collaborator

Introduces a full evaluation suite for testing and benchmarking deployed GenAI workflows within the Studio.

What Is New

studio-eval microservice (new FastAPI service)

  • Manage evaluation datasets: create manually via JSON or JSONL upload, or synthesize from documents (PDF, DOCX) using DeepEval and Ollama.
  • Enabled evaluation runs against active sandboxes with configurable metrics, currently supporting Answer Relevancy and Faithfulness (with additional metrics easily extendable via DeepEval's framework).
  • Support background job execution with real-time progress tracking (polling-based).
  • Persist runs, datasets, entries, and per-entry metric results (SQLite-backed).
  • Expose CRUD APIs for runs (/eval/runs) and datasets (/eval/datasets).

Frontend UI updates (studio-frontend)

  • Add a new Evaluation section with Execution and Dataset tabs.
  • Dataset tab supports upload, manual entry creation, synthesis configuration, and synthesis progress.
  • Execution tab supports run creation (sandbox and judge model selection), progress tracking, and metric summaries.
  • RunDetailsModal shows per-entry scores, reason drill-down, and configuration snapshot.
  • Add ModelSelect UI to browse and pull Ollama judge models directly from the UI.

studio-backend proxy layer

  • Add evaluation_router.py to proxy /evaluation/{path} requests to studio-eval.
  • Extend sandbox APIs to support sandbox discovery for evaluation workflows.

Infrastructure Changes

  • Development setup: update local compose and nginx config templates, and provide .env.example entries for evaluation stack.
  • Kubernetes setup: update manifests and playbooks to deploy studio-eval and Ollama in cluster environments.
  • CI pipeline: update image build and push workflow to include studio-eval artifacts.

Bug Fixes and Improvements

  • Fix deprecated TGI API usage that broke existing workflows.
  • Improve UX in tracer view under observability from workflow section.

wanhakim and others added 7 commits April 13, 2026 01:50
Signed-off-by: wwanarif <wan.abdul.hakim.b.wan.arif@intel.com>
Signed-off-by: wwanarif <wan.abdul.hakim.b.wan.arif@intel.com>
Signed-off-by: wwanarif <wan.abdul.hakim.b.wan.arif@intel.com>
Signed-off-by: wwanarif <wan.abdul.hakim.b.wan.arif@intel.com>
Signed-off-by: wwanarif <wan.abdul.hakim.b.wan.arif@intel.com>
Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: wwanarif <wan.abdul.hakim.b.wan.arif@intel.com>
Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: wwanarif <wan.abdul.hakim.b.wan.arif@intel.com>
Copilot AI review requested due to automatic review settings May 12, 2026 06:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@wanhakim wanhakim requested a review from siawchen May 12, 2026 07:17
wanhakim and others added 3 commits May 13, 2026 08:08
Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: wwanarif <wan.abdul.hakim.b.wan.arif@intel.com>
Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: wwanarif <wan.abdul.hakim.b.wan.arif@intel.com>
Signed-off-by: wwanarif <wan.abdul.hakim.b.wan.arif@intel.com>
Copy link
Copy Markdown
Collaborator

@siawchen siawchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@siawchen siawchen merged commit c77d967 into opea-project:main May 15, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants