Streamlit applications for exploring EVA results.
Interactive dashboard for visualizing and comparing results.
streamlit run apps/analysis.pyBy default, the app looks for runs in the output/ directory. You can change this in the sidebar or by setting the EVA_OUTPUT_DIR environment variable:
EVA_OUTPUT_DIR=path/to/results streamlit run apps/analysis.pyCross-Run Comparison — Compare aggregate metrics across multiple runs. Filter by model, provider, and pipeline type. Includes an EVA scatter plot (accuracy vs. experience) and per-metric bar charts.
Run Overview — Drill into a single run: per-category metric breakdowns, score distributions, and a full records table with per-metric scores.
Record Detail — Deep-dive into individual conversation records:
- Audio playback (mixed recording)
- Transcript with color-coded speaker turns
- Metric scores with explanations
- Conversation trace: tool calls, LLM calls, and audit log entries with a timeline view
- Database state diff (expected vs. actual)
- User goal, persona, and ground truth from the evaluation record
- Output Directory — Path to the directory containing run folders
- View — Switch between the three views above
- Run Selection — Pick a run (with metadata summary)
- Record Selection — Pick a record within the selected run
- Trial Selection — If a record has multiple trials, pick one


