Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions docs/evaluate/datasets/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,54 @@ A dataset can be both: if you create a hosted dataset with the same name as one

You can filter between these types using the **Hosted** and **Local** tabs at the top of the datasets list.

## What is a Hosted Dataset?

A Hosted Dataset is a collection of test cases stored on the Logfire server. Each hosted dataset row is one **case** with inputs, expected output, and optional metadata. Additionally, a **schema** for the whole hosted dataset can be defined which constrains each case — ensuring that every case has the correct structure.

```
+-------------------------------------------------------------------+
| Hosted Dataset |
| |
| +--------------------------------+ +-----------------------+ |
| | Case #1 | | Schema (Optional) | |
| | Input | | Input | |
| | Expected Output | | Expected Output | |
| | Metadata | | Metadata | |
| +--------------------------------+ | | |
| +--------------------------------+ | | |
| | Case #2 | | | |
| +--------------------------------+ | | |
| +--------------------------------+ | | |
| | Case #3 | | | |
| +--------------------------------+ +-----------------------+ |
+-------------------------------------------------------------------+
```

Hosted datasets integrate into the broader [pydantic-evals](https://ai.pydantic.dev/evals/) data model:

```
Hosted Dataset (1) ─────────── (Many) Case
│ │
│ │
└── (Many) Experiment ──── (Many) Case results
├── (1) Task
└── (Many) Evaluator
```

A single hosted dataset contains many cases. Over time, you run multiple experiments against the same hosted dataset — each experiment executes every case against a task and scores the results with evaluators.

## How Cases Get Into a Hosted Dataset

There are several ways to populate a hosted dataset with cases:

- **From Live View**: Find an interesting trace or span in production and save it as a single case. You pick an existing hosted dataset or create a new one, review the extracted inputs and outputs, then add it. This is the easiest way to turn real-world usage into test cases. See [Adding Cases from Traces](ui.md#adding-cases-from-traces) for a walkthrough.
- **Manually in the UI**: Add cases one by one through the dataset's Cases tab. Useful when you want to hand-craft specific edge cases. See [Managing Cases](ui.md#managing-cases) for details.
- **Via the SDK**: Create cases programmatically with Python — either by pushing a full local `pydantic-evals` dataset or by adding individual cases. See the [SDK Guide](sdk.md) for details.

Adding from Live View usually creates one new case from one span. Importing via the SDK can be done in bulk.

## Why Datasets?

When evaluating AI systems, you need test cases that reflect real-world usage. Datasets solve several problems:
Expand Down
Loading