diff --git a/use-cases/falcon-ai-context-aware-debugging.ipynb b/use-cases/falcon-ai-context-aware-debugging.ipynb new file mode 100644 index 0000000..3b2d45a --- /dev/null +++ b/use-cases/falcon-ai-context-aware-debugging.ipynb @@ -0,0 +1,335 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Context-Aware Trace Debugging with Falcon AI: From Error to Fix in Minutes\n", + "\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/future-agi/cookbooks/blob/cookbook/quickstart-notebooks/use-cases/falcon-ai-context-aware-debugging.ipynb)\n", + "[![View on GitHub](https://img.shields.io/badge/View_on_GitHub-181717?logo=github&logoColor=white)](https://github.com/future-agi/cookbooks/blob/cookbook/quickstart-notebooks/use-cases/falcon-ai-context-aware-debugging.ipynb)\n", + "\n", + "| Time | Difficulty |\n", + "|------|------------|\n", + "| 15 min | Beginner |\n", + "\n", + "You launched a research assistant for your team last week. It searches your internal paper database and synthesizes summaries with citations. This morning a colleague pings you: \"this paper you cited doesn't exist.\" You open the dashboard, find the trace, and there it is. A confidently-formatted citation that the model invented because the search tool returned no results for the topic.\n", + "\n", + "You do not need a full regression suite right now. You do not need to build a dataset or run a sweep. You need to know what broke in this one trace and what to change so it stops happening. Fast.\n", + "\n", + "The slow version of this is familiar: open the trace, expand the span tree, scroll through 6 nested spans, copy the system prompt out, copy the model output out, diff them in your head, write the fix in a notebook, run it, look at the new trace, repeat if it didn't work. Easily an hour for one trace, longer if you context-switch.\n", + "\n", + "This is the failure mode that got a New York lawyer sanctioned in 2023 for citing six cases that ChatGPT had completely fabricated. The pattern is the same: a search returns nothing, the model fills the gap with plausible-looking output, and a downstream user trusts it.\n", + "\n", + "This notebook walks the fast version, powered by FutureAGI's **Tracing** + **Falcon AI** stack working together. You open Falcon AI directly on the failing trace, and the chat input shows the trace as a context chip automatically. No copy-pasting trace IDs, no re-establishing \"which trace are we talking about\" between turns. You ask one open question, drill into the span with `/analyze-trace-errors`, get a verbatim prompt fix from `/fix-with-falcon`, paste it into your code, re-run the same query, watch the agent refuse instead of fabricate. End to end in under 15 minutes.\n", + "\n", + "**Prerequisites:**\n", + "- FutureAGI account: [app.futureagi.com](https://app.futureagi.com)\n", + "- API keys: `FI_API_KEY` and `FI_SECRET_KEY` (see [Get your API keys](https://docs.futureagi.com/docs/admin-settings))\n", + "- OpenAI API key (`OPENAI_API_KEY`)\n", + "- Python 3.10+" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install\n", + "\n", + "The simplest way to run this notebook is in **Google Colab** (click the badge at the top). Colab has Python 3.11 and the `%pip install` cell below works out of the box.\n", + "\n", + "If you're running locally, you need Python 3.10+ (`fi-instrumentation-otel` won't import on 3.9)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install fi-instrumentation-otel traceai-openai openai" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.environ[\"FI_API_KEY\"] = \"your-fi-api-key\"\n", + "os.environ[\"FI_SECRET_KEY\"] = \"your-fi-secret-key\"\n", + "os.environ[\"OPENAI_API_KEY\"] = \"your-openai-key\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Build a research assistant with a small knowledge base\n", + "\n", + "The agent has one tool, `search_papers`, that returns hits from a tiny three-paper mock database. The system prompt is intentionally permissive: it tells the model to answer with citations, but it does not say what to do when the search comes back empty. That gap is where the failure lives." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from openai import OpenAI\n", + "\n", + "client = OpenAI()\n", + "\n", + "SYSTEM_PROMPT = \"\"\"You are a research assistant for an ML research team.\n", + "Answer questions using the search_papers tool. Provide citations to support your claims.\"\"\"\n", + "\n", + "TOOLS = [\n", + " {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"search_papers\",\n", + " \"description\": \"Search the team's internal database of ML papers\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"query\": {\"type\": \"string\", \"description\": \"Topic or keyword to search for\"},\n", + " },\n", + " \"required\": [\"query\"],\n", + " },\n", + " },\n", + " },\n", + "]\n", + "\n", + "\n", + "def search_papers(query: str) -> dict:\n", + " db = {\n", + " \"transformer\": [\n", + " {\"title\": \"Attention Is All You Need\", \"authors\": \"Vaswani et al.\", \"year\": 2017, \"venue\": \"NeurIPS\"},\n", + " ],\n", + " \"diffusion\": [\n", + " {\"title\": \"Denoising Diffusion Probabilistic Models\", \"authors\": \"Ho et al.\", \"year\": 2020, \"venue\": \"NeurIPS\"},\n", + " ],\n", + " \"rlhf\": [\n", + " {\"title\": \"Training language models to follow instructions with human feedback\", \"authors\": \"Ouyang et al.\", \"year\": 2022, \"venue\": \"NeurIPS\"},\n", + " ],\n", + " }\n", + " q = query.lower()\n", + " for keyword, papers in db.items():\n", + " if keyword in q:\n", + " return {\"results\": papers, \"total\": len(papers)}\n", + " return {\"results\": [], \"total\": 0}\n", + "\n", + "\n", + "TOOL_MAP = {\"search_papers\": search_papers}\n", + "\n", + "\n", + "def handle_message(messages: list) -> str:\n", + " response = client.chat.completions.create(\n", + " model=\"gpt-4o-mini\",\n", + " messages=[{\"role\": \"system\", \"content\": SYSTEM_PROMPT}] + messages,\n", + " tools=TOOLS,\n", + " )\n", + " msg = response.choices[0].message\n", + "\n", + " if msg.tool_calls:\n", + " tool_messages = [msg]\n", + " for tc in msg.tool_calls:\n", + " result = TOOL_MAP[tc.function.name](**json.loads(tc.function.arguments))\n", + " tool_messages.append({\n", + " \"role\": \"tool\",\n", + " \"tool_call_id\": tc.id,\n", + " \"content\": json.dumps(result),\n", + " })\n", + " followup = client.chat.completions.create(\n", + " model=\"gpt-4o-mini\",\n", + " messages=[{\"role\": \"system\", \"content\": SYSTEM_PROMPT}] + messages + tool_messages,\n", + " tools=TOOLS,\n", + " )\n", + " return followup.choices[0].message.content\n", + "\n", + " return msg.content" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Three topics covered (transformers, diffusion, RLHF). Anything else hits the empty-result branch." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Add tracing so Falcon AI can read the spans" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from fi_instrumentation import register, FITracer, using_user, using_session\n", + "from fi_instrumentation.fi_types import ProjectType\n", + "from traceai_openai import OpenAIInstrumentor\n", + "\n", + "trace_provider = register(\n", + " project_type=ProjectType.OBSERVE,\n", + " project_name=\"research-assistant-debug\",\n", + ")\n", + "OpenAIInstrumentor().instrument(tracer_provider=trace_provider)\n", + "tracer = FITracer(trace_provider.get_tracer(\"research-assistant-debug\"))\n", + "\n", + "\n", + "@tracer.agent(name=\"research_assistant\")\n", + "def traced_handle(user_id: str, session_id: str, messages: list) -> str:\n", + " with using_user(user_id), using_session(session_id):\n", + " return handle_message(messages)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`@tracer.agent` makes the entire request show up as one parent span with the OpenAI calls and tool calls nested underneath. That nesting is what `/fix-with-falcon` needs in order to read the verbatim system prompt and model output later." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Trigger the failing trace\n", + "\n", + "Two queries: one inside the knowledge base and one outside. The second one is the trace you'll debug." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# In-database query: should work cleanly\n", + "traced_handle(\n", + " user_id=\"alice\",\n", + " session_id=\"session-good\",\n", + " messages=[{\"role\": \"user\", \"content\": \"What's the seminal paper on transformers?\"}],\n", + ")\n", + "\n", + "# Outside-database query: should expose the failure\n", + "answer = traced_handle(\n", + " user_id=\"alice\",\n", + " session_id=\"session-bad\",\n", + " messages=[{\"role\": \"user\", \"content\": \"What are the key papers on contrastive learning for self-supervised vision?\"}],\n", + ")\n", + "print(answer)\n", + "\n", + "trace_provider.force_flush()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "Look closely at what the model returned for the second query. The agent likely acknowledged the empty database (\"there are currently no papers available... However, I can provide you with general insights\") and then named three specific papers (SimCLR, MoCo, BYOL) with one-line descriptions, framed as \"key papers often referenced.\" The hedge phrasing makes the response sound careful, but the names and descriptions are not grounded in any tool result. The system prompt told the model to provide citations and never told it what to do when the tool returned nothing, so the model filled the gap from its training data and dressed it up as helpfulness.\n\nOpen **Tracing** in the dashboard, select `research-assistant-debug`, and click into the second trace. The span tree shows the empty tool result and the fabricated content side by side. That contradiction is the bug." + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Debug the trace conversationally using page context\n", + "\n", + "**This step is done in the dashboard, not the notebook.**\n", + "\n", + "Open the failing trace in the **Tracing** Feed. Click into it so the trace detail page is the active view. Now press `Cmd+K` (Mac) or `Ctrl+K` (Windows) to open the Falcon AI sidebar.\n", + "\n", + "Look at the chat input. There is a **context chip** above the message box showing the current trace ID (something like `trace 7ab8c\u2026`). You did not type that. Falcon AI saw what page you were on and attached it to the conversation. Every question you ask in this chat will be answered against that specific trace until you remove or replace the chip.\n", + "\n", + "> **Tip.** Without page context, you would have to start every question with \"Look at trace 7ab8c\u2026 in project research-assistant-debug, \u2026\" and re-paste the ID for each follow-up. With it, you ask the question and Falcon AI already knows what you mean. This is the difference between debugging by chat and debugging by chat that knows what you are looking at.\n", + "\n", + "You will run three turns in the same chat. Each turn builds on the previous one, and the trace context carries through automatically.\n", + "\n", + "**Turn 1 (\u22480:30 in): the open question.** Start by asking what went wrong, the way you would ask a teammate looking over your shoulder.\n", + "\n", + "> What went wrong with this trace?\n", + "\n", + "Falcon AI reads the trace summary and gives an exploratory diagnosis: the empty-result tool call, the fallback to general knowledge, and a list of likely root causes (data, retrieval, indexing).\n", + "\n", + "![Falcon AI sidebar opened on the failing trace, with the trace context chip in the chat input and an exploratory diagnosis of the empty search result](https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/use-cases/falcon-ai-context-aware-debugging/turn-1-open-question.png)\n", + "\n", + "Notice the angle. The first-turn response treats \"what went wrong\" as a question about the system as a whole and leads with the data and retrieval layer. That is a reasonable default; in production, an empty result on a famous topic is more often a retrieval bug than an agent behavior bug. For our case, the database is intentionally tiny (three papers), so the retrieval is working correctly and the real failure is the agent's response to an empty result. The next turn narrows to the agent.\n", + "\n", + "**Turn 2 (\u22482:00 in): drill into the span with `/analyze-trace-errors`.** In the same chat, type:\n", + "\n", + "> /analyze-trace-errors\n", + "\n", + "Falcon AI runs the analyze-trace-errors skill against the trace already in context. It calls `explore_trace_legacy` and `read_trace_span(exact=True)` on the LLM span, submits structured findings, and writes a quality scorecard.\n", + "\n", + "Two things to notice. First, the skill captures both layers of the failure (retrieval returned nothing, then the agent hallucinated) instead of picking one. Second, the recommended prompt fix already previews the next turn's diff but as advice rather than a copy-pasteable change: that is the difference between `/analyze-trace-errors` (diagnosis with suggestions) and `/fix-with-falcon` (one concrete prompt change you can paste).\n", + "\n", + "![Falcon AI showing the structured /analyze-trace-errors output with category findings, severity, and a quality scorecard for the same trace](https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/use-cases/falcon-ai-context-aware-debugging/turn-2-analyze-trace-errors.png)\n", + "\n", + "**Turn 3 (\u22484:00 in): get the prompt diff with `/fix-with-falcon`.** Type:\n", + "\n", + "> /fix-with-falcon\n", + "\n", + "Falcon AI runs the fix-with-falcon skill, reads the verbatim system prompt and model output one more time (it does not trust your description of the failure, only the spans), and returns the fix in a fixed format.\n", + "\n", + "![Falcon AI fix-with-falcon output for the same trace showing What happened, Root cause in the agent, and a verbatim Current vs Replace with prompt diff](https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/use-cases/falcon-ai-context-aware-debugging/turn-3-fix-with-falcon.png)\n", + "\n", + "Two things worth noticing about this output. First, Falcon AI quoted the system prompt **verbatim** from the LLM span, not from a guess: the OpenAI auto-instrumentor captured the system message in this run, so the diff is grounded in actual span content. Second, the response references the scorecard from Turn 2: the chat remembered what it discovered two turns ago and used those numbers to predict the impact. That is page context plus conversation memory paying off compounding.\n", + "\n", + "Wall-clock so far: open question to verbatim prompt diff, three turns, about **5 minutes**. None of the turns required you to type a trace ID, paste a span ID, or repeat what the bug was." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Apply the fix and verify with the same query" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "SYSTEM_PROMPT = \"\"\"You are a research assistant for an ML research team. Answer questions using the search_papers tool. Provide citations to support your claims. If search_papers returns no results or an empty list, respond ONLY with: \"I could not find any papers in the database matching your query. Please try a different search term.\" Do NOT use general knowledge, describe papers from memory, or answer without citations.\"\"\"\n\n# Re-run the exact same failing query\nverify = traced_handle(\n user_id=\"alice\",\n session_id=\"session-bad-verify\",\n messages=[{\"role\": \"user\", \"content\": \"What are the key papers on contrastive learning for self-supervised vision?\"}],\n)\nprint(verify)\n\n# Sanity-check the in-database query still works\nok = traced_handle(\n user_id=\"alice\",\n session_id=\"session-good-verify\",\n messages=[{\"role\": \"user\", \"content\": \"What's the seminal paper on transformers?\"}],\n)\nprint(ok)\n\ntrace_provider.force_flush()" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "After the fix the contrastive-learning query returns the verbatim refusal, and the transformer query still pulls Vaswani et al. 2017 from the tool result. Both are now grounded in what `search_papers` actually returned.\n\nOpen the new traces in the dashboard. The span tree for the contrastive-learning trace now shows the empty tool result followed by the refusal, with no fabricated content in between. Same input, same metric (faithfulness on the citation content), opposite outcome.\n\nWall-clock from the moment your colleague pinged you to the moment the fix is verified: roughly **8 to 10 minutes**. The hour-long version of this loop (read the spans by hand, write the fix, test, repeat) is the version you do not run today.\n\nWant to ask Falcon AI to confirm the fix worked? Open the new failing-query trace and type \"did the fix from the previous trace land?\" The chat will compare the two spans and tell you. The page-context awareness carries across traces in the same conversation." + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What you solved\n", + "\n", + "You took a single hallucinated citation, opened Falcon AI directly on the trace, and ran a three-turn conversation that walked from open question to verbatim prompt diff: page context attached the trace automatically, `/analyze-trace-errors` quoted the offending span, and `/fix-with-falcon` returned a paste-ready Current vs Replace with diff. You applied the fix in code, re-ran the same query, and watched the agent refuse instead of fabricate. The manual version of this loop (open spans, scroll, copy out, diff in your head, write the fix, test, repeat) is roughly an hour. The page-context version closes in about ten minutes.\n", + "\n", + "## Next steps\n", + "\n", + "- **Want to lock this fix in as a regression test?** Capture the failing trace as a dataset and run evals against it: see [Building Evaluation Datasets from Production Traces](https://docs.futureagi.com/docs/cookbook/use-cases/falcon-ai-eval-datasets-from-traces).\n", + "- **Want the same loop across many traces, not just one?** Run the batch version with `/analyze-trace-errors` on the whole project, then `/build-dataset`, `/run-evaluations`, and `/fix-with-falcon`: see [End-to-End with Falcon AI](https://docs.futureagi.com/docs/cookbook/use-cases/falcon-ai-end-to-end).\n", + "- **Need a refresher on tracing setup?** See [Manual Tracing](https://docs.futureagi.com/docs/cookbook/quickstart/manual-tracing) for span decorators, metadata tagging, and prompt template tracking." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/use-cases/falcon-ai-end-to-end.ipynb b/use-cases/falcon-ai-end-to-end.ipynb new file mode 100644 index 0000000..b3b56a0 --- /dev/null +++ b/use-cases/falcon-ai-end-to-end.ipynb @@ -0,0 +1,516 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# End-to-End with Falcon AI: Trace, Debug, Evaluate, Dataset, Fix in One Workflow\n", + "\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/future-agi/cookbooks/blob/cookbook/quickstart-notebooks/use-cases/falcon-ai-end-to-end.ipynb)\n", + "[![View on GitHub](https://img.shields.io/badge/View_on_GitHub-181717?logo=github&logoColor=white)](https://github.com/future-agi/cookbooks/blob/cookbook/quickstart-notebooks/use-cases/falcon-ai-end-to-end.ipynb)\n", + "\n", + "| Time | Difficulty |\n", + "|------|------------|\n", + "| 30 min | Beginner |\n", + "\n", + "You shipped a small support agent yesterday. This morning, three users say it gave them confidently wrong answers about return windows. You open the dashboard and see hundreds of traces. Reading them by hand is not realistic. Spinning up a separate eval pipeline before you even know what's broken is overkill. Writing a fix without seeing the actual prompt the model was running is guessing.\n", + "\n", + "The usual debugging flow forces you to context-switch across five tabs: traces to find a bad request, evals to score it, datasets to capture it, the prompt page to look at the system prompt, and a notebook to draft the fix. Each tab is one more thing to keep in your head. By the time you have a fix, you've forgotten which span started this.\n", + "\n", + "What if one chat could hold the whole loop? Open Falcon AI, ask it to find the failures, group them, save them as a dataset, score them with the right evals, and propose a concrete prompt diff, all in the same conversation. The dashboard renders the artifacts (datasets, eval runs, prompt diffs) as completion cards underneath each step.\n", + "\n", + "This notebook walks through that loop end-to-end on a small support agent, using FutureAGI's full ecosystem (**Tracing**, **Evals**, **Datasets**, **Prompts**, and **Falcon AI**) wired together so the dataset, eval runs, and prompt diffs you produce in chat all become first-class platform entities you can return to later. You will instrument the agent with Tracing, generate a batch of mixed-quality requests, then drive the rest of the workflow from a single Falcon AI chat: `/analyze-trace-errors` to debug, `/build-dataset` to capture the failing cases, `/run-evaluations` to score them, and `/fix-with-falcon` to get a verbatim prompt fix you can paste back into your code.\n", + "\n", + "**Prerequisites:**\n", + "- FutureAGI account: [app.futureagi.com](https://app.futureagi.com)\n", + "- API keys: `FI_API_KEY` and `FI_SECRET_KEY` (see [Get your API keys](https://docs.futureagi.com/docs/admin-settings))\n", + "- OpenAI API key (`OPENAI_API_KEY`)\n", + "- Python 3.9+" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Python(97638) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: fi-instrumentation-otel in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (1.0.0)\n", + "Requirement already satisfied: traceai-openai in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (0.1.10)\n", + "Requirement already satisfied: openai in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (1.109.1)\n", + "Requirement already satisfied: jsonschema<5.0.0,>=4.21.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from fi-instrumentation-otel) (4.26.0)\n", + "Requirement already satisfied: opentelemetry-api<2.0.0,>=1.29.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from fi-instrumentation-otel) (1.39.1)\n", + "Requirement already satisfied: opentelemetry-exporter-otlp-proto-http<2.0.0,>=1.29.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from fi-instrumentation-otel) (1.39.1)\n", + "Requirement already satisfied: opentelemetry-instrumentation in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from fi-instrumentation-otel) (0.60b1)\n", + "Requirement already satisfied: opentelemetry-sdk<2.0.0,>=1.29.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from fi-instrumentation-otel) (1.39.1)\n", + "Requirement already satisfied: pydantic<3.0.0,>=2.9.2 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from fi-instrumentation-otel) (2.12.5)\n", + "Requirement already satisfied: requests<3.0.0,>=2.32.3 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from fi-instrumentation-otel) (2.32.5)\n", + "Requirement already satisfied: typing-extensions<5.0.0,>=4.9.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from fi-instrumentation-otel) (4.15.0)\n", + "Requirement already satisfied: wrapt<2.0.0,>=1.15.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from fi-instrumentation-otel) (1.17.3)\n", + "Requirement already satisfied: anyio<5,>=3.5.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from openai) (4.12.1)\n", + "Requirement already satisfied: distro<2,>=1.7.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from openai) (1.9.0)\n", + "Requirement already satisfied: httpx<1,>=0.23.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from openai) (0.28.1)\n", + "Requirement already satisfied: jiter<1,>=0.4.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from openai) (0.13.0)\n", + "Requirement already satisfied: sniffio in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from openai) (1.3.1)\n", + "Requirement already satisfied: tqdm>4 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from openai) (4.67.3)\n", + "Requirement already satisfied: idna>=2.8 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from anyio<5,>=3.5.0->openai) (3.11)\n", + "Requirement already satisfied: certifi in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from httpx<1,>=0.23.0->openai) (2026.2.25)\n", + "Requirement already satisfied: httpcore==1.* in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from httpx<1,>=0.23.0->openai) (1.0.9)\n", + "Requirement already satisfied: h11>=0.16 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.16.0)\n", + "Requirement already satisfied: attrs>=22.2.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from jsonschema<5.0.0,>=4.21.1->fi-instrumentation-otel) (25.4.0)\n", + "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from jsonschema<5.0.0,>=4.21.1->fi-instrumentation-otel) (2025.9.1)\n", + "Requirement already satisfied: referencing>=0.28.4 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from jsonschema<5.0.0,>=4.21.1->fi-instrumentation-otel) (0.37.0)\n", + "Requirement already satisfied: rpds-py>=0.25.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from jsonschema<5.0.0,>=4.21.1->fi-instrumentation-otel) (0.30.0)\n", + "Requirement already satisfied: importlib-metadata<8.8.0,>=6.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from opentelemetry-api<2.0.0,>=1.29.0->fi-instrumentation-otel) (8.7.1)\n", + "Requirement already satisfied: googleapis-common-protos~=1.52 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from opentelemetry-exporter-otlp-proto-http<2.0.0,>=1.29.0->fi-instrumentation-otel) (1.72.0)\n", + "Requirement already satisfied: opentelemetry-exporter-otlp-proto-common==1.39.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from opentelemetry-exporter-otlp-proto-http<2.0.0,>=1.29.0->fi-instrumentation-otel) (1.39.1)\n", + "Requirement already satisfied: opentelemetry-proto==1.39.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from opentelemetry-exporter-otlp-proto-http<2.0.0,>=1.29.0->fi-instrumentation-otel) (1.39.1)\n", + "Requirement already satisfied: protobuf<7.0,>=5.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from opentelemetry-proto==1.39.1->opentelemetry-exporter-otlp-proto-http<2.0.0,>=1.29.0->fi-instrumentation-otel) (6.33.5)\n", + "Requirement already satisfied: opentelemetry-semantic-conventions==0.60b1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from opentelemetry-sdk<2.0.0,>=1.29.0->fi-instrumentation-otel) (0.60b1)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from pydantic<3.0.0,>=2.9.2->fi-instrumentation-otel) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.41.5 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from pydantic<3.0.0,>=2.9.2->fi-instrumentation-otel) (2.41.5)\n", + "Requirement already satisfied: typing-inspection>=0.4.2 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from pydantic<3.0.0,>=2.9.2->fi-instrumentation-otel) (0.4.2)\n", + "Requirement already satisfied: charset_normalizer<4,>=2 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests<3.0.0,>=2.32.3->fi-instrumentation-otel) (3.4.5)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests<3.0.0,>=2.32.3->fi-instrumentation-otel) (2.6.3)\n", + "Requirement already satisfied: packaging>=18.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from opentelemetry-instrumentation->fi-instrumentation-otel) (26.0)\n", + "Requirement already satisfied: zipp>=3.20 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from importlib-metadata<8.8.0,>=6.0->opentelemetry-api<2.0.0,>=1.29.0->fi-instrumentation-otel) (3.23.0)\n", + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m26.1\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip3 install --upgrade pip\u001b[0m\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install fi-instrumentation-otel traceai-openai openai\n", + "\n", + "# If the imports below fail with ModuleNotFoundError after this install (Colab / Jupyter\n", + "# sometimes caches the old environment), restart the kernel:\n", + "# Runtime \u2192 Restart session (Colab)\n", + "# Kernel \u2192 Restart kernel (Jupyter / VS Code)\n", + "# Then re-run from this cell onward." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.environ[\"FI_API_KEY\"] = \"your-fi-api-key\"\n", + "os.environ[\"FI_SECRET_KEY\"] = \"your-fi-secret-key\"\n", + "os.environ[\"OPENAI_API_KEY\"] = \"your-openai-key\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Build a small agent with intentional weaknesses\n", + "\n", + "The point of this notebook is to drive the workflow from Falcon AI, not to build a perfect agent. So we use a deliberately thin support agent: a short system prompt that does not forbid speculation, two tool stubs, and `gpt-4o-mini`. The thin prompt is what produces the failing traces we want Falcon AI to find." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from openai import OpenAI\n", + "\n", + "client = OpenAI()\n", + "\n", + "SYSTEM_PROMPT = \"\"\"You are a customer support assistant for an electronics store.\n", + "Answer customer questions about products and orders. Use the tools when relevant.\"\"\"\n", + "\n", + "TOOLS = [\n", + " {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"search_products\",\n", + " \"description\": \"Search the product catalog\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"query\": {\"type\": \"string\", \"description\": \"Search query\"},\n", + " },\n", + " \"required\": [\"query\"],\n", + " },\n", + " },\n", + " },\n", + " {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"get_order_status\",\n", + " \"description\": \"Look up order status by order ID\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"order_id\": {\"type\": \"string\", \"description\": \"The order ID\"},\n", + " },\n", + " \"required\": [\"order_id\"],\n", + " },\n", + " },\n", + " },\n", + "]\n", + "\n", + "\n", + "def search_products(query: str) -> dict:\n", + " return {\n", + " \"results\": [\n", + " {\"id\": \"P-101\", \"name\": \"Wireless Headphones\", \"price\": 79.99},\n", + " {\"id\": \"P-205\", \"name\": \"USB-C Hub\", \"price\": 45.00},\n", + " ],\n", + " }\n", + "\n", + "\n", + "def get_order_status(order_id: str) -> dict:\n", + " return {\n", + " \"order_id\": order_id,\n", + " \"status\": \"shipped\",\n", + " \"tracking\": \"1Z999AA10123456784\",\n", + " \"estimated_delivery\": \"2026-05-04\",\n", + " }\n", + "\n", + "\n", + "TOOL_MAP = {\"search_products\": search_products, \"get_order_status\": get_order_status}\n", + "\n", + "\n", + "def handle_message(messages: list) -> str:\n", + " response = client.chat.completions.create(\n", + " model=\"gpt-4o-mini\",\n", + " messages=[{\"role\": \"system\", \"content\": SYSTEM_PROMPT}] + messages,\n", + " tools=TOOLS,\n", + " )\n", + " msg = response.choices[0].message\n", + "\n", + " if msg.tool_calls:\n", + " tool_messages = [msg]\n", + " for tc in msg.tool_calls:\n", + " result = TOOL_MAP[tc.function.name](**json.loads(tc.function.arguments))\n", + " tool_messages.append({\n", + " \"role\": \"tool\",\n", + " \"tool_call_id\": tc.id,\n", + " \"content\": json.dumps(result),\n", + " })\n", + " followup = client.chat.completions.create(\n", + " model=\"gpt-4o-mini\",\n", + " messages=[{\"role\": \"system\", \"content\": SYSTEM_PROMPT}] + messages + tool_messages,\n", + " tools=TOOLS,\n", + " )\n", + " return followup.choices[0].message.content\n", + "\n", + " return msg.content" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The agent will answer product and order questions fine. The places it will fumble are predictable: refund-policy questions (no tool, no instruction to refuse), comparisons that need details the tool doesn't return, and anything that asks \"is this a good deal?\". That is on purpose. Those are the traces Falcon AI will pick up later." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Add tracing so Falcon AI has something to read\n", + "\n", + "Falcon AI works on traces. No traces, nothing to debug. Three lines of instrumentation send every LLM call and tool invocation to the platform as structured spans." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "ename": "ModuleNotFoundError", + "evalue": "No module named 'fi_instrumentation'", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mModuleNotFoundError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[15]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m fi_instrumentation \u001b[38;5;28;01mimport\u001b[39;00m register, FITracer, using_user, using_session\n\u001b[32m 2\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m fi_instrumentation.fi_types \u001b[38;5;28;01mimport\u001b[39;00m ProjectType\n\u001b[32m 3\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m traceai_openai \u001b[38;5;28;01mimport\u001b[39;00m OpenAIInstrumentor\n\u001b[32m 4\u001b[39m \n", + "\u001b[31mModuleNotFoundError\u001b[39m: No module named 'fi_instrumentation'" + ] + } + ], + "source": [ + "from fi_instrumentation import register, FITracer, using_user, using_session\n", + "from fi_instrumentation.fi_types import ProjectType\n", + "from traceai_openai import OpenAIInstrumentor\n", + "\n", + "trace_provider = register(\n", + " project_type=ProjectType.OBSERVE,\n", + " project_name=\"falcon-ai-end-to-end\",\n", + ")\n", + "OpenAIInstrumentor().instrument(tracer_provider=trace_provider)\n", + "tracer = FITracer(trace_provider.get_tracer(\"falcon-ai-end-to-end\"))\n", + "\n", + "\n", + "@tracer.agent(name=\"support_assistant\")\n", + "def traced_handle(user_id: str, session_id: str, messages: list) -> str:\n", + " with using_user(user_id), using_session(session_id):\n", + " return handle_message(messages)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`@tracer.agent` makes the entire request show up as one parent span in the dashboard, with the OpenAI calls and tool calls nested underneath. `using_user` / `using_session` tag each trace so Falcon AI can later filter by who hit it.\n", + "\n", + "See [Manual Tracing](https://docs.futureagi.com/docs/cookbook/quickstart/manual-tracing) for span decorators, metadata tagging, and prompt template tracking." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate a batch of mixed traces\n", + "\n", + "You need enough traces for Falcon AI to find a pattern. Ten requests is the floor: a mix that the thin prompt will partly handle and partly fumble. A handful of these are deliberately outside the tool surface (refund window, comparison, recommendation) so the model has nothing to ground on and is forced to either guess or refuse. The thin system prompt doesn't tell it to refuse, so guessing is what we get, and that's the failure mode we want to see surface in the next step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "test_queries = [\n", + " # Tool-able, should be fine\n", + " \"Show me wireless headphones\",\n", + " \"Where is order ORD-12345?\",\n", + " \"What's the price of the USB-C Hub?\",\n", + " \"Track order ORD-99877 please\",\n", + " # No tool, no rule against speculating, these will likely fail\n", + " \"What's your return policy for opened headphones?\",\n", + " \"Is the USB-C Hub compatible with a 2019 MacBook Pro?\",\n", + " \"Which is better value, the headphones or the hub?\",\n", + " \"Can you ship to Germany?\",\n", + " \"How long is the warranty on the headphones?\",\n", + " \"Will my order ORD-12345 arrive before my birthday on May 5th?\",\n", + "]\n", + "\n", + "for i, query in enumerate(test_queries):\n", + " answer = traced_handle(\n", + " user_id=f\"user-{100 + i}\",\n", + " session_id=f\"session-{i}\",\n", + " messages=[{\"role\": \"user\", \"content\": query}],\n", + " )\n", + " print(f\"Q: {query}\")\n", + " print(f\"A: {answer[:140]}\\n\")\n", + "\n", + "trace_provider.force_flush()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Both of the last two responses are typically pure invention. The agent has no return-policy tool and no compatibility data. Falcon AI is about to find that out.\n", + "\n", + "Open **Tracing** \u2192 select `falcon-ai-end-to-end`. You should see ten parent traces, each with the nested OpenAI and tool spans." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Open Falcon AI on the project and analyze the failures\n", + "\n", + "**This step is done in the dashboard, not the notebook.**\n", + "\n", + "Stay on the Tracing page for `falcon-ai-end-to-end` so Falcon AI picks up the project as context automatically. Press `Cmd+K` (Mac) or `Ctrl+K` (Windows) to open the sidebar.\n", + "\n", + "Type `/` and pick **Analyze Trace Errors** from the slash command picker, or just type:\n", + "\n", + "> Analyze trace errors in this project\n", + "\n", + "Falcon AI runs `analyze_project_traces` on the whole project in the background. The skill explores each trace, classifies issues against an error taxonomy (Hallucination, Wrong Intent, Tool Misuse, Dropped Context, Instruction Adherence, etc.), submits findings, and scores each trace 1 to 5.\n", + "\n", + "![Falcon AI sidebar showing the analyze trace errors completion card with per-trace scores and the dominant error category](https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/use-cases/falcon-ai-end-to-end/step-4-analyze-trace-errors.png)\n", + "\n", + "Two failure modes typically surface from one analysis: a **content** problem (hallucinated specifics from the model the prompt can fix in step 7) and an **instrumentation** problem (missing tool spans your agent code can fix by wrapping each tool with `@tracer.tool` so every retrieval shows up in the trace). Both come out of the same `/analyze-trace-errors` run.\n", + "\n", + "Switch to the **Feed** tab in Tracing to see the same findings rendered per-trace, with the exact span and the verbatim quote that triggered each finding." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Capture the failing traces as a dataset\n", + "\n", + "**This step is done in the same Falcon AI chat as Step 4.**\n", + "\n", + "You found the bad traces. Now lock them in as a regression set so any future fix is evaluated against the same failures, not a new sample. In the same Falcon AI conversation, type:\n", + "\n", + "> Build me a dataset called `falcon-demo-failures` with the queries from the traces flagged with Hallucinated Content. Columns: `query` (text), `agent_output` (text), `failure_category` (text).\n", + "\n", + "Falcon AI runs the build-dataset skill: `create_dataset` \u2192 `add_columns` \u2192 `add_dataset_rows`, pulling the row contents from the traces it just analyzed. A completion card appears in the chat with a link to the new dataset.\n", + "\n", + "![Falcon AI completion card showing the new falcon-demo-failures dataset with row count and link](https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/use-cases/falcon-ai-end-to-end/step-5-build-dataset.png)\n", + "\n", + "Open **Datasets** \u2192 `falcon-demo-failures` to confirm the rows. This dataset is now your regression baseline." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Score the dataset to get a numerical baseline\n", + "\n", + "**Same conversation.** Now run evaluations on the dataset so you have a number to beat after the fix.\n", + "\n", + "> Run `factual_accuracy` and `completeness` evals on the `falcon-demo-failures` dataset.\n", + "\n", + "Falcon AI runs the run-evaluations skill: `add_dataset_eval` to attach each eval template, then `run_dataset_evals` to score every row, then `get_dataset_eval_stats` to summarize the results in the chat.\n", + "\n", + "![Falcon AI evaluation results card showing per-row scores for factual_accuracy and completeness](https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/use-cases/falcon-ai-end-to-end/step-6-run-evaluations.png)\n", + "\n", + "The split between the two evals is what tells the story. **factual_accuracy** is in the floor because the model invented return windows, warranties, and compatibility statements. **completeness** is perfect because the agent does fully address each question, even when the answer is invented. As we noticed in step 4, the prompt doesn't tell the model not to do that, so there's no instruction to violate. That's the gap the prompt fix needs to close." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 7: Ask Falcon AI to fix the agent\n", + "\n", + "**Same conversation.** Open one of the worst-scoring traces (the warranty or compatibility one) from the Feed. With that trace as context, type:\n", + "\n", + "> /fix-with-falcon\n", + "\n", + "The fix-with-falcon skill has a strict shape: gate-check that there is actually a failure, read the verbatim system prompt and model output from the span (`read_trace_span(exact=True)`), and then return one concrete change in a fixed format (*Current* then *Replace with*) under 400 words.\n", + "\n", + "![Falcon AI fix-with-falcon output with sections for What happened, Root cause in the agent, The fix (current vs replace with), and Expected score improvement](https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/use-cases/falcon-ai-end-to-end/step-7-fix-with-falcon.png)\n", + "\n", + "A few things to notice in the fix card. Falcon AI cross-references *all* the hallucination-flagged traces before proposing one change (it doesn't fix per-trace), and when the OpenAI auto-instrumentor didn't capture the literal system message in span attributes the skill is honest about it: the **Current** block is flagged as inferred. The proposed fix is still load-bearing because the failure mode (ungrounded specifics) is independent of the exact wording of the original prompt. The skill is also hard-constrained to one change to the agent's own configuration: no new tool, no model swap, no guardrail layer." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 8: Apply the fix and verify the dataset scores recover\n", + "\n", + "Drop the new prompt into your code, re-run the same ten queries, and re-run the same evals on the same dataset. Same inputs and same metrics give you a real before/after." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'test_queries' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mNameError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[16]\u001b[39m\u001b[32m, line 7\u001b[39m\n\u001b[32m 3\u001b[39m STRICT GROUNDING RULE: You must NEVER state specific order details (tracking numbers, delivery dates, order status), product prices, \u001b[38;5;28;01mor\u001b[39;00m policy specifics unless that data was returned by a tool call \u001b[38;5;28;01min\u001b[39;00m this conversation. If no tool result \u001b[38;5;28;01mis\u001b[39;00m available, respond:\n\u001b[32m 4\u001b[39m \"I don't have access to that information right now \u2014 please check your order confirmation email or contact support at [support channel].\" Do not estimate, infer, or fabricate any order or product data.\"\"\"\n\u001b[32m 5\u001b[39m \n\u001b[32m 6\u001b[39m \u001b[38;5;66;03m# Re-run the same queries\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m7\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m i, query \u001b[38;5;28;01min\u001b[39;00m enumerate(test_queries):\n\u001b[32m 8\u001b[39m traced_handle(\n\u001b[32m 9\u001b[39m user_id=\u001b[33mf\"user-{200 + i}\"\u001b[39m,\n\u001b[32m 10\u001b[39m session_id=\u001b[33mf\"verify-session-{i}\"\u001b[39m,\n", + "\u001b[31mNameError\u001b[39m: name 'test_queries' is not defined" + ] + } + ], + "source": [ + "SYSTEM_PROMPT = \"\"\"You are a customer support assistant for an electronics store.\n", + "Answer customer questions about products and orders. Use the tools when relevant.\n", + "\n", + "STRICT GROUNDING RULE: You must NEVER state specific order details (tracking numbers, delivery dates, order status), product prices, or policy specifics unless that data was returned by a tool call in this conversation. If no tool result is available, respond:\n", + "\"I don't have access to that information right now. Please check your order confirmation email or contact support at [support channel].\" Do not estimate, infer, or fabricate any order or product data.\"\"\"\n", + "\n", + "# Re-run the same queries\n", + "for i, query in enumerate(test_queries):\n", + " traced_handle(\n", + " user_id=f\"user-{200 + i}\",\n", + " session_id=f\"verify-session-{i}\",\n", + " messages=[{\"role\": \"user\", \"content\": query}],\n", + " )\n", + "\n", + "trace_provider.force_flush()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Back in Falcon AI, in the same conversation:\n", + "\n", + "> Re-run the same evals on `falcon-demo-failures` and compare to the previous run.\n", + "\n", + "Sample after-fix scores (your numbers will vary):\n", + "\n", + "| Eval | Before | After |\n", + "|---|---|---|\n", + "| **factual_accuracy** | 1 / 5 | 5 / 5 |\n", + "| **completeness** | 5 / 5 | 5 / 5 |\n", + "\n", + "factual_accuracy recovered because the agent no longer fabricates. completeness stays at 5 / 5 because the refusal still fully addresses the user's question (it tells them what's happening and offers a next step). The dataset now serves as a permanent regression check. Any future prompt change can be re-scored against `falcon-demo-failures` in one chat message.\n", + "\n", + "Save this conversation. The next time someone asks \"why does our support agent refuse the warranty question?\" the full audit trail (failing traces, dataset, eval scores, prompt diff) is one click away in your Falcon AI history." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What you solved\n", + "\n", + "You took a support agent that was confidently inventing return policies and walked the entire fix loop inside one Falcon AI conversation: found the failures, captured them as a regression dataset, scored them, applied a one-line prompt change, and verified the scores recovered. No tab-switching, no separate notebook, no guessing about which span produced which output.\n", + "\n", + "Trace \u2192 Debug \u2192 Evaluate \u2192 Dataset \u2192 Fix, all driven from one chat panel, with every artifact (dataset, eval run, prompt diff) saved as a clickable completion card you can return to.\n", + "\n", + "## Next steps\n", + "\n", + "- **Have just one bad trace and want to fix it conversationally?** See [Context-Aware Trace Debugging](https://docs.futureagi.com/docs/cookbook/use-cases/falcon-ai-context-aware-debugging) for the single-trace, three-turn debug flow.\n", + "- **Want a more thoughtful eval dataset (balanced sampling, ground-truth labels, NEEDS_REVIEW flags)?** See [Building Evaluation Datasets from Production Traces](https://docs.futureagi.com/docs/cookbook/use-cases/falcon-ai-eval-datasets-from-traces).\n", + "- **Need a deeper read on per-trace quality scoring?** See [Error Feed](https://docs.futureagi.com/docs/error-feed) for the per-trace quality scoring and error-category drilldown." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/use-cases/falcon-ai-eval-datasets-from-traces.ipynb b/use-cases/falcon-ai-eval-datasets-from-traces.ipynb new file mode 100644 index 0000000..85540be --- /dev/null +++ b/use-cases/falcon-ai-eval-datasets-from-traces.ipynb @@ -0,0 +1,347 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Building Evaluation Datasets from Production Traces with Falcon AI\n", + "\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/future-agi/cookbooks/blob/cookbook/quickstart-notebooks/use-cases/falcon-ai-eval-datasets-from-traces.ipynb)\n", + "[![View on GitHub](https://img.shields.io/badge/View_on_GitHub-181717?logo=github&logoColor=white)](https://github.com/future-agi/cookbooks/blob/cookbook/quickstart-notebooks/use-cases/falcon-ai-eval-datasets-from-traces.ipynb)\n", + "\n", + "| Time | Difficulty |\n", + "|------|------------|\n", + "| 25 min | Intermediate |\n", + "\n", + "Your email triage agent has been live for two weeks. It's classifying support inbox emails into urgent, billing, technical, general, and spam. Every now and then a customer complaint slips into the wrong queue and someone has to escalate it manually. You'd like to fix the prompt, but first you need a way to measure: a test set you can rerun every time you change anything.\n", + "\n", + "Synthetic test cases will not cut it. The emails you would invent on a whiteboard are too clean. Real production traffic has angry customers, multi-issue emails, vague timing words, sarcasm, and edge cases you would never think to write. That variety is exactly what catches subtle prompt regressions, and you already have it sitting in your trace history.\n", + "\n", + "The slow way to harvest it is familiar: scroll through traces, copy promising ones into a spreadsheet, write expected categories by hand, save as CSV, hope the file does not go stale. Two hours later you have a one-off dataset that someone will never update.\n", + "\n", + "The fast way is to use FutureAGI's full ecosystem (**Tracing** for the raw production signal, **Falcon AI** to read your traces and surface failure patterns, **Datasets** to hold the curated rows, **Evals** to score them) wired together in one chat. Falcon AI selects a balanced set of rows for evaluation, suggests ground-truth labels, and persists the result as a real dataset on the platform. The dataset is reusable. Every future prompt change can be re-scored against it in one chat message. This notebook walks that loop end-to-end.\n", + "\n", + "**Prerequisites:**\n", + "- FutureAGI account: [app.futureagi.com](https://app.futureagi.com)\n", + "- API keys: `FI_API_KEY` and `FI_SECRET_KEY` (see [Get your API keys](https://docs.futureagi.com/docs/admin-settings))\n", + "- OpenAI API key (`OPENAI_API_KEY`)\n", + "- Python 3.10+" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install\n", + "\n", + "The simplest way to run this notebook is in **Google Colab** (click the badge at the top). Colab has Python 3.11 and the `%pip install` cell below works out of the box.\n", + "\n", + "If you're running locally, you need Python 3.10+ (`fi-instrumentation-otel` won't import on 3.9)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install fi-instrumentation-otel traceai-openai openai" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.environ[\"FI_API_KEY\"] = \"your-fi-api-key\"\n", + "os.environ[\"FI_SECRET_KEY\"] = \"your-fi-secret-key\"\n", + "os.environ[\"OPENAI_API_KEY\"] = \"your-openai-key\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Build a small email triage agent\n", + "\n", + "The agent has one tool, `classify_email`, that records the category and a short reasoning string. We force the tool call with `tool_choice` so every trace has the same structured shape (one LLM span plus one tool span). The system prompt is intentionally thin: it lists the categories but says nothing about how to handle hostile tone, multi-issue emails, or vague urgency words. Those gaps are where the production failures live." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from openai import OpenAI\n", + "\n", + "client = OpenAI()\n", + "\n", + "CATEGORIES = [\"urgent\", \"billing\", \"technical\", \"general\", \"spam\"]\n", + "\n", + "SYSTEM_PROMPT = f\"\"\"You are an email triage assistant for a SaaS company's support inbox.\n", + "Classify each incoming email into one of these categories: {', '.join(CATEGORIES)}.\n", + "Use the classify_email tool to record your classification.\"\"\"\n", + "\n", + "TOOLS = [\n", + " {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"classify_email\",\n", + " \"description\": \"Record the chosen category for an incoming email\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"category\": {\n", + " \"type\": \"string\",\n", + " \"enum\": CATEGORIES,\n", + " \"description\": \"Email category\",\n", + " },\n", + " \"reasoning\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"One-sentence justification for the chosen category\",\n", + " },\n", + " },\n", + " \"required\": [\"category\", \"reasoning\"],\n", + " },\n", + " },\n", + " }\n", + "]\n", + "\n", + "\n", + "def classify_email(category: str, reasoning: str) -> dict:\n", + " return {\"recorded\": True, \"category\": category, \"reasoning\": reasoning}\n", + "\n", + "\n", + "TOOL_MAP = {\"classify_email\": classify_email}\n", + "\n", + "\n", + "def handle_message(email_text: str) -> dict:\n", + " response = client.chat.completions.create(\n", + " model=\"gpt-4o-mini\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n", + " {\"role\": \"user\", \"content\": email_text},\n", + " ],\n", + " tools=TOOLS,\n", + " tool_choice={\"type\": \"function\", \"function\": {\"name\": \"classify_email\"}},\n", + " )\n", + " msg = response.choices[0].message\n", + " tc = msg.tool_calls[0]\n", + " args = json.loads(tc.function.arguments)\n", + " return {\"category\": args[\"category\"], \"reasoning\": args[\"reasoning\"]}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Add tracing so each classification becomes a row candidate" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from fi_instrumentation import register, FITracer, using_user, using_session\n", + "from fi_instrumentation.fi_types import ProjectType\n", + "from traceai_openai import OpenAIInstrumentor\n", + "\n", + "trace_provider = register(\n", + " project_type=ProjectType.OBSERVE,\n", + " project_name=\"email-triage-prod\",\n", + ")\n", + "OpenAIInstrumentor().instrument(tracer_provider=trace_provider)\n", + "tracer = FITracer(trace_provider.get_tracer(\"email-triage-prod\"))\n", + "\n", + "\n", + "@tracer.agent(name=\"email_triage\")\n", + "def traced_handle(user_id: str, session_id: str, email_text: str) -> dict:\n", + " with using_user(user_id), using_session(session_id):\n", + " return handle_message(email_text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Two things to know about why the tracing matters here. First, every classification gets its own parent trace tagged with the `user_id` and `session_id`, so Falcon AI can later filter and group when building the dataset. Second, the tool call inside each trace carries the chosen `category` and `reasoning` as span attributes, which is what makes \"select all traces where category was X\" possible without extra labeling." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate a varied batch of production-like traces\n", + "\n", + "Real production has variety. The synthetic batch below mirrors what you would actually see in a SaaS support inbox: clear cases, multi-issue emails, hostile tone over a small problem, vague timing language, and a couple of cases that are deliberately ambiguous. The thin prompt will get the easy ones right and stumble on the rest. That mix is what makes the eval dataset worth building." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "emails = [\n", + " # Clear category, agent should nail these\n", + " (\"Production is down. Payment processing has been failing for 30 minutes.\", \"ops-001\"),\n", + " (\"I was charged twice for my July invoice. Please refund the duplicate.\", \"fin-201\"),\n", + " (\"The export-to-CSV button does not work in Safari but works in Chrome.\", \"qa-310\"),\n", + " (\"How do I invite a teammate to my workspace?\", \"user-414\"),\n", + " (\"Make $$$ from home! Click here NOW: bit.ly/scam-link\", \"spam-001\"),\n", + "\n", + " # Ambiguous: hostile tone over a small issue, multi-issue, vague timing\n", + " (\"WORST SERVICE EVER. I have been on hold for 2 hours. CALL ME BACK.\", \"user-501\"),\n", + " (\"I have a billing question and also my login is not working since yesterday.\", \"user-502\"),\n", + " (\"Hey, just wondering, is the platform GDPR compliant?\", \"user-503\"),\n", + " (\"I need someone to call me ASAP about an enterprise contract.\", \"user-504\"),\n", + " (\"URGENT: My password reset email is not arriving.\", \"user-505\"),\n", + "\n", + " # Time-sensitive but quiet, business-critical but soft language\n", + " (\"Hi team, gentle reminder we have a board meeting Friday and the dashboard has been broken since Monday.\", \"user-601\"),\n", + " (\"Why am I being charged $499 when I signed up for the $49 plan? Please fix this or I am canceling.\", \"user-602\"),\n", + " (\"Your platform deleted all my data. I want my money back AND damages.\", \"legal-701\"),\n", + "\n", + " # Edge cases\n", + " (\"When will the dark mode feature be released?\", \"user-801\"),\n", + " (\"I forgot my admin password, please reset it.\", \"user-802\"),\n", + "]\n", + "\n", + "for i, (text, base_id) in enumerate(emails):\n", + " result = traced_handle(\n", + " user_id=base_id,\n", + " session_id=f\"sess-{i:03d}\",\n", + " email_text=text,\n", + " )\n", + " print(f\"[{base_id}] {result['category']:<10} | {text[:80]}\")\n", + "\n", + "trace_provider.force_flush()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After the loop above runs, look at the printed categories your console actually shows for each row, and compare them against what your team would route them to. The interesting rows are the ones where you disagree with the agent: the hostile-tone email (`user-501`), the cancellation-threat billing dispute (`user-602`), the multi-issue email (`user-502`), and similar gray-zone cases. Those rows are the ones worth flagging for human review and including in your eval set. Those are the rows your eval dataset should cover, alongside the easy-pass rows that confirm the agent isn't regressing on the basics. Either way, model output varies between runs, so trust what your console actually printed rather than the specific labels above.\n", + "\n", + "Open **Tracing** in the dashboard and select `email-triage-prod`. You should see fifteen traces, each with the parent agent span, the LLM span, and the `classify_email` tool span underneath. Those traces are your raw material." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Explore the failure landscape with Falcon AI\n", + "\n", + "**This step is done in the dashboard, not the notebook.**\n", + "\n", + "Open Falcon AI on the project (Cmd+K on Mac, Ctrl+K on Windows). The context chip should show the `email-triage-prod` project automatically. Start with one open question:\n", + "\n", + "> What categories did my agent assign across these traces, and which ones look like misclassifications?\n", + "\n", + "Falcon AI calls `search_traces` and `read_trace_span` across the project, returns a category histogram, and flags traces where the category looks off given the email content (your wording and counts will vary).\n", + "\n", + "![Falcon AI sidebar showing the per-category distribution and flagged misclassifications for the email-triage-prod project](https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/use-cases/falcon-ai-eval-datasets-from-traces/step-4-explore-failures.png)\n", + "\n", + "Two things matter here. First, Falcon AI's \"likely misclassifications\" are not ground truth, they are a strong starting point. You will confirm them in step 6. Second, this exploration is what tells you the dataset needs balancing rules: include the misclassifications **and** the easy successes, so the eval can detect both regressions and false positives." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Build the dataset with explicit curation criteria\n", + "\n", + "**Same conversation.** Type a `/build-dataset` request that bakes in your curation rules. The Falcon AI build-dataset skill orchestrates the relevant dataset tools (such as `create_dataset`, `add_columns`, and `add_dataset_rows`) against the traces in context. The exact sequence and parameter shapes can change as the platform evolves; what stays stable is the outcome, a named dataset with the columns and rows you described.\n", + "\n", + "> /build-dataset\n", + ">\n", + "> Build a dataset called `email-triage-eval-v1`. Pull rows from the `email-triage-prod` traces in this project. Selection criteria: include at least 2 traces from each category (urgent, billing, technical, general, spam) plus the 3 likely misclassifications you flagged in the previous turn. Total target: 12-15 rows. Columns:\n", + "> - `email_text` (text) - the user message\n", + "> - `predicted_category` (text) - what the agent chose\n", + "> - `agent_reasoning` (text) - the reasoning string from the tool call\n", + "> - `trace_id` (text) - so we can trace any failure back\n", + "\n", + "Falcon AI confirms the dataset shape, runs the three tool calls in order, and returns a completion card with a link to **Datasets \u2192 email-triage-eval-v1**. A typical row count for these criteria is around 13: 2 from each category (10) plus the 3 misclassifications. Your exact count will depend on which traces Falcon AI selected.\n", + "\n", + "![Falcon AI completion card for the email-triage-eval-v1 dataset showing per-category coverage and the flagged misclassifications that were included](https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/use-cases/falcon-ai-eval-datasets-from-traces/step-5-build-dataset.png)\n", + "\n", + "The curation rules matter more than the row count. A dataset that is 90% successes will not catch regressions; a dataset that is 90% failures will not catch false positives. The \"at least 2 from each category plus the misclassifications\" rule gives both classes meaningful coverage with very few rows.\n", + "\n", + "If your trace volume is much larger than this example, replace the \"at least 2 from each category\" rule with a stratified sample: \"Sample 5% of each category proportionally, with a floor of 5 rows per category.\" Falcon AI accepts that phrasing in the same `/build-dataset` prompt." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Add a ground truth column for the eval to score against\n", + "\n", + "**Same conversation.** The `predicted_category` column is what the agent chose. To turn the dataset into an eval, you need an `expected_category` column, which is what the agent **should have** chosen. Falcon AI can suggest these but cannot fully replace human review for the close calls.\n", + "\n", + "> Add a column `expected_category` (text) to `email-triage-eval-v1`. For each row, propose the correct category based on the email text. For rows where the correct category is genuinely ambiguous (e.g., hostile tone over a small issue, multi-issue emails), use the value `NEEDS_REVIEW` and add a one-sentence note in a new column `review_note` (text) explaining why.\n", + "\n", + "Falcon AI runs `add_columns` for the two new columns and populates them per row. A typical split for this dataset is roughly 10 rows with confident `expected_category` values and 3 rows tagged `NEEDS_REVIEW` (the misclassifications from step 4, plus the GDPR question). Your split will depend on which rows Falcon AI judges ambiguous.\n", + "\n", + "![Falcon AI per-row preview of the new expected_category and review_note columns with NEEDS_REVIEW flags on the genuinely ambiguous rows](https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/use-cases/falcon-ai-eval-datasets-from-traces/step-6-ground-truth-column.png)\n", + "\n", + "This split is the dataset's most important feature. The 10 confident rows give you a regression baseline you can score automatically. The 3 review rows tell you exactly where to spend 5 minutes of human judgment instead of trying to write a rule for the gray zone. Open the dataset in **Datasets \u2192 email-triage-eval-v1**, click each `NEEDS_REVIEW` row, and decide:\n", + "\n", + "| Row | Email | Falcon AI note | Your call |\n", + "|---|---|---|---|\n", + "| `user-501` | \"WORST SERVICE EVER. 2 hours on hold.\" | Hostile tone, but the underlying issue (long hold) is unclear | `urgent` if your team treats SLA complaints as escalations, otherwise `general` |\n", + "| `user-602` | \"Why am I charged $499... I am canceling\" | Billing dispute plus retention risk | `urgent` for retention-sensitive teams, `billing` otherwise |\n", + "| `user-503` | \"is the platform GDPR compliant\" | Compliance question, may need legal | `general` if legal handles it via your standard escalation, custom value `legal` if you want a separate queue |\n", + "\n", + "Once you've made these calls, edit the rows in the Datasets UI (or ask Falcon AI to update them via `add_dataset_rows`). The dataset now has full ground truth." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 7: Validate the dataset by running evals on it\n", + "\n", + "**Same conversation.** A dataset is only as useful as the evals it can run. Score the current agent's predictions against the ground truth you just added. Describe the goal in plain English so Falcon AI picks the right template from your workspace's catalog (template names vary across workspaces; describing the goal beats guessing the name).\n", + "\n", + "> Run an evaluation on `email-triage-eval-v1` that checks whether `predicted_category` exactly matches `expected_category` for each row. Use the eval template from this workspace that best fits a string-equality check between two columns.\n", + "\n", + "Falcon AI runs the run-evaluations skill: it picks an exact-match-style template (typical names include `correctness`, `output_match`, or a custom equality eval), attaches it via `add_dataset_eval`, runs it with `run_dataset_evals`, and summarizes with `get_dataset_eval_stats`.\n", + "\n", + "![Falcon AI eval run output showing the per-row predicted vs expected category and pass/fail/skip verdict for email-triage-eval-v1](https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/use-cases/falcon-ai-eval-datasets-from-traces/step-7-run-evaluations.png)\n", + "\n", + "The failures are the rows we expected to fail: the misclassifications plus the enterprise-contract escalation. The 9 pass rows are the easy categories. **Both the failure pattern and the pass pattern are what you want.** A regression test where every row passes is not testing anything; a regression test where every row fails is just noisy.\n", + "\n", + "Now the dataset has compounding value. Any time you change the system prompt, run this same eval against `email-triage-eval-v1` and compare the pass rate. The first change you'll likely make is adding tone and escalation rules to the prompt, which should bring `user-501` and `user-602` into the pass column without regressing the 9 that already pass.\n", + "\n", + "> **Tip.** Why describe the goal instead of naming a template? Different workspaces have different eval catalogs. Asking by goal lets Falcon AI map your need to the right template (and if nothing fits, it'll tell you and offer to create a custom one)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What you solved\n\nYou took fifteen production traces, asked Falcon AI which ones looked off, used `/build-dataset` to curate a balanced 13-row eval set, added a ground-truth column with explicit `NEEDS_REVIEW` flags for the gray-zone rows, and ran a `correctness` eval to lock in a numerical baseline. The dataset is now a permanent regression check that any future prompt change can be scored against in one chat message.\n\nProduction traces, curated and ground-truthed in one Falcon AI conversation, become a reusable eval dataset that catches both regressions (rows that used to pass and now fail) and false positives (rows that used to fail and still fail).\n\n- **\"Synthetic test cases miss the real failure modes\"**: production traces include the angry customers, multi-issue emails, and vague timing language a whiteboard never produces\n- **\"My one-off CSV is going stale\"**: the dataset lives on the platform; new prompt versions re-score against the same rows in one chat\n- **\"How do I balance the dataset?\"**: explicit curation criteria in the `/build-dataset` prompt (at least N per category, plus the misclassifications)\n- **\"Where do I draw the line on ground truth?\"**: confident rows get auto-labeled, gray-zone rows get `NEEDS_REVIEW`, you spend five minutes on the close calls instead of debating every row\n- **\"Did the next prompt change actually help?\"**: re-run `correctness` on `email-triage-eval-v1`, compare pass rates, no manual scoring\n\n## Next steps\n\nThe natural follow-up is to **fix the system prompt** so the misclassifications pass on the next run, and use this dataset to prove the fix worked without breaking the rows that already pass:\n\n1. Add tone and escalation rules to the email triage system prompt (e.g., \"if the email expresses cancellation intent or contains hostile language, classify as `urgent`\").\n2. Re-run the same emails through the agent so new traces land in `email-triage-prod`.\n3. Ask Falcon AI: *\"Re-run the `correctness` eval on `email-triage-eval-v1` against the new agent runs and compare pass rates.\"*\n4. Watch `user-501` and `user-602` move from Fail to Pass while the 9 baseline rows stay green.\n\nFor the full prompt-fix loop (`/fix-with-falcon` returning a verbatim diff you paste into your code), see [End-to-End with Falcon AI](https://docs.futureagi.com/docs/cookbook/use-cases/falcon-ai-end-to-end). For drilling into a single failing trace conversationally, see [Context-Aware Trace Debugging](https://docs.futureagi.com/docs/cookbook/use-cases/falcon-ai-context-aware-debugging)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}