Skip to content
Closed
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions docs/examples/tools/python_plotting_repair.py
Comment thread
jakelorocco marked this conversation as resolved.
Comment thread
markstur marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# pytest: ollama, e2e, qualitative
"""Repair plotting code with Python-tool and plotting-specific requirements."""

import tempfile
import traceback
from pathlib import Path

import mellea
from mellea.backends import ModelOption
from mellea.backends.tools import MelleaTool
from mellea.stdlib.requirements import (
python_plotting_requirements,
python_tool_requirements,
)
from mellea.stdlib.sampling import SOFAISamplingStrategy
from mellea.stdlib.tools import local_code_interpreter
from mellea.stdlib.tools.interpreter import ExecutionResult


def python(code: str) -> ExecutionResult:
"""Execute Python code.

Args:
code: Python code to execute

Returns:
Execution result containing stdout, stderr, and success status
"""
return local_code_interpreter(code)


def main():
"""Run the plotting repair example."""
with tempfile.TemporaryDirectory() as tmpdir:
output_path = str(Path(tmpdir) / "plot.png")

m = mellea.start_session(context_type="chat")

requirements = [
*python_tool_requirements(allowed_imports=["numpy", "matplotlib", "math"]),
*python_plotting_requirements(output_path=output_path),
]

sampling_strategy = SOFAISamplingStrategy(
s1_solver_backend=m.backend,
s2_solver_backend=m.backend,
s2_solver_mode="fresh_start",
loop_budget=3,
feedback_strategy="first_error",
)

task_summary = (
f"Create a plot of sin(x) for x in 0..2π and save it to {output_path}"
)

print("=" * 70)
print("Testing plotting-code repair with Python tool requirements")
print("=" * 70)
print(f"Task: {task_summary}\n")

try:
result = m.instruct(
task_summary,
requirements=requirements,
strategy=sampling_strategy,
return_sampling_results=True,
tool_calls=True,
model_options={ModelOption.TOOLS: [MelleaTool.from_callable(python)]},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had previously made a comment about when a python tool would exist. I understand now how you are using it. Can you please clarify if this is expected? ie when models generate python code is it common practice for it to be done through a tool like this? I would've assumed they just generate it through their normal generation mode without a structured output.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, to support the current workflow where we only look for python in the python tool calls, we should have a much more built out python tool that gets used for this then. Otherwise, we should just default to looking at the model output and asking it to generate python.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the response by claudecode. It seems correct and explains well.

Yes, this is expected behavior and follows standard LLM patterns. Here's what's happening:

Why use a tool?
When tool_calls=True is passed to instruct() (line 83), the model generates structured tool calls as part of its normal generation. This is different from free-form code generation because:

  1. Sampling validation comes first — The code inside the tool call is validated by requirements (lines 55-58) before execution
  2. Explicit execution control — Tools are only invoked after passing validation (line 108: _call_tools())

Is this common practice?
Yes. Most production LLM systems that generate code use this pattern:

  • Model generates code wrapped in tool calls
  • The tool call args are validated structurally
  • Tool is explicitly invoked by the caller
  • Results are inspected/handled by application logic

Without a tool, you'd have to parse code from free-form text and trust it immediately, with no structured validation step.

This example shows the full lifecycle — if you just wanted direct code generation without the tool wrapper, you'd skip tool_calls=True and _call_tools(), but you'd lose the safety validation layer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crux of my comment is still unaddressed, that we don't have a standard Mellea python tool. If we want to go this route, we ought to have that.

Why use a tool?
When tool_calls=True is passed to instruct() (line 83), the model generates structured tool calls as part of its normal generation. This is different from free-form code generation because:

Sampling validation comes first — The code inside the tool call is validated by requirements (lines 55-58) before execution

Sampling strategies / Mellea doesn't call tools by default. So this is true of either approach.

Explicit execution control — Tools are only invoked after passing validation (line 108: _call_tools())

Same as above.

Is this common practice?
Yes. Most production LLM systems that generate code use this pattern:

Model generates code wrapped in tool calls
The tool call args are validated structurally
Tool is explicitly invoked by the caller
Results are inspected/handled by application logic
Without a tool, you'd have to parse code from free-form text and trust it immediately, with no structured validation step.

But the output is functionally free-form. Your code tool only requests a string which could be any text. We are still parsing it. We aren't accepting some json schema that defines the total grammar of the python language.

)

print(f"\nResult: {'SUCCESS' if result.success else 'FAILED'}\n")

if result.success:
print("✓ Model successfully generated and executed plotting code")
print("\nFinal generated code:")
print("-" * 70)
print(result.result.value)
Comment thread
markstur marked this conversation as resolved.
Outdated
print("-" * 70)

if Path(output_path).exists():
file_size = Path(output_path).stat().st_size
print(f"\n✓ Output file created: {output_path}")
print(f" File size: {file_size} bytes")
else:
print(f"\n✗ Output file not found: {output_path}")

print(f"\nRepair iterations: {len(result.sample_validations)}")
for attempt_idx, validations in enumerate(result.sample_validations, 1):
passed = sum(1 for _, val in validations if val.as_bool())
total = len(validations)
status = "✓" if passed == total else "✗"
print(
f" {status} Attempt {attempt_idx}: {passed}/{total} "
f"requirements passed"
)

for req, val in validations:
if not val.as_bool():
print(f" - {req.description}")
if val.reason:
reason_preview = val.reason[:100].replace("\n", " ")
print(f" Error: {reason_preview}...")

else:
print("✗ Failed to generate working plotting code after all attempts\n")
print("Last attempt output:")
print("-" * 70)
print(result.result.value)
print("-" * 70)

print(f"\nFailure history ({len(result.sample_validations)} attempts):")
for attempt_idx, validations in enumerate(result.sample_validations, 1):
failed_count = sum(1 for _, val in validations if not val.as_bool())
if failed_count > 0:
print(f"\n Attempt {attempt_idx}:")
for req, val in validations:
if not val.as_bool():
print(f" - {req.description}")
if val.reason:
reason_lines = val.reason.split("\n")[:2]
for line in reason_lines:
print(f" {line}")

except Exception as e:
print(f"✗ Exception during sampling: {e}")
traceback.print_exc()

print("\n" + "=" * 70)
print("Test completed")
print("=" * 70)


if __name__ == "__main__":
main()

# Made with Bob
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only one I found, but please search and remove any references.

4 changes: 4 additions & 0 deletions mellea/stdlib/requirements/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
# Import from core for ergonomics.
from ...core import Requirement, ValidationResult, default_output_to_bool
from .md import as_markdown_list, is_markdown_list, is_markdown_table
from .plotting import python_plotting_requirements
from .python_reqs import PythonExecutionReq
from .python_tools import python_tool_requirements
from .requirement import (
ALoraRequirement,
LLMaJRequirement,
Expand All @@ -26,6 +28,8 @@
"default_output_to_bool",
"is_markdown_list",
"is_markdown_table",
"python_plotting_requirements",
"python_tool_requirements",
"req",
"reqify",
"requirement_check_to_bool",
Expand Down
9 changes: 9 additions & 0 deletions mellea/stdlib/requirements/plotting/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""Plotting-specific requirements for Python tool validation.
Provides matplotlib and plotting-focused requirement factories separate from
generic Python tool requirements.
"""

from .matplotlib import python_plotting_requirements

__all__ = ["python_plotting_requirements"]
Loading
Loading