diff --git a/skills/.system/openai-docs/references/gpt-5p4-prompting-guide.md b/skills/.system/openai-docs/references/gpt-5p4-prompting-guide.md index dc4ebde4..72f49099 100644 --- a/skills/.system/openai-docs/references/gpt-5p4-prompting-guide.md +++ b/skills/.system/openai-docs/references/gpt-5p4-prompting-guide.md @@ -1,102 +1,82 @@ -# GPT-5.4 prompting upgrade guide +# Prompt guidance for GPT-5.4 -Use this guide when prompts written for older models need to be adapted for GPT-5.4 during an upgrade. Start lean: keep the model-string change narrow, preserve the original task intent, and add only the smallest prompt changes needed to recover behavior. +GPT-5.4, our newest mainline model, is designed to balance long-running task performance, stronger control over style and behavior, and more disciplined execution across complex workflows. Building on advances from GPT-5 through GPT-5.3-Codex, GPT-5.4 improves token efficiency, sustains multi-step workflows more reliably, and performs well on long-horizon tasks. -## Default upgrade posture +GPT-5.4 is designed for production-grade assistants and agents that need strong multi-step reasoning, evidence-rich synthesis, and reliable performance over long contexts. It is especially effective when prompts clearly specify the output contract, tool-use expectations, and completion criteria. In practice, the biggest gains come from choosing the right reasoning effort for the task, using explicit grounding and citation rules, and giving the model a precise definition of what "done" looks like. This guide focuses on prompt patterns and migration practices that preserve those efficiency wins. For model capabilities, API parameters, and broader migration guidance, see [our latest model guide](https://developers.openai.com/api/docs/guides/latest-model). -- Start with `model string only` whenever the old prompt is already short, explicit, and task-bounded. -- Move to `model string + light prompt rewrite` only when regressions appear in completeness, persistence, citation quality, verification, or verbosity. -- Prefer one or two targeted prompt additions over a broad rewrite. -- Treat reasoning effort as a last-mile knob. Start lower, then increase only after prompt-level fixes and evals. -- Before increasing reasoning effort, first add a completeness contract, a verification loop, and tool persistence rules - depending on the usage case. -- If the workflow clearly depends on implementation changes rather than prompt changes, treat it as blocked for prompt-only upgrade guidance. -- Do not classify a case as blocked just because the workflow uses tools; block only if the upgrade requires changing tool definitions, wiring, or other implementation details. +When troubleshooting cases where GPT-5.4 treats an intermediate update as the + final answer, verify your integration preserves the assistant message `phase` + field correctly. See [Phase parameter](#phase-parameter) for details. -## Behavioral differences to account for +## Understand GPT-5.4 behavior -Current GPT-5.4 upgrade guidance suggests these strengths: +### Where GPT-5.4 is strongest -- stronger personality and tone adherence, with less drift over long answers -- better long-horizon and agentic workflow stamina -- stronger spreadsheet, finance, and formatting tasks -- more efficient tool selection and fewer unnecessary calls by default -- stronger structured generation and classification reliability +GPT-5.4 tends to work especially well in these areas: -The main places where prompt guidance still helps are: +- Strong personality and tone adherence, with less drift over long answers +- Agentic workflow robustness, with a stronger tendency to stick with multi-step work, retry, and complete agent loops end to end +- Evidence-rich synthesis, especially in long-context or multi-tool workflows +- Instruction adherence in modular, skill-based, and block-structured prompts when the contract is explicit +- Long-context analysis across large, messy, or multi-document inputs +- Batched or parallel tool calling while maintaining tool-call accuracy +- Spreadsheet, finance, and Excel workflows that need instruction following, formatting fidelity, and stronger self-verification -- retrieval-heavy workflows that need persistent tool use and explicit completeness -- research and citation discipline -- verification before irreversible or high-impact actions -- terminal and tool workflow hygiene -- defaults and implied follow-through -- verbosity control for compact, information-dense answers +### Where explicit prompting still helps -Start with the smallest set of instructions that preserves correctness. Add the prompt blocks below only for workflows that actually need them. +Even with those strengths, GPT-5.4 benefits from more explicit guidance in a few recurring patterns: -## Prompt rewrite patterns +- Low-context tool routing early in a session, when tool selection can be less reliable +- Dependency-aware workflows that need explicit prerequisite and downstream-step checks +- Reasoning effort selection, where higher effort is not always better and the right choice depends on task shape, not intuition +- Research tasks that require disciplined source collection and consistent citations +- Irreversible or high-impact actions that require verification before execution +- Terminal or coding-agent environments where tool boundaries must stay clear -| Older prompt pattern | GPT-5.4 adjustment | Why | Example addition | -| --- | --- | --- | --- | -| Long, repetitive instructions that compensate for weaker instruction following | Remove duplicate scaffolding and keep only the constraints that materially change behavior | GPT-5.4 usually needs less repeated steering | Replace repeated reminders with one concise rule plus a verification block | -| Fast assistant prompt with no verbosity control | Keep the prompt as-is first; add a verbosity clamp only if outputs become too long | Many GPT-4o or GPT-4.1 upgrades work with just a model-string swap | Add `output_verbosity_spec` only after a verbosity regression | -| Tool-heavy agent prompt that assumes the model will keep searching until complete | Add persistence and verification rules | GPT-5.4 may use fewer tool calls by default for efficiency | Add `tool_persistence_rules` and `verification_loop` | -| Tool-heavy workflow where later actions depend on earlier lookup or retrieval | Add prerequisite and missing-context rules before action steps | GPT-5.4 benefits from explicit dependency-aware routing when context is still thin | Add `dependency_checks` and `missing_context_gating` | -| Retrieval workflow with several independent lookups | Add selective parallelism guidance | GPT-5.4 is strong at parallel tool use, but should not parallelize dependent steps | Add `parallel_tool_calling` | -| Batch workflow prompt that often misses items | Add an explicit completeness contract | Item accounting benefits from direct instruction | Add `completeness_contract` | -| Research prompt that needs grounding and citation discipline | Add research, citation, and empty-result recovery blocks | Multi-pass retrieval is stronger when the model is told how to react to weak or empty search results | Add `research_mode`, `citation_rules`, and `empty_result_handling`; add `tool_persistence_rules` when retrieval tools are already in use | -| Coding or terminal prompt with shell misuse or early stop failures | Keep the same tool surface and add terminal hygiene and verification instructions | Tool-using coding workflows are not blocked just because tools exist; they usually need better prompt steering, not host rewiring | Add `terminal_tool_hygiene` and `verification_loop`, optionally `tool_persistence_rules` | -| Multi-agent or support-triage workflow with escalation or completeness requirements | Add one lightweight control block for persistence, completeness, or verification | GPT-5.4 can be more efficient by default, so multi-step support flows benefit from an explicit completion or verification contract | Add at least one of `tool_persistence_rules`, `completeness_contract`, or `verification_loop` | +These patterns are observed defaults, not guarantees. Start with the smallest prompt that passes your evals, and add blocks only when they fix a measured failure mode. -## Prompt blocks +## Use core prompt patterns -Use these selectively. Do not add all of them by default. +### Keep outputs compact and structured -### `output_verbosity_spec` +To improve token efficiency with GPT-5.4, constrain verbosity and enforce structured output through clear output contracts. In practice, this acts as an additional control layer alongside the `verbosity` parameter in the Responses API, allowing you to guide both how much the model writes and how it structures the output. -Use when: +```xml + +- Return exactly the sections requested, in the requested order. +- If the prompt defines a preamble, analysis block, or working section, do not treat it as extra output. +- Apply length limits only to the section they are intended for. +- If a format is required (JSON, Markdown, SQL, XML), output only that format. + -- the upgraded model gets too wordy -- the host needs compact, information-dense answers -- the workflow benefits from a short overview plus a checklist - -```text - -- Default: 3-6 sentences or up to 6 bullets. -- If the user asked for a doc or report, use headings with short bullets. -- For multi-step tasks: - - Start with 1 short overview paragraph. - - Then provide a checklist with statuses: [done], [todo], or [blocked]. + +- Prefer concise, information-dense writing. - Avoid repeating the user's request. -- Prefer compact, information-dense writing. - +- Keep progress updates brief. +- Do not shorten the answer so aggressively that required evidence, reasoning, or completion checks are omitted. + ``` -### `default_follow_through_policy` +### Set clear defaults for follow-through -Use when: +Users often change the task, format, or tone mid-conversation. To keep the assistant aligned, define clear rules for when to proceed, when to ask, and how newer instructions override earlier defaults. -- the host expects the model to proceed on reversible, low-risk steps -- the upgraded model becomes too conservative or asks for confirmation too often +Use a default follow-through policy like this: -```text +```xml -- If the user's intent is clear and the next step is reversible and low-risk, proceed without asking permission. -- Only ask permission if the next step is: +- If the user’s intent is clear and the next step is reversible and low-risk, proceed without asking. +- Ask permission only if the next step is: (a) irreversible, - (b) has external side effects, or - (c) requires missing sensitive information or a choice that materially changes outcomes. -- If proceeding, state what you did and what remains optional. + (b) has external side effects (for example sending, purchasing, deleting, or writing to production), or + (c) requires missing sensitive information or a choice that would materially change the outcome. +- If proceeding, briefly state what you did and what remains optional. ``` -### `instruction_priority` - -Use when: +Make instruction priority explicit: -- users often change task shape, format, or tone mid-conversation -- the host needs an explicit override policy instead of relying on defaults - -```text +```xml - User instructions override default style, tone, formatting, and initiative preferences. - Safety, honesty, privacy, and permission constraints do not yield. @@ -105,145 +85,144 @@ Use when: ``` -### `tool_persistence_rules` +Higher-priority developer or system instructions remain binding. -Use when: +**Guidance:** When instructions change mid-conversation, make the update explicit, scoped, and local. State what changed, what still applies, and whether the change affects the next turn or the rest of the conversation. -- the workflow needs multiple retrieval or verification steps -- the model starts stopping too early because it is trying to save tool calls +### Handle mid-conversation instruction updates -```text - -- Use tools whenever they materially improve correctness, completeness, or grounding. -- Do not stop early just to save tool calls. -- Keep calling tools until: - (1) the task is complete, and - (2) verification passes. -- If a tool returns empty or partial results, retry with a different strategy. - -``` +For mid-conversation updates, use explicit, scoped steering messages that state: -### `dig_deeper_nudge` +1. Scope +2. Override +3. Carry forward -Use when: +```text + +For the next response only: +- Do not complete the task. +- Only produce a plan. +- Keep it to 5 bullets. + +All earlier instructions still apply unless they conflict with this update. + +``` -- the model is too literal or stops at the first plausible answer -- the task is safety- or accuracy-sensitive and needs a small initiative nudge before raising reasoning effort +If the task itself changes, say so directly: ```text - -- Do not stop at the first plausible answer. -- Look for second-order issues, edge cases, and missing constraints. -- If the task is safety- or accuracy-critical, perform at least one verification step. - + +The task has changed. +Previous task: complete the workflow. +Current task: review the workflow and identify risks only. + +Rules for this turn: +- Do not execute actions. +- Do not call destructive tools. +- Return exactly: + 1. Main risks + 2. Missing information + 3. Recommended next step + ``` -### `dependency_checks` +### Make tool use persistent when correctness depends on it -Use when: +Use explicit rules to keep tool use thorough, dependency-aware, and appropriately paced, especially in workflows where later actions rely on earlier retrieval or verification. A common failure mode is skipping prerequisites because the right end state seems obvious. -- later actions depend on prerequisite lookup, memory retrieval, or discovery steps -- the model may be tempted to skip prerequisite work because the intended end state seems obvious +GPT-5.4 can be less reliable at tool routing early in a session, when context is still thin. Prompt for prerequisites, dependency checks, and exact tool intent. -```text +```xml + +- Use tools whenever they materially improve correctness, completeness, or grounding. +- Do not stop early when another tool call is likely to materially improve correctness or completeness. +- Keep calling tools until: + (1) the task is complete, and + (2) verification passes (see ). +- If a tool returns empty or partial results, retry with a different strategy. + +``` + +This is especially important for workflows where the final action depends on earlier lookup or retrieval steps. One of the most common failure modes is skipping prerequisites because the intended end state seems obvious. + +```xml -- Before taking an action, check whether prerequisite discovery, lookup, or memory retrieval is required. +- Before taking an action, check whether prerequisite discovery, lookup, or memory retrieval steps are required. - Do not skip prerequisite steps just because the intended final action seems obvious. -- If a later step depends on the output of an earlier one, resolve that dependency first. +- If the task depends on the output of a prior step, resolve that dependency first. ``` -### `parallel_tool_calling` - -Use when: +Prompt for parallelism when the work is independent and wall-clock matters. Prompt for sequencing when dependencies, ambiguity, or irreversible actions matter more than speed. -- the workflow has multiple independent retrieval steps -- wall-clock time matters but some steps still need sequencing - -```text +```xml - When multiple retrieval or lookup steps are independent, prefer parallel tool calls to reduce wall-clock time. -- Do not parallelize steps with prerequisite dependencies or where one result determines the next action. -- After parallel retrieval, pause to synthesize before making more calls. +- Do not parallelize steps that have prerequisite dependencies or where one result determines the next action. +- After parallel retrieval, pause to synthesize the results before making more calls. - Prefer selective parallelism: parallelize independent evidence gathering, not speculative or redundant tool use. ``` -### `completeness_contract` +### Force completeness on long-horizon tasks -Use when: +For multi-step workflows, a common failure mode is incomplete execution: the model finishes after partial coverage, misses items in a batch, or treats empty or narrow retrieval as final. GPT-5.4 becomes more reliable when the prompt defines explicit completion rules and recovery behavior. -- the task involves batches, lists, enumerations, or multiple deliverables -- missing items are a common failure mode +Coverage can be achieved through sequential or parallel retrieval, but completion rules should remain explicit either way. -```text +```xml -- Deliver all requested items. -- Maintain an itemized checklist of deliverables. -- For lists or batches: - - state the expected count, - - enumerate items 1..N, - - confirm that none are missing before finalizing. +- Treat the task as incomplete until all requested items are covered or explicitly marked [blocked]. +- Keep an internal checklist of required deliverables. +- For lists, batches, or paginated results: + - determine expected scope when possible, + - track processed items or pages, + - confirm coverage before finalizing. - If any item is blocked by missing data, mark it [blocked] and state exactly what is missing. ``` -### `empty_result_handling` - -Use when: - -- the workflow frequently performs search, CRM, logs, or retrieval steps -- no-results failures are often false negatives - -```text - -If a lookup returns empty or suspiciously small results: -- Do not conclude that no results exist immediately. -- Try at least 2 fallback strategies, such as a broader query, alternate filters, or another source. +For workflows where empty, partial, or noisy retrieval is common: + +```xml + +If a lookup returns empty, partial, or suspiciously narrow results: +- do not immediately conclude that no results exist, +- try at least one or two fallback strategies, + such as: + - alternate query wording, + - broader filters, + - a prerequisite lookup, + - or an alternate source or tool, - Only then report that no results were found, along with what you tried. - + ``` -### `verification_loop` - -Use when: +### Add a verification loop before high-impact actions -- the workflow has downstream impact -- accuracy, formatting, or completeness regressions matter +Once the workflow appears complete, add a lightweight verification step before returning the answer or taking an irreversible action. This helps catch requirement misses, grounding issues, and format drift before commit. -```text +```xml Before finalizing: - Check correctness: does the output satisfy every requirement? -- Check grounding: are factual claims backed by retrieved sources or tool output? +- Check grounding: are factual claims backed by the provided context or tool outputs? - Check formatting: does the output match the requested schema or style? - Check safety and irreversibility: if the next step has external side effects, ask permission first. ``` -### `missing_context_gating` - -Use when: - -- required context is sometimes missing early in the workflow -- the model should prefer retrieval over guessing - -```text +```xml -- If required context is missing, do not guess. -- Prefer the appropriate lookup tool when the context is retrievable; ask a minimal clarifying question only when it is not. +- If required context is missing, do NOT guess. +- Prefer the appropriate lookup tool when the missing context is retrievable; ask a minimal clarifying question only when it is not. - If you must proceed, label assumptions explicitly and choose a reversible action. ``` -### `action_safety` - -Use when: +For agents that actively take actions, add a short execution frame: -- the agent will actively take actions through tools -- the host benefits from a short pre-flight and post-flight execution frame - -```text +```xml - Pre-flight: summarize the intended action and parameters in 1-2 lines. - Execute via tool. @@ -251,32 +230,41 @@ Use when: ``` -### `citation_rules` +## Handle specialized workflows -Use when: +### Choose image detail explicitly for vision and computer use -- the workflow produces cited answers -- fabricated citations or wrong citation formats are costly +If your workflow depends on visual precision, specify the image `detail` level in the prompt or integration instead of relying on `auto`. Use `high` for standard high-fidelity image understanding. Use `original` for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy tasks](https://developers.openai.com/api/docs/guides/tools-computer-use) on `gpt-5.4` and future models. Use `low` only when speed and cost matter more than fine detail. For more details on image detail levels, see the [Images and Vision guide](https://developers.openai.com/api/docs/guides/images-vision). -```text +### Lock research and citations to retrieved evidence + +When citation quality matters, make both the source boundary and the format requirement explicit. This helps reduce fabricated references, unsupported claims, and citation-format drift. + +```xml -- Only cite sources that were actually retrieved in this session. +- Only cite sources retrieved in the current workflow. - Never fabricate citations, URLs, IDs, or quote spans. -- If you cannot find a source for a claim, say so and either: - - soften the claim, or - - explain how to verify it with tools. - Use exactly the citation format required by the host application. +- Attach citations to the specific claims they support, not only at the end. ``` -### `research_mode` +```xml + +- Base claims only on provided context or tool outputs. +- If sources conflict, state the conflict explicitly and attribute each side. +- If the context is insufficient or irrelevant, narrow the answer or say you cannot support the claim. +- If a statement is an inference rather than a directly supported fact, label it as an inference. + +``` + +If your application requires inline citations, require inline citations. If it requires footnotes, require footnotes. The key is to lock the format and prevent the model from improvising unsupported references. -Use when: +### Research mode -- the workflow is research-heavy -- the host uses web search or retrieval tools +Push GPT-5.4 into a disciplined research mode. Use this pattern for research, review, and synthesis tasks. Do not force it onto short execution tasks or simple deterministic transforms. -```text +```xml - Do research in 3 passes: 1) Plan: list 3-6 sub-questions to answer. @@ -288,11 +276,9 @@ Use when: If your host environment uses a specific research tool or requires a submit step, combine this with the host's finalization contract. -### `structured_output_contract` +### Clamp strict output formats -Use when: - -- the host depends on strict JSON, SQL, or other structured output +For SQL, JSON, or other parse-sensitive outputs, tell GPT-5.4 to emit only the target format and check it before finishing. ```text @@ -304,12 +290,7 @@ Use when: ``` -### `bbox_extraction_spec` - -Use when: - -- the workflow extracts OCR boxes, document regions, or other coordinates -- layout drift or missed dense regions are common failure modes +If you are extracting document regions or OCR boxes, define the coordinate system and add a drift check: ```text @@ -320,114 +301,299 @@ Use when: ``` -### `terminal_tool_hygiene` +### Keep tool boundaries explicit in coding and terminal agents -Use when: +In coding agents, GPT-5.4 works better when the rules for shell access and file editing are unambiguous. This is especially important when you expose tools like [Shell](https://developers.openai.com/api/docs/guides/tools-shell) or [Apply patch](https://developers.openai.com/api/docs/guides/tools-apply-patch). -- the prompt belongs to a terminal-based or coding-agent workflow -- tool misuse or shell misuse has been observed +### User updates -```text +GPT-5.4 does well with brief, outcome-based updates. Reuse the user-updates pattern from the 5.2 guide, but pair it with explicit completion and verification requirements. + +Recommended update spec: + +```xml + +- Only update the user when starting a new major phase or when something changes the plan. +- Each update: 1 sentence on outcome + 1 sentence on next step. +- Do not narrate routine tool calls. +- Keep the user-facing status short; keep the work exhaustive. + +``` + +For coding agents, see the Prompting patterns for coding tasks section below for more specific guidance. + +### Prompting patterns for coding tasks + +**Autonomy and persistence** + +GPT-5.4 is generally more thorough end to end than earlier mainline models on coding and tool-use tasks, so you often need less explicit "verify everything" prompting. Still, for high-stakes changes such as production, migrations, or security work, keep a lightweight verification clause. + +```xml + +Persist until the task is fully handled end-to-end within the current turn whenever feasible: do not stop at analysis or partial fixes; carry changes through implementation, verification, and a clear explanation of outcomes unless the user explicitly pauses or redirects you. + +Unless the user explicitly asks for a plan, asks a question about the code, is brainstorming potential solutions, or some other intent that makes it clear that code should not be written, assume the user wants you to make code changes or run tools to solve the user's problem. In these cases, it's bad to output your proposed solution in a message, you should go ahead and actually implement the change. If you encounter challenges or blockers, you should attempt to resolve them yourself. + +``` + +**Intermediary updates** + +Keep updates sparse and high-signal. In coding tasks, prefer updates at key points. + +```xml + +- Intermediary updates go to the `commentary` channel. +- User updates are short updates while you are working. They are not final answers. +- Use 1-2 sentence updates to communicate progress and new information while you work. +- Do not begin responses with conversational interjections or meta commentary. Avoid openers such as acknowledgements ("Done -", "Got it", or "Great question") or similar framing. +- Before exploring or doing substantial work, send a user update explaining your understanding of the request and your first step. Avoid commenting on the request or starting with phrases such as "Got it" or "Understood." +- Provide updates roughly every 30 seconds while working. +- When exploring, explain what context you are gathering and what you learned. Vary sentence structure so the updates do not become repetitive. +- When working for a while, keep updates informative and varied, but stay concise. +- When work is substantial, provide a longer plan after you have enough context. This is the only update that may be longer than 2 sentences and may contain formatting. +- Before file edits, explain what you are about to change. +- While thinking, keep the user informed of progress without narrating every tool call. Even if you are not taking actions, send frequent progress updates rather than going silent, especially if you are thinking for more than a short stretch. +- Keep the tone of progress updates consistent with the assistant's overall personality. + +``` + +**Formatting** + +GPT-5.4 often defaults to more structured formatting and may overuse bullet lists. If you want a clean final response, explicitly clamp list shape. + +```xml +Never use nested bullets. Keep lists flat (single level). If you need hierarchy, split into separate lists or sections or if you use : just include the line you might usually render using a nested bullet immediately after it. For numbered lists, only use the `1. 2. 3.` style markers (with a period), never `1)`. +``` + +**Frontend tasks** + +Use this only when additional frontend guidance is useful. + +```xml + +When doing frontend design tasks, avoid generic, overbuilt layouts. + +Use these hard rules: +- One composition: The first viewport must read as one composition, not a dashboard, unless it is a dashboard. +- Brand first: On branded pages, the brand or product name must be a hero-level signal, not just nav text or an eyebrow. No headline should overpower the brand. +- Brand test: If the first viewport could belong to another brand after removing the nav, the branding is too weak. +- Full-bleed hero only: On landing pages and promotional surfaces, the hero image should usually be a dominant edge-to-edge visual plane or background. Do not default to inset hero images, side-panel hero images, rounded media cards, tiled collages, or floating image blocks unless the existing design system clearly requires them. +- Hero budget: The first viewport should usually contain only the brand, one headline, one short supporting sentence, one CTA group, and one dominant image. Do not place stats, schedules, event listings, address blocks, promos, "this week" callouts, metadata rows, or secondary marketing content there. +- No hero overlays: Do not place detached labels, floating badges, promo stickers, info chips, or callout boxes on top of hero media. +- Cards: Default to no cards. Never use cards in the hero unless they are the container for a user interaction. If removing a border, shadow, background, or radius does not hurt interaction or understanding, it should not be a card. +- One job per section: Each section should have one purpose, one headline, and usually one short supporting sentence. +- Real visual anchor: Imagery should show the product, place, atmosphere, or context. +- Reduce clutter: Avoid pill clusters, stat strips, icon rows, boxed promos, schedule snippets, and competing text blocks. +- Use motion to create presence and hierarchy, not noise. Ship 2-3 intentional motions for visually led work, and prefer Framer Motion when it is available. + +Exception: If working within an existing website or design system, preserve the established patterns, structure, and visual language. + +``` + +```xml -- Only run shell commands through the terminal tool. -- Never try to "run" tool names as shell commands. -- If a patch or edit tool exists, use it directly instead of emulating it in bash. +- Only run shell commands via the terminal tool. +- Never "run" tool names as shell commands. +- If a patch or edit tool exists, use it directly; do not attempt it in bash. - After changes, run a lightweight verification step such as ls, tests, or a build before declaring the task done. ``` -### `user_updates_spec` +### Document localization and OCR boxes -Use when: +For bbox tasks, be explicit about coordinate conventions and add drift tests. -- the workflow is long-running and user updates matter +```xml + +- Use the specified coordinate format exactly (for example [x1,y1,x2,y2] normalized 0..1). +- For each bbox, include: page, label, text snippet, confidence. +- Add a vertical-drift sanity check: + - ensure bboxes align with the line of text (not shifted up or down). +- If dense layout, process page by page and do a second pass for missed items. + +``` -```text - -- Only update the user when starting a new major phase or when the plan changes. -- Each update should contain: - - 1 sentence on what changed, - - 1 sentence on the next step. -- Do not narrate routine tool calls. -- Keep the user-facing update short, even when the actual work is exhaustive. - +### Use runtime and API integration notes + +For long-running or tool-heavy agents, the runtime contract matters as much as the prompt contract. + +#### Phase parameter + +For GPT-5.4, `gpt-5.3-codex`, and later Responses models, the `phase` field can +help in the small number of long-running or tool-heavy flows where preambles or +other intermediate assistant updates are mistaken for the final answer. + +- `phase` is optional at the API level, but it is highly recommended. Best-effort inference may exist server-side, but explicit round-tripping of `phase` is strictly better. +- Use `phase` for long-running or tool-heavy agents that may emit commentary before tool calls or before a final answer. +- Preserve `phase` when replaying prior assistant items so the model can distinguish working commentary from the completed answer. This matters most in multi-step flows with preambles, tool-related updates, or multiple assistant messages in the same turn. +- Do not add `phase` to user messages. +- If you use `previous_response_id`, that is usually the simplest path, since OpenAI can often recover prior state without manually replaying assistant items. +- If you replay assistant history yourself, preserve the original `phase` values. +- Missing or dropped `phase` can cause preambles to be interpreted as final answers and degrade behavior on those multi-step tasks. + +### Preserve behavior in long sessions + +Compaction unlocks significantly longer effective context windows, where user conversations can persist for many turns without hitting context limits or long-context performance degradation, and agents can perform very long trajectories that exceed a typical context window for long-running, complex tasks. + +If you are using [Compaction](https://developers.openai.com/api/docs/guides/compaction) in the Responses API, compact after major milestones, treat compacted items as opaque state, and keep prompts functionally identical after compaction. The endpoint is ZDR compatible and returns an `encrypted_content` item that you can pass into future requests. GPT-5.4 tends to remain more coherent and reliable over longer, multi-turn conversations with fewer breakdowns as sessions grow. + +For more guidance, see the [`/responses/compact` API reference](https://developers.openai.com/api/docs/api-reference/responses/compact). + +### Control personality for customer-facing workflows + +GPT-5.4 can be steered more effectively when you separate persistent personality from per-response writing controls. This is especially useful for customer-facing workflows such as emails, support replies, announcements, and blog-style content. + +- **Personality (persistent):** sets the default tone, verbosity, and decision style across the session. +- **Writing controls (per response):** define the channel, register, formatting, and length for a specific artifact. +- **Reminder:** personality should not override task-specific output requirements. If the user asks for JSON, return JSON. + +For natural, high-quality prose, the highest-leverage controls are: + +- Give the model a clear persona. +- Specify the channel and emotional register. +- Explicitly ban formatting when you want prose. +- Use hard length limits. + +```xml + +- Persona: +- Channel: +- Emotional register: + "not " +- Formatting: +- Length: +- Default follow-through: if the request is clear and low-risk, proceed without asking permission. + +``` + +For more personality patterns you can lift directly, see the [Prompt Personalities cookbook](https://developers.openai.com/cookbook/examples/gpt-5/prompt_personalities). + +**Professional memo mode** + +For memos, reviews, and other professional writing tasks, general writing instructions are often not enough. These workflows benefit from explicit guidance on specificity, domain conventions, synthesis, and calibrated certainty. + +```xml + +- Write in a polished, professional memo style. +- Use exact names, dates, entities, and authorities when supported by the record. +- Follow domain-specific structure if one is requested. +- Prefer precise conclusions over generic hedging. +- When uncertainty is real, tie it to the exact missing fact or conflicting source. +- Synthesize across documents rather than summarizing each one independently. + +``` + +This mode is especially useful for legal, policy, research, and executive-facing writing, where the goal is not just fluency, but disciplined synthesis and clear conclusions. + +## Tune reasoning and migration + +### Treat reasoning effort as a last-mile knob + +Reasoning effort is not one-size-fits-all. Treat it as a last-mile tuning knob, not the primary way to improve quality. In many cases, stronger prompts, clear output contracts, and lightweight verification loops recover much of the performance teams might otherwise seek through higher reasoning settings. + +Recommended defaults: + +- `none`: Best for fast, cost-sensitive, latency-sensitive tasks where the model does not need to think. +- `low`: Works well for latency-sensitive tasks where a small amount of thinking can produce a meaningful accuracy gain, especially with complex instructions. +- `medium` or `high`: Reserve for tasks that truly require stronger reasoning and can absorb the latency and cost tradeoff. Choose between them based on how much performance gain your task gets from additional reasoning. +- `xhigh`: Avoid as a default unless your evals show clear benefits. It is best suited for long, agentic, reasoning-heavy tasks where maximum intelligence matters more than speed or cost. + +In practice, most teams should default to the `none`, `low`, or `medium` range. + +Start with `none` for execution-heavy workloads such as workflow steps, field extraction, support triage, and short structured transforms. + +Start with `medium` or higher for research-heavy workloads such as long-context synthesis, multi-document review, conflict resolution, and strategy writing. With `medium` and a well-engineered prompt, you can squeeze out a lot of performance. + +For GPT-5.4 workloads, `none` can already perform well on action-selection and tool-discipline tasks. If your workload depends on nuanced interpretation, such as implicit requirements, ambiguity, or cancelled-tool-call recovery, start with `low` or `medium` instead. + +Before increasing reasoning effort, first add: + +- `` +- `` +- `` + +If the model still feels too literal or stops at the first plausible answer, add an initiative nudge before raising reasoning effort: + +```xml + +- Don’t stop at the first plausible answer. +- Look for second-order issues, edge cases, and missing constraints. +- If the task is safety or accuracy critical, perform at least one verification step. + ``` -If you are using [Compaction](https://developers.openai.com/api/docs/guides/compaction) in the Responses API, compact after major milestones, treat compacted items as opaque state, and keep prompts functionally identical after compaction. +### Migrate prompts to GPT-5.4 one change at a time + +Use the same one-change-at-a-time discipline as the 5.2 guide: switch model first, pin `reasoning_effort`, run evals, then iterate. + +These starting points work well for many migrations: -## Responses `phase` guidance +| Current setup | Suggested GPT-5.4 start | Notes | +| ------------------------- | ---------------------------------- | ------------------------------------------------------------------- | +| `gpt-5.2` | Match the current reasoning effort | Preserve the existing latency and quality profile first, then tune. | +| `gpt-5.3-codex` | Match the current reasoning effort | For coding workflows, keep the reasoning effort the same. | +| `gpt-4.1` or `gpt-4o` | `none` | Keep snappy behavior, and increase only if evals regress. | +| Research-heavy assistants | `medium` or `high` | Use explicit research multi-pass and citation gating. | +| Long-horizon agents | `medium` or `high` | Add tool persistence and completeness accounting. | -For long-running Responses workflows, preambles, or tool-heavy agents that replay assistant items, review whether `phase` is already preserved. +### Small-model guidance for `gpt-5.4-mini` and `gpt-5.4-nano` -- If the host already round-trips `phase`, keep it intact during the upgrade. -- If the host uses `previous_response_id` and does not manually replay assistant items, note that this may reduce manual `phase` handling needs. -- If reliable GPT-5.4 behavior would require adding or preserving `phase` and that would need code edits, treat the case as blocked for prompt-only or model-string-only migration guidance. +`gpt-5.4-mini` and `gpt-5.4-nano` are highly steerable, but they are less likely than larger models to infer missing steps, resolve ambiguity implicitly, or package outputs the way you intended unless you specify that behavior directly. In practice, prompts for smaller models are often a bit longer and more explicit. -## Example upgrade profiles +**How `gpt-5.4-mini` differs** -### GPT-5.2 +- `gpt-5.4-mini` is more literal and makes fewer assumptions. +- It is strong when the task is clearly structured, but weaker on implicit workflows and ambiguity handling. +- By default, it may try to keep the conversation going with a follow-up question unless you suppress that behavior explicitly. -- Use `gpt-5.4` -- Match the current reasoning effort first -- Preserve the existing latency and quality profile before tuning prompt blocks -- If the repo does not expose the exact setting, emit `same` as the starting recommendation +**Prompting `gpt-5.4-mini`** -### GPT-5.3-Codex +- Put critical rules first. +- Specify the full execution order when tool use or side effects matter. +- Do not rely on "you MUST" alone. Use structural scaffolding such as numbered steps, decision rules, and explicit action definitions. +- Separate "do the action" from "report the action." +- Show the correct flow, not just the final format. +- Define ambiguity behavior explicitly: when to ask, abstain, or proceed. +- Specify packaging directly: answer length, whether to ask a follow-up question, citation style, and section order. +- Be careful with `output nothing else`. Prefer scoped instructions such as `after the final JSON, output nothing further`. -- Use `gpt-5.4` -- Match the current reasoning effort first -- If you need Codex-style speed and efficiency, add verification blocks before increasing reasoning effort -- If the repo does not expose the exact setting, emit `same` as the starting recommendation +**Prompting `gpt-5.4-nano`** -### GPT-4o or GPT-4.1 assistant +- Use `gpt-5.4-nano` only for narrow, well-bounded tasks. +- Prefer closed outputs: labels, enums, short JSON, or fixed templates. +- Avoid multi-step orchestration unless the flow is extremely constrained. +- Route ambiguous or planning-heavy tasks to a stronger model instead of over-prompting `gpt-5.4-nano`. -- Use `gpt-5.4` -- Start with `none` reasoning effort -- Add `output_verbosity_spec` only if output becomes too verbose +**Good default pattern** -### Long-horizon agent +1. Task +2. Critical rule +3. Exact step order +4. Edge cases or clarification behavior +5. Output format +6. One correct example -- Use `gpt-5.4` -- Start with `medium` reasoning effort -- Add `tool_persistence_rules` -- Add `completeness_contract` -- Add `verification_loop` +**Avoid** -### Research workflow +- Implied next steps +- Unspecified edge cases +- Schema-only prompts for tool workflows +- Generic instructions without structure -- Use `gpt-5.4` -- Start with `medium` reasoning effort -- Add `research_mode` -- Add `citation_rules` -- Add `empty_result_handling` -- Add `tool_persistence_rules` when the host already uses web or retrieval tools -- Add `parallel_tool_calling` when the retrieval steps are independent +### Web search and deep research -### Support triage or multi-agent workflow +If you are migrating a research agent in particular, make these prompt updates before increasing reasoning effort: -- Use `gpt-5.4` -- Prefer `model string + light prompt rewrite` over `model string only` -- Add at least one of `tool_persistence_rules`, `completeness_contract`, or `verification_loop` -- Add more only if evals show a real regression +- Add `` +- Add `` +- Add `` +- Increase `reasoning_effort` one notch only after prompt fixes. -### Coding or terminal workflow +You can start from the 5.2 research block and then layer in citation gating and finalization contracts as needed. -- Use `gpt-5.4` -- Keep the model-string change narrow -- Match the current reasoning effort first if you are upgrading from GPT-5.3-Codex -- Add `terminal_tool_hygiene` -- Add `verification_loop` -- Add `dependency_checks` when actions depend on prerequisite lookup or discovery -- Add `tool_persistence_rules` if the agent stops too early -- Review whether `phase` is already preserved for long-running Responses flows or assistant preambles -- Do not classify this as blocked just because the workflow uses tools; block only if the upgrade requires changing tool definitions or wiring -- If the repo already uses Responses plus tools and no required host-side change is shown, prefer `model_string_plus_light_prompt_rewrite` over `blocked` +GPT-5.4 performs especially well when the task requires multi-step evidence gathering, long-context synthesis, and explicit prompt contracts. In practice, the highest-leverage prompt changes are choosing reasoning effort by task shape, defining exact output and citation formats, adding dependency-aware tool rules, and making completion criteria explicit. The model is often strong out of the box, but it is most reliable when prompts clearly specify how to search, how to verify, and what counts as done. -## Prompt regression checklist +## Next steps -- Check whether the upgraded prompt still preserves the original task intent. -- Check whether the new prompt is leaner, not just longer. -- Check completeness, citation quality, dependency handling, verification behavior, and verbosity. -- For long-running Responses agents, check whether `phase` handling is already in place or needs implementation work. -- Confirm that each added prompt block addresses an observed regression. -- Remove prompt blocks that are not earning their keep. +- Read [our latest model guide](https://developers.openai.com/api/docs/guides/latest-model) for model capabilities, parameters, and API compatibility details. +- Read [Prompt engineering](https://developers.openai.com/api/docs/guides/prompt-engineering) for broader prompting strategies that apply across model families. +- Read [Compaction](https://developers.openai.com/api/docs/guides/compaction) if you are building long-running GPT-5.4 sessions in the Responses API. \ No newline at end of file diff --git a/skills/.system/openai-docs/references/upgrading-to-gpt-5p4.md b/skills/.system/openai-docs/references/upgrading-to-gpt-5p4.md index 7a6775f4..5e0ad296 100644 --- a/skills/.system/openai-docs/references/upgrading-to-gpt-5p4.md +++ b/skills/.system/openai-docs/references/upgrading-to-gpt-5p4.md @@ -2,6 +2,14 @@ Use this guide when the user explicitly asks to upgrade an existing integration to GPT-5.4. Pair it with current OpenAI docs lookups. The default target string is `gpt-5.4`. +## Freshness check + +Before applying this bundled guide, run `node scripts/resolve-latest-model-info.js` from the OpenAI Docs skill directory. + +- If the command returns `modelSlug: "gpt-5p4"`, continue with this bundled guide and use `references/gpt-5p4-prompting-guide.md` when prompt updates are needed. +- If the command returns a different `modelSlug`, fetch both the returned `migrationGuideUrl` and `promptingGuideUrl` and use them as the current source of truth instead of the bundled references. +- If the command fails, the metadata is missing, or either remote guide cannot be fetched, continue with the bundled GPT-5.4 references and say the remote freshness check was unavailable. + ## Upgrade posture Upgrade with the narrowest safe change set: diff --git a/skills/.system/openai-docs/scripts/resolve-latest-model-info.js b/skills/.system/openai-docs/scripts/resolve-latest-model-info.js new file mode 100755 index 00000000..2a47ff05 --- /dev/null +++ b/skills/.system/openai-docs/scripts/resolve-latest-model-info.js @@ -0,0 +1,147 @@ +#!/usr/bin/env node + +const fs = require("node:fs/promises"); +const path = require("node:path"); + +const DEFAULT_URL = + "https://developers.openai.com/api/docs/guides/latest-model.md"; +const DEFAULT_BASE_URL = "https://developers.openai.com"; + +function parseArgs(argv) { + const args = { + source: process.env.LATEST_MODEL_URL || DEFAULT_URL, + baseUrl: process.env.LATEST_MODEL_BASE_URL || DEFAULT_BASE_URL, + }; + + for (let i = 2; i < argv.length; i += 1) { + const arg = argv[i]; + if (arg === "--source" || arg === "--url") { + args.source = argv[i + 1]; + i += 1; + } else if (arg === "--base-url") { + args.baseUrl = argv[i + 1]; + i += 1; + } + } + + return args; +} + +async function readSource(source) { + if (source.startsWith("file://")) { + return fs.readFile(new URL(source), "utf8"); + } + + if (!/^https?:\/\//.test(source)) { + return fs.readFile(path.resolve(source), "utf8"); + } + + const response = await fetch(source, { + headers: { accept: "text/markdown,text/plain,*/*" }, + }); + + if (!response.ok) { + throw new Error(`failed to fetch ${source}: ${response.status}`); + } + + return response.text(); +} + +function parseIndentedInfo(lines, startIndex) { + const info = {}; + + for (let i = startIndex + 1; i < lines.length; i += 1) { + const line = lines[i]; + if (!line.trim()) { + continue; + } + + const match = line.match(/^ {2}([A-Za-z][A-Za-z0-9_-]*):\s*(.+?)\s*$/); + if (!match) { + break; + } + + info[match[1]] = match[2].replace(/^["']|["']$/g, ""); + } + + return info; +} + +function parseFlatInfo(block) { + const info = {}; + + for (const line of block.split(/\r?\n/)) { + const match = line.match(/^([A-Za-z][A-Za-z0-9_-]*):\s*(.+?)\s*$/); + if (match) { + info[match[1]] = match[2].replace(/^["']|["']$/g, ""); + } + } + + return info; +} + +function extractLatestModelInfo(markdown) { + const lines = markdown.split(/\r?\n/); + const latestModelInfoIndex = lines.findIndex((line) => + /^latestModelInfo:\s*$/.test(line) + ); + + if (latestModelInfoIndex >= 0) { + return parseIndentedInfo(lines, latestModelInfoIndex); + } + + const commentMatch = markdown.match( + //m + ); + if (commentMatch) { + return parseFlatInfo(commentMatch[1]); + } + + return undefined; +} + +function modelToSkillSlug(model) { + return model.trim().replace(/\./g, "p"); +} + +function absoluteUrl(baseUrl, value) { + return new URL(value, baseUrl).toString(); +} + +function normalizeInfo(info, baseUrl) { + const model = info?.model?.trim(); + const migrationGuide = info?.migrationGuide?.trim(); + const promptingGuide = info?.promptingGuide?.trim(); + + if (!model || !migrationGuide || !promptingGuide) { + throw new Error( + "latestModelInfo must include model, migrationGuide, and promptingGuide" + ); + } + + return { + model, + modelSlug: modelToSkillSlug(model), + migrationGuideUrl: absoluteUrl(baseUrl, migrationGuide), + promptingGuideUrl: absoluteUrl(baseUrl, promptingGuide), + }; +} + +async function main() { + const { source, baseUrl } = parseArgs(process.argv); + const markdown = await readSource(source); + const info = extractLatestModelInfo(markdown); + + if (!info) { + throw new Error(`latestModelInfo block not found in ${source}`); + } + + process.stdout.write( + `${JSON.stringify(normalizeInfo(info, baseUrl), null, 2)}\n` + ); +} + +main().catch((error) => { + console.error(error.message); + process.exit(1); +});