Skip to content

hashbulla/deep-research

Repository files navigation

Deep Research

Intelligence-grade multi-source research as a Claude Code skill — source-graded, autonomous-capable, evidence-anchored.

Claude Code Skill 7 Phases Admiralty 100+ Sources Tavily MCP

MIT License Runtime Any Language


Point it at a research question. It decomposes the question into orthogonal sub-questions, retrieves across a tiered domain registry with Tavily, grades every source on the NATO Admiralty 2×6 matrix, and hands you a report where every claim traces to a URL and an explicit confidence label. Exhaustive runs reach 100+ sources. Quality is held to absolute gates — groundedness, source-quality, corroboration, and a decorrelated entailment judge — verified per run, not against an external product.

Why does this exist? LLM research agents confidently synthesize from SEO farms unless you force source discipline. Anthropic's own internal research system found that "early agents consistently chose SEO-optimized content farms over academic PDFs, even when authoritative sources were retrievable." The fix is not better prompting alone. It is a source-grading harness that runs before the synthesis step — tier registry, score threshold, Admiralty reliability, CRAAP authority check, LLM-as-judge rerank, CRAG grounding loop. This skill bakes all of that into a 7-phase pipeline that runs autonomously, pausing for a single clarifying question only when the query is genuinely ambiguous — so an orchestrating agent can drive it unattended, yet an ambiguous question is refined with you before a single Tavily credit is spent.


What You Get

Four artifacts in your invocation directory, written atomically at the end of the run.

research-plan.md         # the run's plan, written in Phase 0
research-report.md       # final synthesis, inline citations, confidence tags
research-sources.json    # every cited source, Admiralty-graded
research-evidence.json   # claim → source mapping, credibility 1–6

Exactly four — always. The optional --suggest-tooling flag adds a fifth file (research-toolbox.md) written by a separate companion skill, not the engine; the four-artifact contract stays intact.

research-report.md excerpt

# Impact of the EU AI Act on open-source model providers in 2026

> Research date: 2026-04-17 · Length: short
> Source count: 16/28 · Tier 1/2 share: 94% · Median date: 2025-09-08

## Executive summary

- GPAI obligations under Articles 53–55 entered application on 2 August 2025,
  with systemic-risk provisions applying above the 10²⁵ FLOPs threshold.[^1][^2]
- The open-source exemption (Article 2(5g)) excludes free and open-source
  GPAI models from several transparency obligations unless they meet the
  systemic-risk threshold.[^1][^3] [CONFIRMED]

## 1. GPAI provisions in force in 2026

Under Article 53 of Regulation (EU) 2024/1689 [...].[^1][^4] The European AI
Office published its Code of Practice on 2025-07-10 [...].[^5] [CONFIRMED]

## Contradictions & open debates

The scope of "sufficiently detailed summary" of training data remains disputed.
The Commission's July 2025 template[^5] is interpreted by Meta[^6] as [...],
while Mozilla[^7] argues [...]. [POSSIBLY TRUE — contested]

## Needs Verification

- Claim that compliance costs exceed €1M for small open-source providers —
  rests on a single trade-press source[^12] without regulatory corroboration.

## Sources

[^1]: Regulation (EU) 2024/1689 — eur-lex.europa.eu — Tier 1, Admiralty A1
[^2]: European AI Office GPAI guidance — digital-strategy.ec.europa.eu — Tier 1, A1
[...]

research-evidence.json schema

{
  "claim_id": "C001",
  "claim_text": "GPAI obligations under Articles 53–55 entered application on 2 August 2025.",
  "supporting_source_ids": ["S001", "S002"],
  "contradicting_source_ids": [],
  "admiralty_credibility": 1,
  "label": "CONFIRMED",
  "corroboration_count": 2,
  "independent_tier12_count": 2,
  "primary_source_present": true
}

Every claim in the report has a record. No URL is fabricated; no claim is silent about its provenance.


Quick Start

Install

gh repo clone hashbulla/deep-research ~/.claude/skills/deep-research

Claude Code discovers the skill automatically. No restart needed.

Run

# Standard run, language inferred from the question
/deep-research impact of EU AI Act on open-source model providers in 2026

# Exhaustive run in French, with recency and custom domains
/deep-research --length exhaustive --lang fr \
  --since 2025 --domains anthropic.com,mistral.ai \
  comparaison LangGraph / CrewAI / AutoGen / Claude Agent SDK

# Narrow factual with recency gate
/deep-research --since 2025 prompt caching cost-performance tradeoffs

Prerequisites

Requirement Why Check
Claude Code Runtime for the skill claude --version
Opus Synthesis quality and Admiralty discipline benefit from top-tier reasoning /model opus
Tavily MCP Every retrieval call. WebSearch is fallback only. Visible in /mcp
gh CLI Installing from this repo, and GitHub deep research (SOTA-repo discovery). Absent → graceful Tavily degradation gh auth status
python3 Runs scripts/verify_gates.py (stdlib-only, zero network) for deterministic gate verification python3 --version
Context7 MCP Version-current library docs on technical runs naming a dependency. Absent → graceful Tavily degradation Visible in /mcp
Newsletter corpus Folds the maintainer's curated daily briefs into work-relevant runs as a routing signal. Absent → graceful Tavily degradation ls ~/.claude/deep-research/newsletter-corpus/

Verify Tavily is registered before invoking:

claude mcp list | grep tavily

If the grep returns no match, register the Tavily remote MCP server at user scope (see Troubleshooting). The skill halts at Phase 0 with an explicit error if mcp__tavily__* tools are not visible.

Invocation flags

Flag Values Default Effect
--length short | standard | exhaustive standard Calibrates sub-question count, recall breadth, source target (15–25 / 35–60 / 100+)
--lang ISO 639-1 inferred Output language of research-report.md
--since YYYY or YYYY-MM-DD inferred Lower bound on source publication date
--domains comma list tier profile Additional allowlist appended to the tier profile
--exclude comma list tier blocklist Additional blocklist
--profile academic | technical | current-affairs | mixed inferred Selects include_domains baseline from the tier registry
--min-corroboration int ≥ 1 2 Min independent Tier 1/2 sources required to mark a claim CONFIRMED
--model opus | fable opus Synthesis tier — Claude-Code-native (session model + subagent overrides, zero API keys). Fable 5 is opt-in at ~2× cost (details)
--confidential flag off Confidential-path run: subagents receive neutral references only; rigor escalates (details)
--rigor standard | critical standard (critical implied by --confidential) Verification depth — entailment-judge scope, refuse-if-no-source, mandatory anchors, sycophancy probe (details)
--suggest-tooling flag off After Phase 6 completes, delegate the finished run to the suggest-tooling sibling skill, which proposes work-relevant Claude Code skills, plugins, and MCP servers and writes research-toolbox.md. Default OFF — runs are byte-identical without it. The engine still emits exactly the four artifacts; suggest-tooling is a separate skill that writes the 5th file and never auto-installs anything.

Pipeline

flowchart TD
    Start(["/deep-research <question>"]) --> P0

    subgraph P0["Phase 0 — Query Architect"]
        B1[Parse question & flags] --> RF{"ambiguity signal<br/>or safety trigger?<br/>methodology §9"}
        RF -- "yes" --> ASK["AskUserQuestion<br/>one round — refine"]
        RF -- "no — autonomous" --> B2[Classify: academic/technical/current/mixed]
        ASK --> B2
        B2 --> B3[Decompose into sub-questions<br/>factual · contextual · contradictory · recency]
        B3 --> B4[Assemble tier profile<br/>+ include_domains preview]
        B4 --> B5[Write research-plan.md]
    end

    P0 --> P1

    subgraph P1["Phase 1 — Broad Retrieval"]
        direction LR
        R1["tavily_search<br/>search_depth=advanced"] --> R2["tavily_map<br/>for domain discovery"]
        R2 --> R3["Paced under 20 req/min"]
        R3 --> R4["Conditional sources<br/>declared in the plan:<br/>GitHub · academic · Context7 · newsletter-signal<br/>→ graceful Tavily degradation"]
    end

    P1 --> P2

    subgraph P2["Phase 2 — Source Grading"]
        direction LR
        S1["score &gt; 0.7"] --> S2["Tier 1–4 classification"]
        S2 --> S2b["MBFC credibility overlay<br/>flag / downgrade at the margin<br/>(optional, user-scope dataset)"]
        S2b --> S3["CRAAP Currency + Authority"]
        S3 --> S4["Admiralty A–F"]
        S4 --> S5["Dedupe canonical URL<br/>punycode check"]
    end

    P2 --> P3

    subgraph P3["Phase 3 — Precision Rerank"]
        direction LR
        J1["LLM-as-judge pointwise<br/>≤10 docs per sub-question"] --> J2["Primary vs secondary"]
        J2 --> J3["Top 5–7 selected"]
    end

    P3 --> P4

    subgraph P4["Phase 4 — Deep Extract & Synthesis"]
        direction LR
        E1["tavily_research<br/>mini / pro"] --> E2["tavily_extract<br/>extract_depth=advanced"]
        E2 --> E3["Write research-report.md<br/>surgical quotes only"]
    end

    P4 --> P5

    subgraph P5["Phase 5 — Grounding Validation (CRAG)"]
        direction LR
        C0["Decorrelated entailment judge<br/>different Claude model, claim + span only<br/>scope by rigor profile"] --> C1["groundedness ≥ 0.95?<br/>corroboration ≥ 0.80?"]
        C1 --> C2{gates pass}
        C2 -- "no, &lt;2 iters" --> Req["rewrite query →<br/>tavily_search supplement"]
        Req --> C1
        C2 -- "yes, or 2 iters done" --> C3["move failing claims<br/>to Needs Verification"]
    end

    P5 --> P6

    subgraph P6["Phase 6 — Confidence Annotation"]
        direction LR
        Z1["Admiralty credibility 1–6<br/>per claim"] --> Z2["CONFIRMED / PROBABLY TRUE /<br/>POSSIBLY TRUE → main body"]
        Z2 --> Z3["DOUBTFUL / IMPROBABLE /<br/>UNVERIFIED → Needs Verification"]
    end

    P6 --> Done(["4 artifacts written atomically"])

    style RF fill:#FEF3C7,stroke:#D97706,color:#92400E
    style P1 fill:#DBEAFE,stroke:#3B82F6
    style P2 fill:#FEF9C3,stroke:#CA8A04
    style P3 fill:#FEF9C3,stroke:#CA8A04
    style P4 fill:#DCFCE7,stroke:#059669
    style P5 fill:#FEE2E2,stroke:#EF4444
    style P6 fill:#F3F4F6,stroke:#6B7280
    style Start fill:#7C3AED,stroke:#7C3AED,color:#fff
    style Done fill:#059669,stroke:#059669,color:#fff
Loading

Pre-flight Refinement

The skill runs autonomously by default: Phase 0 writes research-plan.md and proceeds straight to retrieval. It pauses for a single AskUserQuestion round only when the query trips a named ambiguity signal — no scope boundary, undefined comparison axis, ambiguous timeframe, unspecified depth, or undefined audience/jurisdiction — or a safety trigger (a sub-Tier-2 --domains entry, or a likely-false premise under --rigor critical). A well-formed query never pauses, so an orchestrating agent can drive the skill unattended; an ambiguous one is refined with you before a single Tavily credit is spent. The checklist is authoritative in references/methodology.md §9.

Refinement Fires when Time
Refine A named ambiguity signal or safety trigger trips — otherwise the run is fully autonomous 0–2 min

Source Grading

Three overlapping disciplines, applied deterministically in Phase 2 and probabilistically in Phase 3.

1. The tier registry

Every domain classified before any content enters the synthesis prompt. Based on Anthropic's internal finding that unconstrained agents drift toward SEO content.

Tier Examples Admiralty reliability Usage
1 arxiv.org, pubmed.ncbi.nlm.nih.gov, nature.com, *.gov, *.europa.eu, who.int A Primary sources; preferred in include_domains
2 anthropic.com, openai.com, reuters.com, ft.com, docs.python.org, gartner.com B Default retrieval baseline (Tier 1+2 union)
3 techcrunch.com, wired.com, arstechnica.com, substack.com (institution-affiliated) C Acceptable only with corroboration from Tier 1/2
4 reddit.com, x.com, linkedin.com, medium.com D–F Never primary. Social-signal pointers only, in a labeled Signals subsection

Full list in references/methodology.md §6.

2. NATO Admiralty 2×6 matrix

Two orthogonal axes: reliability of the source, credibility of the information after corroboration. Every cited source carries a reliability letter; every claim in research-evidence.json carries a credibility digit.

flowchart LR
    Tier["Domain tier<br/>(1–4)"] --> Rel["Reliability<br/>A / B / C / D / E / F"]
    Rel --> Pair["Source record<br/>e.g. A1, B2, C3"]
    Corr["Independent<br/>corroborating sources"] --> Cred["Credibility<br/>1 / 2 / 3 / 4 / 5 / 6"]
    Contra["Contradicting<br/>sources"] --> Cred
    Cred --> Pair
    Pair --> Label["Claim label<br/>CONFIRMED / PROBABLY TRUE /<br/>POSSIBLY TRUE / …"]

    style Tier fill:#DBEAFE,stroke:#3B82F6
    style Rel fill:#DBEAFE,stroke:#3B82F6
    style Corr fill:#FEF3C7,stroke:#D97706
    style Cred fill:#FEF3C7,stroke:#D97706
    style Contra fill:#FEE2E2,stroke:#EF4444
    style Pair fill:#DCFCE7,stroke:#059669
    style Label fill:#059669,stroke:#059669,color:#fff
Loading

Credibility rules are deterministic — no LLM fluency-weighted guessing. The table renders the normative precedence cascade in references/methodology.md §4.1 (first matching row wins; the cascade wins on any divergence):

Credibility Condition (first match wins) Label
1 ≥2 independent Tier 1/2 sources agree, no Tier 1/2 contradictor CONFIRMED
2 ≥1 Tier 1 source and no contradictor; or ≥2 Tier 1/2 with exactly 1 contradictor PROBABLY TRUE
3 Single Tier 1/2 source, no contradictor (Tier 3 corroboration does not upgrade) POSSIBLY TRUE
4 ≥1 Tier 1/2 support and ≥1 equally authoritative contradictor DOUBTFUL
5 Contradicted by ≥2 Tier 1/2 sources IMPROBABLE
6 Only Tier 3/4 support, or zero supporting sources UNVERIFIED

Labels 4/5/6 cannot appear in the main report body — they route to the Needs Verification section with an explicit reason.

3. Quality gates

Applied post-synthesis. Failing a gate triggers a CRAG re-query loop (up to 2 iterations per sub-question) before any claim ships.

Gate Threshold Failure action
Groundedness ≥ 0.95 CRAG re-query for unsupported claims
Source quality ≥ 0.80 Tier 1/2 Expand allowlist, re-retrieve
Coverage ≥ 0.90 sub-questions Add follow-up sub-question
Freshness Median within --since window Add recency sub-question
Corroboration rate ≥ 0.80 Re-query; else → Needs Verification

Full thresholds in references/quality-gate.md.


Architecture

Seven-phase orchestrator, single SKILL.md entry point, methodology externalized into reference files loaded on demand.

graph LR
    O["SKILL.md<br><i>Orchestrator (7 phases)</i>"]

    O -. Phase 0 reads .-> M["references/methodology.md<br><i>Operational spec (report distillation)</i>"]
    O -. Phase 0 reads .-> T["references/tool-routing.md<br><i>Tavily binding table</i>"]
    O -. Phase 0 reads .-> P["references/research-plan-template.md<br><i>Phase 0 scaffold</i>"]
    O -. Phase 4 reads .-> S["references/report-structure.md<br><i>Output + JSON schemas</i>"]
    O -. Phase 5 reads .-> Q["references/quality-gate.md<br><i>Deterministic thresholds</i>"]
    O -. any phase .-> A["references/anti-patterns.md<br><i>Forbidden behaviors</i>"]

    O -- calls --> TVS(["mcp__tavily__tavily_search"])
    O -- calls --> TVR(["mcp__tavily__tavily_research"])
    O -- calls --> TVE(["mcp__tavily__tavily_extract"])
    O -- calls --> TVM(["mcp__tavily__tavily_map"])

    O -- writes --> plan[("research-plan.md")]
    O -- writes --> report[("research-report.md")]
    O -- writes --> sources[("research-sources.json")]
    O -- writes --> evidence[("research-evidence.json")]
Loading

File structure

~/.claude/skills/deep-research/
├── .claude/CLAUDE.md                      # Maintainer spec anchor — invariants, gotchas, conventions
├── SKILL.md                               # Orchestrator — 7 phases, pre-flight refinement, provenance block
├── deep-research-report.md                # Methodology source of truth (cited below)
├── scripts/
│   ├── verify_gates.py                    # Deterministic gate verification (stdlib-only, zero network)
│   ├── github_rank.py                     # Composite GitHub-repo ranking (scoring only, zero network)
│   └── academic_graph.py                  # Dual-track paper ranking + BibTeX/RIS export (zero network)
└── references/
    ├── methodology.md                     # Full distillation — tier registry, Admiralty, CRAAP, CRAG
    ├── tool-routing.md                    # Tavily MCP tool selection per intent
    ├── report-structure.md                # research-report.md structure + JSON schemas
    ├── quality-gate.md                    # Deterministic thresholds, CRAG triggers
    ├── anti-patterns.md                   # Non-negotiables (no fabricated URLs, no WebSearch, etc.)
    ├── research-plan-template.md          # Phase 0 scaffold
    ├── model-tiers.md                     # Model-tier policy (opus default, fable opt-in)
    ├── github-research.md                 # GitHub SOTA-repo discovery (sharding, expert prior, fake-star gate)
    ├── academic-research.md               # Scholarly pipeline (open graph, dual-track, OA-only ingestion)
    ├── newsletter-signal.md               # Curated-feed routing source (local FTS5 corpus, never cited)
    └── examples.md                        # Worked examples (read on demand)

Design decisions

Decision Choice Rationale
Primary retrieval tavily_search search_depth=advanced Per-phase control; Tavily already reranks semantic chunks
Synthesis for narrow sub-questions tavily_research model=mini|pro Delegates the inner Perplexity-style loop where useful
Stage-2 rerank LLM-as-judge (≤10 docs) Cohere Rerank / cross-encoders not available as MCP; judge approximates cross-encoder accuracy
Pre-context filtering Inline Claude reasoning on Tavily results Anthropic's Dynamic Filtering is API-side only; inline filtering achieves equivalent discipline
Source grading NATO Admiralty 2×6 Intelligence-grade provenance, usable by humans, deterministic
Contradiction handling Dedicated report section Report §1 stage 4 — never silent, never auto-resolve between equally authoritative sources
Gate verification scripts/verify_gates.py (stdlib-only, zero network) Counts, ratios, medians, cascade conformance, and the CWD-report SHA-256 are script-verified at runtime — LLM-self-reported metrics are not gates
Conditional retrieval sources GitHub / academic / Context7 / newsletter-signal, gated at Phase 0 Declared in the plan, fire only when relevant; any absent MCP/CLI/credential/corpus degrades gracefully to Tavily-only and is recorded in the Methodology note
Newsletter-signal grading Routing signal, never a citable record The brief seeds retrieval; the pointed-to URL is graded normally and tagged surfaced via newsletter-signal corpus <date>. Avoids circular "my own digest said so" authority
Rigor profiles standard default · critical (implied by --confidential) The everyday instrument stays fast; high-stakes runs escalate to entailment-on-every-claim, refuse-if-no-source, mandatory anchors, and a Phase-0 sycophancy probe
Model selection Claude-Code-native (session model + subagent overrides), zero API keys Consumers are Claude Max users without keys; Opus 4.8 default, Fable 5 opt-in via --model, no SDK client
Fidelity judge Decorrelated subagent on a different Claude model, claim + span only Breaks same-model self-evaluation circularity without an external key; external Gemini/GPT judges were evaluated and dropped (zero-key contract)

Companion skill: tooling recommender

suggest-tooling is a sibling skill in this repository. It consumes a finished /deep-research run and proposes the Claude Code skills, plugins, and MCP servers worth adopting to act on what the research surfaced — relevance-ranked against your work profile and trust-graded for supply-chain safety. It never installs anything.

Property Detail
Invoke /suggest-tooling <run-dir>, or set --suggest-tooling on a deep-research run to delegate automatically at the end of Phase 6
Six discovery channels GitHub, the MCP Registry, Claude Code marketplaces, Vercel skills, Smithery, and awesome-* lists (seed-only) — each independently degradable
Trust tiers VERIFIED · MAINTAINED · COMMUNITY · CAUTION, derived deterministically from official/verified-namespace status, signed provenance, maintenance recency, and a fake-signal divergence gate — kept orthogonal to the relevance rank
Never auto-installs Install commands are shown as text, never executed. Every listing and README is untrusted data (anti-pattern A6) — parsed for identifiers, never obeyed
Output research-toolbox.md + a research-toolbox.json sidecar (the 5th artifact)
Determinism Ranking and tiering run entirely in suggest-tooling/scripts/marketplace_rank.py — stdlib-only, zero-network, zero-LLM — reusing the fake-star divergence gate extracted from scripts/github_rank.py

It was itself designed by dogfooding a /deep-research run on the discovery landscape; that graded report is kept under docs/superpowers/ as the design's evidence base.


Troubleshooting

Tavily MCP not registered

The skill halts at Phase 0 if mcp__tavily__* tools are not visible. Register the Tavily remote MCP server at user scope:

claude mcp add --scope user tavily --transport http https://mcp.tavily.com/mcp/

Then re-invoke.

Tavily rate limit (429)

Research endpoint is capped at 20 req/min. The skill backs off at 30s → 60s → 120s. On persistent 429, affected sub-questions degrade automatically to tavily_search + manual decomposition. Documented in the Methodology note of the final report.

Exhaustive run came back with < 100 sources

The skill runs a single expansion round — allowlist broadened to the Tier 1+2 union, 2–4 contextual/recency sub-questions added — before proceeding to Phase 4. If it still falls short after that round, it proceeds anyway and the Methodology note documents why (e.g., narrow topic, paywall-dominant domain). The 100+ target is a quality calibration, not a hard contract.

A claim I expected to see ended up in Needs Verification

The corroboration threshold is --min-corroboration 2 by default. A single Tier 1 source with no second independent corroborator yields credibility 2 (PROBABLY TRUE); a single Tier 2 source yields credibility 3 — both still in the main body with inline tags. Credibility 4–6 (contradicted, only Tier 3/4 support, etc.) routes to Needs Verification. Raise --min-corroboration 3 for stricter runs, or examine research-evidence.json for the exact support graph.

I want to override the tier profile

Use --profile academic|technical|current-affairs|mixed or pass --domains directly. User-specified domains are unioned with the tier profile; they are never silently dropped. Any --domains entry below Tier 2 trips a one-round AskUserQuestion confirmation in Phase 0 before retrieval, and the decision is recorded in research-plan.md.

Output language mismatch

--lang wins over the question's language. The skill preserves original-language proper nouns (legislation names, institution names, paper titles) inside the translated report to keep citation traceability.


Extending

Goal How
Tune the tier registry Edit references/methodology.md §6. Add domains to Tier 1/2; rebuild the include_domains preview at the top of the plan template.
Adjust quality gates Edit references/quality-gate.md. Thresholds are deterministic; raising groundedness to 0.98 simply triggers more CRAG iterations.
Add a sub-question category Edit SKILL.md Phase 0 step 4 and mirror in references/research-plan-template.md.
Change default length calibration Edit the "Length calibration" table in references/methodology.md.
Swap Tavily for another MCP Edit references/tool-routing.md and the Phase 1 / Phase 4 call templates in SKILL.md. Keep the grading phases intact — they are MCP-agnostic.
Feed the newsletter-signal corpus Drop redacted briefs/YYYY-MM.jsonl files (schema: tests/schema/newsletter-corpus-record.schema.json) into ~/.claude/deep-research/newsletter-corpus/ — user-scope, outside this repo, sibling to experts.yaml. A producer (e.g. a newsletter agent) commits them via the GitHub Contents API; the skill reads a local clone. See references/newsletter-signal.md.
Tune the tooling recommender Edit suggest-tooling/references/tooling-categories.md (the closed category taxonomy) and the hat weights in ~/.claude/deep-research/tooling-hats.json (user-scope; template ships as suggest-tooling/tooling-hats.json.example). The category set is parity-checked against the ranker in CI.

Roadmap

Shipped

Status Capability
Done 7-phase pipeline with autonomous pre-flight refinement
Done NATO Admiralty 2×6 grading with deterministic credibility assignment
Done CRAG grounding loop (2 iterations max, graceful Needs-Verification fallback)
Done Unicode homograph defense (punycode normalization of every domain)
Done 100+ source exhaustive calibration
0.3.0 Conditional retrieval sources — GitHub SOTA-repo discovery, academic open-graph pipeline (OpenAlex / arXiv / Semantic Scholar), Context7 docs
0.4.0 Newsletter-signal conditional source — folds the maintainer's curated daily briefs into work-relevant runs via a local FTS5 corpus (routing signal, never cited)
0.3.0 Claude-Code-native model tiers (Opus default, Fable 5 opt-in) + critical rigor profile
0.3.0 Decorrelated entailment judge (different Claude model) — breaks LLM-as-judge circularity with no external key
0.3.0 MBFC credibility overlay on the tier registry (flag / downgrade at the margin)
0.3.0 Citation graph export — BibTeX / RIS for academic handoff
0.3.0 Five-layer eval harness + frozen benchmark test set, secret-gated CI judges
Done suggest-tooling companion skill + --suggest-tooling delegation flag — trust-graded skill/plugin/MCP recommender across six discovery channels, never auto-installing

Considered / deferred

Status Item Why
Deferred Exa findSimilar / Valyu full-text academic index Evaluated during the 0.3.0 academic-pipeline work; the open-graph pipeline (OpenAlex / arXiv / Semantic Scholar) shipped instead. Revisit only if a recall benchmark shows a material gap, and only as an optional, key-gated source
Dropped External-model judge (Gemini / GPT-4) Superseded by the decorrelated Claude judge above; the zero-API-key contract (consumers are Claude Max users) rules out external providers
Dropped Quality benchmark vs Perplexity Deep Research The harness ships with absolute quality gates instead — groundedness, source-quality, corroboration, entailment — tracked per run over a frozen test set. Perplexity is no longer the reference comparator

Research Foundation

The full methodology is in deep-research-report.md — a standalone intelligence brief on state-of-the-art web search techniques for AI agents. The references/methodology.md file is a near-verbatim distillation of that report, with every skill rule back-referenced to its source section ([R§n]).

The report synthesizes these sources, each directly wired into a specific part of the skill:

Source Used for Skill location
Perplexity Deep Research 5-stage retrieve-reason-refine loop, tiered source preference SKILL.md Phase 0–6 shape
Anthropic · Multi-agent research system Orchestrator-worker pattern, "start wide then narrow", SEO-farm failure mode Phase 1 parallel retrieval, prompt structure
Anthropic · Web search tool (domain controls) allowed_domains / blocked_domains, Unicode homograph defense Phase 2 domain normalization
Anthropic · Building effective agents Extended / interleaved thinking, source quality as first-class metric Phase 0 decomposition, Phase 3 rerank
Anthropic · Agent evaluation guide (2026) Groundedness + Coverage + Source Quality triad Phase 5 CRAG gates
Tavily API documentation search_depth=advanced, include_domains, score filter, Research endpoint Phase 1, Phase 4 tool routing
Corrective RAG (CRAG) Post-synthesis claim grading and re-query loop Phase 5
LevelRAG / PRISM query decomposition CoT decomposition patterns for multi-hop queries Phase 0
NATO Admiralty source grading A–F × 1–6 matrix for intelligence-grade provenance Phase 2, Phase 6
CRAAP Test Currency · Relevance · Authority · Accuracy · Purpose automation Phase 2 filter gates
OSINT five-step validation Independent corroboration, primary-source chain, confidence-trap defense Phase 5, anti-patterns

A curated domain tier registry (≈ 60 Tier 1 domains + ≈ 40 Tier 2) is embedded in references/methodology.md §6. Full bibliography and methodology rationale in deep-research-report.md.


Repository layout

deep-research/
├── .claude/CLAUDE.md                      # maintainer spec anchor
├── README.md                              # this file
├── LICENSE                                # MIT
├── SKILL.md                               # skill entry point
├── deep-research-report.md                # methodology source of truth
├── scripts/
│   ├── verify_gates.py                    # deterministic gate verification (stdlib-only, zero network)
│   ├── github_rank.py                     # composite GitHub-repo ranking + exported fake_star_gate() (zero network)
│   ├── academic_graph.py                  # dual-track paper ranking + BibTeX/RIS export (zero network)
│   ├── newsletter_search.py               # newsletter-signal corpus search (in-memory FTS5, zero network)
│   └── eval_harness/                      # 5-layer verification harness (judge prompts + secret-gated CI runner)
├── experts.yaml.example                   # anonymous template — the real seed lives user-scope, outside the repo
├── references/
│   ├── methodology.md
│   ├── tool-routing.md
│   ├── report-structure.md
│   ├── quality-gate.md
│   ├── anti-patterns.md
│   ├── research-plan-template.md
│   ├── model-tiers.md                     # model-tier policy (opus default, fable opt-in)
│   ├── github-research.md                 # GitHub SOTA-repo discovery
│   ├── academic-research.md               # scholarly open-graph pipeline
│   ├── newsletter-signal.md               # curated-feed routing source (local FTS5 corpus)
│   └── examples.md                        # worked examples (read on demand)
├── suggest-tooling/                       # companion skill — trust-graded tooling recommender
│   ├── SKILL.md                           # entry point, 6-connector orchestration, no-auto-install
│   ├── references/                        # tooling-discovery / tooling-categories / toolbox-output
│   ├── scripts/marketplace_rank.py        # composite rank + trust tiers (stdlib-only, zero network)
│   ├── tooling-hats.json.example          # hat-weight template (real file lives user-scope)
│   └── evals/                             # loading + e2e fixtures for the five failure modes
├── docs/superpowers/                      # design specs, plans, and the dogfood research run
├── examples/eu-ai-act-2026/               # end-to-end fixture (4 artifacts, gate-conformant)
├── evals/                                 # loading / progressive / e2e + sycophancy-probes + benchmark-testset + rubric
├── CHANGELOG.md                           # semver release history (append-only)
├── gotchas-log.md                         # maintainer traps + perishable-asset cadences
├── tests/                                 # cross-reference / provenance / schema / invariant checks
│   ├── check-cross-references.sh
│   ├── check-provenance.sh
│   ├── check-schema.sh
│   ├── check-example-invariants.sh
│   ├── check-newsletter-search.sh
│   ├── check-marketplace-rank.sh          # suggest-tooling ranker unit + regression checks
│   ├── schema/{research-sources,research-evidence}.schema.json
│   └── fixtures/ → examples/eu-ai-act-2026/*.json
└── .github/workflows/validate.yml         # CI — runs the check suite on push + PR

Sync model

The canonical copy is this repository. If you symlink instead of clone (e.g., ln -s "$PWD" ~/.claude/skills/deep-research from a working directory that holds the repo), edits propagate to Claude Code immediately. The default gh repo clone hashbulla/deep-research ~/.claude/skills/deep-research install creates a regular clone — commit upstream from that directory to publish changes.


MIT

About

A human-gated, 7-phase agentic deep-research skill for Claude Code. Fans out across the open web (Tavily), GitHub, academic open-graphs (OpenAlex/arXiv/Semantic Scholar), and live library docs — then grades every source on the NATO Admiralty A–F scale, runs deterministic quality gates, and emits four cited artifacts.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors