Skip to content

Auto-generate evidence definitions from ClinVar expert panel and GWAS tier-1 data#40

Closed
steve228uk wants to merge 30 commits into
mainfrom
claude/add-clingen-evidence-w5sm6
Closed

Auto-generate evidence definitions from ClinVar expert panel and GWAS tier-1 data#40
steve228uk wants to merge 30 commits into
mainfrom
claude/add-clingen-evidence-w5sm6

Conversation

@steve228uk

Copy link
Copy Markdown
Collaborator
  • Add makeGenericDefinition() factory to evidencePack.ts so high-quality
    ClinVar/GWAS findings get the same full curated-card treatment as the
    12 hand-crafted EVIDENCE_DEFINITIONS (genotype-specific summaries, tone,
    coverage tracking)

  • Add src/lib/autoDefinitions.ts (generated by buildDefinitions.ts) that
    exports AUTO_DEFINITION_PARAMS as plain data; evidencePack.ts maps it
    through the factory, avoiding circular imports

  • Add scripts/buildDefinitions.ts: mines ClinVar for expert-panel/practice-
    guideline reviewed variants (3★/4★) and GWAS Catalog for p ≤ 1e-10
    associations with replication and a numeric OR/Beta; writes autoDefinitions.ts

  • Replace hand-maintained records.json with buildDefinitionRecords() in
    buildEvidencePack.ts, which auto-extracts PMIDs and notes from source data
    for each EVIDENCE_DEFINITION marker set

  • Bulk record quality improvements in buildEvidencePack.ts:

    • GWAS: restore missing https:// prefix on all URLs
    • GWAS: deduplicate genes array before slicing
    • GWAS: extract OR/Beta → repute, CI + sample size in detail,
      risk allele frequency → frequencyNote, author/journal/accession in notes,
      p-value-based evidence tier (≤ 1e-10 = high)
    • ClinVar: extract PMIDs from OtherIDs column, NumberSubmitters in
      detail, LastEvaluated in notes
  • Wire evidence:definitions:build into evidence:update and
    evidence:update:monthly pipelines in package.json

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP

claude and others added 28 commits April 27, 2026 22:52
… tier-1 data

- Add makeGenericDefinition() factory to evidencePack.ts so high-quality
  ClinVar/GWAS findings get the same full curated-card treatment as the
  12 hand-crafted EVIDENCE_DEFINITIONS (genotype-specific summaries, tone,
  coverage tracking)

- Add src/lib/autoDefinitions.ts (generated by buildDefinitions.ts) that
  exports AUTO_DEFINITION_PARAMS as plain data; evidencePack.ts maps it
  through the factory, avoiding circular imports

- Add scripts/buildDefinitions.ts: mines ClinVar for expert-panel/practice-
  guideline reviewed variants (3★/4★) and GWAS Catalog for p ≤ 1e-10
  associations with replication and a numeric OR/Beta; writes autoDefinitions.ts

- Replace hand-maintained records.json with buildDefinitionRecords() in
  buildEvidencePack.ts, which auto-extracts PMIDs and notes from source data
  for each EVIDENCE_DEFINITION marker set

- Bulk record quality improvements in buildEvidencePack.ts:
  - GWAS: restore missing https:// prefix on all URLs
  - GWAS: deduplicate genes array before slicing
  - GWAS: extract OR/Beta → repute, CI + sample size in detail,
    risk allele frequency → frequencyNote, author/journal/accession in notes,
    p-value-based evidence tier (≤ 1e-10 = high)
  - ClinVar: extract PMIDs from OtherIDs column, NumberSubmitters in
    detail, LastEvaluated in notes

- Wire evidence:definitions:build into evidence:update and
  evidence:update:monthly pipelines in package.json

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
- New scripts/tsvUtils.ts: splitTsv, tsvRows, column, normalizeRsid,
  singleBaseAllele, extractRsids, extractPmids — removes 4 copies of
  identical streaming/parsing helpers across build scripts
- New scripts/definitionMarkers.ts: DEFINITION_MARKERS, MANUAL_RSIDS,
  DEFINITION_TITLES — single source of truth for the 12 hand-crafted
  definition rsids (was duplicated between buildDefinitions and
  buildEvidencePack with a sync hazard)
- Fix renderParams() in buildDefinitions: was emitting makeGenericDefinition()
  calls in generated output but autoDefinitions.ts only imports the type,
  which would cause a runtime error when the array is populated
- Fix clinvarRecord() detail field: removed duplicate "Review status: X"
  that appeared in both detail and notes[0]
- Fix gwasRecords() detail: use pre-parsed pValue variable instead of
  re-reading the column
- Fix buildDefinitionRecords() titles: use human-readable DEFINITION_TITLES
  lookup instead of machine ID strings like "medical-apoe"
- Remove redundant !== "" check in frequencyNote (short-circuit already
  handles empty strings)
- Align GWAS evidence tier threshold: buildDefinitions now uses p≤1e-10
  for "high" (was p≤1e-20), matching buildEvidencePack
- Remove deduplicateById() from buildDefinitions: ClinVar and GWAS use
  structurally distinct id prefixes and each source is already internally
  deduplicated, so the function was a no-op
- Optimize mineGwas(): store only 7 needed fields in bestByRsid instead
  of entire row objects, reducing memory pressure on large GWAS files

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
After the evidence pack build runs, autoDefinitions.ts can contain
tens of thousands of TypeScript object literals. When vitest imports
evidencePack.ts (e.g. via reportEngine.test.ts), V8 parses the entire
generated file at startup, consuming 4GB+ of heap and aborting.

Add a test alias that redirects any import ending in /autoDefinitions
to an empty stub. The test suite only exercises the 12 hand-crafted
definitions and the framework, not the auto-generated associations.

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
The previous fix had two bugs:

1. The regex find /\/autoDefinitions$/ matched correctly but Vite uses
   String.replace(find, replacement) internally, which only replaces the
   matched substring — not the full specifier. So "./autoDefinitions" became
   "." + absoluteStubPath, an invalid path that caused "Does the file exist?".

2. vi.mock() in the setup file is more reliable than a vite alias for
   intercepting transitive imports — it works at the module registry level
   before any file loading occurs.

Changes:
- src/test/setup.ts: add vi.mock("../lib/autoDefinitions", ...) with empty
  AUTO_DEFINITION_PARAMS array; this is the primary intercept mechanism
- vite.config.ts: fix the alias regex to /^.*\/autoDefinitions(\.ts)?$/
  (full-string anchored) so String.replace swaps the entire specifier
  correctly; kept as belt-and-suspenders for Vite's import-analysis stage

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
The evidence build generates autoDefinitions.ts with tens of thousands of
TypeScript object literals. tsc -b allocates 4GB+ inferring types for them;
vite build then OOMs bundling them. Bundling this data is also wrong for
production — 50k definitions would add ~10-50MB to the JS bundle that every
user downloads.

Evidence data belongs in the lazy-loaded shards, not compiled TypeScript.
The 12 hand-crafted EVIDENCE_DEFINITIONS with custom multi-marker evaluate
functions are the only entries that justify TS code; everything else is
already covered by the ClinVar/GWAS bulk shard records.

Changes:
- evidencePack.ts: remove AUTO_DEFINITION_PARAMS import and spread from
  EVIDENCE_DEFINITIONS; makeGenericDefinition and GenericDefinitionParams
  are kept for future use
- package.json: remove evidence:definitions:build from evidence:update and
  evidence:update:monthly pipelines; keep it as a standalone script so it
  can still be run as a research/development tool
- update-evidence-pack.yml: add NODE_OPTIONS --max-old-space-size=6144 to
  test and build steps as belt-and-suspenders for the full source tree

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
- Report engine: compute tone/coverage/summary dynamically from riskAllele
  + matched genotype instead of using static record fields. Coverage is now
  "missing" when the marker wasn't in the upload; tone escalates to caution/good
  when the user carries risk copies.
- Schema: add riskSummary? to EvidencePackRecord for genotype-specific narrative.
- Build pipeline: populate riskSummary for expert-panel ClinVar and tier-1 GWAS
  (p ≤ 1e-10, replicated) records in buildEvidencePack.ts.
- Move riskSummaryForClinvar/riskSummaryForGwas to tsvUtils.ts so both build
  scripts share them without duplication.

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
Sets qualityTier: "tier-1" on expert-panel ClinVar entries and replicated
GWAS records (p ≤ 1e-10), mirroring the riskSummary population sites.
UI badge logic can key off this field without any further build changes.

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
Wraps fetchJson with up to 5 attempts (2s/4s/8s/16s delays) for 429
and 5xx responses, which covers the 502s seen in CI. Network errors
are also retried. Non-retryable status codes still fail immediately.

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
- Raise ClinVar and GWAS caps from 50k to 100k
- Deduplicate GWAS records by (rsid, primary trait) before capping, so
  the best single association per SNP-trait pair fills the pack rather
  than duplicate rows from different studies
- Add scripts/syncCpic.ts: fetches CPIC level A/B gene-drug pairs and
  rsid-annotated variants from the CPIC API, caches to .evidence-cache/cpic/
- Add scripts/syncPharmgkb.ts: downloads PharmGKB clinical annotations ZIP,
  parses levels 1A-2B into .evidence-cache/pharmgkb/annotations.json
- Add buildCpicRecords() and buildPharmgkbRecords() in buildEvidencePack.ts
- Add evidence:cpic:sync and evidence:pharmgkb:sync scripts; wire both into
  evidence:update and evidence:update:monthly pipelines
- Add PharmGKB to sourceMetadata and source checksums

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
- Add PharmGKB and ClinGen to SOURCE_LIBRARY with evidence notes,
  population notes, chip caveats, and disclaimers
- Update DataSourcesCard: add Drug response row (PharmGKB, CPIC) and
  ClinGen to Clinical row
- Update iconForSource: PharmGKB → target, ClinGen → shield; fix order
  so "clin" check doesn't shadow more specific patterns
- Add scripts/syncClinGen.ts: downloads ClinGen gene-disease validity
  CSV, filters to Definitive/Strong/Moderate, caches to .evidence-cache/clingen/
- Add buildClinGenRecords() in buildEvidencePack.ts: builds gene→rsid
  map from ClinVar records then creates one record per gene-disease pair
  with tier-1 on Definitive/Strong classifications
- Add ClinGen to sourceMetadata, sourceChecksums, main() pipeline
- Add evidence:clingen:sync script; wire into update pipelines

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
Caches ~/.bun/install/cache keyed on bun.lock hash to skip package
reinstallation on unchanged dependencies. Caches evidence source
directories (excluding large dbsnp VCFs) with a monthly key so
workflow_dispatch re-runs within the same month skip re-downloading
sources the sync scripts would otherwise re-fetch.

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
…arallelize builders

- Add splitCsv() to tsvUtils.ts and use it in syncClinGen (fixes header parsing
  for quoted commas; removes inline CSV loop)
- Extract fetchWithRetry() to fetchUtils.ts; remove duplicate retry loops from
  syncCpic and syncSnpedia
- Parallelize the two CPIC API calls and their writeFile calls in syncCpic
- Inline GWAS deduplication into the streaming loop to eliminate the intermediate
  raw[] array
- Parallelize sourceChecksums() with Promise.all across all six source files
- Run all independent record builders concurrently in main() via Promise.all;
  ClinGen still awaits after ClinVar since it depends on the result

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
The PostgREST API returns 400 when a column name doesn't match the schema.
guidelineName doesn't exist (CPIC uses snake_case); url is also uncertain.
Simplify the select to the three columns we actually need: genesymbol,
drugname, level. Always use the fallback cpicpgx.org/guidelines/ URL.

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
- Fetch /pair without select or filter to avoid 400s from column-name
  mismatches; filter A/B client-side instead
- Try /variant then /allele as fallback for rsid data (table name varies
  across API versions); return [] if both endpoints fail
- Use Promise.allSettled so one endpoint failure doesn't crash the other
- Always write cache files (empty arrays if needed) and exit 0, so
  the monthly evidence pipeline continues even when CPIC API is down

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
CPIC is fast and can fail early; SNPedia is slow so moving it later
avoids waiting an hour before seeing a CPIC failure.

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
… rsids

Research findings (cpicpgx/cpic-data wiki):
- Correct endpoint is /pair_view, not /pair — raw pair table has drugid not drugname
- Level column is cpiclevel, not level
- No /variant endpoint exists; rsid data is in allele_definition joined via
  allele_location_value → sequence_location(rsid)

Changes:
- fetchPairs: /pair_view?cpiclevel=in.(A,B) — server-side filter, correct table
- fetchVariants: allele_definition with embedded resource join for rsid extraction,
  deduplicates (rsid, gene) pairs, returns [] gracefully if the join fails

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
…Retry retry bug

fetchUtils.ts: 4xx errors were being retried 5 times because the throw inside
the try block was caught by the same iteration's catch. Replace throw with break
so non-retryable errors exit the loop immediately without burning 30s of backoff.

syncCpic.ts: Two fixes to the allele_definition embedded resource query:
- PostgREST requires columns selected from intermediate table before embedding
  a nested resource; change to allele_location_value(*,sequence_location(dbsnpid))
- The rsid column in sequence_location is "dbsnpid", not "rsid" (from CPIC wiki)

syncClinGen.ts: ClinGen CSV headers are all-uppercase (CLASSIFICATION, GENE SYMBOL,
DISEASE LABEL, ONLINE REPORT). Normalize all header keys to toUpperCase() when
building the row map so lookups match regardless of case. Adds FINAL CLASSIFICATION
as a fallback variant.

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
CPIC and ClinGen removed from the evidence-sources cache: both are fast
API calls (seconds) with no large files, and caching them caused the broken
April run's empty [] files to be restored every subsequent run this month.

Added workflow_dispatch force_refresh boolean input: when true, passes
--force to all sync scripts so cached heavy sources (ClinVar, GWAS, SNPedia,
PharmGKB) are also re-downloaded. Expanded the run step to wire this up.

https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
Surfaces the specific clinical annotation level (1A, 1B, 2A, 2B) as a
colour-coded badge in both the card list and inspector panel, rather than
embedding it in the title string. Level hierarchy: 1A (green, guideline-backed)
> 1B (blue, high evidence) > 2A (amber) > 2B (muted amber). The title no longer
includes the redundant "(PharmGKB level X)" suffix.

https://claude.ai/code/session_014eBd1rwo69GTnRLLMpqsBV
… 0 results

Logs the parsed CSV headers and a sample of classification column values
whenever parseClinGenCsv returns 0 records, so the next CI run reveals
whether the issue is header mismatch or unexpected classification strings.
Also strips UTF-8 BOM and adds DISEASE ID (MONDO) as primary diseaseId key.

https://claude.ai/code/session_01HJ9uGWdTFXf7Mq2owtJSq8
Manual-trigger-only workflow that runs syncClinGen.ts --force and prints
the first 50 lines of the resulting JSON so we can see what classification
values the API is actually returning.

https://claude.ai/code/session_01HJ9uGWdTFXf7Mq2owtJSq8
@vercel

vercel Bot commented Apr 30, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
deana Ready Ready Preview, Comment Apr 30, 2026 8:50pm

Request Review

workflow_dispatch is invisible in the UI unless the file is on main.
Add a push trigger scoped to this branch so it fires automatically.

https://claude.ai/code/session_01HJ9uGWdTFXf7Mq2owtJSq8
The ClinGen API was returning an HTML page (Cloudflare bot protection)
instead of CSV because the request had no User-Agent header. Add a
browser-compatible User-Agent. Also throw early if the response body
starts with '<' so the script fails loudly rather than writing 0 records.

https://claude.ai/code/session_01HJ9uGWdTFXf7Mq2owtJSq8

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbb4c5c59f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1021 to +1023
markerIds: rsids.slice(0, 3),
genes: [cls.gene],
title: `${cls.gene} / ${cls.disease} (ClinGen ${cls.classification})`,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require variant-level criteria for ClinGen matches

These ClinGen records are built from markerIds only and do not set riskAllele or genotype, so matchEvidenceRecords will treat any observed genotype at one of these rsids as a match. In practice, users can receive a pathogenic/caution ClinGen card just because their file contains a marker in the gene, even when they do not carry a disease-associated allele, which is a high-risk false-positive regression for medical findings.

Useful? React with 👍 / 👎.

@steve228uk steve228uk closed this Apr 30, 2026
@steve228uk steve228uk deleted the claude/add-clingen-evidence-w5sm6 branch May 1, 2026 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants