Auto-generate evidence definitions from ClinVar expert panel and GWAS tier-1 data#40
Auto-generate evidence definitions from ClinVar expert panel and GWAS tier-1 data#40steve228uk wants to merge 30 commits into
Conversation
… tier-1 data
- Add makeGenericDefinition() factory to evidencePack.ts so high-quality
ClinVar/GWAS findings get the same full curated-card treatment as the
12 hand-crafted EVIDENCE_DEFINITIONS (genotype-specific summaries, tone,
coverage tracking)
- Add src/lib/autoDefinitions.ts (generated by buildDefinitions.ts) that
exports AUTO_DEFINITION_PARAMS as plain data; evidencePack.ts maps it
through the factory, avoiding circular imports
- Add scripts/buildDefinitions.ts: mines ClinVar for expert-panel/practice-
guideline reviewed variants (3★/4★) and GWAS Catalog for p ≤ 1e-10
associations with replication and a numeric OR/Beta; writes autoDefinitions.ts
- Replace hand-maintained records.json with buildDefinitionRecords() in
buildEvidencePack.ts, which auto-extracts PMIDs and notes from source data
for each EVIDENCE_DEFINITION marker set
- Bulk record quality improvements in buildEvidencePack.ts:
- GWAS: restore missing https:// prefix on all URLs
- GWAS: deduplicate genes array before slicing
- GWAS: extract OR/Beta → repute, CI + sample size in detail,
risk allele frequency → frequencyNote, author/journal/accession in notes,
p-value-based evidence tier (≤ 1e-10 = high)
- ClinVar: extract PMIDs from OtherIDs column, NumberSubmitters in
detail, LastEvaluated in notes
- Wire evidence:definitions:build into evidence:update and
evidence:update:monthly pipelines in package.json
https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
- New scripts/tsvUtils.ts: splitTsv, tsvRows, column, normalizeRsid, singleBaseAllele, extractRsids, extractPmids — removes 4 copies of identical streaming/parsing helpers across build scripts - New scripts/definitionMarkers.ts: DEFINITION_MARKERS, MANUAL_RSIDS, DEFINITION_TITLES — single source of truth for the 12 hand-crafted definition rsids (was duplicated between buildDefinitions and buildEvidencePack with a sync hazard) - Fix renderParams() in buildDefinitions: was emitting makeGenericDefinition() calls in generated output but autoDefinitions.ts only imports the type, which would cause a runtime error when the array is populated - Fix clinvarRecord() detail field: removed duplicate "Review status: X" that appeared in both detail and notes[0] - Fix gwasRecords() detail: use pre-parsed pValue variable instead of re-reading the column - Fix buildDefinitionRecords() titles: use human-readable DEFINITION_TITLES lookup instead of machine ID strings like "medical-apoe" - Remove redundant !== "" check in frequencyNote (short-circuit already handles empty strings) - Align GWAS evidence tier threshold: buildDefinitions now uses p≤1e-10 for "high" (was p≤1e-20), matching buildEvidencePack - Remove deduplicateById() from buildDefinitions: ClinVar and GWAS use structurally distinct id prefixes and each source is already internally deduplicated, so the function was a no-op - Optimize mineGwas(): store only 7 needed fields in bestByRsid instead of entire row objects, reducing memory pressure on large GWAS files https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
After the evidence pack build runs, autoDefinitions.ts can contain tens of thousands of TypeScript object literals. When vitest imports evidencePack.ts (e.g. via reportEngine.test.ts), V8 parses the entire generated file at startup, consuming 4GB+ of heap and aborting. Add a test alias that redirects any import ending in /autoDefinitions to an empty stub. The test suite only exercises the 12 hand-crafted definitions and the framework, not the auto-generated associations. https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
The previous fix had two bugs:
1. The regex find /\/autoDefinitions$/ matched correctly but Vite uses
String.replace(find, replacement) internally, which only replaces the
matched substring — not the full specifier. So "./autoDefinitions" became
"." + absoluteStubPath, an invalid path that caused "Does the file exist?".
2. vi.mock() in the setup file is more reliable than a vite alias for
intercepting transitive imports — it works at the module registry level
before any file loading occurs.
Changes:
- src/test/setup.ts: add vi.mock("../lib/autoDefinitions", ...) with empty
AUTO_DEFINITION_PARAMS array; this is the primary intercept mechanism
- vite.config.ts: fix the alias regex to /^.*\/autoDefinitions(\.ts)?$/
(full-string anchored) so String.replace swaps the entire specifier
correctly; kept as belt-and-suspenders for Vite's import-analysis stage
https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
The evidence build generates autoDefinitions.ts with tens of thousands of TypeScript object literals. tsc -b allocates 4GB+ inferring types for them; vite build then OOMs bundling them. Bundling this data is also wrong for production — 50k definitions would add ~10-50MB to the JS bundle that every user downloads. Evidence data belongs in the lazy-loaded shards, not compiled TypeScript. The 12 hand-crafted EVIDENCE_DEFINITIONS with custom multi-marker evaluate functions are the only entries that justify TS code; everything else is already covered by the ClinVar/GWAS bulk shard records. Changes: - evidencePack.ts: remove AUTO_DEFINITION_PARAMS import and spread from EVIDENCE_DEFINITIONS; makeGenericDefinition and GenericDefinitionParams are kept for future use - package.json: remove evidence:definitions:build from evidence:update and evidence:update:monthly pipelines; keep it as a standalone script so it can still be run as a research/development tool - update-evidence-pack.yml: add NODE_OPTIONS --max-old-space-size=6144 to test and build steps as belt-and-suspenders for the full source tree https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
- Report engine: compute tone/coverage/summary dynamically from riskAllele + matched genotype instead of using static record fields. Coverage is now "missing" when the marker wasn't in the upload; tone escalates to caution/good when the user carries risk copies. - Schema: add riskSummary? to EvidencePackRecord for genotype-specific narrative. - Build pipeline: populate riskSummary for expert-panel ClinVar and tier-1 GWAS (p ≤ 1e-10, replicated) records in buildEvidencePack.ts. - Move riskSummaryForClinvar/riskSummaryForGwas to tsvUtils.ts so both build scripts share them without duplication. https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
Sets qualityTier: "tier-1" on expert-panel ClinVar entries and replicated GWAS records (p ≤ 1e-10), mirroring the riskSummary population sites. UI badge logic can key off this field without any further build changes. https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
Wraps fetchJson with up to 5 attempts (2s/4s/8s/16s delays) for 429 and 5xx responses, which covers the 502s seen in CI. Network errors are also retried. Non-retryable status codes still fail immediately. https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
- Raise ClinVar and GWAS caps from 50k to 100k - Deduplicate GWAS records by (rsid, primary trait) before capping, so the best single association per SNP-trait pair fills the pack rather than duplicate rows from different studies - Add scripts/syncCpic.ts: fetches CPIC level A/B gene-drug pairs and rsid-annotated variants from the CPIC API, caches to .evidence-cache/cpic/ - Add scripts/syncPharmgkb.ts: downloads PharmGKB clinical annotations ZIP, parses levels 1A-2B into .evidence-cache/pharmgkb/annotations.json - Add buildCpicRecords() and buildPharmgkbRecords() in buildEvidencePack.ts - Add evidence:cpic:sync and evidence:pharmgkb:sync scripts; wire both into evidence:update and evidence:update:monthly pipelines - Add PharmGKB to sourceMetadata and source checksums https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
- Add PharmGKB and ClinGen to SOURCE_LIBRARY with evidence notes, population notes, chip caveats, and disclaimers - Update DataSourcesCard: add Drug response row (PharmGKB, CPIC) and ClinGen to Clinical row - Update iconForSource: PharmGKB → target, ClinGen → shield; fix order so "clin" check doesn't shadow more specific patterns - Add scripts/syncClinGen.ts: downloads ClinGen gene-disease validity CSV, filters to Definitive/Strong/Moderate, caches to .evidence-cache/clingen/ - Add buildClinGenRecords() in buildEvidencePack.ts: builds gene→rsid map from ClinVar records then creates one record per gene-disease pair with tier-1 on Definitive/Strong classifications - Add ClinGen to sourceMetadata, sourceChecksums, main() pipeline - Add evidence:clingen:sync script; wire into update pipelines https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
Caches ~/.bun/install/cache keyed on bun.lock hash to skip package reinstallation on unchanged dependencies. Caches evidence source directories (excluding large dbsnp VCFs) with a monthly key so workflow_dispatch re-runs within the same month skip re-downloading sources the sync scripts would otherwise re-fetch. https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
…arallelize builders - Add splitCsv() to tsvUtils.ts and use it in syncClinGen (fixes header parsing for quoted commas; removes inline CSV loop) - Extract fetchWithRetry() to fetchUtils.ts; remove duplicate retry loops from syncCpic and syncSnpedia - Parallelize the two CPIC API calls and their writeFile calls in syncCpic - Inline GWAS deduplication into the streaming loop to eliminate the intermediate raw[] array - Parallelize sourceChecksums() with Promise.all across all six source files - Run all independent record builders concurrently in main() via Promise.all; ClinGen still awaits after ClinVar since it depends on the result https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
The PostgREST API returns 400 when a column name doesn't match the schema. guidelineName doesn't exist (CPIC uses snake_case); url is also uncertain. Simplify the select to the three columns we actually need: genesymbol, drugname, level. Always use the fallback cpicpgx.org/guidelines/ URL. https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
- Fetch /pair without select or filter to avoid 400s from column-name mismatches; filter A/B client-side instead - Try /variant then /allele as fallback for rsid data (table name varies across API versions); return [] if both endpoints fail - Use Promise.allSettled so one endpoint failure doesn't crash the other - Always write cache files (empty arrays if needed) and exit 0, so the monthly evidence pipeline continues even when CPIC API is down https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
CPIC is fast and can fail early; SNPedia is slow so moving it later avoids waiting an hour before seeing a CPIC failure. https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
… rsids Research findings (cpicpgx/cpic-data wiki): - Correct endpoint is /pair_view, not /pair — raw pair table has drugid not drugname - Level column is cpiclevel, not level - No /variant endpoint exists; rsid data is in allele_definition joined via allele_location_value → sequence_location(rsid) Changes: - fetchPairs: /pair_view?cpiclevel=in.(A,B) — server-side filter, correct table - fetchVariants: allele_definition with embedded resource join for rsid extraction, deduplicates (rsid, gene) pairs, returns [] gracefully if the join fails https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
…Retry retry bug fetchUtils.ts: 4xx errors were being retried 5 times because the throw inside the try block was caught by the same iteration's catch. Replace throw with break so non-retryable errors exit the loop immediately without burning 30s of backoff. syncCpic.ts: Two fixes to the allele_definition embedded resource query: - PostgREST requires columns selected from intermediate table before embedding a nested resource; change to allele_location_value(*,sequence_location(dbsnpid)) - The rsid column in sequence_location is "dbsnpid", not "rsid" (from CPIC wiki) syncClinGen.ts: ClinGen CSV headers are all-uppercase (CLASSIFICATION, GENE SYMBOL, DISEASE LABEL, ONLINE REPORT). Normalize all header keys to toUpperCase() when building the row map so lookups match regardless of case. Adds FINAL CLASSIFICATION as a fallback variant. https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
CPIC and ClinGen removed from the evidence-sources cache: both are fast API calls (seconds) with no large files, and caching them caused the broken April run's empty [] files to be restored every subsequent run this month. Added workflow_dispatch force_refresh boolean input: when true, passes --force to all sync scripts so cached heavy sources (ClinVar, GWAS, SNPedia, PharmGKB) are also re-downloaded. Expanded the run step to wire this up. https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP
Surfaces the specific clinical annotation level (1A, 1B, 2A, 2B) as a colour-coded badge in both the card list and inspector panel, rather than embedding it in the title string. Level hierarchy: 1A (green, guideline-backed) > 1B (blue, high evidence) > 2A (amber) > 2B (muted amber). The title no longer includes the redundant "(PharmGKB level X)" suffix. https://claude.ai/code/session_014eBd1rwo69GTnRLLMpqsBV
… 0 results Logs the parsed CSV headers and a sample of classification column values whenever parseClinGenCsv returns 0 records, so the next CI run reveals whether the issue is header mismatch or unexpected classification strings. Also strips UTF-8 BOM and adds DISEASE ID (MONDO) as primary diseaseId key. https://claude.ai/code/session_01HJ9uGWdTFXf7Mq2owtJSq8
Manual-trigger-only workflow that runs syncClinGen.ts --force and prints the first 50 lines of the resulting JSON so we can see what classification values the API is actually returning. https://claude.ai/code/session_01HJ9uGWdTFXf7Mq2owtJSq8
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
workflow_dispatch is invisible in the UI unless the file is on main. Add a push trigger scoped to this branch so it fires automatically. https://claude.ai/code/session_01HJ9uGWdTFXf7Mq2owtJSq8
The ClinGen API was returning an HTML page (Cloudflare bot protection) instead of CSV because the request had no User-Agent header. Add a browser-compatible User-Agent. Also throw early if the response body starts with '<' so the script fails loudly rather than writing 0 records. https://claude.ai/code/session_01HJ9uGWdTFXf7Mq2owtJSq8
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fbb4c5c59f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| markerIds: rsids.slice(0, 3), | ||
| genes: [cls.gene], | ||
| title: `${cls.gene} / ${cls.disease} (ClinGen ${cls.classification})`, |
There was a problem hiding this comment.
Require variant-level criteria for ClinGen matches
These ClinGen records are built from markerIds only and do not set riskAllele or genotype, so matchEvidenceRecords will treat any observed genotype at one of these rsids as a match. In practice, users can receive a pathogenic/caution ClinGen card just because their file contains a marker in the gene, even when they do not carry a disease-associated allele, which is a high-risk false-positive regression for medical findings.
Useful? React with 👍 / 👎.
Add makeGenericDefinition() factory to evidencePack.ts so high-quality
ClinVar/GWAS findings get the same full curated-card treatment as the
12 hand-crafted EVIDENCE_DEFINITIONS (genotype-specific summaries, tone,
coverage tracking)
Add src/lib/autoDefinitions.ts (generated by buildDefinitions.ts) that
exports AUTO_DEFINITION_PARAMS as plain data; evidencePack.ts maps it
through the factory, avoiding circular imports
Add scripts/buildDefinitions.ts: mines ClinVar for expert-panel/practice-
guideline reviewed variants (3★/4★) and GWAS Catalog for p ≤ 1e-10
associations with replication and a numeric OR/Beta; writes autoDefinitions.ts
Replace hand-maintained records.json with buildDefinitionRecords() in
buildEvidencePack.ts, which auto-extracts PMIDs and notes from source data
for each EVIDENCE_DEFINITION marker set
Bulk record quality improvements in buildEvidencePack.ts:
risk allele frequency → frequencyNote, author/journal/accession in notes,
p-value-based evidence tier (≤ 1e-10 = high)
detail, LastEvaluated in notes
Wire evidence:definitions:build into evidence:update and
evidence:update:monthly pipelines in package.json
https://claude.ai/code/session_014kiKiWBBmhBaRM7Jz99WTP