perf(import): bulk-load tab-delim matrix + genetic_entity inserts by inodb · Pull Request #160 · cBioPortal/cbioportal-core

inodb · 2026-05-29T19:49:01Z

Summary

Two bulk-load fixes for v7 tab-delim matrix imports:

Enable ClickHouseBulkLoader for every tab-delim matrix import in ImportTabDelimData (mRNA + z-scores, methylation, RPPA + z-scores, log2 CNA, generic assay, …). Previously only the discretized CNA profile got the bulk path.
Bulk-load genetic_entity inserts in DaoGeneticEntity.addNewGeneticEntity, plus an explicit ClickHouseBulkLoader.flushAll() in ImportGenericAssayEntity.importData() so that buffered entities are visible to the SELECT inside ImportTabDelimData (both run in the same JVM for GENERIC_ASSAY profiles).

Measured impact

Full import of TCGA Pancan ACC (~92 samples, 17 data files including methylation_hm450 with ~400K rows) against an in-cluster ClickHouse:

Path	Wall time
Before: per-row JDBC inserts everywhere	25+ min, multiple files silent 5–10 min each
After: bulk loader for both genetic_alteration + genetic_entity	1m52s end-to-end

Representative per-file times after the fix: data_log2_cna.txt 1.5s, data_methylation_hm27_hm450_merged.txt 2.8s, data_methylation_hm450.txt ~5s, mRNA z-score files 1.5s each.

Heap requirement

The bulk loader buffers pendingRecords plus a TSV byte[] in memory. For TCGA Pancan-scale matrices the peak is ~2–4 GB; the default container-aware JVM heap (~25% of container limit) OOMs on methylation_hm450. Importer Jobs should set -Xmx6g (we use JAVA_OPTS=-Xmx6g + an 8 GB container memory limit).

Safety

flushAll() at the end of ImportTabDelimData.importDataInternal is already gated on isBulkLoad(); flipping the loader on unconditionally also auto-flushes at the end.
DaoGeneticEntity.addNewGeneticEntity reserves the ID via ClickHouseAutoIncrement.nextId() before buffering, so callers downstream get a valid id immediately.
flushAll() iterates loaders in insertion order; genetic_entity is always touched before genetic_alteration for a given gene, so the flush order matches the dependency.
The discretized CNA existingCnaEvents population is untouched.

Test plan

CI integration tests pass
Manual import of a non-discrete matrix profile (mRNA) verifies rows land in genetic_alteration
Manual import of a generic-assay profile (e.g. methylation_hm450) verifies rows in genetic_alteration + entities in genetic_entity

Temporary branch to produce cbioportal/cbioportal:demo-bulk-load-fix bundling cBioPortal/cbioportal-core#160 (bulk-load all tab-delim matrix imports). Used by hub.cbioportal.org to validate the perf fix end-to-end while the upstream PR is in review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Temporary image bundling cBioPortal/cbioportal-core#160 (bulk-load all tab-delim matrix imports) — fixes the per-row JDBC INSERT bottleneck that made non-discrete matrix profiles (mRNA, methylation, RPPA, log2 CNA, z-scores) take ~9 min each instead of seconds. Revert to v7.0.x once upstream PR lands + next release tags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Enable ClickHouseBulkLoader for every tab-delim matrix import in ImportTabDelimData (mRNA + z-scores, methylation, RPPA + z-scores, log2 CNA, generic assay, …). The loader streams rows via INSERT ... FORMAT TSVWithNames — one round-trip per profile instead of one JDBC INSERT per gene. For a TCGA Pancan-shaped matrix (~20K genes × ~92 samples) this is the difference between seconds and minutes per profile. flushAll() at the end of importData() is already gated on isBulkLoad(), so flipping the loader on also auto-flushes. The discretized CNA existingCnaEvents path is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…icAssayEntity Mirror the genetic_alteration bulk-load path in addNewGeneticEntity so the loader-aware code path is used whenever bulkLoadOn() is active. For GENERIC_ASSAY profiles, ImportProfileData runs ImportGenericAssayEntity and ImportTabDelimData in the same JVM, where the latter builds its stable-id→entity-id map from a SELECT. Without an explicit flush between those two steps, the entities buffered by the former are invisible to the SELECT and all rows get skipped. Add ClickHouseBulkLoader.flushAll() at the end of ImportGenericAssayEntity.importData() so the rows are in the DB before the next importer's lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bulkLoadOn/Off is the established pattern (7 production callers — DaoGene, DaoReferenceGenomeGene, ImportGeneData, ImportCopyNumberSegmentData, ImportMicroRNAIDs, MutSigReader, ConsoleUtil — and the integration tests). Leaving the loader in the "on" state after a flush leaves the global flag sticky for any downstream DAO call in the same JVM that has a bulk-load branch (e.g. addNewGeneticEntity via DaoGeneset.addGeneset, which does not pre-emptively call bulkLoadOff()). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR improves ClickHouse-backed import performance by switching tab-delimited “matrix” profile imports (mRNA, methylation, RPPA, CNA log2, generic assay matrices, etc.) onto the ClickHouseBulkLoader path and extending bulk loading to genetic_entity inserts so generic-assay entity creation no longer bottlenecks large imports.

Changes:

Enable ClickHouseBulkLoader for all tab-delimited matrix imports in ImportTabDelimData.
Buffer genetic_entity inserts via ClickHouseBulkLoader in DaoGeneticEntity.addNewGeneticEntity.
Flush buffered inserts after generic-assay entity import so subsequent same-JVM SELECTs can observe newly inserted entities.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`src/main/java/org/mskcc/cbio/portal/scripts/ImportTabDelimData.java`	Turns bulk loading on for all matrix imports to avoid per-row JDBC inserts.
`src/main/java/org/mskcc/cbio/portal/scripts/ImportGenericAssayEntity.java`	Flushes buffered inserts at the end of entity import to ensure same-JVM visibility for follow-on matrix import SELECTs.
`src/main/java/org/mskcc/cbio/portal/dao/DaoGeneticEntity.java`	Adds a bulk-load path for `genetic_entity` inserts while reserving IDs up-front.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        long entityId = ClickHouseAutoIncrement.nextId("seq_genetic_entity");
+        geneticEntity.setId((int) entityId);
+


+        // Flush any buffered genetic_entity inserts so that a subsequent SELECT
+        // (e.g. GenericAssayMetaUtils.buildGenericAssayStableIdToEntityIdMap)
+        // in the same JVM sees them. ImportProfileData runs this method
+        // immediately before ImportTabDelimData for GENERIC_ASSAY profiles.
+        // bulkLoadOff() restores the global flag — ImportTabDelimData will
+        // turn it back on for the matrix import.
+        if (ClickHouseBulkLoader.isBulkLoad()) {


inodb force-pushed the bulk-load-all-tab-delim-imports branch from bbf5e96 to 49577f9 Compare May 29, 2026 20:13

inodb changed the title ~~perf(import): bulk-load all tab-delim matrix data, not just discretized CNA~~ perf(import): bulk-load all tab-delim matrix data May 29, 2026

inodb requested a review from sheridancbio May 29, 2026 21:13

inodb force-pushed the bulk-load-all-tab-delim-imports branch 2 times, most recently from 6c67739 to 09a51fc Compare May 30, 2026 01:31

inodb force-pushed the bulk-load-all-tab-delim-imports branch from 09a51fc to 15fce5e Compare May 30, 2026 01:44

inodb changed the title ~~perf(import): bulk-load all tab-delim matrix data~~ perf(import): bulk-load tab-delim matrix + genetic_entity inserts May 30, 2026

inodb requested a review from Copilot May 30, 2026 09:27

Copilot started reviewing on behalf of inodb May 30, 2026 09:27 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(import): bulk-load tab-delim matrix + genetic_entity inserts#160

perf(import): bulk-load tab-delim matrix + genetic_entity inserts#160
inodb wants to merge 3 commits into
cBioPortal:mainfrom
inodb:bulk-load-all-tab-delim-imports

inodb commented May 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		long entityId = ClickHouseAutoIncrement.nextId("seq_genetic_entity");
		geneticEntity.setId((int) entityId);

Uh oh!

Conversation

inodb commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured impact

Heap requirement

Safety

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

inodb commented May 29, 2026 •

edited

Loading