Skip to content

perf(import): bulk-load tab-delim matrix + genetic_entity inserts#160

Open
inodb wants to merge 3 commits into
cBioPortal:mainfrom
inodb:bulk-load-all-tab-delim-imports
Open

perf(import): bulk-load tab-delim matrix + genetic_entity inserts#160
inodb wants to merge 3 commits into
cBioPortal:mainfrom
inodb:bulk-load-all-tab-delim-imports

Conversation

@inodb
Copy link
Copy Markdown
Member

@inodb inodb commented May 29, 2026

Summary

Two bulk-load fixes for v7 tab-delim matrix imports:

  1. Enable ClickHouseBulkLoader for every tab-delim matrix import in ImportTabDelimData (mRNA + z-scores, methylation, RPPA + z-scores, log2 CNA, generic assay, …). Previously only the discretized CNA profile got the bulk path.
  2. Bulk-load genetic_entity inserts in DaoGeneticEntity.addNewGeneticEntity, plus an explicit ClickHouseBulkLoader.flushAll() in ImportGenericAssayEntity.importData() so that buffered entities are visible to the SELECT inside ImportTabDelimData (both run in the same JVM for GENERIC_ASSAY profiles).

Measured impact

Full import of TCGA Pancan ACC (~92 samples, 17 data files including methylation_hm450 with ~400K rows) against an in-cluster ClickHouse:

Path Wall time
Before: per-row JDBC inserts everywhere 25+ min, multiple files silent 5–10 min each
After: bulk loader for both genetic_alteration + genetic_entity 1m52s end-to-end

Representative per-file times after the fix: data_log2_cna.txt 1.5s, data_methylation_hm27_hm450_merged.txt 2.8s, data_methylation_hm450.txt ~5s, mRNA z-score files 1.5s each.

Heap requirement

The bulk loader buffers pendingRecords plus a TSV byte[] in memory. For TCGA Pancan-scale matrices the peak is ~2–4 GB; the default container-aware JVM heap (~25% of container limit) OOMs on methylation_hm450. Importer Jobs should set -Xmx6g (we use JAVA_OPTS=-Xmx6g + an 8 GB container memory limit).

Safety

  • flushAll() at the end of ImportTabDelimData.importDataInternal is already gated on isBulkLoad(); flipping the loader on unconditionally also auto-flushes at the end.
  • DaoGeneticEntity.addNewGeneticEntity reserves the ID via ClickHouseAutoIncrement.nextId() before buffering, so callers downstream get a valid id immediately.
  • flushAll() iterates loaders in insertion order; genetic_entity is always touched before genetic_alteration for a given gene, so the flush order matches the dependency.
  • The discretized CNA existingCnaEvents population is untouched.

Test plan

  • CI integration tests pass
  • Manual import of a non-discrete matrix profile (mRNA) verifies rows land in genetic_alteration
  • Manual import of a generic-assay profile (e.g. methylation_hm450) verifies rows in genetic_alteration + entities in genetic_entity

inodb added a commit to cBioPortal/cbioportal that referenced this pull request May 29, 2026
Temporary branch to produce cbioportal/cbioportal:demo-bulk-load-fix
bundling cBioPortal/cbioportal-core#160 (bulk-load all tab-delim
matrix imports). Used by hub.cbioportal.org to validate the perf fix
end-to-end while the upstream PR is in review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
inodb added a commit to knowledgesystems/knowledgesystems-k8s-deployment that referenced this pull request May 29, 2026
Temporary image bundling cBioPortal/cbioportal-core#160 (bulk-load all
tab-delim matrix imports) — fixes the per-row JDBC INSERT bottleneck
that made non-discrete matrix profiles (mRNA, methylation, RPPA, log2
CNA, z-scores) take ~9 min each instead of seconds. Revert to v7.0.x
once upstream PR lands + next release tags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Enable ClickHouseBulkLoader for every tab-delim matrix import in
ImportTabDelimData (mRNA + z-scores, methylation, RPPA + z-scores,
log2 CNA, generic assay, …). The loader streams rows via
INSERT ... FORMAT TSVWithNames — one round-trip per profile instead
of one JDBC INSERT per gene. For a TCGA Pancan-shaped matrix
(~20K genes × ~92 samples) this is the difference between seconds
and minutes per profile.

flushAll() at the end of importData() is already gated on isBulkLoad(),
so flipping the loader on also auto-flushes. The discretized CNA
existingCnaEvents path is untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@inodb inodb force-pushed the bulk-load-all-tab-delim-imports branch from bbf5e96 to 49577f9 Compare May 29, 2026 20:13
@inodb inodb changed the title perf(import): bulk-load all tab-delim matrix data, not just discretized CNA perf(import): bulk-load all tab-delim matrix data May 29, 2026
@inodb inodb requested a review from sheridancbio May 29, 2026 21:13
@inodb inodb force-pushed the bulk-load-all-tab-delim-imports branch 2 times, most recently from 6c67739 to 09a51fc Compare May 30, 2026 01:31
…icAssayEntity

Mirror the genetic_alteration bulk-load path in addNewGeneticEntity so
the loader-aware code path is used whenever bulkLoadOn() is active.

For GENERIC_ASSAY profiles, ImportProfileData runs ImportGenericAssayEntity
and ImportTabDelimData in the same JVM, where the latter builds its
stable-id→entity-id map from a SELECT. Without an explicit flush between
those two steps, the entities buffered by the former are invisible to
the SELECT and all rows get skipped. Add ClickHouseBulkLoader.flushAll()
at the end of ImportGenericAssayEntity.importData() so the rows are in
the DB before the next importer's lookup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@inodb inodb force-pushed the bulk-load-all-tab-delim-imports branch from 09a51fc to 15fce5e Compare May 30, 2026 01:44
@inodb inodb changed the title perf(import): bulk-load all tab-delim matrix data perf(import): bulk-load tab-delim matrix + genetic_entity inserts May 30, 2026
@inodb inodb requested a review from Copilot May 30, 2026 09:27
bulkLoadOn/Off is the established pattern (7 production callers — DaoGene,
DaoReferenceGenomeGene, ImportGeneData, ImportCopyNumberSegmentData,
ImportMicroRNAIDs, MutSigReader, ConsoleUtil — and the integration tests).
Leaving the loader in the "on" state after a flush leaves the global flag
sticky for any downstream DAO call in the same JVM that has a bulk-load
branch (e.g. addNewGeneticEntity via DaoGeneset.addGeneset, which does not
pre-emptively call bulkLoadOff()).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves ClickHouse-backed import performance by switching tab-delimited “matrix” profile imports (mRNA, methylation, RPPA, CNA log2, generic assay matrices, etc.) onto the ClickHouseBulkLoader path and extending bulk loading to genetic_entity inserts so generic-assay entity creation no longer bottlenecks large imports.

Changes:

  • Enable ClickHouseBulkLoader for all tab-delimited matrix imports in ImportTabDelimData.
  • Buffer genetic_entity inserts via ClickHouseBulkLoader in DaoGeneticEntity.addNewGeneticEntity.
  • Flush buffered inserts after generic-assay entity import so subsequent same-JVM SELECTs can observe newly inserted entities.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/main/java/org/mskcc/cbio/portal/scripts/ImportTabDelimData.java Turns bulk loading on for all matrix imports to avoid per-row JDBC inserts.
src/main/java/org/mskcc/cbio/portal/scripts/ImportGenericAssayEntity.java Flushes buffered inserts at the end of entity import to ensure same-JVM visibility for follow-on matrix import SELECTs.
src/main/java/org/mskcc/cbio/portal/dao/DaoGeneticEntity.java Adds a bulk-load path for genetic_entity inserts while reserving IDs up-front.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +33 to +35
long entityId = ClickHouseAutoIncrement.nextId("seq_genetic_entity");
geneticEntity.setId((int) entityId);

Comment on lines +252 to +258
// Flush any buffered genetic_entity inserts so that a subsequent SELECT
// (e.g. GenericAssayMetaUtils.buildGenericAssayStableIdToEntityIdMap)
// in the same JVM sees them. ImportProfileData runs this method
// immediately before ImportTabDelimData for GENERIC_ASSAY profiles.
// bulkLoadOff() restores the global flag — ImportTabDelimData will
// turn it back on for the matrix import.
if (ClickHouseBulkLoader.isBulkLoad()) {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants