perf(import): bulk-load tab-delim matrix + genetic_entity inserts#160
Open
inodb wants to merge 3 commits into
Open
perf(import): bulk-load tab-delim matrix + genetic_entity inserts#160inodb wants to merge 3 commits into
inodb wants to merge 3 commits into
Conversation
inodb
added a commit
to cBioPortal/cbioportal
that referenced
this pull request
May 29, 2026
Temporary branch to produce cbioportal/cbioportal:demo-bulk-load-fix bundling cBioPortal/cbioportal-core#160 (bulk-load all tab-delim matrix imports). Used by hub.cbioportal.org to validate the perf fix end-to-end while the upstream PR is in review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
inodb
added a commit
to knowledgesystems/knowledgesystems-k8s-deployment
that referenced
this pull request
May 29, 2026
Temporary image bundling cBioPortal/cbioportal-core#160 (bulk-load all tab-delim matrix imports) — fixes the per-row JDBC INSERT bottleneck that made non-discrete matrix profiles (mRNA, methylation, RPPA, log2 CNA, z-scores) take ~9 min each instead of seconds. Revert to v7.0.x once upstream PR lands + next release tags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Enable ClickHouseBulkLoader for every tab-delim matrix import in ImportTabDelimData (mRNA + z-scores, methylation, RPPA + z-scores, log2 CNA, generic assay, …). The loader streams rows via INSERT ... FORMAT TSVWithNames — one round-trip per profile instead of one JDBC INSERT per gene. For a TCGA Pancan-shaped matrix (~20K genes × ~92 samples) this is the difference between seconds and minutes per profile. flushAll() at the end of importData() is already gated on isBulkLoad(), so flipping the loader on also auto-flushes. The discretized CNA existingCnaEvents path is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bbf5e96 to
49577f9
Compare
6c67739 to
09a51fc
Compare
…icAssayEntity Mirror the genetic_alteration bulk-load path in addNewGeneticEntity so the loader-aware code path is used whenever bulkLoadOn() is active. For GENERIC_ASSAY profiles, ImportProfileData runs ImportGenericAssayEntity and ImportTabDelimData in the same JVM, where the latter builds its stable-id→entity-id map from a SELECT. Without an explicit flush between those two steps, the entities buffered by the former are invisible to the SELECT and all rows get skipped. Add ClickHouseBulkLoader.flushAll() at the end of ImportGenericAssayEntity.importData() so the rows are in the DB before the next importer's lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
09a51fc to
15fce5e
Compare
bulkLoadOn/Off is the established pattern (7 production callers — DaoGene, DaoReferenceGenomeGene, ImportGeneData, ImportCopyNumberSegmentData, ImportMicroRNAIDs, MutSigReader, ConsoleUtil — and the integration tests). Leaving the loader in the "on" state after a flush leaves the global flag sticky for any downstream DAO call in the same JVM that has a bulk-load branch (e.g. addNewGeneticEntity via DaoGeneset.addGeneset, which does not pre-emptively call bulkLoadOff()). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR improves ClickHouse-backed import performance by switching tab-delimited “matrix” profile imports (mRNA, methylation, RPPA, CNA log2, generic assay matrices, etc.) onto the ClickHouseBulkLoader path and extending bulk loading to genetic_entity inserts so generic-assay entity creation no longer bottlenecks large imports.
Changes:
- Enable
ClickHouseBulkLoaderfor all tab-delimited matrix imports inImportTabDelimData. - Buffer
genetic_entityinserts viaClickHouseBulkLoaderinDaoGeneticEntity.addNewGeneticEntity. - Flush buffered inserts after generic-assay entity import so subsequent same-JVM SELECTs can observe newly inserted entities.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/main/java/org/mskcc/cbio/portal/scripts/ImportTabDelimData.java |
Turns bulk loading on for all matrix imports to avoid per-row JDBC inserts. |
src/main/java/org/mskcc/cbio/portal/scripts/ImportGenericAssayEntity.java |
Flushes buffered inserts at the end of entity import to ensure same-JVM visibility for follow-on matrix import SELECTs. |
src/main/java/org/mskcc/cbio/portal/dao/DaoGeneticEntity.java |
Adds a bulk-load path for genetic_entity inserts while reserving IDs up-front. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+33
to
+35
| long entityId = ClickHouseAutoIncrement.nextId("seq_genetic_entity"); | ||
| geneticEntity.setId((int) entityId); | ||
|
|
Comment on lines
+252
to
+258
| // Flush any buffered genetic_entity inserts so that a subsequent SELECT | ||
| // (e.g. GenericAssayMetaUtils.buildGenericAssayStableIdToEntityIdMap) | ||
| // in the same JVM sees them. ImportProfileData runs this method | ||
| // immediately before ImportTabDelimData for GENERIC_ASSAY profiles. | ||
| // bulkLoadOff() restores the global flag — ImportTabDelimData will | ||
| // turn it back on for the matrix import. | ||
| if (ClickHouseBulkLoader.isBulkLoad()) { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two bulk-load fixes for v7 tab-delim matrix imports:
ClickHouseBulkLoaderfor every tab-delim matrix import inImportTabDelimData(mRNA + z-scores, methylation, RPPA + z-scores, log2 CNA, generic assay, …). Previously only the discretized CNA profile got the bulk path.genetic_entityinserts inDaoGeneticEntity.addNewGeneticEntity, plus an explicitClickHouseBulkLoader.flushAll()inImportGenericAssayEntity.importData()so that buffered entities are visible to the SELECT insideImportTabDelimData(both run in the same JVM for GENERIC_ASSAY profiles).Measured impact
Full import of TCGA Pancan ACC (~92 samples, 17 data files including methylation_hm450 with ~400K rows) against an in-cluster ClickHouse:
Representative per-file times after the fix:
data_log2_cna.txt1.5s,data_methylation_hm27_hm450_merged.txt2.8s,data_methylation_hm450.txt~5s, mRNA z-score files 1.5s each.Heap requirement
The bulk loader buffers
pendingRecordsplus a TSV byte[] in memory. For TCGA Pancan-scale matrices the peak is ~2–4 GB; the default container-aware JVM heap (~25% of container limit) OOMs on methylation_hm450. Importer Jobs should set-Xmx6g(we useJAVA_OPTS=-Xmx6g+ an 8 GB container memory limit).Safety
flushAll()at the end ofImportTabDelimData.importDataInternalis already gated onisBulkLoad(); flipping the loader on unconditionally also auto-flushes at the end.DaoGeneticEntity.addNewGeneticEntityreserves the ID viaClickHouseAutoIncrement.nextId()before buffering, so callers downstream get a valid id immediately.flushAll()iterates loaders in insertion order;genetic_entityis always touched beforegenetic_alterationfor a given gene, so the flush order matches the dependency.existingCnaEventspopulation is untouched.Test plan
genetic_alterationgenetic_alteration+ entities ingenetic_entity