Hackathon: Initial draft for resource data model by dippindots · Pull Request #12010 · cBioPortal/cbioportal

dippindots · 2026-03-05T19:46:37Z

Fix # (see https://help.github.com/en/articles/closing-issues-using-keywords)

Describe changes proposed in this pull request:

a
b

Checks

The commit log is comprehensible. It follows 7 rules of great commit messages. We can fix this during merge by using a squash+merge if necessary
Has tests or has a separate issue that describes the types of test that should be created. If no test is included it should explicitly be mentioned in the PR why there is no test.
Is this PR adding logic based on one or more clinical attributes? If yes, please make sure validation for this attribute is also present in the data validation / data loading layers (in backend repo) and documented in File-Formats Clinical data section!
Make sure your PR has one of the labels defined in https://github.com/cBioPortal/cbioportal/blob/master/.github/release-drafter.yml

Any screenshots or GIFs?

If this is a new visual feature please add a before/after screenshot or gif
here with e.g. Giphy CAPTURE or Peek

Notify reviewers

Read our Pull request merging
policy. It can help to figure out who worked on the
file before you. Please use git blame <filename> to determine that
and notify them either through slack or by assigning them as a reviewer on the PR

dippindots · 2026-03-05T19:57:41Z

Here are the key topics worth discussing before moving to implementation:

1. Controlled vocabulary for `TYPE`

Should TYPE values (e.g., H_AND_E, IHC, CT) be a curated list enforced somewhere, or truly free-text with no validation?
Who owns the list, and how do new types get added?
Does TYPE drive any rendering behavior in the UI (e.g., iframe vs download link vs embedded viewer)?

2. Metadata on `GROUP` nodes

Currently only ITEM nodes carry TYPE and METADATA. Should GROUP nodes also have metadata?
Example: a tissue block GROUP might want to store anatomical site, collection date, or pathology notes alongside it.

3. Tree depth limits

Is there a practical maximum depth the UI should support?
Unlimited depth is possible in the schema but could create UX problems if curators go too deep.

4. Cross-entity queries

Should users be able to query across patients — e.g., “show all H_AND_E slides in this cohort”?
If yes, the current per-patient tree model works, but the API and UI need explicit design for it.

5. Incremental updates vs full reload

When a study is re-imported, how are existing resource_node rows handled — full delete and reload, or upsert/diff?
GROUP_PATH uniqueness: if two import files define the same path for the same patient, do they merge or conflict?

6. Backward compatibility timeline

How long do the old resource_sample, resource_patient, and resource_study tables and their API endpoints stay alive for existing consumers?
Is there a versioned migration path or a hard cutover?

7. API shape

Should the API return the full tree in one response, or lazy-load children on expand?
For cohort-level pages (not single patient), how should imaging resources be surfaced?

8. Who creates `GROUP` nodes — curators or the system?

Current plan: importer auto-creates GROUPs from GROUP_PATH.
Is there a use case where curators need to explicitly define and name a GROUP with its own metadata before adding items to it?

- Add METADATA and PRIORITY to node semantics table - Expand import file format from 3 to 5 optional columns (add METADATA, PRIORITY) - Add METADATA examples to data file example rows - Fix migration SQL: use RESOURCE_ID (not rd.DISPLAY_NAME) as DISPLAY_NAME - Expand migration SQL to cover all 3 tables (was placeholder comment) - Fix resource_study migration to use CANCER_STUDY_ID as ENTITY_INTERNAL_ID - Clarify GROUP node DISPLAY_NAME = path segment string in importer logic - Update implementation step 3 to mention METADATA and PRIORITY handling Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Entity ID columns are optional — their presence depends on resource scope: - STUDY resources: no PATIENT_ID or SAMPLE_ID - PATIENT resources: PATIENT_ID required, no SAMPLE_ID - SAMPLE resources: both PATIENT_ID and SAMPLE_ID required Entity type is resolved from resource_definition.RESOURCE_TYPE, not column presence. Updated column table and added a study-resource example. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

sonarqubecloud · 2026-03-11T16:09:01Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

dippindots · 2026-03-12T14:26:14Z

PR Mismatch Findings

Sources reviewed:

Plan: Hackathon: Initial draft for resource data model #12010
Backend schema+mapper: Hackathon Schema Updates -- Resource Data Model #12018
Importer: cbioportal-core#119
Validator: cbioportal-core#120
File format docs: Hackathon: Updated resource data file format #12022
Frontend: cbioportal-frontend#5437

1. [Backend + Importer] Mapper queries columns that don't exist in the schema

Files: ResourceDataMapper.xml (PR #12018), DaoResourceData.java + ImportResourceData.java (PR core#119)

Problem: The MyBatis mapper's selectNodeFields SQL block queries:

resource_node.node_type   AS "nodeType"
resource_node.parent_id   AS "parentId"

But cgds.sql / migration.sql (same PR) defines resource_node with GROUP_PATH varchar(255) — no NODE_TYPE and no PARENT_ID columns. The importer also inserts group_path as a flat string rather than resolving it to parent rows.

The plan (PR #12010) designed the schema with PARENT_ID + NODE_TYPE ENUM('GROUP','ITEM'). The Java model (NodeType.java, ResourceData.nodeType, ResourceData.parentId) and mapper were written for that design, but the schema and importer drifted to a flat GROUP_PATH string instead.

Will cause: SQL runtime error the moment any resource API endpoint is called.

Fix: Resolve `GROUP_PATH` to `PARENT_ID` at import time

The mapper and Java model already implement the PARENT_ID design correctly. The fix is confined to the schema and importer only.

Schema changes (cgds.sql + migration.sql in PR #12018):

Replace GROUP_PATH varchar(255) with:
- NODE_TYPE ENUM('GROUP','ITEM') NOT NULL
- PARENT_ID bigint DEFAULT NULL with FOREIGN KEY (PARENT_ID) REFERENCES resource_node(ID) ON DELETE CASCADE
Add a unique index on (RESOURCE_ID, CANCER_STUDY_ID, ENTITY_TYPE, ENTITY_INTERNAL_ID, PARENT_ID, DISPLAY_NAME) to allow efficient GROUP node upsert (two rows with the same parent + display name at the same scope are the same folder).

Importer changes (DaoResourceData.java + ImportResourceData.java in PR core#119):

Change the INSERT statement: replace group_path with node_type and parent_id.
Add a findOrCreateGroupNode() helper in DaoResourceData that does a SELECT-then-INSERT for each GROUP node:

// SELECT ID ... WHERE node_type='GROUP' AND display_name=segment AND parent_id IS [NULL / = currentParentId]
// If not found: INSERT GROUP node, return generated ID

In ImportResourceData, before inserting each ITEM row, walk the GROUP_PATH segments to resolve the final parentId:

segments = GROUP_PATH.split("/")   // e.g. ["CT 2023-01-15", "Series 1: Axial T2"]
currentParentId = null

for each segment:
    currentParentId = DaoResourceData.findOrCreateGroupNode(
        resourceId, cancerStudyId, entityType, entityInternalId,
        segment, currentParentId)

INSERT ITEM node with parent_id = currentParentId, node_type = 'ITEM'

Keep an in-memory cache Map<String, Long> keyed on "resourceId|entityId|parentId|segment" within each import run. This avoids redundant DB lookups when many rows share the same folder (e.g. dozens of slides all under Block A – Primary Tumor).
ClickHouse bulk load path: GROUP nodes require a SELECT to get the generated ID before inserting children, so GROUP resolution must run outside the bulk loader. Only ITEM rows should go through ClickHouseBulkLoader; GROUP inserts always use the regular JDBC path.

What does NOT need to change:

ResourceDataMapper.xml — already correct for PARENT_ID / NODE_TYPE
ResourceData.java, NodeType.java — already correct
File format docs (PR Hackathon: Updated resource data file format #12022) — GROUP_PATH stays as the user-facing column name
Validator (PR core#120) — no changes needed
Frontend — receives parentId in API responses, which it already handles in ResourceNodeBase

2. [Importer] `DISPLAY_NAME` is never read from the TSV — hardcoded to `resourceId`

File: ImportResourceData.java (PR core#119)
Problem: The importer calls:

DaoResourceData.addPatientDatum(internalId, cancerStudyId, resourceId, resourceId, resourceURL, metadata, groupPath);
//                                                                        ↑ displayName arg is just resourceId again

There is no DISPLAY_NAME_COLUMN_NAME constant and no findDisplayNameColumn(). So whatever users put in the DISPLAY_NAME column of their data files is silently ignored, and resourceId is stored as the display name instead.

Contradicts: File format docs (PR #12022) and validator (PR core#120) both support DISPLAY_NAME as a meaningful optional column.

3. [Importer] `TYPE` column is not read from TSV and not stored in DB

File: ImportResourceData.java + DaoResourceData.java (PR core#119)
Problem: The DB INSERT statement only covers 8 columns:

resource_id, cancer_study_id, entity_type, entity_internal_id,
display_name, url, metadata, group_path

TYPE is completely absent — no column constant, no findTypeColumn(), not in the INSERT.

Contradicts: The DB schema has TYPE varchar(64), file format docs document it, and the validator validates it. Data in TYPE column will be silently dropped on import.

4. [Importer] `PRIORITY` column is not read from TSV and not stored in DB

File: ImportResourceData.java + DaoResourceData.java (PR core#119)
Problem: Same as TYPE — PRIORITY has no column constant, no findPriorityColumn(), and is absent from the INSERT statement. The DB will always get the schema default of 0 regardless of what the data file says.

Contradicts: DB schema has PRIORITY int(11) DEFAULT 0, file format docs document it, and the validator validates it (checks it's an integer).

5. [Frontend] `ResourceNodeRow` is missing the `priority` field

File: ResourceNodeTypes.ts (PR #5437)
Problem: The TSV row interface:

interface ResourceNodeRow {
  patientId: string; sampleId: string; resourceId: string;
  url: string; displayName: string; type?: string;
  groupPath?: string; metadata?: Record<string, any>;
  // no priority field
}

priority is defined on ResourceNodeBase (the tree node type) but missing from ResourceNodeRow (the parsed TSV row). So the TSV parser won't capture it and ordering will be undefined.

6. [Frontend] Data fetched from hardcoded `localhost:3000`, bypassing the backend API entirely

Files: Two separate components in PR #5437
Problem:

const RESOURCE_BASE_URL = '//localhost:3000/';
const response = await fetch(`${RESOURCE_BASE_URL}${tsvFile}`);

This appears in two places. The frontend parses the raw TSV file client-side and never calls the REST API that the backend (PR #12018) built. This only works in local dev. In production, the localhost:3000 URL will fail.

Required work: Wire the frontend to the actual /api/... resource endpoints instead of fetching TSVs directly.

7. [Frontend ↔ API] `metadata` type mismatch — object vs string

Problem: Frontend types metadata as Record<string, any> (already-parsed JS object). The backend ResourceData.java stores and returns it as String. When the frontend is eventually wired to the real API, calling .stain on a raw JSON string will silently fail or throw.

Required work: Frontend needs a JSON.parse(metadata) step when consuming the API response, or the API should serialize it as a JSON object.

initial plan

a4e3483

dippindots and others added 3 commits March 11, 2026 11:42

plan: use resource_definition.DISPLAY_NAME as migrated node DISPLAY_NAME

371a199

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dippindots changed the title ~~Initial draft for resource data model~~ Hackathon: Initial draft for resource data model Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hackathon: Initial draft for resource data model#12010

Hackathon: Initial draft for resource data model#12010
dippindots wants to merge 4 commits into
cBioPortal:masterfrom
dippindots:resource-data-model-design

dippindots commented Mar 5, 2026

Uh oh!

dippindots commented Mar 5, 2026

Uh oh!

sonarqubecloud Bot commented Mar 11, 2026

Uh oh!

dippindots commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dippindots commented Mar 5, 2026

Checks

Any screenshots or GIFs?

Notify reviewers

Uh oh!

dippindots commented Mar 5, 2026

1. Controlled vocabulary for TYPE

2. Metadata on GROUP nodes

3. Tree depth limits

4. Cross-entity queries

5. Incremental updates vs full reload

6. Backward compatibility timeline

7. API shape

8. Who creates GROUP nodes — curators or the system?

Uh oh!

sonarqubecloud Bot commented Mar 11, 2026

Quality Gate passed

Uh oh!

dippindots commented Mar 12, 2026

PR Mismatch Findings

1. [Backend + Importer] Mapper queries columns that don't exist in the schema

Fix: Resolve GROUP_PATH to PARENT_ID at import time

2. [Importer] DISPLAY_NAME is never read from the TSV — hardcoded to resourceId

3. [Importer] TYPE column is not read from TSV and not stored in DB

4. [Importer] PRIORITY column is not read from TSV and not stored in DB

5. [Frontend] ResourceNodeRow is missing the priority field

6. [Frontend] Data fetched from hardcoded localhost:3000, bypassing the backend API entirely

7. [Frontend ↔ API] metadata type mismatch — object vs string

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Controlled vocabulary for `TYPE`

2. Metadata on `GROUP` nodes

8. Who creates `GROUP` nodes — curators or the system?

Fix: Resolve `GROUP_PATH` to `PARENT_ID` at import time

2. [Importer] `DISPLAY_NAME` is never read from the TSV — hardcoded to `resourceId`

3. [Importer] `TYPE` column is not read from TSV and not stored in DB

4. [Importer] `PRIORITY` column is not read from TSV and not stored in DB

5. [Frontend] `ResourceNodeRow` is missing the `priority` field

6. [Frontend] Data fetched from hardcoded `localhost:3000`, bypassing the backend API entirely

7. [Frontend ↔ API] `metadata` type mismatch — object vs string