Skip to content

Hackathon: Initial draft for resource data model#12010

Draft
dippindots wants to merge 4 commits into
cBioPortal:masterfrom
dippindots:resource-data-model-design
Draft

Hackathon: Initial draft for resource data model#12010
dippindots wants to merge 4 commits into
cBioPortal:masterfrom
dippindots:resource-data-model-design

Conversation

@dippindots
Copy link
Copy Markdown
Collaborator

Fix # (see https://help.github.com/en/articles/closing-issues-using-keywords)

Describe changes proposed in this pull request:

  • a
  • b

Checks

Any screenshots or GIFs?

If this is a new visual feature please add a before/after screenshot or gif
here with e.g. Giphy CAPTURE or Peek

Notify reviewers

Read our Pull request merging
policy
. It can help to figure out who worked on the
file before you. Please use git blame <filename> to determine that
and notify them either through slack or by assigning them as a reviewer on the PR

@dippindots
Copy link
Copy Markdown
Collaborator Author

Here are the key topics worth discussing before moving to implementation:


1. Controlled vocabulary for TYPE

  • Should TYPE values (e.g., H_AND_E, IHC, CT) be a curated list enforced somewhere, or truly free-text with no validation?
  • Who owns the list, and how do new types get added?
  • Does TYPE drive any rendering behavior in the UI (e.g., iframe vs download link vs embedded viewer)?

2. Metadata on GROUP nodes

  • Currently only ITEM nodes carry TYPE and METADATA. Should GROUP nodes also have metadata?
  • Example: a tissue block GROUP might want to store anatomical site, collection date, or pathology notes alongside it.

3. Tree depth limits

  • Is there a practical maximum depth the UI should support?
  • Unlimited depth is possible in the schema but could create UX problems if curators go too deep.

4. Cross-entity queries

  • Should users be able to query across patients — e.g., “show all H_AND_E slides in this cohort”?
  • If yes, the current per-patient tree model works, but the API and UI need explicit design for it.

5. Incremental updates vs full reload

  • When a study is re-imported, how are existing resource_node rows handled — full delete and reload, or upsert/diff?
  • GROUP_PATH uniqueness: if two import files define the same path for the same patient, do they merge or conflict?

6. Backward compatibility timeline

  • How long do the old resource_sample, resource_patient, and resource_study tables and their API endpoints stay alive for existing consumers?
  • Is there a versioned migration path or a hard cutover?

7. API shape

  • Should the API return the full tree in one response, or lazy-load children on expand?
  • For cohort-level pages (not single patient), how should imaging resources be surfaced?

8. Who creates GROUP nodes — curators or the system?

  • Current plan: importer auto-creates GROUPs from GROUP_PATH.
  • Is there a use case where curators need to explicitly define and name a GROUP with its own metadata before adding items to it?

dippindots and others added 3 commits March 11, 2026 11:42
- Add METADATA and PRIORITY to node semantics table
- Expand import file format from 3 to 5 optional columns (add METADATA, PRIORITY)
- Add METADATA examples to data file example rows
- Fix migration SQL: use RESOURCE_ID (not rd.DISPLAY_NAME) as DISPLAY_NAME
- Expand migration SQL to cover all 3 tables (was placeholder comment)
- Fix resource_study migration to use CANCER_STUDY_ID as ENTITY_INTERNAL_ID
- Clarify GROUP node DISPLAY_NAME = path segment string in importer logic
- Update implementation step 3 to mention METADATA and PRIORITY handling

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Entity ID columns are optional — their presence depends on resource scope:
- STUDY resources: no PATIENT_ID or SAMPLE_ID
- PATIENT resources: PATIENT_ID required, no SAMPLE_ID
- SAMPLE resources: both PATIENT_ID and SAMPLE_ID required

Entity type is resolved from resource_definition.RESOURCE_TYPE, not
column presence. Updated column table and added a study-resource example.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@sonarqubecloud
Copy link
Copy Markdown

@dippindots
Copy link
Copy Markdown
Collaborator Author

PR Mismatch Findings

Sources reviewed:


1. [Backend + Importer] Mapper queries columns that don't exist in the schema

Files: ResourceDataMapper.xml (PR #12018), DaoResourceData.java + ImportResourceData.java (PR core#119)

Problem: The MyBatis mapper's selectNodeFields SQL block queries:

resource_node.node_type   AS "nodeType"
resource_node.parent_id   AS "parentId"

But cgds.sql / migration.sql (same PR) defines resource_node with GROUP_PATH varchar(255) — no NODE_TYPE and no PARENT_ID columns. The importer also inserts group_path as a flat string rather than resolving it to parent rows.

The plan (PR #12010) designed the schema with PARENT_ID + NODE_TYPE ENUM('GROUP','ITEM'). The Java model (NodeType.java, ResourceData.nodeType, ResourceData.parentId) and mapper were written for that design, but the schema and importer drifted to a flat GROUP_PATH string instead.

Will cause: SQL runtime error the moment any resource API endpoint is called.

Fix: Resolve GROUP_PATH to PARENT_ID at import time

The mapper and Java model already implement the PARENT_ID design correctly. The fix is confined to the schema and importer only.

Schema changes (cgds.sql + migration.sql in PR #12018):

  • Replace GROUP_PATH varchar(255) with:
    • NODE_TYPE ENUM('GROUP','ITEM') NOT NULL
    • PARENT_ID bigint DEFAULT NULL with FOREIGN KEY (PARENT_ID) REFERENCES resource_node(ID) ON DELETE CASCADE
  • Add a unique index on (RESOURCE_ID, CANCER_STUDY_ID, ENTITY_TYPE, ENTITY_INTERNAL_ID, PARENT_ID, DISPLAY_NAME) to allow efficient GROUP node upsert (two rows with the same parent + display name at the same scope are the same folder).

Importer changes (DaoResourceData.java + ImportResourceData.java in PR core#119):

  1. Change the INSERT statement: replace group_path with node_type and parent_id.
  2. Add a findOrCreateGroupNode() helper in DaoResourceData that does a SELECT-then-INSERT for each GROUP node:
// SELECT ID ... WHERE node_type='GROUP' AND display_name=segment AND parent_id IS [NULL / = currentParentId]
// If not found: INSERT GROUP node, return generated ID
  1. In ImportResourceData, before inserting each ITEM row, walk the GROUP_PATH segments to resolve the final parentId:
segments = GROUP_PATH.split("/")   // e.g. ["CT 2023-01-15", "Series 1: Axial T2"]
currentParentId = null

for each segment:
    currentParentId = DaoResourceData.findOrCreateGroupNode(
        resourceId, cancerStudyId, entityType, entityInternalId,
        segment, currentParentId)

INSERT ITEM node with parent_id = currentParentId, node_type = 'ITEM'
  1. Keep an in-memory cache Map<String, Long> keyed on "resourceId|entityId|parentId|segment" within each import run. This avoids redundant DB lookups when many rows share the same folder (e.g. dozens of slides all under Block A – Primary Tumor).

  2. ClickHouse bulk load path: GROUP nodes require a SELECT to get the generated ID before inserting children, so GROUP resolution must run outside the bulk loader. Only ITEM rows should go through ClickHouseBulkLoader; GROUP inserts always use the regular JDBC path.

What does NOT need to change:

  • ResourceDataMapper.xml — already correct for PARENT_ID / NODE_TYPE
  • ResourceData.java, NodeType.java — already correct
  • File format docs (PR Hackathon: Updated resource data file format #12022) — GROUP_PATH stays as the user-facing column name
  • Validator (PR core#120) — no changes needed
  • Frontend — receives parentId in API responses, which it already handles in ResourceNodeBase

2. [Importer] DISPLAY_NAME is never read from the TSV — hardcoded to resourceId

File: ImportResourceData.java (PR core#119)
Problem: The importer calls:

DaoResourceData.addPatientDatum(internalId, cancerStudyId, resourceId, resourceId, resourceURL, metadata, groupPath);
//                                                                        ↑ displayName arg is just resourceId again

There is no DISPLAY_NAME_COLUMN_NAME constant and no findDisplayNameColumn(). So whatever users put in the DISPLAY_NAME column of their data files is silently ignored, and resourceId is stored as the display name instead.

Contradicts: File format docs (PR #12022) and validator (PR core#120) both support DISPLAY_NAME as a meaningful optional column.


3. [Importer] TYPE column is not read from TSV and not stored in DB

File: ImportResourceData.java + DaoResourceData.java (PR core#119)
Problem: The DB INSERT statement only covers 8 columns:

resource_id, cancer_study_id, entity_type, entity_internal_id,
display_name, url, metadata, group_path

TYPE is completely absent — no column constant, no findTypeColumn(), not in the INSERT.

Contradicts: The DB schema has TYPE varchar(64), file format docs document it, and the validator validates it. Data in TYPE column will be silently dropped on import.


4. [Importer] PRIORITY column is not read from TSV and not stored in DB

File: ImportResourceData.java + DaoResourceData.java (PR core#119)
Problem: Same as TYPE — PRIORITY has no column constant, no findPriorityColumn(), and is absent from the INSERT statement. The DB will always get the schema default of 0 regardless of what the data file says.

Contradicts: DB schema has PRIORITY int(11) DEFAULT 0, file format docs document it, and the validator validates it (checks it's an integer).


5. [Frontend] ResourceNodeRow is missing the priority field

File: ResourceNodeTypes.ts (PR #5437)
Problem: The TSV row interface:

interface ResourceNodeRow {
  patientId: string; sampleId: string; resourceId: string;
  url: string; displayName: string; type?: string;
  groupPath?: string; metadata?: Record<string, any>;
  // no priority field
}

priority is defined on ResourceNodeBase (the tree node type) but missing from ResourceNodeRow (the parsed TSV row). So the TSV parser won't capture it and ordering will be undefined.


6. [Frontend] Data fetched from hardcoded localhost:3000, bypassing the backend API entirely

Files: Two separate components in PR #5437
Problem:

const RESOURCE_BASE_URL = '//localhost:3000/';
const response = await fetch(`${RESOURCE_BASE_URL}${tsvFile}`);

This appears in two places. The frontend parses the raw TSV file client-side and never calls the REST API that the backend (PR #12018) built. This only works in local dev. In production, the localhost:3000 URL will fail.

Required work: Wire the frontend to the actual /api/... resource endpoints instead of fetching TSVs directly.


7. [Frontend ↔ API] metadata type mismatch — object vs string

Problem: Frontend types metadata as Record<string, any> (already-parsed JS object). The backend ResourceData.java stores and returns it as String. When the frontend is eventually wired to the real API, calling .stain on a raw JSON string will silently fail or throw.

Required work: Frontend needs a JSON.parse(metadata) step when consuming the API response, or the API should serialize it as a JSON object.

@dippindots dippindots changed the title Initial draft for resource data model Hackathon: Initial draft for resource data model Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant