Add grouped validation summary to stderr output by DanielSebagala · Pull Request #142 · cBioPortal/cbioportal-core

DanielSebagala · 2026-04-16T04:04:46Z

Adds a ValidationSummaryHandler to validateData.py that prints a categorized summary of all errors and warnings at the end of validation.

Problem

The current validator streams messages chronologically as they're found. For large studies with many issues, curators have to scroll through the full log to understand what went wrong and how often. The per-file summaries (line count with warnings/errors) help, but don't aggregate across the full study or group by message type.

Solution

A new ValidationSummaryHandler (a standard logging.Handler subclass) collects all ERROR and WARNING messages during validation and prints a grouped, counted summary to stderr after validation completes:

------------------------------------------------------------
VALIDATION SUMMARY
------------------------------------------------------------
ERRORS (8):
  [4] Normal sample id not in list of sample ids...
  [2] Value of numeric attribute is not a real number
  [1] No case list found with stable_id 'brca_tcga_pub_all'
  [1] datatype definition for attribute 'DFS_MONTHS' must be NUMBER

WARNINGS (10):
  [4] Unrecognized field in meta file
  [1] Missing clinical data for a patient associated with samples
  [1] Given value for Variant_Classification column is not one of the expected values
------------------------------------------------------------

Changes

New ValidationSummaryHandler class (~50 lines) after MaxLevelTrackingHandler
Two lines in main_validate() to register the handler
Summary flush and print before returning exit status

Testing

Tested against study_various_issues, study_es_3, study_wr_clin — all produce correct grouped summaries
156 existing unit tests pass (2 pre-existing failures unrelated to this change)
No changes to existing logging behavior — the summary is additive, printed to stderr alongside the existing Validation of data {status} line

Adds ValidationSummaryHandler that collects all errors and warnings during validation and prints a categorized summary at the end, grouped by severity and sorted by frequency. This makes it easier for curators to triage validation output at a glance rather than scanning a chronological log stream. Example output: ERRORS (8): [4] Normal sample id not in list... [2] Value of numeric attribute is not a real number WARNINGS (10): [4] Unrecognized field in meta file [1] Missing clinical data for a patient...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add grouped validation summary to stderr output#142

Add grouped validation summary to stderr output#142
DanielSebagala wants to merge 1 commit into
cBioPortal:mainfrom
DanielSebagala:feature/validation-summary

DanielSebagala commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DanielSebagala commented Apr 16, 2026

Problem

Solution

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant