Skip to content

Add grouped validation summary to stderr output#142

Open
DanielSebagala wants to merge 1 commit into
cBioPortal:mainfrom
DanielSebagala:feature/validation-summary
Open

Add grouped validation summary to stderr output#142
DanielSebagala wants to merge 1 commit into
cBioPortal:mainfrom
DanielSebagala:feature/validation-summary

Conversation

@DanielSebagala
Copy link
Copy Markdown

Adds a ValidationSummaryHandler to validateData.py that prints a categorized summary of all errors and warnings at the end of validation.

Problem

The current validator streams messages chronologically as they're found. For large studies with many issues, curators have to scroll through the full log to understand what went wrong and how often. The per-file summaries (line count with warnings/errors) help, but don't aggregate across the full study or group by message type.

Solution

A new ValidationSummaryHandler (a standard logging.Handler subclass) collects all ERROR and WARNING messages during validation and prints a grouped, counted summary to stderr after validation completes:

------------------------------------------------------------
VALIDATION SUMMARY
------------------------------------------------------------
ERRORS (8):
  [4] Normal sample id not in list of sample ids...
  [2] Value of numeric attribute is not a real number
  [1] No case list found with stable_id 'brca_tcga_pub_all'
  [1] datatype definition for attribute 'DFS_MONTHS' must be NUMBER

WARNINGS (10):
  [4] Unrecognized field in meta file
  [1] Missing clinical data for a patient associated with samples
  [1] Given value for Variant_Classification column is not one of the expected values
------------------------------------------------------------

Changes

  • New ValidationSummaryHandler class (~50 lines) after MaxLevelTrackingHandler
  • Two lines in main_validate() to register the handler
  • Summary flush and print before returning exit status

Testing

  • Tested against study_various_issues, study_es_3, study_wr_clin — all produce correct grouped summaries
  • 156 existing unit tests pass (2 pre-existing failures unrelated to this change)
  • No changes to existing logging behavior — the summary is additive, printed to stderr alongside the existing Validation of data {status} line

Adds ValidationSummaryHandler that collects all errors and warnings
during validation and prints a categorized summary at the end, grouped
by severity and sorted by frequency. This makes it easier for curators
to triage validation output at a glance rather than scanning a
chronological log stream.

Example output:
  ERRORS (8):
    [4] Normal sample id not in list...
    [2] Value of numeric attribute is not a real number
  WARNINGS (10):
    [4] Unrecognized field in meta file
    [1] Missing clinical data for a patient...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant