Skip to content

feat: validate clinical attribute values against the 255-character limit (#87)#158

Open
officialasishkumar wants to merge 1 commit into
cBioPortal:mainfrom
officialasishkumar:feature/validate-clinical-attribute-value-length
Open

feat: validate clinical attribute values against the 255-character limit (#87)#158
officialasishkumar wants to merge 1 commit into
cBioPortal:mainfrom
officialasishkumar:feature/validate-clinical-attribute-value-length

Conversation

@officialasishkumar
Copy link
Copy Markdown

Summary

Clinical attribute values longer than 255 characters are silently truncated on load, with no feedback from either the validator or the loader. For example, in the aml_ohsu_2022 study the SURFACE_ANTIGENS_IMMUNOHISTOCHEMICAL_STAINS value for sample aml_ohsu_2022_2010_BA2304 is cut off mid-sentence, losing clinical detail.

The root cause is that clinical attribute values are stored in the clinical_patient.ATTR_VALUE and clinical_sample.ATTR_VALUE columns, both defined as varchar(255):

CREATE TABLE `clinical_patient` ( ... `ATTR_VALUE` varchar(255) NOT NULL, ... );
CREATE TABLE `clinical_sample`  ( ... `ATTR_VALUE` varchar(255) NOT NULL, ... );

Changes

  • Add MAX_CLINICAL_ATTRIBUTE_VALUE_LENGTH = 255, mirroring the existing MAX_SAMPLE_STABLE_ID_LENGTH constant.
  • In ClinicalValidator.checkLine() (the base class shared by SampleClinicalValidator and PatientClinicalValidator), raise an error when a value exceeds the maximum length, so the issue is caught during validation — before the data is loaded and truncated. This mirrors the existing 255-character check already applied to HGVSp_Short in the mutation validator.
  • Add a unit test and a data fixture (data_clin_value_too_long.txt) verifying the boundary: a value of exactly 255 characters is accepted, while a 256-character value is rejected (reported with the correct line and column number).

This addresses the validator side of the issue, which is the recommended gatekeeper in the import workflow (validate → load): with validation enforcing the limit, the silent truncation is no longer reached through the normal pipeline.

Test plan

Ran the full validator test suite (as in CI, python:3.9):

./test_scripts.sh
Ran 183 tests
OK (skipped=1)

The new test:

ClinicalValuesTestCase.test_clinical_attribute_value_too_long

Fixes #87

Clinical attribute values are stored in the `clinical_patient.ATTR_VALUE` and
`clinical_sample.ATTR_VALUE` columns, which are defined as varchar(255). When a
value exceeds this length it is silently truncated on load (for example, long
free-text columns such as SURFACE_ANTIGENS_IMMUNOHISTOCHEMICAL_STAINS in the
aml_ohsu_2022 study), resulting in loss of clinical detail without any feedback.

Add a check in ClinicalValidator.checkLine() that raises an error when any
clinical attribute value is longer than the maximum supported length, so the
problem is surfaced during validation before the data is loaded. The check
covers both sample- and patient-level clinical files. The 255-character limit
is exposed as the MAX_CLINICAL_ATTRIBUTE_VALUE_LENGTH constant, mirroring the
existing MAX_SAMPLE_STABLE_ID_LENGTH constant.

Adds a unit test (and a data fixture) verifying the boundary: a value of exactly
255 characters is accepted while a 256-character value is rejected.

Fixes cBioPortal#87
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Warning/Error When Clinical Attribute Values Are Truncated Due to Length Limits

1 participant