Proposal: clinician led failure mechanism layer for MedHELM medical AI safety evaluation #27
goktugozkanmd
started this conversation in
Ideas
Replies: 1 comment
-
|
Thank you, Dr. Özkan, for your detailed proposal, and for engaging through this discussion prior to opening a pull request. We confirm that we are currently reviewing your submission, including the proposed failure mechanism framing, rubric structure, and potential integration pathways with MedHELM. Your careful positioning of this work as synthetic, clinician-led, and preliminary is appreciated. Once our review is complete, we will return with a formal outcome and specific guidance on the most appropriate next step within the MedHELM contribution process. We appreciate your contribution to advancing safe and rigorous medical AI evaluation. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello MedHELM maintainers,
I am a physician building the Medical AI Failure Atlas, an open source clinician led resource for medical language model safety evaluation. The current release is a single physician authored synthetic draft pending additional clinician review. It is not clinical advice, it is not clinical validation, and it is not a model ranking. All scenarios are synthetic and contain no patient data.
The public repository is here:
https://github.com/v0id-lab/medical-ai-failure-atlas
The larger direction is to develop a failure mechanism layer that can complement clinical task benchmarks. Instead of asking only whether a model recognizes the correct risk, the resource asks whether the model handles the boundary after risk recognition: does it avoid unsafe remote protocol detail, acknowledge missing patient variables, separate triage from bedside treatment protocols, and give safe actionable next steps.
I am posting this as a proposal and scoping question because the MedHELM repository has Discussions enabled, and the contribution guide asks that potential changes be discussed before a pull request when the change affects core functionality.
The first failure mechanism I would like to discuss is unsafe precision after correct escalation. A model can recognize urgent clinical risk, recommend escalation, and still give protocol level details that should depend on bedside assessment, missing patient variables, local policy, or specialist oversight.
At this stage I treat these as hypothesized failure mechanisms from clinical reasoning and preliminary triage, not as empirically validated categories.
The current public sample includes three synthetic examples:
I drafted a metric candidate called
remote_rescue_protocol_boundarywith five rubric dimensions:emergency_risk_recognitionunsafe_precision_avoidancemissing_variable_awarenesstriage_protocol_separationsafe_actionabilityThe relevant public files are:
data/medhelm_remote_rescue_metric_v0_1.jsondata/failure_atlas_external_sample_v0_1.jsonldocs/MEDHELM_CROSSWALK_DRAFT.mddocs/MEDHELM_REMOTE_RESCUE_BOUNDARY_METRIC_PACKAGE_DRAFT.mddata/scoring_rubric_v0_1.jsonMy main question is whether this direction would be useful to MedHELM as:
If this is directionally useful, I would value guidance on the minimum shape that would be worth preparing next: more synthetic cases, a clearer category crosswalk, a different rubric structure, or a specific contribution format.
I do not want to open a pull request before maintainer guidance, because the current resource is still a draft and I want the contribution shape to respect the MedHELM roadmap.
For reference, the MedHELM paper I am aligning with is Bedi et al., Nature Medicine 32, 943 to 951, 2026, DOI 10.1038/s41591-025-04151-2.
Thank you for building an open evaluation ecosystem for medical AI.
Göktug Özkan, MD
Department of Internal Medicine, Kutahya Emet Dr. Fazil Dogan State Hospital, Kutahya, Turkiye
ORCID: 0000-0002-5022-9124
Beta Was this translation helpful? Give feedback.
All reactions