feat: Add HiCu model and MIMIC4ICD10Coding task for ICD-10 coding#947
Open
matthew-ardi wants to merge 4 commits intosunlabuiuc:masterfrom
Open
feat: Add HiCu model and MIMIC4ICD10Coding task for ICD-10 coding#947matthew-ardi wants to merge 4 commits intosunlabuiuc:masterfrom
matthew-ardi wants to merge 4 commits intosunlabuiuc:masterfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds an end-to-end ICD-10 coding pipeline for MIMIC-IV by introducing a new task for extracting discharge-note text + ICD-10 labels, and a new HiCu model implementing hierarchical curriculum learning over an ICD-10 hierarchy.
Changes:
- Added
MIMIC4ICD10Codingtask with simple tokenization and ICD-10-only filtering. - Added
HiCumodel (MultiResCNN encoder + hierarchical decoder + asymmetric loss) with ICD-10 hierarchy utilities. - Added tests, API docs entries, and an example script demonstrating curriculum vs flat training.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/core/test_mimic4_icd10_coding.py | Unit tests covering sample extraction, ICD-10 filtering, dedup, and tokenization behavior. |
| tests/core/test_hicu.py | Unit tests covering HiCu initialization, forward/backward, depth switching, weight transfer, ASL, and hierarchy builder. |
| pyhealth/tasks/medical_coding.py | Implements MIMIC4ICD10Coding task and _tokenize_clinical_text helper. |
| pyhealth/tasks/init.py | Exposes MIMIC4ICD10Coding in the public tasks namespace. |
| pyhealth/models/hicu.py | Adds the HiCu model, hierarchical decoder, encoder, ASL, and ICD-10 hierarchy utilities. |
| pyhealth/models/init.py | Exposes HiCu-related classes in the public models namespace. |
| examples/mimic4_icd10_coding_hicu.py | End-to-end runnable example for synthetic + real MIMIC-IV, including curriculum experiments. |
| docs/api/tasks/pyhealth.tasks.MIMIC4ICD10Coding.rst | New API doc page for the ICD-10 coding task. |
| docs/api/tasks.rst | Adds the ICD-10 coding task to the tasks API index. |
| docs/api/models/pyhealth.models.HiCu.rst | New API doc page for HiCu and its components. |
| docs/api/models.rst | Adds HiCu to the models API index. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ing, deterministic ordering, visit_id, and memory-efficient label mappings Agent-Logs-Url: https://github.com/matthew-ardi/PyHealth/sessions/4752d079-651a-4fe1-9faa-3a2025813f50 Co-authored-by: matthew-ardi <25186507+matthew-ardi@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contributor
Matthew Ardi · NetID
mardi2· CS 598 DL4H, Spring 2026Type
Model + Task (reuses existing
MIMIC4Dataset).Paper
Ren et al., HiCu: Leveraging Hierarchy for Curriculum Learning in Automated ICD Coding, ML4H 2022 — https://arxiv.org/abs/2208.02301
Summary
pyhealth.models.HiCu— MultiResCNN encoder + per-label attention decoder with a 3-level ICD-10 hierarchy (chapter → 3-char category → full code).set_depth(d)activates depthdand copies parent weights into child positions so each curriculum stage starts warm. Uses Asymmetric Loss (Ben-Baruch et al. 2020).pyhealth.tasks.MIMIC4ICD10Coding— one sample per admission from discharge notes +diagnoses_icd. Filters toicd_version == "10"since MIMIC-IV mixes ICD-9 and ICD-10. Whitespace-tokenized, truncated to 4000 tokens.model.set_depth()between stages), so the model class stays stateless and works with the existingTrainer.Design decisions
scatter_add_index buffer instead of a dense one-hot matrix (~6.6 MB → ~18 KB per depth on a 2,281-code vocab).Cerification
Verified on MIMIC-IV dev mode (1 000 patients / 907 samples / 2 281 codes) on an M2 Max with MPS
Ablation (dev-mode, for path verification)
Schedule
5 / 10 / 30epochs across depths.These are training losses on a small dev subset — absolute numbers aren't comparable to published benchmarks, and ASL depresses the scale by construction. The useful signal is the ~3× gap between ASL and BCE, which matches the paper's motivation. Proper benchmarking would need a held-out test split and micro-F1 / macro-F1 / P@k.