Skip to content

feat: Add HiCu model and MIMIC4ICD10Coding task for ICD-10 coding#947

Open
matthew-ardi wants to merge 4 commits intosunlabuiuc:masterfrom
matthew-ardi:hicu_auto_icd_coding
Open

feat: Add HiCu model and MIMIC4ICD10Coding task for ICD-10 coding#947
matthew-ardi wants to merge 4 commits intosunlabuiuc:masterfrom
matthew-ardi:hicu_auto_icd_coding

Conversation

@matthew-ardi
Copy link
Copy Markdown

@matthew-ardi matthew-ardi commented Apr 5, 2026

Contributor

Matthew Ardi · NetID mardi2 · CS 598 DL4H, Spring 2026

Type

Model + Task (reuses existing MIMIC4Dataset).

Paper

Ren et al., HiCu: Leveraging Hierarchy for Curriculum Learning in Automated ICD Coding, ML4H 2022 — https://arxiv.org/abs/2208.02301

Summary

  • pyhealth.models.HiCu — MultiResCNN encoder + per-label attention decoder with a 3-level ICD-10 hierarchy (chapter → 3-char category → full code). set_depth(d) activates depth d and copies parent weights into child positions so each curriculum stage starts warm. Uses Asymmetric Loss (Ben-Baruch et al. 2020).
  • pyhealth.tasks.MIMIC4ICD10Coding — one sample per admission from discharge notes + diagnoses_icd. Filters to icd_version == "10" since MIMIC-IV mixes ICD-9 and ICD-10. Whitespace-tokenized, truncated to 4000 tokens.
  • Curriculum training loop lives in the example (model.set_depth() between stages), so the model class stays stateless and works with the existing Trainer.

Design decisions

  • ICD-10 on MIMIC-IV instead of the paper's ICD-9 on MIMIC-III — current coding standard, newer dataset.
  • 3 depths (chapter / category / full) instead of 5. Covers the ICD-10 structure cleanly; extending to 5 is a follow-up.
  • Label remapping uses a 1-D scatter_add_ index buffer instead of a dense one-hot matrix (~6.6 MB → ~18 KB per depth on a 2,281-code vocab).
  • Dropped Poincaré embeddings — paper reports marginal gains, and it keeps the PR dependency-free.

Cerification

Verified on MIMIC-IV dev mode (1 000 patients / 907 samples / 2 281 codes) on an M2 Max with MPS

Ablation (dev-mode, for path verification)

Schedule 5 / 10 / 30 epochs across depths.

Config Final train loss
Curriculum + ASL 0.0007
Flat + ASL 0.0002
Curriculum + BCE 0.0020
Curriculum + ASL (100 filters) 0.0001

These are training losses on a small dev subset — absolute numbers aren't comparable to published benchmarks, and ASL depresses the scale by construction. The useful signal is the ~3× gap between ASL and BCE, which matches the paper's motivation. Proper benchmarking would need a held-out test split and micro-F1 / macro-F1 / P@k.

@matthew-ardi matthew-ardi marked this pull request as ready for review April 5, 2026 20:40
Copilot AI review requested due to automatic review settings April 5, 2026 20:40
@matthew-ardi matthew-ardi marked this pull request as draft April 5, 2026 20:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an end-to-end ICD-10 coding pipeline for MIMIC-IV by introducing a new task for extracting discharge-note text + ICD-10 labels, and a new HiCu model implementing hierarchical curriculum learning over an ICD-10 hierarchy.

Changes:

  • Added MIMIC4ICD10Coding task with simple tokenization and ICD-10-only filtering.
  • Added HiCu model (MultiResCNN encoder + hierarchical decoder + asymmetric loss) with ICD-10 hierarchy utilities.
  • Added tests, API docs entries, and an example script demonstrating curriculum vs flat training.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/core/test_mimic4_icd10_coding.py Unit tests covering sample extraction, ICD-10 filtering, dedup, and tokenization behavior.
tests/core/test_hicu.py Unit tests covering HiCu initialization, forward/backward, depth switching, weight transfer, ASL, and hierarchy builder.
pyhealth/tasks/medical_coding.py Implements MIMIC4ICD10Coding task and _tokenize_clinical_text helper.
pyhealth/tasks/init.py Exposes MIMIC4ICD10Coding in the public tasks namespace.
pyhealth/models/hicu.py Adds the HiCu model, hierarchical decoder, encoder, ASL, and ICD-10 hierarchy utilities.
pyhealth/models/init.py Exposes HiCu-related classes in the public models namespace.
examples/mimic4_icd10_coding_hicu.py End-to-end runnable example for synthetic + real MIMIC-IV, including curriculum experiments.
docs/api/tasks/pyhealth.tasks.MIMIC4ICD10Coding.rst New API doc page for the ICD-10 coding task.
docs/api/tasks.rst Adds the ICD-10 coding task to the tasks API index.
docs/api/models/pyhealth.models.HiCu.rst New API doc page for HiCu and its components.
docs/api/models.rst Adds HiCu to the models API index.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Claude AI and others added 2 commits April 5, 2026 21:13
…ing, deterministic ordering, visit_id, and memory-efficient label mappings

Agent-Logs-Url: https://github.com/matthew-ardi/PyHealth/sessions/4752d079-651a-4fe1-9faa-3a2025813f50

Co-authored-by: matthew-ardi <25186507+matthew-ardi@users.noreply.github.com>
@matthew-ardi matthew-ardi marked this pull request as ready for review April 12, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants