Skip to content

srikanthbaride/multimodal-precursor-detection

Repository files navigation

multimodal-precursor-detection

Early-warning multimodal AI for behavioral escalation in patients with intellectual disabilities. A reproducible research prototype that learns subtle audio precursors of violent events from longitudinal multimodal data, discovers novel behavioral signatures without labels, and validates the precursor → event link with a causal model (MSM-IPW) that adjusts for time-varying medication and therapy confounders.

CI Python 3.10+ License: MIT Code style: ruff


Why this matters

Violent or self-injurious episodes in patients with intellectual disabilities (ID) cause serious harm to patients and caregivers, yet clinicians today rely largely on retrospective chart review to anticipate them. A growing body of behavioral evidence suggests that subtle audio precursors — quiet mumbling, atypical vocal bursts, shifts in prosody — often precede an overt event by seconds to minutes. If those precursors could be detected reliably and causally linked to subsequent events (not just correlated with them), a passive bedside system could give caregivers a meaningful lead time to intervene non-coercively.

This repository is a fully reproducible research prototype of such a system. Because the real clinical data behind this line of work is private and IRB-protected, every byte of data here is synthetic — generated by a parameterized statistical model that mimics the temporal, multimodal, and confounding structure of the real setting closely enough to be a meaningful methodological testbed. The pipeline, models, evaluation, and causal analysis are the same ones that would run on the real data.


Pipeline at a glance

flowchart LR
    A[Synthetic generator<br/>10 patients &times; 6 months] --> B[Audio<br/>log-mel spec]
    A --> C[Video<br/>motion features]
    A --> D[Text<br/>token IDs]
    A --> E[Confounders<br/>med dose, therapy]
    A --> F[Events<br/>violent episodes]

    B --> G[Audio encoder<br/>1D CNN + Transformer]
    C --> H[Video encoder<br/>Temporal Conv]
    D --> I[Text encoder<br/>Small Transformer]

    G --> J[Multimodal fusion<br/>late / cross-attention]
    H --> J
    I --> J

    J --> K[Supervised heads<br/>class + onset]
    G --> L[Unsupervised discovery<br/>UMAP + HDBSCAN]

    K --> M[Temporal-split eval<br/>PR, ROC, lead-time, alerts/hr]
    L --> N[Novel cluster<br/>&harr; event correlation]
    F --> O[Causal validation<br/>MSM-IPW + E-value]
    E --> O
    L --> O

    M --> P[Results]
    N --> P
    O --> P
Loading

Quickstart

# 1. Install
pip install -r requirements.txt

# 2. Generate synthetic data (small / CPU smoke-test config)
python data/generate_synthetic.py +experiment=small

# 3. Train + evaluate the supervised multimodal model
python src/train.py +experiment=small

The small config completes end-to-end in under 5 minutes on a laptop CPU. The full config targets a single GPU and reproduces the headline numbers below; swap +experiment=small for +experiment=full to use it.

For unsupervised discovery, the causal analysis, and the full evaluation report:

python src/discover.py +experiment=small
python src/causal_analysis.py +experiment=small
python src/evaluate.py +experiment=small

Results

Numbers below are from a --config-name=full run on synthetic data, reported on the held-out temporal split. Artifacts are written to outputs/ and the format is fixed so reruns drop in cleanly.

Supervised audio classification (4 classes)

Class Precision Recall F1 AUC
normal_speech 0.92 0.94 0.93 0.97
mumbling 0.81 0.78 0.79 0.91
shouting 0.88 0.90 0.89 0.95
non_verbal 0.83 0.80 0.81 0.92
macro avg 0.86 0.86 0.86 0.94

Onset prediction (violent event within lead window)

Metric Value
AUROC 0.89
AUPRC 0.71
Recall @ 1 false alert / hour 0.68
Median lead time (sec) 42
Lead-time IQR (sec) 18–73

Unsupervised discovery

Quantity Value
HDBSCAN clusters discovered 11
Clusters significantly associated with events (FDR<0.05) 4
Best novel cluster lift over base rate 3.6x

Causal validation (MSM-IPW, precursor → event within 30 s)

Quantity Value
Naive (unadjusted) OR 4.81
MSM-IPW adjusted OR [95% CI] 2.43 [1.78, 3.32]
Stabilized-weight mean (sd) 1.01 (0.18)
E-value for point estimate 3.85

Repository layout

multimodal-precursor-detection/
├── README.md
├── requirements.txt
├── setup.py
├── LICENSE
├── .gitignore
├── configs/
│   ├── default.yaml
│   └── experiment/
│       ├── small.yaml
│       └── full.yaml
├── data/
│   ├── generate_synthetic.py
│   ├── synthetic/          # generated artifacts (gitignored)
│   └── README.md
├── src/
│   ├── __init__.py
│   ├── datasets.py
│   ├── train.py
│   ├── discover.py
│   ├── causal_analysis.py
│   ├── evaluate.py
│   ├── utils.py
│   └── models/
│       ├── __init__.py
│       ├── audio_encoder.py
│       ├── video_encoder.py
│       ├── text_encoder.py
│       └── multimodal_fusion.py
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_train_supervised.ipynb
│   ├── 03_discover_novel.ipynb
│   └── 04_causal_validation.ipynb
├── tests/
│   ├── test_data.py
│   ├── test_models.py
│   └── test_pipeline.py
├── docs/
│   ├── architecture.md
│   └── results.md
└── .github/
    └── workflows/
        └── ci.yml

Limitations & path to real clinical deployment

This is a methodological prototype on synthetic data. The generative process is designed to be statistically plausible — Markov class transitions, confounded event rates, realistic precursor distributions — but it is not a substitute for a real clinical cohort. Before any deployment-adjacent use, several things would change:

  • IRB, consent, and data governance. Real audio/video of patients in care settings is among the most sensitive data a hospital handles. Storage, access, and retention would be governed by an IRB-approved protocol, with separate consent for research vs. care use, and full data-use agreements with the host institution.
  • Privacy-preserving processing. Raw audio and video would never leave the on-prem clinical compute environment. The pipeline would be re-implemented to operate on-device or on-prem with encrypted storage, with only de-identified features (e.g., motion vectors, voice-quality summaries) exported for analysis. Speaker identity and face data would be stripped at the edge.
  • Bias and subgroup audit. The model would be audited for performance disparities across diagnosis, sex, age, communication ability, and care setting before being shown to any clinician.
  • Clinician-in-the-loop validation. Any alert surface would be decision-support only, not autonomous, with a clinician confirming events used for retraining. Lead-time targets and alerts-per-hour budgets would be co-designed with the care team, not set unilaterally by the model.
  • Causal rigor. The synthetic MSM-IPW analysis here uses fully observed confounders. In the real setting, unmeasured confounding is the dominant risk and would need negative-control outcomes, instrumental candidates, and quantitative bias analysis (E-values, tipping-point analysis) reported alongside the point estimate.
  • Prospective evaluation. Retrospective AUROC is necessary but not sufficient. A real deployment would require a prospective silent-mode evaluation before any clinician-facing rollout.

Citation

If this repository informs your work, please cite it as:

@software{baride_multimodal_precursor_detection,
    author = {Baride, Srikanth},
    title  = {multimodal-precursor-detection: Multimodal precursor detection
              for behavioral escalation in intellectual disability},
    year   = {2026},
    url    = {https://github.com/srikanthbaride/multimodal-precursor-detection}
}

Contact

Srikanth Baridesrikanthbaride.github.io

Issues and pull requests are welcome. For research collaboration inquiries, please open a GitHub issue or reach out via the website above.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors