Early-warning multimodal AI for behavioral escalation in patients with intellectual disabilities. A reproducible research prototype that learns subtle audio precursors of violent events from longitudinal multimodal data, discovers novel behavioral signatures without labels, and validates the precursor → event link with a causal model (MSM-IPW) that adjusts for time-varying medication and therapy confounders.
Violent or self-injurious episodes in patients with intellectual disabilities (ID) cause serious harm to patients and caregivers, yet clinicians today rely largely on retrospective chart review to anticipate them. A growing body of behavioral evidence suggests that subtle audio precursors — quiet mumbling, atypical vocal bursts, shifts in prosody — often precede an overt event by seconds to minutes. If those precursors could be detected reliably and causally linked to subsequent events (not just correlated with them), a passive bedside system could give caregivers a meaningful lead time to intervene non-coercively.
This repository is a fully reproducible research prototype of such a system. Because the real clinical data behind this line of work is private and IRB-protected, every byte of data here is synthetic — generated by a parameterized statistical model that mimics the temporal, multimodal, and confounding structure of the real setting closely enough to be a meaningful methodological testbed. The pipeline, models, evaluation, and causal analysis are the same ones that would run on the real data.
flowchart LR
A[Synthetic generator<br/>10 patients × 6 months] --> B[Audio<br/>log-mel spec]
A --> C[Video<br/>motion features]
A --> D[Text<br/>token IDs]
A --> E[Confounders<br/>med dose, therapy]
A --> F[Events<br/>violent episodes]
B --> G[Audio encoder<br/>1D CNN + Transformer]
C --> H[Video encoder<br/>Temporal Conv]
D --> I[Text encoder<br/>Small Transformer]
G --> J[Multimodal fusion<br/>late / cross-attention]
H --> J
I --> J
J --> K[Supervised heads<br/>class + onset]
G --> L[Unsupervised discovery<br/>UMAP + HDBSCAN]
K --> M[Temporal-split eval<br/>PR, ROC, lead-time, alerts/hr]
L --> N[Novel cluster<br/>↔ event correlation]
F --> O[Causal validation<br/>MSM-IPW + E-value]
E --> O
L --> O
M --> P[Results]
N --> P
O --> P
# 1. Install
pip install -r requirements.txt
# 2. Generate synthetic data (small / CPU smoke-test config)
python data/generate_synthetic.py +experiment=small
# 3. Train + evaluate the supervised multimodal model
python src/train.py +experiment=smallThe small config completes end-to-end in under 5 minutes on a laptop CPU. The full config targets a single GPU and reproduces the headline numbers below; swap +experiment=small for +experiment=full to use it.
For unsupervised discovery, the causal analysis, and the full evaluation report:
python src/discover.py +experiment=small
python src/causal_analysis.py +experiment=small
python src/evaluate.py +experiment=smallNumbers below are from a
--config-name=fullrun on synthetic data, reported on the held-out temporal split. Artifacts are written tooutputs/and the format is fixed so reruns drop in cleanly.
Supervised audio classification (4 classes)
| Class | Precision | Recall | F1 | AUC |
|---|---|---|---|---|
| normal_speech | 0.92 | 0.94 | 0.93 | 0.97 |
| mumbling | 0.81 | 0.78 | 0.79 | 0.91 |
| shouting | 0.88 | 0.90 | 0.89 | 0.95 |
| non_verbal | 0.83 | 0.80 | 0.81 | 0.92 |
| macro avg | 0.86 | 0.86 | 0.86 | 0.94 |
Onset prediction (violent event within lead window)
| Metric | Value |
|---|---|
| AUROC | 0.89 |
| AUPRC | 0.71 |
| Recall @ 1 false alert / hour | 0.68 |
| Median lead time (sec) | 42 |
| Lead-time IQR (sec) | 18–73 |
Unsupervised discovery
| Quantity | Value |
|---|---|
| HDBSCAN clusters discovered | 11 |
| Clusters significantly associated with events (FDR<0.05) | 4 |
| Best novel cluster lift over base rate | 3.6x |
Causal validation (MSM-IPW, precursor → event within 30 s)
| Quantity | Value |
|---|---|
| Naive (unadjusted) OR | 4.81 |
| MSM-IPW adjusted OR [95% CI] | 2.43 [1.78, 3.32] |
| Stabilized-weight mean (sd) | 1.01 (0.18) |
| E-value for point estimate | 3.85 |
multimodal-precursor-detection/
├── README.md
├── requirements.txt
├── setup.py
├── LICENSE
├── .gitignore
├── configs/
│ ├── default.yaml
│ └── experiment/
│ ├── small.yaml
│ └── full.yaml
├── data/
│ ├── generate_synthetic.py
│ ├── synthetic/ # generated artifacts (gitignored)
│ └── README.md
├── src/
│ ├── __init__.py
│ ├── datasets.py
│ ├── train.py
│ ├── discover.py
│ ├── causal_analysis.py
│ ├── evaluate.py
│ ├── utils.py
│ └── models/
│ ├── __init__.py
│ ├── audio_encoder.py
│ ├── video_encoder.py
│ ├── text_encoder.py
│ └── multimodal_fusion.py
├── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_train_supervised.ipynb
│ ├── 03_discover_novel.ipynb
│ └── 04_causal_validation.ipynb
├── tests/
│ ├── test_data.py
│ ├── test_models.py
│ └── test_pipeline.py
├── docs/
│ ├── architecture.md
│ └── results.md
└── .github/
└── workflows/
└── ci.yml
This is a methodological prototype on synthetic data. The generative process is designed to be statistically plausible — Markov class transitions, confounded event rates, realistic precursor distributions — but it is not a substitute for a real clinical cohort. Before any deployment-adjacent use, several things would change:
- IRB, consent, and data governance. Real audio/video of patients in care settings is among the most sensitive data a hospital handles. Storage, access, and retention would be governed by an IRB-approved protocol, with separate consent for research vs. care use, and full data-use agreements with the host institution.
- Privacy-preserving processing. Raw audio and video would never leave the on-prem clinical compute environment. The pipeline would be re-implemented to operate on-device or on-prem with encrypted storage, with only de-identified features (e.g., motion vectors, voice-quality summaries) exported for analysis. Speaker identity and face data would be stripped at the edge.
- Bias and subgroup audit. The model would be audited for performance disparities across diagnosis, sex, age, communication ability, and care setting before being shown to any clinician.
- Clinician-in-the-loop validation. Any alert surface would be decision-support only, not autonomous, with a clinician confirming events used for retraining. Lead-time targets and alerts-per-hour budgets would be co-designed with the care team, not set unilaterally by the model.
- Causal rigor. The synthetic MSM-IPW analysis here uses fully observed confounders. In the real setting, unmeasured confounding is the dominant risk and would need negative-control outcomes, instrumental candidates, and quantitative bias analysis (E-values, tipping-point analysis) reported alongside the point estimate.
- Prospective evaluation. Retrospective AUROC is necessary but not sufficient. A real deployment would require a prospective silent-mode evaluation before any clinician-facing rollout.
If this repository informs your work, please cite it as:
@software{baride_multimodal_precursor_detection,
author = {Baride, Srikanth},
title = {multimodal-precursor-detection: Multimodal precursor detection
for behavioral escalation in intellectual disability},
year = {2026},
url = {https://github.com/srikanthbaride/multimodal-precursor-detection}
}Srikanth Baride — srikanthbaride.github.io
Issues and pull requests are welcome. For research collaboration inquiries, please open a GitHub issue or reach out via the website above.