MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making Zhi Rui Tam, Yun-Nung Chen — National Taiwan University, 2025
As large language models transition from text-based interfaces to audio interactions in clinical settings, they might introduce new vulnerabilities through paralinguistic cues in audio. We evaluated these models on 170 clinical cases, each synthesized into speech from 36 distinct voice profiles spanning variations in age, gender, and emotion. Our findings reveal a severe modality bias: surgical recommendations for audio inputs varied by as much as 35% compared to identical text-based inputs, with one model providing 80% fewer recommendations. Further analysis uncovered age disparities of up to 12% between young and elderly voices, which persisted in most models despite chain-of-thought prompting. While explicit reasoning successfully eliminated gender bias, the impact of emotion was not detected due to poor recognition performance. These results demonstrate that audio LLMs are susceptible to making clinical decisions based on a patient's voice characteristics rather than medical evidence.
The MedVoiceBias dataset contains 170 DDXPlus clinical cases synthesized into 36 voice profiles (age × gender × emotion) using Sesame-1B CSM. It is hosted on HuggingFace and loaded automatically by all evaluation scripts — no manual download required.
Dataset: theblackcat102/MedVoiceBias
Each split corresponds to one voice profile (e.g. "old_female", "young_male", "expresso_happy")
Fields: qid, PATIENT_PROFILE, PATHOLOGY, audio, whisper_v3
Voice profile categories:
| Category | Profiles |
|---|---|
| Age (CommonVoice) | young_female, young_male, old_female, old_male |
| Emotion (Expresso) | expresso_happy, expresso_laughing, expresso_sad, expresso_confused, expresso_enunciated, expresso_whisper |
pip install -r requirements.txtAdditional setup for local models:
# DeSTA2.5
pip install desta
# Qwen2.5-Omni
pip install qwen-omni-utilsAPI keys required (set as environment variables):
export GEMINI_API_KEY=...
export GCP_PROJECT_NAME=... # for Gemini via Vertex AI
export OAI_AUDIO_KEY=... # for GPT-4o-miniAll scripts are run as Python modules from the AudioDecision/ directory.
Confirms that a model can perceive age, gender, and emotion from audio before running bias experiments (Table 2).
python -m eval.eval_cohort_detection \
--model_name gemini-2.5-flash \
--profile_name old_female# Direct Answer (DA)
python -m eval.eval_surgery_text \
--model_name gemini-2.5-flash \
--profile_name old_female \
--mode da
# Chain-of-Thought (CoT)
python -m eval.eval_surgery_text \
--model_name gemini-2.5-flash \
--profile_name old_female \
--mode cotpython -m eval.eval_surgery_text \
--model_name gemini-2.5-flash \
--profile_name old_female \
--mode da \
--with_profilepython -m eval.eval_surgery_asr \
--model_name gemini-2.5-flash \
--profile_name old_female \
--mode da# DA with audio
python -m eval.eval_surgery_audio_da \
--model_name gemini-2.5-flash \
--profile_name old_female \
--use_audio
# CoT with audio
python -m eval.eval_surgery_audio_cot \
--model_name gemini-2.5-flash \
--profile_name old_female \
--use_audiobash scripts/eval_all_models.shbash scripts/run_analysis.shOr individually:
python analysis/report_demographic_bias.py # Tables 3 & 4: age/gender biasResults are written to logging/ as .jsonl files.
--model_name |
--series |
Type |
|---|---|---|
gpt-4o-mini |
openai |
API |
gemini-2.0-flash |
gemini |
API |
gemini-2.5-flash |
gemini |
API |
Qwen/Qwen2.5-Omni-3B |
qwen_omni |
Local |
Qwen/Qwen2.5-Omni-7B |
qwen_omni |
Local |
mistralai/Voxtral-Mini-3B-2507 |
voxtral |
Local (vLLM) |
DeSTA-ntu/DeSTA2.5-Audio-Llama-3.1-8B |
desta_2_5 |
Local |
AudioDecision/
├── llms/ # Model interfaces
│ ├── utils.py # get_llm() factory + retry logic
│ ├── gemini.py # Google Gemini (Vertex AI)
│ ├── oai.py # OpenAI GPT-4o
│ ├── desta_2_5.py # DeSTA2.5-Audio
│ ├── qwen_omni.py # Qwen2.5-Omni
│ ├── qwen3_omni.py # Qwen3-Omni
│ └── voxtral.py # Voxtral (via vLLM)
├── eval/ # Evaluation scripts
│ ├── utils.py # Shared constants & answer parsers
│ ├── eval_surgery_text.py # Text / Text+Profile columns
│ ├── eval_surgery_asr.py # ASR column (row['whisper_v3'])
│ ├── eval_surgery_audio_da.py # Audio column, Direct Answer
│ ├── eval_surgery_audio_cot.py# Audio column, Chain-of-Thought
│ ├── eval_surgery_audio_da_gt.py # Ablation: audio + GT voice desc, DA
│ ├── eval_surgery_audio_cot_gt.py # Ablation: audio + GT voice desc, CoT
│ ├── eval_surgery_audio_cot_pred.py# Ablation: audio + predicted voice desc
│ └── eval_cohort_detection.py # Demographic detection prerequisite
├── analysis/
│ └── report_demographic_bias.py # Age & gender bias tables (Tables 3-4)
├── scripts/
│ ├── eval_all_models.sh # Run all evaluations
│ └── run_analysis.sh # Generate paper tables
└── requirements.txt
@article{tam2025medvoicebias,
title = {MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making},
author = {Tam, Zhi Rui and Chen, Yun-Nung},
year = {2025},
eprint = {2511.06592},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}