MedVoiceBias

MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making Zhi Rui Tam, Yun-Nung Chen — National Taiwan University, 2025

Abstract

As large language models transition from text-based interfaces to audio interactions in clinical settings, they might introduce new vulnerabilities through paralinguistic cues in audio. We evaluated these models on 170 clinical cases, each synthesized into speech from 36 distinct voice profiles spanning variations in age, gender, and emotion. Our findings reveal a severe modality bias: surgical recommendations for audio inputs varied by as much as 35% compared to identical text-based inputs, with one model providing 80% fewer recommendations. Further analysis uncovered age disparities of up to 12% between young and elderly voices, which persisted in most models despite chain-of-thought prompting. While explicit reasoning successfully eliminated gender bias, the impact of emotion was not detected due to poor recognition performance. These results demonstrate that audio LLMs are susceptible to making clinical decisions based on a patient's voice characteristics rather than medical evidence.

Dataset

The MedVoiceBias dataset contains 170 DDXPlus clinical cases synthesized into 36 voice profiles (age × gender × emotion) using Sesame-1B CSM. It is hosted on HuggingFace and loaded automatically by all evaluation scripts — no manual download required.

Dataset: theblackcat102/MedVoiceBias
Each split corresponds to one voice profile (e.g. "old_female", "young_male", "expresso_happy")
Fields: qid, PATIENT_PROFILE, PATHOLOGY, audio, whisper_v3

Voice profile categories:

Category	Profiles
Age (CommonVoice)	`young_female`, `young_male`, `old_female`, `old_male`
Emotion (Expresso)	`expresso_happy`, `expresso_laughing`, `expresso_sad`, `expresso_confused`, `expresso_enunciated`, `expresso_whisper`

Installation

pip install -r requirements.txt

Additional setup for local models:

# DeSTA2.5
pip install desta

# Qwen2.5-Omni
pip install qwen-omni-utils

API keys required (set as environment variables):

export GEMINI_API_KEY=...
export GCP_PROJECT_NAME=...   # for Gemini via Vertex AI
export OAI_AUDIO_KEY=...      # for GPT-4o-mini

Evaluation

All scripts are run as Python modules from the AudioDecision/ directory.

Step 0 (Prerequisite): Verify demographic detection ability

Confirms that a model can perceive age, gender, and emotion from audio before running bias experiments (Table 2).

python -m eval.eval_cohort_detection \
    --model_name gemini-2.5-flash \
    --profile_name old_female

Step 1: Text baseline

# Direct Answer (DA)
python -m eval.eval_surgery_text \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --mode da

# Chain-of-Thought (CoT)
python -m eval.eval_surgery_text \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --mode cot

Step 2: Text+Profile (inject GT demographic description)

python -m eval.eval_surgery_text \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --mode da \
    --with_profile

Step 3: ASR column (Whisper-v3 transcripts from dataset)

python -m eval.eval_surgery_asr \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --mode da

Step 4: Audio evaluation (main experiment)

# DA with audio
python -m eval.eval_surgery_audio_da \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --use_audio

# CoT with audio
python -m eval.eval_surgery_audio_cot \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --use_audio

Run all models and profiles

bash scripts/eval_all_models.sh

Reproducing Paper Tables

bash scripts/run_analysis.sh

Or individually:

python analysis/report_demographic_bias.py   # Tables 3 & 4: age/gender bias

Results are written to logging/ as .jsonl files.

Supported Models

`--model_name`	`--series`	Type
`gpt-4o-mini`	`openai`	API
`gemini-2.0-flash`	`gemini`	API
`gemini-2.5-flash`	`gemini`	API
`Qwen/Qwen2.5-Omni-3B`	`qwen_omni`	Local
`Qwen/Qwen2.5-Omni-7B`	`qwen_omni`	Local
`mistralai/Voxtral-Mini-3B-2507`	`voxtral`	Local (vLLM)
`DeSTA-ntu/DeSTA2.5-Audio-Llama-3.1-8B`	`desta_2_5`	Local

Repository Structure

AudioDecision/
├── llms/                        # Model interfaces
│   ├── utils.py                 # get_llm() factory + retry logic
│   ├── gemini.py                # Google Gemini (Vertex AI)
│   ├── oai.py                   # OpenAI GPT-4o
│   ├── desta_2_5.py             # DeSTA2.5-Audio
│   ├── qwen_omni.py             # Qwen2.5-Omni
│   ├── qwen3_omni.py            # Qwen3-Omni
│   └── voxtral.py               # Voxtral (via vLLM)
├── eval/                        # Evaluation scripts
│   ├── utils.py                 # Shared constants & answer parsers
│   ├── eval_surgery_text.py     # Text / Text+Profile columns
│   ├── eval_surgery_asr.py      # ASR column (row['whisper_v3'])
│   ├── eval_surgery_audio_da.py # Audio column, Direct Answer
│   ├── eval_surgery_audio_cot.py# Audio column, Chain-of-Thought
│   ├── eval_surgery_audio_da_gt.py   # Ablation: audio + GT voice desc, DA
│   ├── eval_surgery_audio_cot_gt.py  # Ablation: audio + GT voice desc, CoT
│   ├── eval_surgery_audio_cot_pred.py# Ablation: audio + predicted voice desc
│   └── eval_cohort_detection.py # Demographic detection prerequisite
├── analysis/
│   └── report_demographic_bias.py  # Age & gender bias tables (Tables 3-4)
├── scripts/
│   ├── eval_all_models.sh       # Run all evaluations
│   └── run_analysis.sh          # Generate paper tables
└── requirements.txt

Citation

@article{tam2025medvoicebias,
  title     = {MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making},
  author    = {Tam, Zhi Rui and Chen, Yun-Nung},
  year      = {2025},
  eprint    = {2511.06592},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedVoiceBias

Abstract

Dataset

Installation

Evaluation

Step 0 (Prerequisite): Verify demographic detection ability

Step 1: Text baseline

Step 2: Text+Profile (inject GT demographic description)

Step 3: ASR column (Whisper-v3 transcripts from dataset)

Step 4: Audio evaluation (main experiment)

Run all models and profiles

Reproducing Paper Tables

Supported Models

Repository Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
analysis		analysis
eval		eval
llms		llms
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MedVoiceBias

Abstract

Dataset

Installation

Evaluation

Step 0 (Prerequisite): Verify demographic detection ability

Step 1: Text baseline

Step 2: Text+Profile (inject GT demographic description)

Step 3: ASR column (Whisper-v3 transcripts from dataset)

Step 4: Audio evaluation (main experiment)

Run all models and profiles

Reproducing Paper Tables

Supported Models

Repository Structure

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages