Towards Temporally Explainable Dysarthric Speech Clarity Assessment

Accepted at Interspeech 2025

This repository contains the implementation code for our paper "Towards Temporally Explainable Dysarthric Speech Clarity Assessment," which presents a three-stage framework for automated mispronunciation evaluation in dysarthric speech.

Paper

Supplementary material

Overview

Our framework provides:

Overall clarity scoring of passages or utterances
Mispronunciation localization to identify regions with reduced intelligibility
Mispronunciation type classification into specific error categories

We systematically analyze pretrained Automatic Speech Recognition (ASR) models across these three stages, offering clinically relevant insights for automating actionable feedback in pronunciation assessment.

Framework Architecture

Audio Input → ASR Processing → Error Analysis → Clinical Feedback
    ↓              ↓              ↓              ↓
Stage 1:     Stage 2:        Stage 3:      Applications:
Clarity      Temporal        Error Type    Speech Therapy
Scoring      Localization    Classification    Support

Installation

Requirements

Python 3.8+
PyTorch 2.0+
OpenAI API key (for label categorization)

Setup

git clone https://github.com/augmented-human-lab/interspeech25_speechtherapy.git
cd interspeech25_speechtherapy
pip install -r requirements.txt

Usage Guide

Pipeline Overview

Our analysis pipeline consists of 7 sequential scripts that process audio data through the complete framework:

0_generate_asr_outputs.py → 1_generate_error_reports.py → 2_forced_alignment.py → 
3_llm_label_categorization.py → 4_error_localization.py → 5_evaluation_matrices.py → 
6_error_matrix_analysis.py

Script Descriptions

Stage 1: ASR Processing

0_generate_asr_outputs.py - Generate ASR transcriptions using Whisper models

python scripts/0_generate_asr_outputs.py

Processes audio files with multiple Whisper model sizes (tiny, base, small, medium, large, turbo)
Outputs timestamped JSON transcriptions
Input: Audio files (.wav)
Output: ASR results with word-level timestamps

1_generate_error_reports.py - Compare ASR outputs with ground truth transcripts

python scripts/1_generate_error_reports.py

Aligns ASR transcriptions with original transcripts
Identifies substitution, deletion, and insertion errors
Input: ASR JSON files, reference transcripts
Output: Detailed error reports with timestamps

Stage 2: Temporal Analysis

2_forced_alignment.py - Perform word-level forced alignment

python scripts/2_forced_alignment.py --audio_dir ./data/raw_audio --output_dir ./data/alignments

Uses Wav2Vec2 for precise word-level timing
Input: Audio files
Output: Word-level alignment files

3_llm_label_categorization.py - Categorize therapist labels using LLM

export OPENAI_API_KEY="your-api-key-here"
python scripts/3_llm_label_categorization.py --input_dir ./data/label/origin --output_dir ./data/label/grouped

Automatically categorizes therapist annotations into 5 error types:
- Substitution Errors
- Deletion Errors
- Insertion Errors
- Repetition Errors
- Prosodic Errors
Input: Raw therapist label files
Output: Categorized labels in Excel format

Stage 3: Evaluation and Analysis

4_error_localization.py - Compute temporal matching metrics

python scripts/4_error_localization.py --asr_dir ./evaluation/error_reports --label_dir ./data/label/grouped --output_dir ./evaluation/final_counts

Matches ASR-detected errors with therapist labels temporally
Calculates TP, FP, FN counts for each error type
Input: Error reports, grouped therapist labels
Output: Matching logs and count matrices

5_evaluation_matrices.py - Generate performance metrics and visualizations

python scripts/5_evaluation_matrices.py --input ./evaluation/final_counts/final_counts_with_matching_logs.xlsx --output_dir ./evaluation/figures

Computes precision, recall, and F1 scores
Generates comparison plots across models and error types
Input: TP/FP/FN count matrices
Output: Performance metrics and visualization plots

6_error_matrix_analysis.py - Create confusion matrices

python scripts/6_error_matrix_analysis.py --asr_dir ./evaluation/error_reports --label_dir ./data/label/grouped --output_dir ./evaluation/matrix --visualize

Generates confusion matrices between ASR and therapist classifications
Creates summary statistics by model and speaker
Input: Error reports, therapist labels
Output: Confusion matrices and heatmap visualizations

Running the Complete Pipeline

For a full analysis, run the scripts sequentially:

# Stage 1: ASR Processing
python scripts/0_generate_asr_outputs.py
python scripts/1_generate_error_reports.py

# Stage 2: Alignment and Categorization  
python scripts/2_forced_alignment.py
python scripts/3_llm_label_categorization.py

# Stage 3: Evaluation
python scripts/4_error_localization.py
python scripts/5_evaluation_matrices.py
python scripts/6_error_matrix_analysis.py --visualize

Key Findings

ASR-based clarity scores strongly correlate with dysarthria severity levels (r > 0.9)
Larger Whisper models achieve higher precision in temporal localization
Substitution and deletion errors are detected more accurately than prosodic errors
70.1% exact error match rate between ASR and therapist annotations

Repository Structure

├── scripts/                 # Analysis pipeline scripts
│   ├── 0_generate_asr_outputs.py
│   ├── 1_generate_error_reports.py
│   ├── 2_forced_alignment.py
│   ├── 3_llm_label_categorization.py
│   ├── 4_error_localization.py
│   ├── 5_evaluation_matrices.py
│   └── 6_error_matrix_analysis.py
├── plots/                   # Generated visualization plots
├── requirements.txt         # Python dependencies
└── README.md               # This file

Expected Data Structure

When using your own data, organize it as follows:

data/
├── raw_audio/              # Original audio recordings (.wav)
├── transcripts/            # Reference transcripts (grandfather.txt, rainbow.txt)
├── label/
│   ├── origin/            # Original therapist annotations (.txt)
│   └── grouped/           # LLM-categorized labels (.xlsx)
├── asr/                   # ASR output files (.json)
└── alignments/            # Forced alignment results (.txt)

Clinical Applications

This framework enables:

Automated speech assessment for patients with dysarthria
Targeted therapy recommendations based on specific error types
Progress tracking through objective clarity measurements
Scalable evaluation reducing therapist workload

Citation

If you use this code in your research, please cite our paper:

@inproceedings{your2025interspeech,
  title={Towards Temporally Explainable Dysarthric Speech Clarity Assessment},
  author={Seohyun Park*, Chitralekha Gupta*, Michelle Kah Yian Kwan, Xinhui Fung, Alexander Wenjun Yip, Suranga Nanayakkara (*equal contributors)},
  booktitle={Proceedings of Interspeech 2025},
  year={2025}
}

Contact

For questions about the code or methodology, please contact [email protected] .

Data Availability

The dysarthric speech dataset with expert therapist annotations will be made available for research purposes following institutional review and approval processes. Please check back for updates or contact us for information about data access.

License

This repository is licensed under the GNU General Public License Version 3.

Note: This repository currently contains the analysis scripts and visualization tools. The complete dataset will be released pending ethical approval and privacy considerations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Towards Temporally Explainable Dysarthric Speech Clarity Assessment

Overview

Framework Architecture

Installation

Requirements

Setup

Usage Guide

Pipeline Overview

Script Descriptions

Stage 1: ASR Processing

Stage 2: Temporal Analysis

Stage 3: Evaluation and Analysis

Running the Complete Pipeline

Key Findings

Repository Structure

Expected Data Structure

Clinical Applications

Citation

Contact

Data Availability

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
plots		plots
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

augmented-human-lab/interspeech25_speechtherapy

Folders and files

Latest commit

History

Repository files navigation

Towards Temporally Explainable Dysarthric Speech Clarity Assessment

Overview

Framework Architecture

Installation

Requirements

Setup

Usage Guide

Pipeline Overview

Script Descriptions

Stage 1: ASR Processing

Stage 2: Temporal Analysis

Stage 3: Evaluation and Analysis

Running the Complete Pipeline

Key Findings

Repository Structure

Expected Data Structure

Clinical Applications

Citation

Contact

Data Availability

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages