Skip to content

An interactive visualization to organize thousands of human health sounds via t-SNE

Notifications You must be signed in to change notification settings

ubicomplab/human-health-sounds

Repository files navigation

Human Health Sounds

An interactive visualization to organize thousands of human health sounds via t-SNE

Exmaple Visualization

About

Human health sounds — like coughing, sneezing, wheezing, and laughing — carry valuable diagnostic information. These sounds vary widely across individuals but can reveal deep insights into respiratory and overall health.

Understanding these sounds purely through their acoustic properties offers an efficient tool for healthcare. For instance, a model can compare an individual's throat-clearing sound to typical patterns of throat clearing for healthy populations to potentially diagnose an illness.

This project takes a first step toward that goal by clustering human health sounds using machine learning. We organize thousands of human health sounds among six classes: cough, sneeze, sniff, sigh, throat-clearing, and laughter, from the open-source VocalSound dataset. This visualization is built entirely through unsupervised learning, in this case simply t-SNE. No labels (such as sound type or speaker identity) were provided; the resulting map is based purely on acoustic features. We observe that similar sounds naturally cluster together, demonstrating that even a simple unsupervised method can uncover clear structure given high-quality embeddings from audio foundation models.

The project provides an interactive grid visualization of clustered audio clips. Users can click on images to view metadata, click and drag to play several related clips simultaneously, and filter by metadata to discover patterns.

Additionally, users can record their own audio, and view similar clips in the grid. Such a tool enables healthcare experts to accurately compare new audio with pre-existing data, to reveal underlying patterns and promote accurate diagnoses. Note that if not running the Flask server locally, it may take ~1 minute for the first time, but subsequent records should be processed quickly.

A demo can be viewed from the following link: https://hishambhatti.github.io/human-health-sounds/

Usage

Here we describe the basic pipeline for transforming raw audio files into an effective visualization. If you want to create a similar visualization for a different audio dataset, follow these instructions with a different audio folder.

human-health-sounds/
├── Notebooks/
│   ├── Audio_Processing.ipynb
│   ├── HeAR_embeddings.ipynb
│   └── t-SNE_and_grid_clustering.ipynb
│   └── requirements.txt
├── ca-cough-ony/  # React frontend
│   ├── public/
│   ├── src/
│   └── package.json
├── backend/ # Flask server for audio recording feature
│   ├── models--google--hear/
│   ├── app.py
│   ├── match_audio.py
│   ├── ...
└── README.md

Backend Setup (Python)

To build the backend Python, enter the notebooks folder by running cd Notebooks. Create a virtual environment and install the required dependencies.

Some commands to create a virtual environment

python3 -m venv .venv source
.venv/bin/activate
pip install -r requirements.txt

After installing dependencies, run the following notebooks in order, modifying the folder name for audio data:

First, save your dataset locally. In our example, we have a source directory vs_release_16k/audio_16k.

  1. Audio_Preprocessing.ipynb: Run the cells for Pre-Processing
  • Pre-processes the human health audio data to create audio suitable for the HeAR model
  • Trims silence, removes short/quiet files, and caps the length of clip
  • Creates spectrograms for each audio clip
  1. HeAR_embeddings.ipynb
  • Uses Google’s HeAR model (via Hugging Face) to generate embeddings
  • Tests embeddings on the preprocessed data
  1. t-SNE_and_grid_clustering.ipynb
  • Runs the t-SNE algorithm to cluster the HeAR embeddings, searching over various perplexities
  • Runs the LAP solver to convert the t-SNE output into a 2D grid
  • Saves the output as a JSON for the frontend visualization
  1. Audio_Preprocessing.ipynb: Run the cells for Post-Processing
  • Arranges spectrograms into a single large grid for frontend visualization
  • Combines individual audio clips into a single file, adding start and end times in metadata

Frontend Setup (React)

To build the client-side React, make sure you are in the ca-cough-ony folder. Then install node and run npm install.

Place the generated JSON file (either vocalsound_wav.json or vocalsound_mp3.json) into the ca-cough-ony/src folder. Copy the spectrogram grid (precomposed_grid_32.png) and the audio file (all_sounds_combined.wav or all_sounds_combined.mp3) and processed audio folders into thepublic/directory. Then run:

npm run dev

Backend Server (for Audio Recording feature)

If you want to enable the audio recording feature locally (not necessary), enter the /backend folder by running cd backend from the main diretory. Then, to start the flask server, run:

flask run

Currently, the frontend server connects to Google Cloud. To switch to connecting locally, uncomment the line in const BACKEND_URL = "http://127.0.0.1:5000/get-grid-indices" in ca-cough-ony/src/components/RecordPanel.jsx.

Results

Below is a visualization of our generated t-SNE cloud and LAP 2D grid for the processed VocalSound audio clips

Example t-SNE Cloud Example Grid

As you can see, sound types are generally clustered together. The misgroupings often come from mislabelings in VocalSound, or poor audio quality. Feel free to explore them yourself!

Credit

Developed by Hisham Bhatti, working with Zhihan Zhang, at the Ubiquitous Computing Lab in the University of Washington Paul G. Allen School of Computer Science & Engineering. We thank Jake Garrison for discussion.

This project was based on Bird Sounds at Google Creative Lab, but designed with modern tooling, and for others to test with their own datasets. In particular, below are some notebooks that I took inspiration from:

The core embedding model is Google’s HeAR model, available on Hugging Face

The dataset used is VocalSound, an open-source collection of human health sounds.

Built With

Backend Logic: Python Jupyter

Backend Server (for Audio recording): Flask Google Cloud Docker

Frontend: HTML CSS JavaScript JSON

Frameworks: Node.js React Tailwind CSS Vite

Libraries: Pandas Scikit-learn Matplotlib NumPy

Tools: Hugging Face GitHub Pages

Disclaimer

We do not store audio information provided by users. This tool is for research only.

About

An interactive visualization to organize thousands of human health sounds via t-SNE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages