An interactive visualization to organize thousands of human health sounds via t-SNE
Human health sounds — like coughing, sneezing, wheezing, and laughing — carry valuable diagnostic information. These sounds vary widely across individuals but can reveal deep insights into respiratory and overall health.
Understanding these sounds purely through their acoustic properties offers an efficient tool for healthcare. For instance, a model can compare an individual's throat-clearing sound to typical patterns of throat clearing for healthy populations to potentially diagnose an illness.
This project takes a first step toward that goal by clustering human health sounds using machine learning. We organize thousands of human health sounds among six classes: cough, sneeze, sniff, sigh, throat-clearing, and laughter, from the open-source VocalSound dataset. This visualization is built entirely through unsupervised learning, in this case simply t-SNE. No labels (such as sound type or speaker identity) were provided; the resulting map is based purely on acoustic features. We observe that similar sounds naturally cluster together, demonstrating that even a simple unsupervised method can uncover clear structure given high-quality embeddings from audio foundation models.
The project provides an interactive grid visualization of clustered audio clips. Users can click on images to view metadata, click and drag to play several related clips simultaneously, and filter by metadata to discover patterns.
Additionally, users can record their own audio, and view similar clips in the grid. Such a tool enables healthcare experts to accurately compare new audio with pre-existing data, to reveal underlying patterns and promote accurate diagnoses. Note that if not running the Flask server locally, it may take ~1 minute for the first time, but subsequent records should be processed quickly.
A demo can be viewed from the following link: https://hishambhatti.github.io/human-health-sounds/
Here we describe the basic pipeline for transforming raw audio files into an effective visualization. If you want to create a similar visualization for a different audio dataset, follow these instructions with a different audio folder.
human-health-sounds/
├── Notebooks/
│ ├── Audio_Processing.ipynb
│ ├── HeAR_embeddings.ipynb
│ └── t-SNE_and_grid_clustering.ipynb
│ └── requirements.txt
├── ca-cough-ony/ # React frontend
│ ├── public/
│ ├── src/
│ └── package.json
├── backend/ # Flask server for audio recording feature
│ ├── models--google--hear/
│ ├── app.py
│ ├── match_audio.py
│ ├── ...
└── README.md
To build the backend Python, enter the notebooks folder by running cd Notebooks. Create a virtual environment and install the required dependencies.
Some commands to create a virtual environment
python3 -m venv .venv source
.venv/bin/activate
pip install -r requirements.txt
After installing dependencies, run the following notebooks in order, modifying the folder name for audio data:
First, save your dataset locally. In our example, we have a source directory vs_release_16k/audio_16k.
- Audio_Preprocessing.ipynb: Run the cells for
Pre-Processing
- Pre-processes the human health audio data to create audio suitable for the HeAR model
- Trims silence, removes short/quiet files, and caps the length of clip
- Creates spectrograms for each audio clip
- HeAR_embeddings.ipynb
- Uses Google’s HeAR model (via Hugging Face) to generate embeddings
- Tests embeddings on the preprocessed data
- t-SNE_and_grid_clustering.ipynb
- Runs the t-SNE algorithm to cluster the HeAR embeddings, searching over various perplexities
- Runs the LAP solver to convert the t-SNE output into a 2D grid
- Saves the output as a JSON for the frontend visualization
- Audio_Preprocessing.ipynb: Run the cells for
Post-Processing
- Arranges spectrograms into a single large grid for frontend visualization
- Combines individual audio clips into a single file, adding start and end times in metadata
To build the client-side React, make sure you are in the ca-cough-ony folder. Then install node and run npm install.
Place the generated JSON file (either vocalsound_wav.json or vocalsound_mp3.json) into the ca-cough-ony/src folder. Copy the spectrogram grid (precomposed_grid_32.png) and the audio file (all_sounds_combined.wav or all_sounds_combined.mp3) and processed audio folders into thepublic/directory. Then run:
npm run dev
If you want to enable the audio recording feature locally (not necessary), enter the /backend folder by running cd backend from the main diretory. Then, to start the flask server, run:
flask run
Currently, the frontend server connects to Google Cloud. To switch to connecting locally, uncomment the line in const BACKEND_URL = "http://127.0.0.1:5000/get-grid-indices" in ca-cough-ony/src/components/RecordPanel.jsx.
Below is a visualization of our generated t-SNE cloud and LAP 2D grid for the processed VocalSound audio clips
![]() |
![]() |
|---|
As you can see, sound types are generally clustered together. The misgroupings often come from mislabelings in VocalSound, or poor audio quality. Feel free to explore them yourself!
Developed by Hisham Bhatti, working with Zhihan Zhang, at the Ubiquitous Computing Lab in the University of Washington Paul G. Allen School of Computer Science & Engineering. We thank Jake Garrison for discussion.
This project was based on Bird Sounds at Google Creative Lab, but designed with modern tooling, and for others to test with their own datasets. In particular, below are some notebooks that I took inspiration from:
The core embedding model is Google’s HeAR model, available on Hugging Face
The dataset used is VocalSound, an open-source collection of human health sounds.
Backend Server (for Audio recording):
We do not store audio information provided by users. This tool is for research only.


