Skip to content

Phoneme Detection and Classifier Model Codes #238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions applications/KWS_Phoneme/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Phoneme-based Keyword Spotting(KWS)

# Project Description
There are two major issues in the existing KWS systems (a) They are not robust to heavy background noise and random utterances, and (b) They require collecting a lot of data, hampering the ease of adding a new keyword. Tackling these issues from a different perspective, we propose a new two staged scheme with a model for predicting phonemes which are in turn used for phoneme-based keyword classification.

First we train a phoneme classification model which gives the phoneme transcription of the input speech snippet. For training this phoneme classifier, we use a large public speech dataset like LibriSpeech. The public dataset can be aligned (meaning we can get the phoneme labels for each speech snippet in the data) using Montreal Forced Aligner. We also add reverberations and additive noise to the speech samples from the public dataset to make the phoneme classifier training robust to various accents, background noise and varied environments. In this project, we predict phonemes at every 10ms which is the standard way. You can find the aligned LibriSpeech dataset we used for training here.

In the second part, we use the predicted phoneme outputs from the phoneme classifier for predicting the input keyword. We train a 1 layer FastGRNN classifier to predict the keyword based on the phoneme transcription as input. Since the phoneme classifier training has been done to account for diverse accents, background noise and environments, the keyword classifier can be trained using a small number of Text-To-Speech(TTS) samples generated using any standard TTS API from cloud services like Azure, Google Cloud or AWS.

This gives two advantages: (a) The phoneme model is trained to account for diverse accents and background noise settings, thus the flexible keyword classifier training requires only a small number of keyword samples, and (b) Empirically this method was able to detect keywords from as far as 9ft of distance. Further, the phoneme model has a small size of around 250k parameters and can fit on a Cortex M7 micro-controller.

# Training the Phoneme Classifier
1) Train a phoneme classification model on some public speech dataset like LibriSpeech.
2) Training speech dataset can be labelled using Montreal Force Aligner.
3) Speech snippets are convolved with reverberation files, and additive noises from YouTube or other open source are added.
4) We also add white gaussian noise of various SNRs.

# Training the KWS Model
1) Our method takes as input the speech snippet and passes it through the phoneme classifier.
2) Keywords are detected by training a keyword classifier over the detected phonemes.
3) For training the keyword classifier, we use Azure and Google Text-To-Speech API to get the training data (keyword snippets).
4) For example, if you want to train a keyword classifier for the keywords in the Google30 dataset, generate TTS samples from the Azure/Google-Cloud/AWS API for each of the 30 keywords. The TTS samples for each keyword must be stored in a separate folder named according to the keyword. More details about how the generated TTS data should be stored are mentioned below in sample use case for classifier model training.

# Sample Use Cases

## Phoneme Model Training
The following command can be used to instantiate and train the phoneme model.
```
python train_phoneme.py --base_path=/path/to/librispeech_data/ --rir_base_path=/path/to/reverb_files/ --additive_base_path=/path/to/additive_noises/ --snr_samples="0,5,10,25,100,100" --rir_chance=0.5
```
Some important command line arguments:
1) base_path : Path of the speech data folder. The data in this folder should be in accordance to the data-loader code written here.
2) rir_base_path, additive_base_path : Path to the reverb and additive noise files.
3) snr_samples : List of various SNRs at which the additive noise is to be added.
4) rir_chance : Probability that would decide if the reverberation operation has to be performed for a given speech sample.

## Classifier Model Training
The following command can be used to instantiate and train the classifier model.
```
python train_classifier.py --base_path=/path/to/train_and_test_data_folders/ --train_data_folders=google30_azure_tts,google30_google_tts --test_data_folders=google30_test --phoneme_model_load_ckpt=/path/to/checkpoint/x.pt --rir_base_path=/mnt/reverb_noise_sampled/ --additive_base_path=/mnt/add_noises_sampled/ --synth
```
Some important command line arguments:

1) base_path : Path to train and test data folders.
2) train_data_folders, test_data_folders : These folders should have the .wav files for each keyword in a separate subfolder inside according to the data-loader here.
3) phoneme_model_load_ckpt : The full path of the checkpoint file that would be used to load the weights to the instantiated phoneme model.
4) rir_base_path, additive_base_path : Path to the reverb and additive noise files.
5) synth : Boolean flag for specifying if reverberations and noise addition has to be done.

Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT license.
22 changes: 22 additions & 0 deletions applications/KWS_Phoneme/auxiliary_files/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Python scripts to help download and down-sample the additive noise data from YouTube videos

Run the following commands to download the CSV Files to download the YouTube Additive Noise Data :

```
wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/balanced_train_segments.csv
```
Following the download of the CSV file, run the extraction script to download the actual audio data :
```
python download_youtube_data.py --csv_file=/path/to/csv_file.csv --target_folder=/path/to/target/folder/
```

Please check [Google's Audioset data page](https://research.google.com/audioset/download.html) for further details.

The downloaded files would need to be converted to 16KHz for our pipeline. Please run the following for the same :
```
python convert_sampling_rate.py --source_folder=/path/to/csv_file.csv --target_folder=/path/to/target/16KHz_folder/ --fs=16000 --log_rate=100
```
The script can convert the sampling rate of any .wav file to the specified --fs. But for our applications, we use 16KHz only. Choose the log rate for how often the log should be printed for the sample rate conversion. This will print a string every log_rate iterations.

Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT license.
45 changes: 45 additions & 0 deletions applications/KWS_Phoneme/auxiliary_files/convert_sampling_rate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT license.

import os
import librosa
import numpy as np
import soundfile as sf
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--source_folder', default=None, required=True)
parser.add_argument('--target_folder', default=None, required=True)
parser.add_argument('--fs', type=int, default=16000)
parser.add_argument('--log_rate', type=int, default=1000)
args = parser.parse_args()

source_folder = args.source_folder
target_folder = args.target_folder
fs = args.fs
log_rate = args.log_rate
print(f'Source Folder :: {source_folder}\nTarget Folder :: {target_folder}\nSampling Frequency :: {fs}', flush=True)

source_files = []
target_files = []
list_completed = []

# Get the list of list of wav files from source folder and create target file names (full paths)
for i, f in enumerate(os.listdir(source_folder)):
if f[-4:].lower() == '.wav':
source_files.append(os.path.join(source_folder, f))
target_files.append(os.path.join(target_folder, f))
print(f'Saved all the file paths, Number of files = {len(source_files)}', flush=True)

# Convert the files to args.fs
# Read with librosa and write the mono channel audio using soundfile
print(f'Converting all files to {fs/1000} Khz', flush=True)
for i, file_path in enumerate(source_files):
y, sr = librosa.load(file_path, sr=fs, mono=True)
sf.write(target_files[i], y, sr)
list_completed.append(target_files[i])
if i % log_rate == 0:
print(f'File Number {i+1}, Shape of Audio {y.shape}, Sampling Frequency {sr}', flush=True)

print(f'Number of Files saved {len(list_completed)}')
print('Done', flush=True)
42 changes: 42 additions & 0 deletions applications/KWS_Phoneme/auxiliary_files/download_youtube_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT license.

import csv
import os
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--csv_file', default=None, required=True)
parser.add_argument('--target_folder', default=None, required=True)
args = parser.parse_args()

with open(args.csv_file, 'r') as csv_f:
reader = csv.reader(csv_f, skipinitialspace=True)
# Skip 3 lines ; Header
next(reader)
next(reader)
next(reader)
for row in reader:
# Logging
print(row, flush=True)
# Link for the Youtube Video
YouTube_ID = row[0] # "-0RWZT-miFs"
start_time = int(float(row[1])) # 420
end_time = int(float(row[2])) # 430
# Construct downloadable link
YouTube_link = "https://youtu.be/" + YouTube_ID
# Output Filename
output_file = f"{args.target_folder}/ID_{YouTube_ID}.wav"
# Start time in hrs:min:sec format
start_sec = start_time % 60
start_min = (start_time // 60) % 60
start_hrs = start_time // 3600
# End time in hrs:min:sec format
end_sec = end_time % 60
end_min = (end_time // 60) % 60
end_hrs = end_time // 3600
# Start and End time args
time_args = f"-ss {start_hrs}:{start_min}:{start_sec} -to {end_hrs}:{end_min}:{end_sec}"
# Command Line Execution
os.system(f"youtube-dl -x -q --audio-format wav --postprocessor-args '{time_args}' {YouTube_link}" + " --exec 'mv {} " + f"{output_file}'")
print('', flush=True)
Loading