microsoft · AnirudhBHarish · Jul 4, 2021 · Jul 7, 2021 · Jul 8, 2021 · Jul 8, 2021
diff --git a/applications/KWS_Phoneme/README.md b/applications/KWS_Phoneme/README.md
@@ -0,0 +1,51 @@
+# Phoneme-based Keyword Spotting(KWS)
+
+# Project Description
+There are two major issues in the existing KWS systems (a) They are not robust to heavy background noise and random utterances, and (b) They require collecting a lot of data, hampering the ease of adding a new keyword. Tackling these issues from a different perspective, we propose a new two staged scheme with a model for predicting phonemes which are in turn used for phoneme-based keyword classification. 
+
+First we train a phoneme classification model which gives the phoneme transcription of the input speech snippet. For training this phoneme classifier, we use a large public speech dataset like LibriSpeech. The public dataset can be aligned (meaning we can get the phoneme labels for each speech snippet in the data) using Montreal Forced Aligner. We also add reverberations and additive noise to the speech samples from the public dataset to make the phoneme classifier training robust to various accents, background noise and varied environments. In this project, we predict phonemes at every 10ms which is the standard way. You can find the aligned LibriSpeech dataset we used for training here.
+
+In the second part, we use the predicted phoneme outputs from the phoneme classifier for predicting the input keyword. We train a 1 layer FastGRNN classifier to predict the keyword based on the phoneme transcription as input. Since the phoneme classifier training has been done to account for diverse accents, background noise and environments, the keyword classifier can be trained using a small number of Text-To-Speech(TTS) samples generated using any standard TTS API from cloud services like Azure, Google Cloud or AWS.
+
+This gives two advantages: (a) The phoneme model is trained to account for diverse accents and background noise settings, thus the flexible keyword classifier training requires only a small number of keyword samples, and (b) Empirically this method was able to detect keywords from as far as 9ft of distance. Further, the phoneme model has a small size of around 250k parameters and can fit on a Cortex M7 micro-controller.
+
+# Training the Phoneme Classifier
+1) Train a phoneme classification model on some public speech dataset like LibriSpeech.
+2) Training speech dataset can be labelled using Montreal Force Aligner.
+3) Speech snippets are convolved with reverberation files, and additive noises from YouTube or other open source are added.
+4) We also add white gaussian noise of various SNRs.
+
+# Training the KWS Model
+1) Our method takes as input the speech snippet and passes it through the phoneme classifier.
+2) Keywords are detected by training a keyword classifier over the detected phonemes.
+3) For training the keyword classifier, we use Azure and Google Text-To-Speech API to get the training data (keyword snippets).
+4) For example, if you want to train a keyword classifier for the keywords in the Google30 dataset, generate TTS samples from the Azure/Google-Cloud/AWS API for each of the 30 keywords. The TTS samples for each keyword must be stored in a separate folder named according to the keyword. More details about how the generated TTS data should be stored are mentioned below in sample use case for classifier model training.
+
+# Sample Use Cases
+
+## Phoneme Model Training
+The following command can be used to instantiate and train the phoneme model.
+```
+python train_phoneme.py --base_path=/path/to/librispeech_data/ --rir_base_path=/path/to/reverb_files/ --additive_base_path=/path/to/additive_noises/ --snr_samples="0,5,10,25,100,100" --rir_chance=0.5 
+```
+Some important command line arguments:
+1) base_path : Path of the speech data folder. The data in this folder should be in accordance to the data-loader code written here.
+2) rir_base_path, additive_base_path : Path to the reverb and additive noise files.
+3) snr_samples : List of various SNRs at which the additive noise is to be added.
+4) rir_chance : Probability that would decide if the reverberation operation has to be performed for a given speech sample.
+
+## Classifier Model Training
+The following command can be used to instantiate and train the classifier model.
+```
+python train_classifier.py --base_path=/path/to/train_and_test_data_folders/ --train_data_folders=google30_azure_tts,google30_google_tts --test_data_folders=google30_test --phoneme_model_load_ckpt=/path/to/checkpoint/x.pt --rir_base_path=/mnt/reverb_noise_sampled/ --additive_base_path=/mnt/add_noises_sampled/ --synth 
+```
+Some important command line arguments:
+
+1) base_path : Path to train and test data folders.
+2) train_data_folders, test_data_folders : These folders should have the .wav files for each keyword in a separate subfolder inside according to the data-loader here.
+3) phoneme_model_load_ckpt : The full path of the checkpoint file that would be used to load the weights to the instantiated phoneme model.
+4) rir_base_path, additive_base_path : Path to the reverb and additive noise files.
+5) synth : Boolean flag for specifying if reverberations and noise addition has to be done.
+
+Copyright (c) Microsoft Corporation. All rights reserved.
+Licensed under the MIT license.
diff --git a/applications/KWS_Phoneme/auxiliary_files/README.md b/applications/KWS_Phoneme/auxiliary_files/README.md
@@ -0,0 +1,22 @@
+# Python scripts to help download and down-sample the additive noise data from YouTube videos 
+
+Run the following commands to download the CSV Files to download the YouTube Additive Noise Data :
+
+```
+wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/balanced_train_segments.csv
+```
+Following the download of the CSV file, run the extraction script to download the actual audio data :
+```
+python download_youtube_data.py --csv_file=/path/to/csv_file.csv --target_folder=/path/to/target/folder/
+```
+
+Please check [Google's Audioset data page](https://research.google.com/audioset/download.html) for further details.
+
+The downloaded files would need to be converted to 16KHz for our pipeline. Please run the following for the same :
+```
+python convert_sampling_rate.py --source_folder=/path/to/csv_file.csv --target_folder=/path/to/target/16KHz_folder/ --fs=16000 --log_rate=100
+```
+The script can convert the sampling rate of any .wav file to the specified --fs. But for our applications, we use 16KHz only. Choose the log rate for how often the log should be printed for the sample rate conversion. This will print a string every log_rate iterations.
+
+Copyright (c) Microsoft Corporation. All rights reserved.
+Licensed under the MIT license.
diff --git a/applications/KWS_Phoneme/auxiliary_files/convert_sampling_rate.py b/applications/KWS_Phoneme/auxiliary_files/convert_sampling_rate.py
@@ -0,0 +1,45 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import os
+import librosa
+import numpy as np
+import soundfile as sf
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--source_folder', default=None, required=True)
+parser.add_argument('--target_folder', default=None, required=True)
+parser.add_argument('--fs', type=int, default=16000)
+parser.add_argument('--log_rate', type=int, default=1000)
+args = parser.parse_args()
+
+source_folder = args.source_folder
+target_folder = args.target_folder
+fs = args.fs
+log_rate = args.log_rate
+print(f'Source Folder :: {source_folder}\nTarget Folder :: {target_folder}\nSampling Frequency :: {fs}', flush=True)
+
+source_files = []
+target_files = []
+list_completed = []
+
+# Get the list of list of wav files from source folder and create target file names (full paths)
+for i, f in enumerate(os.listdir(source_folder)):
+  if f[-4:].lower() == '.wav':
+    source_files.append(os.path.join(source_folder, f))
+    target_files.append(os.path.join(target_folder, f))
+print(f'Saved all the file paths, Number of files = {len(source_files)}', flush=True)
+
+# Convert the files to args.fs
+# Read with librosa and write the mono channel audio using soundfile
+print(f'Converting all files to {fs/1000} Khz', flush=True)
+for i, file_path in enumerate(source_files): 
+  y, sr = librosa.load(file_path, sr=fs, mono=True)
+  sf.write(target_files[i], y, sr)
+  list_completed.append(target_files[i])
+  if i % log_rate == 0:
+    print(f'File Number {i+1}, Shape of Audio {y.shape}, Sampling Frequency {sr}', flush=True)
+
+print(f'Number of Files saved {len(list_completed)}')
+print('Done', flush=True)
diff --git a/applications/KWS_Phoneme/auxiliary_files/download_youtube_data.py b/applications/KWS_Phoneme/auxiliary_files/download_youtube_data.py
@@ -0,0 +1,42 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT license.
+
+import csv
+import os
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--csv_file', default=None, required=True)
+parser.add_argument('--target_folder', default=None, required=True)
+args = parser.parse_args()
+
+with open(args.csv_file, 'r') as csv_f:
+    reader = csv.reader(csv_f, skipinitialspace=True)
+    # Skip 3 lines ; Header
+    next(reader)
+    next(reader)
+    next(reader)
+    for row in reader:
+        # Logging
+        print(row, flush=True)
+        # Link for the Youtube Video
+        YouTube_ID = row[0]                 # "-0RWZT-miFs"
+        start_time = int(float(row[1]))     # 420
+        end_time = int(float(row[2]))       # 430
+        # Construct downloadable link
+        YouTube_link = "https://youtu.be/" + YouTube_ID
+        # Output Filename
+        output_file = f"{args.target_folder}/ID_{YouTube_ID}.wav"
+        # Start time in hrs:min:sec format
+        start_sec = start_time % 60
+        start_min = (start_time // 60) % 60
+        start_hrs = start_time // 3600
+        # End time in hrs:min:sec format
+        end_sec = end_time % 60
+        end_min = (end_time // 60) % 60
+        end_hrs = end_time // 3600
+        # Start and End time args
+        time_args = f"-ss {start_hrs}:{start_min}:{start_sec} -to {end_hrs}:{end_min}:{end_sec}"
+        # Command Line Execution
+        os.system(f"youtube-dl -x -q --audio-format wav --postprocessor-args '{time_args}' {YouTube_link}" + " --exec 'mv {} " + f"{output_file}'")
+        print('', flush=True)