Dataperf-Selection-Speech is a challenge hosted by DataPerf.org that measures the performance of dataset selection algorithms. The model training component is frozen and participants can only improve the accuracy by selecting the best training set. The benchmark is intended to encompass the tasks of dataset cleaning and coreset selection for a keyword spotting application. As a participant, you will submit your proposed list of training samples to the leaderboard on DynaBench where the model is trained, evaluated, and scored.
Getting Started: Jump to our introductory colab below!
Component Ownership Diagram:
You are given a training dataset for spoken word classification and your goal is to produce an algorithm that selects a subset of size M examples (a coreset) from this dataset*. Evaluation proceeds by subsequently training a fixed model (sklearn.ensemble.VotingClassifier
with various constituent classifiers) on your chosen subset, and then scoring the model's predictions on fixed test data via the sklearn.metrics.f1_score
metric with average = macro
. We average the score over 10 random seeds (located in workspace/dataperf_speech_config.yaml) to produce the final score.
* M is user defined, but Dynabench will host two leaderboards per language with coreset size caps of 25 and 60.
For each language, the challenge includes two leaderboards on Dynabench (six leaderboards in total). Each leaderboard corresponds to a language and a fixed maximum number of training samples (your submission can specifiy fewer samples than the maximum coreset size).
The training dataset consists of embedding vectors produced by a pretrained keyword spotting model (model checkpoint weights) for five target words in each of three languages (English, Portuguese, and Indonesian) taken from the Multilingual Spoken Words Corpus. The classifier also includes a nontarget
category representing unknown words which are distinct from one of the five target words. To train and evaluate the classifier's ability to recognize nontarget words, we include a large set of embedding vectors drawn from each respective language. The total number of target and nontarget samples for each language is shown in the figure below:
Solutions should be algorithmic in nature (i.e., they should not involve human-in-the-loop audio sample listening and selection). We warmly encourage open-source submissions. If a participant team does not wish to open-source their solution, we ask that they allow the DataPerf organization to independently verify their solution and approach to ensure it is within the challenge rules.
The challenge is hosted on dataperf.org and will run from March 30 2023 through May 26 2023. Participants can submit solutions to DynaBench.
Launch our introductory notebook on Google Colab which walks through performing coreset selection with our baseline algorithm and running our evaluation script on the coresets for English, Portuguese, and Indonesian.
Below, we provide additional documentation for each step of the above colab (downloading, training coreset selection, and evaluation).
In case bugs or concerns are found, we will include a description of any changes to the evaluation metric, datasets, or support code here. Participants can re-submit their solutions to a new round on DynaBench which will reflect these changes.
- May 16 2023: Added averaging across multiple seeds during evaluation
- April 26 2023: Evaluation dataset publicly released
- March 30 2023: Challenge launch
python utils/download_data.py --output_path workspace/data
This will automatically download and extract the train and eval embeddings for English, Inodnesian, and Portuguese.
Run and evaluate the baseline selection algorithm. The target language can be changed by modifying the --language
argument (English: en
, Indonesian: id
, Portuguese: pt
). The training set size can be changed by modifying the --train_size
argument (in particular, for each language, you will run two iterations of your training set selection algorithm, one for each --train_size
leaderboard - in other words, you will perform six coreset generations in total per submission to Dynabench).
Run selection:
python -m selection.main --language en --train_size 25
This will write out en_25_train.json
into the directory specified by --outdir
(default is the workspace/
directory), where 25
refers to the maximum size of the coreset.
You can run evaluation locally on your training set, but please note the following:
Please see the challenge rules on dataperf.org for more details - in particular, we ask you not to optimize your result using any of the challenge evaluation data. Optimization (e.g., cross-validation) should be performed on the samples in allowed_training_set.yaml
for each language, and solutions should not be optimized against any of the samples listed in eval.yaml
for any of the languages.
Since this speech challenge is fully open, there is no hidden test set. A locally-computed evaluation score is unofficial, but should match the results on DynaBench, and included here solely to allow for double-checking of DynaBench-computed results only if necessary. Official evaluations will only be performed on DynaBench. The following command performs local (offline) evaluation:
python eval.py --language en --train_size 25
This will output the macro f1 score of a model trained on the selected training set, against the official evaluation samples.
To develop their own selection algorithm, participants should:
- Create a new
selection.py
algorithm inselection/implementations
which subclassesTrainingSetSelection
- Implement
select()
in your class to use your selection algorithm - Change
selection_algorithm_module
andselection_algorithm_class
inworkspace/dataperf_speech_config.yaml
to match the name of your selection implementation - optionally, add experiment configs to
workspace/dataperf_speech_config.yaml
(this can be accessed viaself.config
in ) - Run your selection strategy and submit your results to DynaBench
Once participants are satisfied with their selection algorithm they should submit their {lang}_{size}_train.json
files to DynaBench.
A seperate file is required for each language and training set size conbination (6 total).
Each supported language has the following files:
-
train_vectors
: The directory that contains the embedding vectors that can be selected for training. The file structure follows the patterntrain_vectors/en/left.parquet
. Each parquet file contains a "clip_id" column and a "mswc_embedding_vector" column. -
eval_vectors
: The directory that contains the embedding vectors that are used for evaluation. The structure is identical totrain_vectors
-
allowed_train_set.yaml
: A file that specifies which sample IDs are valid training samples. The file contrains the following structure{"targets": {"left":[list]}, "nontargets": [list]}
-
eval.yaml
: The evaluation set for eval.py. It follows the same structure asallowed_train_set.yaml
. Participants should never use this data for training set selection algorithm development. -
{lang}_{size}_train.json
: The file produced byselection:main
that specifies the language specific training set for eval.py.
All languages share the following files:
dataperf_speech_config.yaml
: This file contains the configuration for the dataperf-speech-example workflow. Participants can extend this configuration file as needed.
-
mswc_vectors
: The unified directory of all embedding vectors. This directory can be used to generate newtrain_vectors
andeval_vectors
directories. -
train_audio
: The directory of wav files that can optionally be used in the selection algorithm.
To use the raw audio in selection in addition to the embedding vectors:
- Download the .wav version of the MSWC dataset.
- Pass the MSWC audio directory to selection:main as the
audio_dir
argument. - Access the raw audio of a sample in a selection implementation with the
['audio']
label
Participants may use the MLCube workflow to simplify development on the users machine and increase reproducability.
To run the baseline selection algorithm:
Create Python environment and install MLCube Docker runner:
virtualenv -p python3 ./env && source ./env/bin/activate && pip install mlcube-docker
Run download task (only required once):
mlcube run --task=download -Pdocker.build_strategy=always
Run selection:
mlcube run --task=select -Pdocker.build_strategy=always
Run offline evaluation:
mlcube run --task=evaluate -Pdocker.build_strategy=always
- Keyword spotting model (KWS model): Also referred to as a wakeword, hotword, or voice trigger detection model, this is a small ML speech model that is designed to recognize a small vocabulary of spoken words or phrases (e.g., Siri, Google Voice Assistant, Alexa)
- Target sample: An example 1-second audio clip of a keyword used to train or evaluate a keyword-spotting model
- Nontarget sample: 1-second audio clips of words which are outside of the KWS model's vocabulary, used to train or measure the model's ability to minimize false positive detections on non-keywords.
- MSWC dataset: the Multilingual Spoken Words Corpus, a dataset of 340,000 spoken words in 50 languages.
- Embedding vector representation: An n-dimensional vector which provides a feature representation of an audio word. We have trained a large classifier on keywords in MSWC, and we provide a 1024-element feature vector by using the penultimate layer of the classifer.