This is inspired from the Datasheets for datasets paper.
Q1) For what purpose was the dataset created ? Was there a specific task in mind ? Was there a specific gap that needed to be filled ?
Ans. This is a dataset for Intent classification from (Indian English) speech, and covers 14 coarse-grained intents from the Banking domain. While there are other datasets that have approached this task, here we provide a much largee training dataset (>650
samples per intent) to train models in an end-to-end fashion. We also provide anonymised speaker information to help answer questions around model robustness and bias.
Q2) Who created the dataset and on behalf of which entity ?
Ans. The (internal) Operations team at Skit was involved in the generation of the dataset, and provided their information for (anonymous) release. Unnati was involved in the curation of utterance templates, and Kriti and Manas were involved in the planning and collection of utterances - using an internal tool called sandbox. These contributors worked on this dataset as part of the Conversational UX and ML teams at Skit.
Q3) Who funded the creation of the dataset ?
Ans. Skit funded the creation of this dataset.
Q4) What do the instances that comprise the dataset consist of ?
Ans. The intent dataset is split across train.csv
and test.csv
. In both, individual instances consist of the following fields:
id
intent_class
template
audio_path
speaker_id
You can trace more information on the intents, using the shared intent_class
field in intent_info.csv
:
intent_class
intent_name
description
You can trace more information on the speakers, using the shared speaker_id
field in speaker_info.csv
:
speaker_id
native_language
languages_spoken
places_lived
gender
Q5) How many instances are there in total (of each type, if appropriate) ?
Ans. In all there are 11845
samples, across the train and test splits:
test.csv
has a total of1400
samples, with exactly100
samples per intenttrain.csv
has a total of10445
samples, with atleast650
samples per intent
The 11 speakers are distributed across the dataset, but unequally. However:
- each intent has data from all speakers
- the speakers are stratified across the train and test split - for each intent independently
Some statistics on the speakers are provided below. More granular information can be found in speaker_info.csv
:
- Native languages:
Hindi
(4),Bengali
(3),Kannada
(2),Malayalam
(1),Punjabi
(1) - Languages spoken:
Hindi
,English
,Bengali
,Odia
,Kannada
,Punjabi
,Malayalam
,Bihari
,Marathi
- Indian states lived in:
Bihar
,Odisha
,Karnataka
,West Bengal
,Punjab
,Kerala
,Jharkhand
,Maharashtra
Q6) Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set ?
Ans. For each intent, our Conversational UX team generated a list of templates. These are meant to be a (satisfactory) representation of all the variations in utterances, seen in human speech. These templates were used as a guide by the speakers when generating data. So, this dataset is limited by the templates and the variations that speakers added (spontaneously).
Q7) Are there recommended data splits (e.g., training, development/validation, testing) ?
Ans. The recommended split into train and test sets is provided as train.csv
and test.csv
respectively.
Q8) Are there any errors, sources of noise, or redundancies in the dataset?
Ans. There could be channel noise present in the dataset, because the data was generated through telephone calls. However, background noise will not be as prevalent as in real-world scenarios, since these telephone calls were made in a semi-controlled environment.
Q9) Other comments.
Ans. Speakers were responsible for generating variations in utterances, using the template
field as a guide. So, there could be some unintentional overlap across the content of utterances.
Q10) How was the data associated with each instance acquired ?
Ans. Members of the (internal) Operation team generated each utterance - using the associated template
field as a guide, and injecting their own variations into the speech utterance.
Q11) Who was involved in the data collection process and how were they compensated ?
Ans. The data was generated by the (internal) Operations team and they are/were full-time employees.
Q12) Over what timeframe was the data collected ?
Ans. This data was collected over a time period of 1 month.
Q13) Was any preprocessing/cleaning/labelling of the data done ?
Ans. Audio instances in the dataset were auto-labelled with their associated intent
and template
fields. For more information on this, refer to the documentation of sandbox.
Q14) Has the dataset been used for any tasks already ?
Ans. It has been used to benchmark models for the task of intent classification from speech.
Q15) What (other) tasks could the dataset be used for ?
Ans. We provide speaker characteristics. So, this dataset could be used for alternate classification tasks from speech - like, gender or native language.
Q16) Will the dataset be distributed under a copyright or other intellectual property (IP) license ?
Ans. This dataset is being distributed under a CC BY NC license.
Q17) Who will be maintaining the dataset ?
Ans. The research team at Skit will be maintaining the dataset. They can be contacted by sending an email to [email protected].
Q18) Will the dataset be updated in the future (e.g., to correct labelling errors, add new instances, delete instances) ?
Ans. Incase there are errors, we will try to collate and share an updated version every 3 months. We also plan to add more instances and variations to the dataset - to make it more robust.