The Mozilla Common Voice project is the world's largest open source voice data project. From around 2019, data contributors to Common Voice have been able to specify an accent in their Profile data. However, the accents they could select from were constrained to an enumerated list. In early 2022, the Profile user interface was changed to allow data contributors to self-specify accents.
However, this change presents challenges to ML practitioners and those using accent data for downstream applications. Firstly, the accent data is free text which is comma delimited. However, the contributor-specified accent descriptors may also contain commas, requiring the use of regular expressions to separate accent descriptors accurately. Secondly, because the accent data is contributor-specified, there is no restrictive taxonomy.
Contributors may specify accents without the use of a constraining taxonomy - for example "Midwestern United States English", "Midwest", "Midwestern, and "Midwestern United States" all represent the same accent descriptor. This necessitated a set of heuristics to merge identical descriptors. Thirdly, contributors may specify accent descriptors that are compounds. For example, "heavy Cantonese" actually contains two descriptors - "heavy" to indicate accent strength, and "Cantonese" as a geographical regional marker.
This problem is significant due to the large number of actors who consume the Common Voice datasets; it is widely used in other datasets such as People's Speech (Galvez, D., Diamos, G., Ciro, J., Cerón, J. F., Achorn, K., Gopi, A., & Reddi, V. J. (2021). The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage. arXiv preprint arXiv:2111.09344.), and is used as an evaluation dataset for large automatic speech recognition toolsets such as Whisper (Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.). Tooling is needed to be able to work with free-text accent data, across the 100+ languages available in Common Voice.
The tools here include:
cvaccents.py
- a set of Python classes to representAccent
,AccentCollection
andAccentDescriptor
cvaccents-v{Mozilla CV dataset version}-{(optional) language}
- Jupyter notebook that provides a worked example of extracting AccentDescriptor information and applying heuristics withcyvaccents.py
. This notebook also creates nodes and edgesJSON
files suitable for data visualisation in network diagrams.- Data visualisations are provided as shown here for English and for Kiswahili.
- The Common Voice English dataset is very large; we were fortunate enough to have access to a GPU cluster to run the Jupyter notebook, but the dataset is so large that it will likely stretch the limits of a single machine.
- Ethical considerations of this work are detailed in our accompanying paper.
-
This taxonomy was developed from an English dataset, and we have applied it to Kiswahili. We invite other researchers and practitioners to apply it to other languages to validate its utility; or amend it as needed for other languages.
-
This taxonomy has not been used to empirically assess voice datasets or models for bias; we anticipate doing this in future work.
The cvaccents-v13.ipynb
notebook provides the following taxonomy of Accent Descriptors. Practitioners may wish to extend this taxonomy for their own purposes, or may use this taxonomy as a structure with which to assess voice datasets or models for accent bias.
Taxonomic category | Count |
---|---|
Geographic descriptors | |
- Supranational region | 20 |
- Country | 50 |
- Subnational region | 76 |
- City | 20 |
- Other | 1 |
Register | 12 |
First or other language marker | 13 |
Accent strength descriptor | 14 |
Phonetic descriptors | |
- Specific phonetic changes | 3 |
- Rhoticity | 1 |
- Inflection | 1 |
Vocal quality descriptor | 10 |
Mixed or variable accent | 4 |
Uncertainty marker | 1 |
Accent effects due to physical change | 1 |
-
Fork this repo under your own GitHub username, or clone this repo into your environment with the command:
git clone https://github.com/KathyReid/cv-analysis-for-bias
-
Use your environment's instructions and create a virtual environment. This will be:
python3 -m venv [directory]
if you usepip
but will differ if you useconda
. -
Activate your virtual environment. If you use
pip
this will be:
cd [directory]
source /bin/activate
-
Install the required packages. If you use
pip
the commaned is:python3 -m pip install -r requirements.txt
but will differ if you useconda
. -
Launch Jupyter
jupyter notebook
- You can then use the notebook from within the Jupyter environment.
@inproceedings{10.1145/3617694.3623258,
author = {Reid, Kathy and Williams, Elizabeth T.},
title = {Common Voice and Accent Choice: Data Contributors Self-Describe Their Spoken Accents in Diverse Ways},
year = {2023},
isbn = {9798400703812},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3617694.3623258},
doi = {10.1145/3617694.3623258},
abstract = {The use of machine learning (ML)-powered speech technologies has increased significantly in recent years [40, 56, 72]. The datasets used for training speech models often represent demographic features of the speaker – such as gender, age, and accent. These axes are frequently used to evaluate the training set and model for bias [52]. Here, we focus on how accent is represented in voice data due to the adverse consequences of accent bias. We perform document analysis on several voice datasets to identify how accents are currently represented. We then analyse and visualise speaker-described accents from Mozilla’s Common Voice (CV) v13 English dataset, forming an emergent taxonomy of accent descriptors. We repeat this process using the CV v13 Kiswahili dataset, demonstrating that the taxonomy has use beyond English. We find that accents are currently represented in ways that are geographically, and predominantly, nationally bound. While this pattern is also shown in speaker-described accents from CV, a more diverse set of descriptors is revealed. This work provides some early evidence for re-thinking how accents are represented in datasets intended for ML applications. Our tooling is open-sourced, and we invite further work that uses our taxonomy to assess accent bias in speech data and models.},
booktitle = {Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization},
articleno = {35},
numpages = {10},
keywords = {bias corpora, accent data, data visualization, bias, accent bias, metadata, datasets, dataset documentation, accent recognition, voice data, speech data},
location = {Boston, MA, USA},
series = {EAAMO '23}
}
https://dl.acm.org/doi/10.1145/3617694.3623258
These tools use the Mozilla Public License (MPL) to align with Mozilla's broader ecosystem.
- 16 categories identified
- 164 individual accents identified
- 297 relationships between accents identified
Related data visualisation on the Observable platform
The key changes in this version are:
-
The number of categories identified in the data have increased from 16 in the first version, to 20 in this one. The four additional categories are:
- Linguistic heritage of speaker - indicating the speaker's language acquisition or immersion heritage, such as time spent in a location, or being born or raised in a location.
- Socio-economic marker - indicating a speaker's association with a socio-economic group or class, such as Middle Class.
- Hybrid dialect - indicating the speaker speaks using a dialect where two languages have come into contact - such as Denglish (German - Deutsch - and English) and Hinglish (Hindi and English, spoken in India).
- Generational marker - indicating the speaker's association with a generation, belying their age range, such as Gen Z.
-
The number of individual accents identified has increased from 164 in the first version, to 235 in this one.
-
The number of relationships between individual accents, which indicate a co-occurrence between speaker-described accents, such as "German" and "England English", has increased from 297 in the first version, to 515 in this one.