Skip to content

Commit b80e4c6

Browse files
committed
Merge dev into master
Merges all changes that makes the repo prepare data for multiple datasets
1 parent 3c818b9 commit b80e4c6

23 files changed

+609
-135
lines changed

.gitignore

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1 @@
1-
__pycache__/
2-
wav/
3-
rttm/
4-
uems/
5-
lists/
6-
*.tar.gz
1+
**/*.pyc

README.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,22 @@
1-
# AISHELL-4 for Pyannote
1+
# Dataset setup scripts for pyannote
22

3-
This repository automatically downloads the AISHELL-4 dataset and set it up to be used with pyannote-database.
3+
This repository aims to centralize scripts that prepare datasets to be used with [pyannote-audio](https://github.com/pyannote/pyannote-audio) (more precisely, with its [pyannote-database](https://github.com/pyannote/pyannote-database) dependency).
44

5-
It will generate two subsets of the original training data : 'train' and 'dev', as the original dataset only has training and test data (defaults are 80% train, 20% dev).
5+
Currently available :
6+
- [AISHELL4](aishell4)
7+
- [MSDWild](msdwild)
68

7-
## Instruction
9+
To setup each dataset, refer to the `README.md` contained in their respective folder.
810

9-
Run `setup.sh` to download and extract the files.
11+
Each dataset comes with its predefined `database.yml`, containing pyannote-database protocol(s) with already defined train+dev+test sets for out-of-the-box *speaker diarization* usage.
12+
How these subsets are defined is entirely configurable.
1013

11-
If you want to change the subsets generated from the original training dataset, change the `CUSTOM_TRAIN_SUBSETS` variable in `generate_uris.py` and run `python generate_uris.py`. If you add/remove subsets, don't forget to edit database.yml accordingly.
14+
## FAQ
15+
### How do I change the train/dev split / How do I define my own subsets ?
1216

13-
## Credits
17+
Head to the `generate_uris.py` of the desired dataset, and edit `your_subset_creation_logic()`.
18+
In particular check `compute_uri_subsets_files(...)` and `compute_uri_subsets_time(...)` in [scripts/uri.py](scripts/uri.py), which allow you to split according to the number of files or time desired in the subsets.
1419

15-
- AISHELL-4 (CC BY-SA 4.0) :
16-
- Dataset: https://www.openslr.org/111/
17-
- Original website : http://www.aishelltech.com/aishell_4
20+
This split can be absolute (= I want X files in subset1 / I want X hours in subset1) or relative (I want X% of the files in subset1 / I want X% of the hours in subset1).
21+
22+
Don't forget to update the database.yml file accordingly.

aishell4/.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
__pycache__/
2+
wav/
3+
rttm/
4+
uems/
5+
lists/
6+
*.tar.gz

aishell4/README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# AISHELL-4 for Pyannote
2+
3+
These scripts automatically download the AISHELL-4 dataset and set it up to be used with pyannote-database.
4+
5+
It will generate two subsets from the original `train` set : `custom_train` and `custom_dev`, as the original dataset only has training and test data.
6+
Defaults are 12h for `custom_dev`, and what's left (~92h) for `custom_train`.
7+
8+
Out-of-the-box protocol for pyannote.audio training is `AISHELL.SpeakerDiarization.Custom`.
9+
10+
## Instruction
11+
12+
Run `setup.sh` to download and extract the files.
13+
14+
15+
## Original sets info
16+
17+
| subset | # files | total length |
18+
|---|----|----|
19+
| train | 191 | 104h46m |
20+
| test | 20 | 12h34m |
21+
22+
## Credits
23+
24+
- AISHELL-4 (CC BY-SA 4.0) :
25+
- Dataset: https://www.openslr.org/111/
26+
- Original website : http://www.aishelltech.com/aishell_4

database.yml renamed to aishell4/database.yml

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,22 @@ Databases:
44
Protocols:
55
AISHELL4:
66
SpeakerDiarization:
7-
only_words:
7+
Custom:
88
train:
9-
uri: lists/train.txt
9+
uri: lists/custom_train.txt
1010
annotation: rttm/{uri}.rttm
1111
annotated: uems/{uri}.uem
1212
development:
13-
uri: lists/dev.txt
13+
uri: lists/custom_dev.txt
14+
annotation: rttm/{uri}.rttm
15+
annotated: uems/{uri}.uem
16+
test:
17+
uri: lists/test.txt
18+
annotation: rttm/{uri}.rttm
19+
annotated: uems/{uri}.uem
20+
Original:
21+
train:
22+
uri: lists/train.txt
1423
annotation: rttm/{uri}.rttm
1524
annotated: uems/{uri}.uem
1625
test:

aishell4/generate_uems.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
UEM_OUT="uems/"
2+
ALL_URIS_FILE = "lists/all.txt"
3+
RTTM_FOLDER = 'rttm'
4+
5+
import glob
6+
from pathlib import Path
7+
import sys
8+
9+
sys.path.append("../")
10+
from scripts.io import read_stringlist_from_file
11+
from scripts.uem import generate_uems_for_uris
12+
13+
14+
def main():
15+
all_uris = read_stringlist_from_file(ALL_URIS_FILE)
16+
generate_uems_for_uris(RTTM_FOLDER, UEM_OUT, all_uris)
17+
18+
19+
if __name__ == "__main__":
20+
main()

aishell4/generate_uris.py

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Generates URIs from filenames
2+
FILES_SOURCE="wav/*.flac"
3+
RESULT_DIR = "lists"
4+
UEM_TEMPLATE = "uems/{uri}.uem"
5+
6+
SEED=42
7+
8+
import glob
9+
import math
10+
from pathlib import Path
11+
import random
12+
import sys
13+
sys.path.append("../")
14+
15+
16+
from scripts.uri import compute_uri_subsets_files, compute_uri_subsets_time
17+
from scripts.io import write_stringlist_to_file
18+
19+
20+
def is_aishell_test_file(filename: str):
21+
return filename.startswith('L') or filename.startswith('M') or filename.startswith('S')
22+
23+
def your_subset_creation_logic():
24+
# Original subsets
25+
all_uris = [Path(filename).stem for filename in glob.glob(FILES_SOURCE)]
26+
all_train_uris = [uri for uri in all_uris if not is_aishell_test_file(uri)] # 191 files, 104h46m
27+
all_test_uris = [uri for uri in all_uris if is_aishell_test_file(uri)] # 20 files, 12h34m
28+
29+
write_stringlist_to_file(Path(RESULT_DIR) / "train.txt", all_train_uris)
30+
write_stringlist_to_file(Path(RESULT_DIR) / "test.txt", all_test_uris)
31+
32+
# Custom subsets !
33+
subsets_time_ratio = {'custom_dev':60*60*12.0, 'custom_train':math.inf} # aim for about the same size as test : 12h
34+
35+
computed_subsets_uri = [
36+
compute_uri_subsets_time(all_train_uris, UEM_TEMPLATE, subsets_time_ratio, mode="absolute")
37+
]
38+
39+
for computed_subsets in computed_subsets_uri:
40+
for subsetname, subseturis in computed_subsets.items():
41+
write_stringlist_to_file(Path(RESULT_DIR) / (subsetname+".txt"), subseturis)
42+
43+
44+
if __name__ == '__main__':
45+
all_uris = [Path(filename).stem for filename in glob.glob(FILES_SOURCE)]
46+
write_stringlist_to_file(Path(RESULT_DIR) / "all.txt", all_uris, sort=True)
47+
48+
if len(sys.argv) > 1 and sys.argv[1] == "index":
49+
print("Only created complete URIs index : all.txt")
50+
exit
51+
else:
52+
your_subset_creation_logic()

setup.sh renamed to aishell4/setup.sh

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@
22

33

44
echo "Downloading ..."
5-
wget -nc "https://www.openslr.org/resources/111/train_L.tar.gz"
6-
wget -nc "https://www.openslr.org/resources/111/train_M.tar.gz"
7-
wget -nc "https://www.openslr.org/resources/111/train_S.tar.gz"
8-
wget -nc "https://www.openslr.org/resources/111/test.tar.gz"
5+
wget c "https://www.openslr.org/resources/111/train_L.tar.gz"
6+
wget -c "https://www.openslr.org/resources/111/train_M.tar.gz"
7+
wget -c "https://www.openslr.org/resources/111/train_S.tar.gz"
8+
wget -c "https://www.openslr.org/resources/111/test.tar.gz"
99

1010
echo "Extracting train_L"
1111
tar -xf train_L.tar.gz
@@ -37,10 +37,13 @@ mv test/TextGrid/* rttm/
3737
rm -rd test/
3838

3939

40-
echo "Generating URI lists ..."
41-
python generate_uris.py
40+
echo "Generating URI index ..."
41+
python generate_uris.py index
4242

4343
echo "Generating UEM files ..."
4444
python generate_uems.py
4545

46-
echo "Done !"
46+
echo "Generating URI lists ..."
47+
python generate_uris.py
48+
49+
echo "Done !"

generate_uems.py

Lines changed: 0 additions & 50 deletions
This file was deleted.

generate_uris.py

Lines changed: 0 additions & 59 deletions
This file was deleted.

0 commit comments

Comments
 (0)