FrenchKrab
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 6 deletions b/‎.gitignore‎
Lines changed: 1 addition & 6 deletions
diff --git a/‎README.md‎
Lines changed: 15 additions & 10 deletions b/‎README.md‎
Lines changed: 15 additions & 10 deletions
diff --git a/‎aishell4/.gitignore‎
Lines changed: 6 additions & 0 deletions b/‎aishell4/.gitignore‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎aishell4/README.md‎
Lines changed: 26 additions & 0 deletions b/‎aishell4/README.md‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎database.yml‎ ‎aishell4/database.yml‎database.yml renamed to aishell4/database.yml
Lines changed: 12 additions & 3 deletions b/‎database.yml‎ ‎aishell4/database.yml‎database.yml renamed to aishell4/database.yml
Lines changed: 12 additions & 3 deletions
diff --git a/‎aishell4/generate_uems.py‎
Lines changed: 20 additions & 0 deletions b/‎aishell4/generate_uems.py‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎aishell4/generate_uris.py‎
Lines changed: 52 additions & 0 deletions b/‎aishell4/generate_uris.py‎
Lines changed: 52 additions & 0 deletions
diff --git a/‎setup.sh‎ ‎aishell4/setup.sh‎setup.sh renamed to aishell4/setup.sh
Lines changed: 10 additions & 7 deletions b/‎setup.sh‎ ‎aishell4/setup.sh‎setup.sh renamed to aishell4/setup.sh
Lines changed: 10 additions & 7 deletions
diff --git a/‎generate_uems.py‎
Lines changed: 0 additions & 50 deletions b/‎generate_uems.py‎
Lines changed: 0 additions & 50 deletions
diff --git a/‎generate_uris.py‎
Lines changed: 0 additions & 59 deletions b/‎generate_uris.py‎
Lines changed: 0 additions & 59 deletions
@@ -1,6 +1 @@
-__pycache__/
-wav/
-rttm/
-uems/
-lists/
-*.tar.gz
+**/*.pyc
@@ -1,17 +1,22 @@
-# AISHELL-4 for Pyannote
+# Dataset setup scripts for pyannote
 
-This repository automatically downloads the AISHELL-4 dataset and set it up to be used with pyannote-database.
+This repository aims to centralize scripts that prepare datasets to be used with [pyannote-audio](https://github.com/pyannote/pyannote-audio) (more precisely, with its [pyannote-database](https://github.com/pyannote/pyannote-database) dependency).
 
-It will generate two subsets of the original training data : 'train' and 'dev', as the original dataset only has training and test data (defaults are 80% train, 20% dev).
+Currently available : 
+- [AISHELL4](aishell4)
+- [MSDWild](msdwild)
 
-## Instruction
+To setup each dataset, refer to the `README.md` contained in their respective folder.
 
-Run `setup.sh` to download and extract the files.
+Each dataset comes with its predefined `database.yml`, containing pyannote-database protocol(s) with already defined train+dev+test sets for out-of-the-box *speaker diarization* usage.
+How these subsets are defined is entirely configurable.
 
-If you want to change the subsets generated from the original training dataset, change the `CUSTOM_TRAIN_SUBSETS` variable in `generate_uris.py` and run `python generate_uris.py`. If you add/remove subsets, don't forget to edit database.yml accordingly.
+## FAQ
+### How do I change the train/dev split / How do I define my own subsets ?
 
-## Credits
+Head to the `generate_uris.py` of the desired dataset, and edit `your_subset_creation_logic()`.
+In particular check `compute_uri_subsets_files(...)` and `compute_uri_subsets_time(...)` in [scripts/uri.py](scripts/uri.py), which allow you to split according to the number of files or time desired in the subsets. 
 
-- AISHELL-4 (CC BY-SA 4.0) : 
-    - Dataset: https://www.openslr.org/111/
-    - Original website : http://www.aishelltech.com/aishell_4
+This split can be absolute (= I want X files in subset1 / I want X hours in subset1) or relative (I want X% of the files in subset1 / I want X% of the hours in subset1).
+
+Don't forget to update the database.yml file accordingly.
@@ -0,0 +1,6 @@
+__pycache__/
+wav/
+rttm/
+uems/
+lists/
+*.tar.gz
@@ -0,0 +1,26 @@
+# AISHELL-4 for Pyannote
+
+These scripts automatically download the AISHELL-4 dataset and set it up to be used with pyannote-database.
+
+It will generate two subsets from the original `train` set : `custom_train` and `custom_dev`, as the original dataset only has training and test data.
+Defaults are 12h for `custom_dev`, and what's left (~92h) for `custom_train`.
+
+Out-of-the-box protocol for pyannote.audio training is `AISHELL.SpeakerDiarization.Custom`.
+
+## Instruction
+
+Run `setup.sh` to download and extract the files.
+
+
+## Original sets info
+
+| subset | # files | total length |
+|---|----|----|
+| train | 191 | 104h46m |
+| test | 20 | 12h34m |
+
+## Credits
+
+- AISHELL-4 (CC BY-SA 4.0) : 
+    - Dataset: https://www.openslr.org/111/
+    - Original website : http://www.aishelltech.com/aishell_4
@@ -4,13 +4,22 @@ Databases:
 Protocols:
   AISHELL4:
     SpeakerDiarization:
-      only_words:
+      Custom:
         train:
-          uri: lists/train.txt
+          uri: lists/custom_train.txt
           annotation: rttm/{uri}.rttm
           annotated: uems/{uri}.uem
         development:
-          uri: lists/dev.txt
+          uri: lists/custom_dev.txt
+          annotation: rttm/{uri}.rttm
+          annotated: uems/{uri}.uem
+        test:
+          uri: lists/test.txt
+          annotation: rttm/{uri}.rttm
+          annotated: uems/{uri}.uem
+      Original:
+        train:
+          uri: lists/train.txt
           annotation: rttm/{uri}.rttm
           annotated: uems/{uri}.uem
         test:
 
@@ -0,0 +1,20 @@
+UEM_OUT="uems/"
+ALL_URIS_FILE = "lists/all.txt"
+RTTM_FOLDER = 'rttm'
+
+import glob
+from pathlib import Path
+import sys
+
+sys.path.append("../")
+from scripts.io import read_stringlist_from_file
+from scripts.uem import generate_uems_for_uris
+
+
+def main():
+    all_uris = read_stringlist_from_file(ALL_URIS_FILE)
+    generate_uems_for_uris(RTTM_FOLDER, UEM_OUT, all_uris)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,52 @@
+# Generates URIs from filenames
+FILES_SOURCE="wav/*.flac"
+RESULT_DIR = "lists"
+UEM_TEMPLATE = "uems/{uri}.uem"
+
+SEED=42
+
+import glob
+import math
+from pathlib import Path
+import random
+import sys
+sys.path.append("../")
+
+
+from scripts.uri import compute_uri_subsets_files, compute_uri_subsets_time
+from scripts.io import write_stringlist_to_file
+
+
+def is_aishell_test_file(filename: str):
+    return filename.startswith('L') or filename.startswith('M') or filename.startswith('S')
+
+def your_subset_creation_logic():
+    # Original subsets
+    all_uris = [Path(filename).stem for filename in glob.glob(FILES_SOURCE)]
+    all_train_uris = [uri for uri in all_uris if not is_aishell_test_file(uri)] # 191 files, 104h46m
+    all_test_uris = [uri for uri in all_uris if is_aishell_test_file(uri)]  # 20 files, 12h34m
+
+    write_stringlist_to_file(Path(RESULT_DIR) / "train.txt", all_train_uris)
+    write_stringlist_to_file(Path(RESULT_DIR) / "test.txt", all_test_uris)
+
+    # Custom subsets !
+    subsets_time_ratio = {'custom_dev':60*60*12.0, 'custom_train':math.inf} # aim for about the same size as test : 12h
+        
+    computed_subsets_uri = [
+        compute_uri_subsets_time(all_train_uris, UEM_TEMPLATE, subsets_time_ratio, mode="absolute")
+    ]
+
+    for computed_subsets in computed_subsets_uri:
+        for subsetname, subseturis in computed_subsets.items():
+            write_stringlist_to_file(Path(RESULT_DIR) / (subsetname+".txt"), subseturis)
+
+
+if __name__ == '__main__':
+    all_uris = [Path(filename).stem for filename in glob.glob(FILES_SOURCE)]
+    write_stringlist_to_file(Path(RESULT_DIR) / "all.txt", all_uris, sort=True)
+
+    if len(sys.argv) > 1 and sys.argv[1] == "index":
+        print("Only created complete URIs index : all.txt")
+        exit
+    else:
+        your_subset_creation_logic()
@@ -2,10 +2,10 @@
 
 
 echo "Downloading ..."
-wget -nc "https://www.openslr.org/resources/111/train_L.tar.gz"
-wget -nc "https://www.openslr.org/resources/111/train_M.tar.gz"
-wget -nc "https://www.openslr.org/resources/111/train_S.tar.gz"
-wget -nc "https://www.openslr.org/resources/111/test.tar.gz"
+wget c "https://www.openslr.org/resources/111/train_L.tar.gz"
+wget -c "https://www.openslr.org/resources/111/train_M.tar.gz"
+wget -c "https://www.openslr.org/resources/111/train_S.tar.gz"
+wget -c "https://www.openslr.org/resources/111/test.tar.gz"
 
 echo "Extracting train_L"
 tar -xf train_L.tar.gz
@@ -37,10 +37,13 @@ mv test/TextGrid/* rttm/
 rm -rd test/
 
 
-echo "Generating URI lists ..."
-python generate_uris.py
+echo "Generating URI index ..."
+python generate_uris.py index
 
 echo "Generating UEM files ..."
 python generate_uems.py
 
-echo "Done !"
+echo "Generating URI lists ..."
+python generate_uris.py
+
+echo "Done !"
-Original file line number
+Diff line change
@@ @@ -0,0 +1,6 @@ @@
 +__pycache__/
 +wav/
 +rttm/
 +uems/
 +lists/
 +*.tar.gz