FrenchKrab
diff --git a/‎.github/workflows/publish.yml
-25 b/‎.github/workflows/publish.yml
-25
diff --git a/‎README.md
+14 b/‎README.md
+14
diff --git a/‎calibration.qmd
+45-6 b/‎calibration.qmd
+45-6
diff --git a/‎data/calibration/database.yml
+54 b/‎data/calibration/database.yml
+54
diff --git a/‎data/calibration/[email protected]
16.9 MB b/‎data/calibration/[email protected]
16.9 MB
@@ -0,0 +1,14 @@
+# On the calibration of powerset speaker diarization models
+
+[Alexis Plaquet](https://frenchkrab.github.io/) and [Hervé Bredin](https://herve.niderb.fr)  
+Proc. InterSpeech 2024.
+
+> End-to-end neural diarization models have usually relied on a multilabel-classification formulation of the speaker diarization problem. Recently, a powerset multiclass formulation has beaten state-of-the-art on multiple datasets. In this paper, we propose to study the calibration of a powerset speaker diarization model, and explore some of its uses. We study the calibration in-domain, as well as out-of-domain, and explore the data in low-confidence regions. The reliability of model confidence is then tested in practice: we use the confidence of the pretrained model to selectively create training and validation subsets out of unannotated data, and compare this to random selection. We find that top-label confidence can be used to reliably predict high-error regions.  Moreover, training on low-confidence regions provide a better calibrated model, and validating on low-confidence regions can be more annotation-efficient than random regions.
+
+[Read the paper (TODO)](https://www.isca-speech.org/archive/interspeech_2023/plaquet23_interspeech.html)
+
+[Browse the companion website](https://frenchkrab.github.io/IS2024-powerset-calibration/)
+
+## Citations
+
+To be added.
@@ -5,14 +5,13 @@ about:
   links:
     - icon: github
       text: Github
-      href: https://github.com/FrenchKrab
+      href: https://github.com/FrenchKrab/IS2024-powerset-calibration
     - icon: book
       text: Google Scholar
       href: https://scholar.google.com/citations?user=7gJ465gAAAAJ
 
 ---
 
-
 # Raw results table
 
 We could not include the raw result table in the paper. We show it here, and include some additional metrics (Expected Calibration Error using different binning schemes and bin counts). It is pretty clear that the bins used to compute the ECE does not have a huge impact on the metric.
@@ -56,18 +55,58 @@ The paper contains two scatter plots for DER / ECE. Here we grouped all datasets
 
 # Reliability diagrams
 
+Here are reliability diagrams for all 11 DIHARD 3 domains. The paper only shows uniform binning, but we also propose diagrams for adaptive binning.
+We put the figures under foldable sections since they take a lot of vertical space.
+
+
 ## Uniform binning with 10 bins
 
 <!-- 09c_view_calibration_eval.ipynb with BINNING_METHOD='uniform' -->
-::: {.callout-note appearance="detail" collapse=true}
-# Using uniform binning with 10 bins
+::: {.callout-note appearance="detail" collapse=true title="Using uniform binning with 10 bins"}
 ![](site_media/calibration/reliability_uniform10bins.png)
 :::
 
 <!-- 09c_view_calibration_eval.ipynb with BINNING_METHOD='adaptive' -->
 ## Adapative binning with 10 bins
 
-::: {.callout-note appearance="detail" collapse=true}
-# Using adaptive binning with 10 bins
+Note that the X axis is not linear at all. Since most predictions are confident, the higher bins contain very similar confidence values.
+
+::: {.callout-note appearance="detail" collapse=true title="Using adaptive binning with 10 bins"}
 ![](site_media/calibration/reliability_adaptive10bins.png)
 :::
+
+# Analysis of low-confidence regions
+
+We sample low-confidence data (left column) and random regions of data (right column), and compare the composition of the data as well as the model performance. As usual we provide the figures of all DIHARD domains instead of a select few.
+
+## Data composition
+
+<!-- 21_selected_al_analysis.ipynb -->
+::: {.callout-note appearance="detail" collapse=true title="Data composition of low-confidence regions"}
+![](site_media/calibration/data_composition.png)
+:::
+
+## Model performance (DER)
+
+<!-- 21_selected_al_analysis_der.ipynb -->
+::: {.callout-note appearance="detail" collapse=true title="DER on low-confidence regions"}
+![](site_media/calibration/der_analysis.png)
+:::
+
+
+# Reproducibility
+
+Pretrained model checkpoint downloads:
+
+- [Github](https://github.com/FrenchKrab/IS2024-powerset-calibration/tree/master/data/calibration/[email protected])
+- [HuggingFace (mirror)](https://huggingface.co/aplaquet/IS2024-powerset-calibration/blob/main/pretrained%40epoch109.ckpt)
+
+
+Composition of the training dataset:
+
+- [pyannote.database protocol specifications](https://github.com/FrenchKrab/IS2024-powerset-calibration/tree/master/data/calibration/database.yml)
+
+
+Parquet inference files, containing model probabilities and targets for all of the datasets:
+
+- [.parquet inference files](https://huggingface.co/aplaquet/IS2024-powerset-calibration/tree/main/model_inference)
@@ -0,0 +1,54 @@
+Requirements:
+  - database_eie.yml
+  - database_dihard_1file.yml
+  - database_dihard_1file2min.yml
+  - database_dihard_1file-1_2min.yml
+  - database_dihard_1file-2_2min.yml
+  - database_dihard_1file-3_2min.yml
+
+Protocols:
+  X:
+    SpeakerDiarization:
+      Pretraining_2023-12_no-DIHARD:
+        train:
+          AISHELL.SpeakerDiarization.Adaptation: [train, ]
+          AliMeeting.SpeakerDiarization.Adaptation: [train, ]
+          AMI.SpeakerDiarization.Adaptation: [train, ]
+          AMI-SDM.SpeakerDiarization.Adaptation: [train, ]
+          AVA-AVD.SpeakerDiarization.Adaptation: [train, ]
+          CALLHOME.SpeakerDiarization.Adaptation: [train, ]
+          DISPLACE.SpeakerDiarization.Adaptation: [train, ]
+          # DIHARD.SpeakerDiarization.Adaptation: [train, ]
+          Ego4D.SpeakerDiarization.Adaptation: [train, ]
+          MSDWILD.SpeakerDiarization.Adaptation: [train, ]
+          RAMC.SpeakerDiarization.Adaptation: [train, ]
+          REPERE.SpeakerDiarization.Adaptation: [train, ]
+          VoxConverse.SpeakerDiarization.Adaptation: [train, ]
+        development:
+          AISHELL.SpeakerDiarization.Adaptation: [development, ]
+          AliMeeting.SpeakerDiarization.Adaptation: [development, ]
+          AMI.SpeakerDiarization.Adaptation: [development, ]
+          AMI-SDM.SpeakerDiarization.Adaptation: [development, ]
+          AVA-AVD.SpeakerDiarization.Adaptation: [development, ]
+          CALLHOME.SpeakerDiarization.Adaptation: [development, ]
+          DISPLACE.SpeakerDiarization.Adaptation: [development, ]
+          # DIHARD.SpeakerDiarization.Adaptation: [development, ]
+          Ego4D.SpeakerDiarization.Adaptation: [development, ]
+          MSDWILD.SpeakerDiarization.Adaptation: [development, ]
+          RAMC.SpeakerDiarization.Adaptation: [development, ]
+          REPERE.SpeakerDiarization.Adaptation: [development, ]
+          VoxConverse.SpeakerDiarization.Adaptation: [development, ]
+        test:
+          AISHELL.SpeakerDiarization.Benchmark: [test, ]
+          AliMeeting.SpeakerDiarization.Benchmark: [test, ]
+          AMI.SpeakerDiarization.Benchmark: [test, ]
+          AMI-SDM.SpeakerDiarization.Benchmark: [test, ]
+          AVA-AVD.SpeakerDiarization.Benchmark: [test, ]
+          CALLHOME.SpeakerDiarization.Benchmark: [test, ]
+          # DISPLACE.SpeakerDiarization.Benchmark: [test, ]
+          # DIHARD.SpeakerDiarization.Benchmark: [test, ]
+          # Ego4D.SpeakerDiarization.Benchmark: [test, ]
+          MSDWILD.SpeakerDiarization.Benchmark: [test, ]
+          RAMC.SpeakerDiarization.Benchmark: [test, ]
+          REPERE.SpeakerDiarization.Benchmark: [test, ]
+          VoxConverse.SpeakerDiarization.Benchmark: [test, ]