Official implementation of
- Adaptive Convolutional Neural Network for Text-Independent Speaker Recognition
by Seong-Hu Kim, Yong-Hwa Park @ Human Lab, Mechanical Engineering Department, KAIST
Accepted paper in InterSpeech 2021.
This code was written mainly with reference to baseline code.
We use two scaling maps, which are frequency and time domain, to each axis for the adaptive kernel in the ACNN module. The adaptive kernel is created by element-wise multiplication of each output channel of the content-invariant kernel with the scaling matrix. The structure of proposed ACNN module for speaker recognition is shown as follows.
This module is applied to VGG-M and ResNet for text-independent speaker recognition.
- pytorch >= 1.4.0
- pytorchaudio >= 0.4.0
- numpy >= 1.18
We used Voxceleb1 dataset in this paper. You can download the dataset by reffering to Voxceleb. All data should be gathered in one folder and you set the dataset directories in 'train_model.yaml'.
You can train and save model in exps
folder by running:
python train_model.py
You need to adjust the training parameters in yaml before training.
Network | Top-1 (%) | Top-1 (%) | EER (%) | C_det (%) |
---|---|---|---|---|
Adaptive VGG-M (N=18) | 86.51 | 95.31 | 5.68 | 0.510 |
Adaptive ResNet18 (N=18) | 85.84 | 95.29 | 6.18 | 0.589 |
There are pretrained models in 'pretrained_model'. The example code for verification using the pretrained models is not provided separately.
@inproceedings{kim21_interspeech,
author={Seong-Hu Kim and Yong-Hwa Park},
title={{Adaptive Convolutional Neural Network for Text-Independent Speaker Recognition}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={66--70},
doi={10.21437/Interspeech.2021-65}
}
Please contact Seong-Hu Kim at [email protected] for any query.