Skip to content
Isabelle Eysseric edited this page Sep 11, 2024 · 43 revisions

Identification of bird songs (Kaggle Competition)
Analysis, Extraction of spectrogram image information


Team



Abstract

BirdCLEF 2021 is a Kaggle competition[1] that aims to classify bird songs by species. This task is very complex due to the very noisy recordings. We propose to extract the mel-spectrogram from the audio files and use a convolutional layer network to perform the classification. ResNet models seem the most promising although pre-training these networks does not seem to be beneficial. Resnet 34 obtains a test accuracy of 52.48%. This suggests that the task is feasible but that there is always room for improvement.



Introduction

Problem presentation

The identification of bird calls is a complex task that falls under audio classification. To do this, it is normally necessary to go through two steps: 1) detect the sound signals and 2) classify them. This task is rarely simple since audio files are very noisy due to multiple noises that are not related to what we are trying to classify.


A model that can adequately recognize birds based on their calls could be useful in locating species without having to recognize them visually. Many birds can be difficult to observe due to their multiple camouflage techniques, which motivates this audio classification project. By installing song recognition devices at key locations, it would be possible, for example, to identify the departures and arrivals of the migrations of certain birds or even to delimit the territories that each species occupies.



Dataset presentation

The dataset contains audio files for 397 different bird species. Each audio file is of a variable length ranging from 10 seconds to 2 minutes. These recordings are files with the .ogg extension. In addition, it should be noted that the number of recordings is not constant from species to species. The most represented class contains about 500 recordings while the least represented contains only about twenty. The recordings are very noisy in nature since they are recorded in the natural habitat of the species. We can therefore hear, for example, the sound of the wind in the leaves.



Difference between the project and the competition

The goal of the Kaggle competition is to classify bird songs in a 10-minute audio sequence for each 5-second period contained in this sequence. We modify the goal of the competition to obtain a more classical classification task. We therefore aim to identify a bird song for a period of 5 seconds only. This way, it was easier to separate the training and test data. In addition, this approach was more similar to what we had seen in class, so we could put most of our efforts into experimenting with different architectures rather than spending a major part of the time on data processing.



State of the art

Several classical machine learning models have been used to tackle the task of audio classification. For example, support vector machines have been used to differentiate between two English accents[2] and in another case to classify a recording according to 8 different emotions[3]. However, since we are looking to use deep learning to solve the problem of classifying birdsongs, we will turn more towards these methods.


Several authors recommend the use of deep networks with convolutional layers. The use of these networks is however mainly reserved for image classification tasks. This is why spectrogram extraction from recordings is used by several authors. For example, the use of spectrograms is used to identify the main instrument of a musical sequence[4] or during voice recognition[5].



Discussion

Discussion of results

We tested several ResNet models (18 and 34) to find that we had better results with ResNet 34. Larger models like ResNet 50 or 100 could not be tested because the Kaggle notebook did not allow us to allocate enough memory to train them. We could possibly have slightly better results using deeper architectures.



Table 1: Summary table of the different architectures tested


The tests on the VGG 11 architecture were done with the same hyperparameters as those used for ResNet 34, but the results are very disappointing. The accuracy in training and validation remains below 1%. This indicates that there is probably a problem initializing the network that prevents it from learning the training dataset. The second experiment that did not work very well is the use of the mask to binarize our images. By using several different thresholds, maybe we would have been able to get better results with this approach.


We can also conclude that using pre-trained networks was not useful for our classification task. The very different nature of our images with those of the ImageNet database are in our opinion the main cause that explains why this approach gives bad results.


The different techniques aimed at reducing overfitting (early stopping, dropout and adaptive learning rate) seem to improve our results on the test set. ResNet 34, trained without these techniques, obtains a test accuracy of 49.33% versus 52.48% with training using these techniques.



Possible improvements

The first improvement that could be implemented is to extract the mel-spectrogram of the audio files in color version rather than in grayscale. This would probably produce a better contrast between the signals of interest and the noise contained in the images, which would allow the networks to better classify the data.


In order to solve the class imbalance problem, it would be possible to extract more than 3 mel-spectrograms per audio file for underrepresented classes and to extract only 3 for classes with a large amount of recording. We could also use another metric than the accuracy to compare our models. Since we have a very large number of classes, top-k accuracy or F1 score could be appropriate.


Using multiple networks with different architectures and performing a majority vote to obtain the final prediction could also be a way to achieve better performance.



Conclusion

Although the task of classifying bird songs is not a simple problem to solve, the best performing model achieves an accuracy of 52.48% on the test set. This may seem low, but since the data is very noisy and the number of classes is very high, we consider this performance to be reasonable. By having access to more powerful computing resources and implementing different deeper architectures, we could probably achieve better results. However, we believe that color extraction and more sophisticated image processing would give us the greatest performance gains.



References

[1] Birdclef-2021, the Kaggle competition organized by the Cornell Ornithology Laboratory.


[2] Pedersen C, Diederich J (2007) Accent classification using support vector machines in Annual IEEE/ACIS, international conference on computer and information science, pp 444–449.


[3] Shegokar P, Sircar P (2016) Continuous wavelet transform based speech emotion recognition in: International conference on signal processing and communication systems, pp 1–8.


[4] Han Y, Kim J, Lee K (2017) Deep convolutional neural networks for predominant instrument recognition in polyphonic music in IEEE/ACM Trans Audio Speech Lang Process 25(1):208–221.


[5] 5Hannun AY, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, Ng AY (2014) Deep speech: scaling up end-to-end speech recognition.


[6] Article Understanding the Mel Spectrogram from Leland Roberts, written on March 5, 2020, on the Medium.com website.