Passive acoustic monitoring (PAM) is an essential tool for biodiversity conservation, but it generates vast amounts of audio data that are challenging to analyze. This project aims to automate and improve bird species identification.
The BIRDeep Bird Song Detector is part of the BIRDeep project, aimed at monitoring bird communities with PAM through deep learning in Doñana National Park.
This repository contains the code, data links, and project resources associated with the research paper:
“A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana”
Alba Márquez-Rodríguez, Miguel Ángel Mohedano-Munoz, Manuel J. Marín-Jiménez, Eduardo Santamaría-García, Giulia Bastianelli, Pedro Jordano, Irene Mendoza
arXiv:2503.15576 · Accepted in Ecological Informatics
In the paper A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana, we propose a deep learning pipeline for automated bird song detection and species classification using audio recordings from Doñana National Park (SW Spain). The pipeline combines a YOLOv8-based detector with a fine-tuned version of BirdNET, significantly improving identification accuracy in Doñana soundscapes. The following figure illustrates the pipeline proposed in our study:
Figure: Pipeline used for the development of our Bird Song Detector. The process was divided into three main stages:
-
Preprocess: AudioMoth recorders were deployed in Doñana to collect audio data as part of the BIRDeep project. Recordings were annotated by experts and split into training, validation, and test sets.
-
Bird Song Detector: A YOLOv8-based model was trained to detect segments containing bird vocalizations (presence/absence). It was applied to the test set to extract segments with potential bird vocalizations.
-
Classifier: BirdNET was fine-tuned on expert-labeled data from Doñana. Its embeddings were used to train additional ML algorithms. These were validated and tested on the detected segments, resulting in improved species classification (higher True Positives, fewer False Negatives).
If you use this repository, please cite the preprint:
@misc{marquezrodriguez2025birdsongdetectorimproving,
title={A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana},
author={Alba Márquez-Rodríguez and Miguel Ángel Mohedano-Munoz and Manuel J. Marín-Jiménez and Eduardo Santamaría-García and Giulia Bastianelli and Pedro Jordano and Irene Mendoza},
year={2025},
eprint={2503.15576},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2503.15576},
}
Preprint available at: https://arxiv.org/abs/2503.15576 Accepted in Ecological Informatics (we will post the published article as soon as it is available).
The dataset used in this research is publicly available via Hugging Face:
🔗 BIRDeep_AudioAnnotations on Hugging Face
Read more in the Data section below.
The repository is organized as follows:
Bird Classifiers/
: Contains the codes and outputs of the bird classifiers used in the project. It includes BirdNET classifier, embeddings for machine learning based classifiers and other deep learning architectures.BirdNET/
: Contains BirdNET generated models, training plots and predictions by some of the different models tested.models/
: Contains the final classifiers used in the project.Scripts/
: Scripts used for data generation and training of the classifiers. Evaluation scripts are all together in generalScripts/
folder.
BIRDeep Song Detector/
: This directory contains the core structure and files for the Bird Song Detector. Contains the trainings and pre-trained and fine-tuned models data of the Bird Song Detector.runs/detect/
: Output files, including model predictions and performance metricsfrom the Bird Song Detector.
Data/
: Contains the audio data and annotations used for training and evaluation, you can check the BIRDeep_AudioAnnotations Dataset. Also generated images for Bird Song Detector and Deep Learning Classifiers.Research/
: Information collected during literature review, only a base research README missing a lot of information, for more, please, go to manuscripts.Scripts/
: Jupyter notebooks for data preprocessing and exploratory data analysis.README.md
: This file.
Data was collected using automatic audio recording devices (AudioMoths) in three different habitats in Doñana National Park. Approximately 500 minutes of audio data were recorded. There are 9 recorders in 3 different habitats (marshland, scrubland, and ecotone), which are constantly running, recording 1 minute and leaving 9 minutes between recordings. That is, 1 minute is recorded for every 10 minutes, with a sampling rate of 32 kHz. The recordings were made prioritising those times when the birds are most active in order to try to have as many audio recordings of songs as possible, specifically a few hours before dawn until midday. Locations of the recorders are shown in the following map:
Expert annotators labeled 461 minutes of audio data, identifying bird vocalizations and other relevant sounds. Annotations are provided in a standard format with start time, end time, and frequency range for each bird vocalization.
🔗 Check the dataset BIRDeep_AudioAnnotations on Hugging Face
The theory behind this methodology is that Deep Learning models can learn to identify and classify bird species from Mel spectrograms, which are graphical representations of audio data. Just as a general model can achieve good results when Transfer Learning is performed to adapt it to a specific problem.
According to the original BirdNET paper: "In summary, BirdNET achieved a mean average precision of 0.791 for single-species recordings, a F0.5 score of 0.414 for annotated soundscapes, and an average correlation of 0.251 with hotspot observation across 121 species and 4 years of audio data." That is, on audios that belong to the domain to which BirdNET belongs, in a real context in which the audios contain soundscapes, that is, soundscapes, the performance is not the best. On the other hand "The most common sources of false-positive detections were other vocalizing animals (e.g., insects, anurans, mammals), geophysical noise (e.g., wind, rain, thunder), human vocal and non-vocal sounds (e.g., whistling, footsteps, speech), anthropogenic sounds typically encountered in urban areas (e.g., cars, airplanes, si rens), and electronic recorder noise. The Google AudioSet is one of the largest collections of human-labeled sounds that span a wide range of classes that are organized in an ontology (Gemmeke et al., 2017). BirdNET can produce many false positives, creating a bird song detector step beforehand can reduce the number of false positives. Following an idea from DeepFaune, in which a first step based on Megadetector is established for photo-trapping cameras to eliminate empty images from those containing animals and thus be able to subsequently apply a classifier only on those samples that are True Positive, reducing the number of False Positives in the classifier.
- Based on YOLOv8
- Trained on spectrograms annotated with vocalizations
- Achieved a test mAP50 of ~0.30 using full-frequency bounding boxes and a reduced ESC-50 dataset
- Confidence threshold optimized at 0.15 for our study case
- Custom Classifier BirdNET v2.4 with Doñana-specific data
- Extracted embeddings (1024-dim) for training other models
- Classifiers trained: Basic Neural Network, Random Forest, ResNet50, MobileNetV2
The most significant results are that it seems that there was not enough data available to generate a robust detection model. Future work is needed to improve the detector, since after carrying out various experiments, achieving improvements has been difficult. The greatest improvement achieved has been by moving from temporal and frequency detections to only temporal detections, including the entire frequency spectrum for training and waiting for the entire frequency spectrum for detections.
In addition to finding difficulties with empty instances, i.e. True Negatives and False Positives, Data Augmentation techniques have been included to reduce this. First, background audios were edited for training, modifying intensity and adding noise. This improved, but not significantly. Later, audios from the ESC-50 library were included, which contains focal sounds, eliminating the sounds of birds such as crows and chickens. After applying the training, first results were obtained in which the network did not learn and ended up classifying all instances as empty due to the disproportion of ESC-50 audios compared to the dataset of interest. The number of ESC-50 audios was reduced to find a balance and thus the results were improved, although not very significantly.
The best detector model achieves a mAP50 of 0.29756 in the train, in validation it was around X.XX (to be completed) and in test it was similar to the validation.
Performance when applying the only classifier vs the Bird Song Detector + Classifier pipeline are the following:
Classifier | Bird Song Detector | Acc. | Macro Prec. | Macro Rec. | Macro F1 | Weighted Prec. | Weighted Rec. | Weighted F1 | Idx Pred/Ann |
---|---|---|---|---|---|---|---|---|---|
BirdNET fine-tuned | ❌ | 0.21 | 0.12 | 0.14 | 0.11 | 0.18 | 0.21 | 0.17 | 1.8046 |
BirdNET fine-tuned | ✅ | 0.30 | 0.21 | 0.14 | 0.13 | 0.37 | 0.30 | 0.28 | 0.9183 |
Random Forest | ❌ | 0.19 | 0.10 | 0.10 | 0.08 | 0.19 | 0.19 | 0.15 | 0.9059 |
Random Forest | ✅ | 0.29 | 0.11 | 0.12 | 0.10 | 0.24 | 0.29 | 0.23 | 0.5435 |
ResNet50 | ❌ | 0.02 | 0.00 | 0.03 | 0.00 | 0.00 | 0.02 | 0.00 | 3.2682 |
ResNet50 | ✅ | 0.08 | 0.01 | 0.05 | 0.01 | 0.01 | 0.08 | 0.02 | 0.6306 |
MobileNetV2 | ❌ | 0.02 | 0.01 | 0.04 | 0.01 | 0.01 | 0.02 | 0.01 | 3.2682 |
MobileNetV2 | ✅ | 0.08 | 0.01 | 0.04 | 0.01 | 0.02 | 0.08 | 0.02 | 0.6306 |
Note: All metrics are better at higher values, except for Idx Pred/Ann, which is optimal when closer to 1. Bold rows indicate performance improvement when using the Bird Song Detector.
✅ Using the Bird Song Detector improves classification across all models.
- Python 3.8 or higher
- Required Python packages (listed in
environment.yml
)
If you want to reproduce this project, you can start by setting up the Conda environment. Follow these steps:
-
Clone this repository to your local machine:
git clone https://github.com/GrunCrow/BIRDeep_NeuralNetworks
-
Navigate to the project's directory:
cd BIRDeep_NeuralNetworks
-
Create a Conda environment using the provided environment.yml file:
conda env create -f environment.yml
This will create a Conda environment named "BIRDeep" with the required dependencies.
-
Activate the Conda environment:
conda activate BIRDeep
An alternative version of this repository is available, including only the most essential resources, Bird-Song-Detector on GitHub:
- Trained YOLO-based Bird Song Detector model
- Scripts to apply the detector
- A basic demo application to test the model
This repository supports the same research described in
"A Bird Song Detector for Improving Bird Identification through Deep Learning: A Case Study from Doñana",
accepted in Ecological Informatics.
It is optimized for users who want to quickly test or apply the detector without downloading the full research stack.
This project is licensed under the MIT License. See the LICENSE file for details.
Stay tuned for updates and advancements in our pursuit to understand and classify bird songs more accurately with the help of deep learning and neural networks.
This work has received financial support from the BIRDeep project (TED2021-129871A-I00), which is funded by MICIU/AEI/10.13039/501100011033 and the ‘European Union NextGenerationEU/PRTR