BIRDeep Bird Song Detector by Neural Networks

Overview

Passive acoustic monitoring (PAM) is an essential tool for biodiversity conservation, but it generates vast amounts of audio data that are challenging to analyze. This project aims to automate and improve bird species identification.

The BIRDeep Bird Song Detector is part of the BIRDeep project, aimed at monitoring bird communities with PAM through deep learning in Doñana National Park.

This repository contains the code, data links, and project resources associated with the research paper:

“A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana”
Alba Márquez-Rodríguez, Miguel Ángel Mohedano-Munoz, Manuel J. Marín-Jiménez, Eduardo Santamaría-García, Giulia Bastianelli, Pedro Jordano, Irene Mendoza
arXiv:2503.15576 · Accepted in Ecological Informatics

In the paper A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana, we propose a deep learning pipeline for automated bird song detection and species classification using audio recordings from Doñana National Park (SW Spain). The pipeline combines a YOLOv8-based detector with a fine-tuned version of BirdNET, significantly improving identification accuracy in Doñana soundscapes. The following figure illustrates the pipeline proposed in our study:

Figure: Pipeline used for the development of our Bird Song Detector. The process was divided into three main stages:

Preprocess: AudioMoth recorders were deployed in Doñana to collect audio data as part of the BIRDeep project. Recordings were annotated by experts and split into training, validation, and test sets.
Bird Song Detector: A YOLOv8-based model was trained to detect segments containing bird vocalizations (presence/absence). It was applied to the test set to extract segments with potential bird vocalizations.
Classifier: BirdNET was fine-tuned on expert-labeled data from Doñana. Its embeddings were used to train additional ML algorithms. These were validated and tested on the detected segments, resulting in improved species classification (higher True Positives, fewer False Negatives).

Citation

If you use this repository, please cite the preprint:

@misc{marquezrodriguez2025birdsongdetectorimproving,
      title={A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana}, 
      author={Alba Márquez-Rodríguez and Miguel Ángel Mohedano-Munoz and Manuel J. Marín-Jiménez and Eduardo Santamaría-García and Giulia Bastianelli and Pedro Jordano and Irene Mendoza},
      year={2025},
      eprint={2503.15576},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2503.15576}, 
}

Preprint available at: https://arxiv.org/abs/2503.15576 Accepted in Ecological Informatics (we will post the published article as soon as it is available).

Dataset

The dataset used in this research is publicly available via Hugging Face:

🔗 BIRDeep_AudioAnnotations on Hugging Face

Repository Structure

The repository is organized as follows:

Bird Classifiers/: Contains the codes and outputs of the bird classifiers used in the project. It includes BirdNET classifier, embeddings for machine learning based classifiers and other deep learning architectures.
- BirdNET/: Contains BirdNET generated models, training plots and predictions by some of the different models tested.
- models/: Contains the final classifiers used in the project.
- Scripts/: Scripts used for data generation and training of the classifiers. Evaluation scripts are all together in general Scripts/ folder.
BIRDeep Song Detector/: This directory contains the core structure and files for the Bird Song Detector. Contains the trainings and pre-trained and fine-tuned models data of the Bird Song Detector.
- runs/detect/: Output files, including model predictions and performance metricsfrom the Bird Song Detector.
Data/: Contains the audio data and annotations used for training and evaluation, you can check the BIRDeep_AudioAnnotations Dataset. Also generated images for Bird Song Detector and Deep Learning Classifiers.
Research/: Information collected during literature review, only a base research README missing a lot of information, for more, please, go to manuscripts.
Scripts/: Jupyter notebooks for data preprocessing and exploratory data analysis.
README.md: This file.

Data

Audio Recordings

Data was collected using automatic audio recording devices (AudioMoths) in three different habitats in Doñana National Park. Approximately 500 minutes of audio data were recorded. There are 9 recorders in 3 different habitats (marshland, scrubland, and ecotone), which are constantly running, recording 1 minute and leaving 9 minutes between recordings. That is, 1 minute is recorded for every 10 minutes, with a sampling rate of 32 kHz. The recordings were made prioritising those times when the birds are most active in order to try to have as many audio recordings of songs as possible, specifically a few hours before dawn until midday. Locations of the recorders are shown in the following map:

Annotations

Expert annotators labeled 461 minutes of audio data, identifying bird vocalizations and other relevant sounds. Annotations are provided in a standard format with start time, end time, and frequency range for each bird vocalization.

🔗 Check the dataset BIRDeep_AudioAnnotations on Hugging Face

Theory

The theory behind this methodology is that Deep Learning models can learn to identify and classify bird species from Mel spectrograms, which are graphical representations of audio data. Just as a general model can achieve good results when Transfer Learning is performed to adapt it to a specific problem.

According to the original BirdNET paper: "In summary, BirdNET achieved a mean average precision of 0.791 for single-species recordings, a F0.5 score of 0.414 for annotated soundscapes, and an average correlation of 0.251 with hotspot observation across 121 species and 4 years of audio data." That is, on audios that belong to the domain to which BirdNET belongs, in a real context in which the audios contain soundscapes, that is, soundscapes, the performance is not the best. On the other hand "The most common sources of false-positive detections were other vocalizing animals (e.g., insects, anurans, mammals), geophysical noise (e.g., wind, rain, thunder), human vocal and non-vocal sounds (e.g., whistling, footsteps, speech), anthropogenic sounds typically encountered in urban areas (e.g., cars, airplanes, si rens), and electronic recorder noise. The Google AudioSet is one of the largest collections of human-labeled sounds that span a wide range of classes that are organized in an ontology (Gemmeke et al., 2017). BirdNET can produce many false positives, creating a bird song detector step beforehand can reduce the number of false positives. Following an idea from DeepFaune, in which a first step based on Megadetector is established for photo-trapping cameras to eliminate empty images from those containing animals and thus be able to subsequently apply a classifier only on those samples that are True Positive, reducing the number of False Positives in the classifier.

Models

Bird Song Detector

Based on YOLOv8
Trained on spectrograms annotated with vocalizations
Achieved a test mAP50 of ~0.30 using full-frequency bounding boxes and a reduced ESC-50 dataset
Confidence threshold optimized at 0.15 for our study case

Bird Species Classifier

Custom Classifier BirdNET v2.4 with Doñana-specific data
Extracted embeddings (1024-dim) for training other models
Classifiers trained: Basic Neural Network, Random Forest, ResNet50, MobileNetV2

Results

The most significant results are that it seems that there was not enough data available to generate a robust detection model. Future work is needed to improve the detector, since after carrying out various experiments, achieving improvements has been difficult. The greatest improvement achieved has been by moving from temporal and frequency detections to only temporal detections, including the entire frequency spectrum for training and waiting for the entire frequency spectrum for detections.

In addition to finding difficulties with empty instances, i.e. True Negatives and False Positives, Data Augmentation techniques have been included to reduce this. First, background audios were edited for training, modifying intensity and adding noise. This improved, but not significantly. Later, audios from the ESC-50 library were included, which contains focal sounds, eliminating the sounds of birds such as crows and chickens. After applying the training, first results were obtained in which the network did not learn and ended up classifying all instances as empty due to the disproportion of ESC-50 audios compared to the dataset of interest. The number of ESC-50 audios was reduced to find a balance and thus the results were improved, although not very significantly.

The best detector model achieves a mAP50 of 0.29756 in the train, in validation it was around X.XX (to be completed) and in test it was similar to the validation.

Performance when applying the only classifier vs the Bird Song Detector + Classifier pipeline are the following:

Classifier	Bird Song Detector	Acc.	Macro Prec.	Macro Rec.	Macro F1	Weighted Prec.	Weighted Rec.	Weighted F1	Idx Pred/Ann
BirdNET fine-tuned	❌	0.21	0.12	0.14	0.11	0.18	0.21	0.17	1.8046
*BirdNET fine-tuned*	✅	0.30	0.21	0.14	0.13	0.37	0.30	0.28	0.9183
Random Forest	❌	0.19	0.10	0.10	0.08	0.19	0.19	0.15	0.9059
*Random Forest*	✅	0.29	0.11	0.12	0.10	0.24	0.29	0.23	0.5435
ResNet50	❌	0.02	0.00	0.03	0.00	0.00	0.02	0.00	3.2682
*ResNet50*	✅	0.08	0.01	0.05	0.01	0.01	0.08	0.02	0.6306
MobileNetV2	❌	0.02	0.01	0.04	0.01	0.01	0.02	0.01	3.2682
*MobileNetV2*	✅	0.08	0.01	0.04	0.01	0.02	0.08	0.02	0.6306

Note: All metrics are better at higher values, except for Idx Pred/Ann, which is optimal when closer to 1. Bold rows indicate performance improvement when using the Bird Song Detector.

✅ Using the Bird Song Detector improves classification across all models.

Usage

Prerequisites

Python 3.8 or higher
Required Python packages (listed in environment.yml)

Setting up the Conda Environment

If you want to reproduce this project, you can start by setting up the Conda environment. Follow these steps:

Clone this repository to your local machine:

git clone https://github.com/GrunCrow/BIRDeep_NeuralNetworks

Navigate to the project's directory:
```
cd BIRDeep_NeuralNetworks
```
Create a Conda environment using the provided environment.yml file:
```
conda env create -f environment.yml
```
This will create a Conda environment named "BIRDeep" with the required dependencies.
Activate the Conda environment:
```
conda activate BIRDeep
```

Looking for a lightweight version of this project?

An alternative version of this repository is available, including only the most essential resources, Bird-Song-Detector on GitHub:

Trained YOLO-based Bird Song Detector model
Scripts to apply the detector
A basic demo application to test the model

This repository supports the same research described in
"A Bird Song Detector for Improving Bird Identification through Deep Learning: A Case Study from Doñana",
accepted in Ecological Informatics.
It is optimized for users who want to quickly test or apply the detector without downloading the full research stack.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Stay tuned for updates and advancements in our pursuit to understand and classify bird songs more accurately with the help of deep learning and neural networks.

Funding

This work has received financial support from the BIRDeep project (TED2021-129871A-I00), which is funded by MICIU/AEI/10.13039/501100011033 and the ‘European Union NextGenerationEU/PRTR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BIRDeep Bird Song Detector by Neural Networks