Convert BirdNET models (Keras/H5 format) to Apple Core ML format (.mlpackage) for on-device inference on iOS/macOS devices.
This repository provides tools to convert BirdNET's bird detection models to Apple's Core ML format, enabling:
- Audio Model Conversion: Convert the main BirdNET acoustic model that identifies bird species from audio
- Metadata Model Conversion: Convert the location/time-based model that filters species by geographic occurrence
- Model Verification: Tools to validate that converted models produce identical results
Use the automated setup script:
cd BirdNET-CoreML
./setup_environment.sh
source venv/bin/activate
This creates a clean virtual environment with all required dependencies.
The main conversion script handles both Keras/H5 models and TensorFlow SavedModel directories:
# Convert the included Keras/H5 model
python coreml_export/convert_keras_to_coreml.py \
--in_path coreml_export/input/audio-model.h5 \
--out_path coreml_export/output/audio-model.mlpackage
# Convert from a SavedModel directory
python coreml_export/convert_keras_to_coreml.py \
--in_path path/to/savedmodel/ \
--out_path output/model.mlpackage
Options:
--target ios15
- Minimum deployment target (default: ios15)--keep_fp32
- Keep 32-bit precision (not recommended - models are already optimized for FP16)--melspec_layer_file
- Specify MelSpecLayerSimple implementation (default: MelSpecLayerSimple_fixed.py for CoreML compatibility)
BirdNET uses a metadata model to filter predictions based on location and time of year:
# Convert the metadata model
python coreml_export/convert_meta_model_to_coreml.py \
--input coreml_export/input/meta-model.h5 \
--output coreml_export/output/metadata-model.mlpackage
The converted audio model expects:
- Input: 3-second mono audio @ 48 kHz (144,000 samples)
- Output: Probability scores for 6,522 bird species
Use the metadata model to filter predictions by location:
from coreml_export.meta_utils import (
get_species_priors,
filter_by_location,
load_coreml_meta_model
)
# Load models
audio_model = coremltools.models.MLModel("audio-model.mlpackage")
meta_model = load_coreml_meta_model("metadata-model.mlpackage")
# Get audio predictions
audio_scores = audio_model.predict({"input": audio_data})
# Filter by location (latitude, longitude, week_of_year)
filtered_scores, filtered_labels = filter_by_location(
audio_scores, species_labels,
latitude=40.7128, longitude=-74.0060, week=12,
meta_model=meta_model
)
The repository includes verification tools:
Compare predictions between different model formats:
# Compare Keras vs CoreML on test audio files
python verification/compare_model_predictions.py \
--model1 coreml_export/input/audio-model.h5 \
--model2 coreml_export/output/audio-model.mlpackage \
--audio_dir verification/bird_sounds \
--output_csv comparison_results.csv
Test the metadata model with 17 geographic test cases:
python verification/verify_meta_models.py
BirdNET-CoreML/
├── coreml_export/ # Main conversion scripts
│ ├── convert_keras_to_coreml.py # Audio model converter
│ ├── convert_meta_model_to_coreml.py # Metadata model converter
│ ├── meta_utils.py # Location filtering utilities
│ ├── input/ # Input models and resources
│ │ ├── audio-model.h5 # Pre-trained audio model
│ │ ├── meta-model.h5 # Pre-trained metadata model
│ │ └── labels/ # Species labels in multiple languages
│ └── output/ # Converted CoreML models
├── verification/ # Model validation tools
│ ├── compare_model_predictions.py # Compare outputs across formats
│ ├── verify_meta_models.py # Test metadata model
│ └── bird_sounds/ # Test audio samples
├── requirements.txt # Python dependencies
├── setup_environment.sh # Automated setup script
└── deprecated/ # Legacy scripts no longer needed
└── custom_layers.py # Legacy SimpleSpecLayer (no longer used)
- Audio Model: ResNet-based architecture with custom spectrogram preprocessing
- Input: Raw audio waveform (144,000 samples @ 48kHz)
- Custom Layers:
MelSpecLayerSimple_fixed.py
: Modified mel-spectrogram layer that avoids CoreML-incompatible operations- Output: 6,522 bird species probabilities
The original MelSpecLayerSimple layer used TensorFlow's tf.abs()
operation on complex spectrograms, which isn't supported by CoreML. The fixed version manually computes the magnitude spectrum using sqrt(real² + imag²)
, making it compatible with CoreML conversion while producing identical results.
The models use FP16 (16-bit floating point) precision by default, which provides:
- ~50MB model size (vs ~100MB for FP32)
- Faster inference on Apple Neural Engine
- Negligible impact on accuracy
Initial experiments with FP32 showed no meaningful accuracy improvements, so FP16 is recommended.
The model identifies 6,522 bird species globally, with labels available in 27 languages.
If you encounter import errors:
- Delete your virtual environment
- Run
./setup_environment.sh
to create a fresh environment
You may see: TensorFlow version 2.15.0 has not been tested with coremltools
This warning can be safely ignored - the conversion has been thoroughly tested.
The converter automatically handles BirdNET's custom layers. If you encounter issues:
- Ensure the correct MelSpecLayerSimple implementation is used (typically
MelSpecLayerSimple_fixed.py
) - The fixed version avoids CoreML-incompatible operations while maintaining identical functionality
- The SimpleSpecLayer from
custom_layers.py
existed in early versions of BirdNET but is no longer used in current models
SimpleSpecLayer (deprecated):
- Creates a basic linear spectrogram (not mel-scaled)
- Uses STFT to get frequency bins linearly spaced
- Output shape: (257, 384) - likely 257 frequency bins
- Simple magnitude scaling and normalization
- No frequency warping to match human perception
MelSpecLayerSimple:
- Creates a mel-spectrogram (perceptually-scaled frequencies)
- Uses mel filterbank to convert linear frequencies to mel scale
- Includes frequency range parameters (fmin, fmax)
- More sophisticated preprocessing (normalization to [-1, 1] range)
- Better suited for audio ML tasks involving human-relevant sounds
Why MelSpecLayerSimple_fixed.py was needed
The key issue is on line 51 of the fixed version:
op) real = tf.math.real(complex_spec) imag = tf.math.imag(complex_spec) spec = tf.math.sqrt(real * real + imag * imag)
The original MelSpecLayerSimple.py likely used: spec = tf.abs(complex_spec) # This doesn't work with CoreML!
CoreML doesn't support the ComplexAbs operation that TensorFlow generates when you call tf.abs() on complex numbers. The "fixed" version manually computes the magnitude using the mathematical formula |z| = sqrt(real² + imag²), which CoreML can handle since it only uses basic arithmetic operations.
Evolution of BirdNET models
It appears BirdNET evolved from:
- Early versions: Used SimpleSpecLayer (basic spectrograms)
- Later versions: Switched to MelSpecLayerSimple (mel spectrograms) for better performance
- CoreML conversion: Required the "fixed" version to avoid unsupported operations
This evolution makes sense because mel spectrograms are generally superior for audio ML tasks - they better match how humans (and likely birds) perceive frequency differences, with higher resolution in lower frequencies where most biological sounds occur.
The successful conversion of BirdNET’s pioneering work in bioacoustics to Apple’s CoreML has been a long-standing dream of mine. It would not have been possible without the incredible work of the BirdNET team—thank you. I also can’t overstate the value of large language models in helping me understand and work through topics and technologies that would have taken me months to ramp up on. I simply wouldn’t have had the time otherwise. I look forward to pursuing more passion projects on Apple devices using these technologies.
- BirdNET team for the original models
- Models
- BirdNET-CoreML (No longer public, under MIT)