This repository contains all the code and data necessary to build the wildfire dataset. This dataset is then used to train our ML models.
Install uv
with pipx
:
pipx install uv
Create a virtualenv and install the dependencies with uv
:
uv sync
Activate the uv
virutalenv:
source .venv/bin/activate
Get the wildfire datasets with dvc
:
dvc get . data/processed
Pull all the data with dvc
:
dvc pull
Note: One needs to configure their dvc remote and get access to our remote data storage. Please ask somebody from the team to give you access.
Run the pipeline to build the dataset:
dvc repro
The whole repository is organized as a data pipeline that can be run to generate the different datasets.
The Data pipeline is organized with a dvc.yaml file.
This section list and describes all the DVC stages that are defined in the dvc.yaml file:
- data_pyro_sdis_testset: Turn the parquet files of the pyro-sdis-testset dataset into a regular ultralytics folder structure.
- predictions_wise_wolf_pyro_sdis_val: Run inference on all images from the
pyro-sdis val split with the
wise_wolf
model. - predictions_legendary_field_pyro_sdis_val: Run inference on all images
from the pyro-sdis val split with the
legendary_field
model. - predictions_wise_wolf_FP_2024: Run inference on all images from the
FP_2024 dataset with the
wise_wolf
model. - crops_wise_wolf_pyro_sdis_val: Generate crops from the predictions of the
wise_wolf
model on the pyro-sdis val split. - crops_wise_wolf_FP_2024: Generate crops from the predictions of the
wise_wolf
model on the FP_2024 dataset.
- filter_data_pyrosdis_smoke: Keep only the fire smokes from the
pyro-sdis
dataset - remove the background images. - filter_data_figlib_smoke: Keep only the fire smokes from the
FIGLIB_ANNOTATED_RESIZED
dataset - remove the background images. - filter_data_pyronear_ds_smoke: Keep only the fire smokes from the
pyronear-ds-03-2024
dataset - remove the background images. - filter_data_false_positives_FP_2024: Keep only the false positives that
the
wise_wolf
has made on theFP_2024
dataset.
- split_data_figlib: Split the
FIGLIB_ANNOTATED_RESIZED
dataset into train/val/test sets. - split_data_false_positives_FP_2024: Split the false postives dataset into train/val/test sets.
- merge_smoke_datasets: Merge the different data sources of fire smokes and split into the train/val/test sets.
- make_train_val_wildfire_dataset: Make the train/val
wildfire
dataset using the previous stages. - make_test_wildfire_dataset: Make the test
wildfire
dataset using the previous stages.
- make_temporal_train_val_dataset: Make the train/val temporal
wildfire
dataset using the previous stages. - make_temporal_test_dataset: Make the test temporal
wildfire
dataset using the previous stages.
- analyze_wildfire_dataset: Run some analyses on the generated dataset to check for data leakage, data distribution, and background images. Some interactive plots are also generated and exported.
The datasets below are the foundation of our data pipeline and are the source of truth.
- FIGLIB_ANNOTATED_RESIZED: re-annotated dataset from the Fire Ignition images Library.
- DS_fp: All the collected false positives of the Pyronear System before 2024.
- FP_2024: All the collected false positives of the Pyronear System in 2024.
- pyro-sdis: Pyro-SDIS is a dataset designed for wildfire smoke detection using AI models. It is developed in collaboration with the Fire and Rescue Services (SDIS) in France and the dedicated volunteers of the Pyronear association. It contains only detected fires by the Pyronear System.
- pyronear-ds-03-2024: Dataset of fire smokes as a mix of different public datasets and synthetic images. It also includes temporal sequences of fire events.
- pyro-sdis-testset: Private dataset used for evaluating the final performances of the ML models.
- Test_dataset_2025: built from Test_DS by adding extra false positives.
- Test_DS: The initial and curated test dataset.
All the folders located in ./data/interim/
are intermediary results needed to
build up the final datasets. They are versioned with DVC.
Many artifacts and datasets can be found here: from cropped image areas, to filtered datasets to focus on false positives for instance.
- false_positives: curated and annotated dataset containing false positives from the pyronear systems.
The final datasets are located in ./data/processed/
:
- π₯ wildfire: the train/val dataset used to train our ML models. It follows the ultralytics format.
- π₯ wildfire_test: the test dataset used to evaluate the performance of our ML models.
- β° wildfire_temporal: the train/val dataset used to train our temporal ML models.
- β° wildfire_temporal_test: the test dataset used to evaluate the performance of our temporal ML models and the pyronear engine.
Once the datasets are generated and stored in the ./data/processed/
directory, various reports are created to visualize the data. These reports
break down the datasets across different dimensions, allowing for a quick
assessment of whether the various data splits are logical and meaningful.
These reports live under ./data/reporting/
.
Scripts are located in the ./scripts
folder.
Most scripts are connected via the dvc.yaml configuration file. Others are utility scripts that can be used to perform various tasks.
Fetch a detection sequences by its sequence-id directly from the Pyronear platform API.
export PLATFORM_API_ENDPOINT="https://alertapi.pyronear.org"
export PLATFORM_LOGIN=sdis-07
export PLATFORM_PASSWORD=XXX
export PLATFORM_ADMIN_LOGIN=XXX
export PLATFORM_ADMIN_PASSWORD=XXX
uv run python ./scripts/platform_train_loop/fetch_platform_sequence_id.py \
--save-dir ./data/raw/pyronear-platform/sequences/my-sequence-5347/ \
--sequence-id 5347
Note: Make sure to use an admin login/password as well as a regular login/password. The admin level access is needed to fetch information about the organizations and properly name the detection images locally.
Fetch detection sequences directly from the Pyronear platform API.
Fetch all the detection sequences for sdis-07
and save them in the specified
directory:
export PLATFORM_API_ENDPOINT="https://alertapi.pyronear.org"
export PLATFORM_LOGIN=sdis-07
export PLATFORM_PASSWORD=XXX
export PLATFORM_ADMIN_LOGIN=XXX
export PLATFORM_ADMIN_PASSWORD=XXX
uv run python ./scripts/platform_train_loop/fetch_platform_sequences.py \
--save-dir ./data/raw/pyronear-platform/sequences/sdis-07/ \
--date-from 2025-05-01 \
--date-end 2025-06-01
Note: Make sure to use an admin login/password as well as a regular login/password. The admin level access is needed to fetch information about the organizations and properly name the detection images locally.
- π legendary_field: yolov8s object detection model, first performant model trained in 2019. to detect fire smoke.
- πΊ wise_wolf: yolov11s object detection model, trained on
2024-04-26
using a larger dataset.
The script to release a new version of the model is located in
./scripts/release.py
.
Make sure to set your GITHUB_ACCESS_TOKEN
as an env variable in your shell
before running the following script:
export GITHUB_ACCESS_TOKEN=XXX
uv run python ./scripts/release.py \
--version v1.3.5 \
--github-owner earthtoolsmaker \
--github-repo pyro-dataset
This will create a new release in the github repository with and upload an archive of the datasets to a private S3 repository. The link to the dataset is displayed in the release summary.
uv run pytest tests/