Pyro Dataset

This repository contains all the code and data necessary to build the wildfire dataset. This dataset is then used to train our ML models.

Setup

🐍 Python dependencies

Install uv with pipx:

pipx install uv

Create a virtualenv and install the dependencies with uv:

uv sync

Activate the uv virutalenv:

source .venv/bin/activate

🍜 Data dependencies

Get the wildfire datasets with dvc:

dvc get . data/processed

Pull all the data with dvc:

dvc pull

Note: One needs to configure their dvc remote and get access to our remote data storage. Please ask somebody from the team to give you access.

Run the pipeline to build the dataset:

dvc repro

Data Pipeline

The whole repository is organized as a data pipeline that can be run to generate the different datasets.

The Data pipeline is organized with a dvc.yaml file.

DVC stages

This section list and describes all the DVC stages that are defined in the dvc.yaml file:

⛱️ Data Preparation

data_pyro_sdis_testset: Turn the parquet files of the pyro-sdis-testset dataset into a regular ultralytics folder structure.

🧠 Model Inference

predictions_wise_wolf_pyro_sdis_val: Run inference on all images from the pyro-sdis val split with the wise_wolf model.
predictions_legendary_field_pyro_sdis_val: Run inference on all images from the pyro-sdis val split with the legendary_field model.
predictions_wise_wolf_FP_2024: Run inference on all images from the FP_2024 dataset with the wise_wolf model.
crops_wise_wolf_pyro_sdis_val: Generate crops from the predictions of the wise_wolf model on the pyro-sdis val split.
crops_wise_wolf_FP_2024: Generate crops from the predictions of the wise_wolf model on the FP_2024 dataset.

🚭 Filtering

filter_data_pyrosdis_smoke: Keep only the fire smokes from the pyro-sdis dataset - remove the background images.
filter_data_figlib_smoke: Keep only the fire smokes from the FIGLIB_ANNOTATED_RESIZED dataset - remove the background images.
filter_data_pyronear_ds_smoke: Keep only the fire smokes from the pyronear-ds-03-2024 dataset - remove the background images.
filter_data_false_positives_FP_2024: Keep only the false positives that the wise_wolf has made on the FP_2024 dataset.

🍞 Data Splitting

split_data_figlib: Split the FIGLIB_ANNOTATED_RESIZED dataset into train/val/test sets.
split_data_false_positives_FP_2024: Split the false postives dataset into train/val/test sets.
merge_smoke_datasets: Merge the different data sources of fire smokes and split into the train/val/test sets.

🧬 Dataset Creation

Wildfire Dataset

make_train_val_wildfire_dataset: Make the train/val wildfire dataset using the previous stages.
make_test_wildfire_dataset: Make the test wildfire dataset using the previous stages.

Temporal Dataset

make_temporal_train_val_dataset: Make the train/val temporal wildfire dataset using the previous stages.
make_temporal_test_dataset: Make the test temporal wildfire dataset using the previous stages.

🔎 Dataset Analysis

analyze_wildfire_dataset: Run some analyses on the generated dataset to check for data leakage, data distribution, and background images. Some interactive plots are also generated and exported.

Data

Raw

The datasets below are the foundation of our data pipeline and are the source of truth.

FIGLIB_ANNOTATED_RESIZED: re-annotated dataset from the Fire Ignition images Library.
DS_fp: All the collected false positives of the Pyronear System before 2024.
FP_2024: All the collected false positives of the Pyronear System in 2024.
pyro-sdis: Pyro-SDIS is a dataset designed for wildfire smoke detection using AI models. It is developed in collaboration with the Fire and Rescue Services (SDIS) in France and the dedicated volunteers of the Pyronear association. It contains only detected fires by the Pyronear System.
pyronear-ds-03-2024: Dataset of fire smokes as a mix of different public datasets and synthetic images. It also includes temporal sequences of fire events.
pyro-sdis-testset: Private dataset used for evaluating the final performances of the ML models.
Test_dataset_2025: built from Test_DS by adding extra false positives.
Test_DS: The initial and curated test dataset.

Interim

All the folders located in ./data/interim/ are intermediary results needed to build up the final datasets. They are versioned with DVC.

Many artifacts and datasets can be found here: from cropped image areas, to filtered datasets to focus on false positives for instance.

false_positives: curated and annotated dataset containing false positives from the pyronear systems.

Processed

The final datasets are located in ./data/processed/:

🔥 wildfire: the train/val dataset used to train our ML models. It follows the ultralytics format.
🔥 wildfire_test: the test dataset used to evaluate the performance of our ML models.
⏰ wildfire_temporal: the train/val dataset used to train our temporal ML models.
⏰ wildfire_temporal_test: the test dataset used to evaluate the performance of our temporal ML models and the pyronear engine.

Reporting

Once the datasets are generated and stored in the ./data/processed/ directory, various reports are created to visualize the data. These reports break down the datasets across different dimensions, allowing for a quick assessment of whether the various data splits are logical and meaningful.

These reports live under ./data/reporting/.

Scripts

Scripts are located in the ./scripts folder.

Most scripts are connected via the dvc.yaml configuration file. Others are utility scripts that can be used to perform various tasks.

fetch_platform_sequence_id.py

Fetch a detection sequences by its sequence-id directly from the Pyronear platform API.

export PLATFORM_API_ENDPOINT="https://alertapi.pyronear.org"
export PLATFORM_LOGIN=sdis-07
export PLATFORM_PASSWORD=XXX
export PLATFORM_ADMIN_LOGIN=XXX
export PLATFORM_ADMIN_PASSWORD=XXX

uv run python ./scripts/platform_train_loop/fetch_platform_sequence_id.py \
  --save-dir ./data/raw/pyronear-platform/sequences/my-sequence-5347/ \
  --sequence-id 5347

Note: Make sure to use an admin login/password as well as a regular login/password. The admin level access is needed to fetch information about the organizations and properly name the detection images locally.

fetch_platform_sequences.py

Fetch detection sequences directly from the Pyronear platform API.

Fetch all the detection sequences for sdis-07 and save them in the specified directory:

export PLATFORM_API_ENDPOINT="https://alertapi.pyronear.org"
export PLATFORM_LOGIN=sdis-07
export PLATFORM_PASSWORD=XXX
export PLATFORM_ADMIN_LOGIN=XXX
export PLATFORM_ADMIN_PASSWORD=XXX

uv run python ./scripts/platform_train_loop/fetch_platform_sequences.py \
  --save-dir ./data/raw/pyronear-platform/sequences/sdis-07/ \
  --date-from 2025-05-01 \
  --date-end 2025-06-01

Note: Make sure to use an admin login/password as well as a regular login/password. The admin level access is needed to fetch information about the organizations and properly name the detection images locally.

🧠 Models

🌈 legendary_field: yolov8s object detection model, first performant model trained in 2019. to detect fire smoke.
🐺 wise_wolf: yolov11s object detection model, trained on 2024-04-26 using a larger dataset.

🌎 Release the datasets

The script to release a new version of the model is located in ./scripts/release.py. Make sure to set your GITHUB_ACCESS_TOKEN as an env variable in your shell before running the following script:

export GITHUB_ACCESS_TOKEN=XXX
uv run python ./scripts/release.py \
  --version v1.3.5 \
  --github-owner earthtoolsmaker \
  --github-repo pyro-dataset

This will create a new release in the github repository with and upload an archive of the datasets to a private S3 repository. The link to the dataset is displayed in the release summary.

Run the tests

uv run pytest tests/

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
.dvc		.dvc
data		data
notebooks		notebooks
scripts		scripts
src/pyro_dataset		src/pyro_dataset
tests		tests
.dvcignore		.dvcignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
TODO.md		TODO.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pyro Dataset

Setup

🐍 Python dependencies

🍜 Data dependencies

Data Pipeline

DVC stages

⛱️ Data Preparation

🧠 Model Inference

🚭 Filtering

🍞 Data Splitting

🧬 Dataset Creation

Wildfire Dataset

Temporal Dataset

🔎 Dataset Analysis

Data

Raw

Interim

Processed

Reporting

Scripts

fetch_platform_sequence_id.py

fetch_platform_sequences.py

🧠 Models

🌎 Release the datasets

Run the tests

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

pyronear/pyro-dataset

Folders and files

Latest commit

History

Repository files navigation

Pyro Dataset

Setup

🐍 Python dependencies

🍜 Data dependencies

Data Pipeline

DVC stages

⛱️ Data Preparation

🧠 Model Inference

🚭 Filtering

🍞 Data Splitting

🧬 Dataset Creation

Wildfire Dataset

Temporal Dataset

🔎 Dataset Analysis

Data

Raw

Interim

Processed

Reporting

Scripts

fetch_platform_sequence_id.py

fetch_platform_sequences.py

🧠 Models

🌎 Release the datasets

Run the tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages