Indaba 2025 Hackathon. Hack the Carbon (Kaggle, hosted by InstaDeep) ππ 3rd Place Solution ππ
This repository contains our solution (azza Osman team) that placed 3rd in the Deep Learning Indaba Hackathon "Hack the Carbon" hosted by InstaDeep on Kaggle.
Our main approach to the solution was "bigger is better", but we also addressed several significant issues in the InstaGeo codebase. we have also attempted to exploite a potintial data leakge.
Hack the Carbon is a geospatial machine learning challenge focused on estimating biomass and carbon stocks from Earth observation data. The task is a supervised regression problem where models predict biomass or a closely related proxy from multi-sensor satellite inputs (image to image regresssion).
- Input: Preprocessed geospatial chips provided by the competition baseline. a 6 bands 256 Γ 256 reslition chips spanning across east africa imported from Sential-2 with 3 temporal resourtions.
- Output: Predicted biomass/carbon density values for each test example in the required submission format.
- Objective: Learn spatial patterns that map remote sensing signals to biomass/carbon targets to enable scalable carbon accounting.
- Evaluation: Performance is scored on the hidden test set using the competitionβs regression metric (e.g., RMSE) as defined on the leaderboard.
- Competition solution code built on top of
InstaGeo-E2E-Geospatial-ML. - Modifications to InstaGeo to support the larger model
Prithvi-EO-2.0-600Mconfiguration and to stabilize training/inference. - Dataset pipeline fixes, notably ensuring data loader workers respect configuration.
- Experiment artifacts and configs integrating WandB. in addition to insure the results on the leader board producable.
- Notebook to explore data Leakage see Where is the data leakage? section below.
- Larger model support: Adapted
InstaGeomodel and config to accommodate a bigger backbone and higher capacity settings (seeinstageo/model/configs/biomass.yamland related training code). - Data pipeline fixes: Resolved an issue where the dataset/data loader workers were effectively constrained, making the loader run with a single worker regardless of the provided configuration. Now, the configured value is honored end-to-end.
- Training stability and logging:
- Improved learning rate scheduling and monitoring.
- Reduced excessive logs; improved progress visibility.
- Integrated config upload to experiment tracking (e.g., W&B) for reproducibility.
- Create and activate your environment using the provided dependencies under
InstaGeo-E2E-Geospatial-ML/requirements.txt(or the projectpyproject.toml). - Configure your run via the YAML in
instageo/model/configs/(e.g.,biomass.yaml). - Launch training or inference using the
instageo/model/run.pyentrypoints according to your setup.
the primis is that since this data is been pulled from sentianl-2, and that the biomass is an important feature, there is a hgih probability that this data is publically avilalbe. our intent was that if we get there first we will share so that to prevent someone form gaining unfair advantage over others by over fitting the test set.
- Built on top of
InstaGeo-E2E-Geospatial-MLand the competition starter resources. - Thanks to the Deep Learning Indaba and InstaDeep Kaggle organizers and community.