Welcome to the BELKA-mini challenge — a scaled-down version of the BELKA competition from NeurIPS 2024. In this hackathon-style challenge, your goal is to predict whether a small molecule binds to a specific protein target based on its SMILES representation.
This repo provides everything you need to get started.
File / Folder | Description |
---|---|
small_belka_splits/ |
Cleaned and balanced train.csv , val.csv , test.csv files |
small_belka_splits_900k/ |
A larger version of the problem; use it if you can, computationally. |
starter_model.ipynb |
A baseline model using ECFPs + Random Forest |
belka_hint_exploration.ipynb |
A deeper exploration notebook (shared later during the hackathon) |
submission.csv (optional) |
Example format for leaderboard submission |
BELKA-mini-Overview.md |
A detailed overview of the challenge |
BELKA-mini-Dataset-Description.md |
A detailed description of the dataset |
README.md |
You’re here! |
Given:
- A small molecule's SMILES string (representing its structure)
- The protein target name
Your task is to:
- Predict whether the molecule will bind to the target.
This is a binary classification problem. Binding is highly imbalanced (less than 1% of molecules bind), so careful modeling and evaluation are key!
Each row contains:
id
: unique identifierbuildingblock1/2/3_smiles
: SMILES for each componentmolecule_smiles
: full molecule SMILES (including triazine core)protein_name
: one of three targets (HSA, BRD4, sEH)binds
: 1 if the molecule binds the protein, 0 otherwiseis_novel
: (test only) 1 if the molecule includes building blocks not seen in the training set
Submissions are scored using average precision (AP).
Final score = average AP per protein + novelty group (as in the original BELKA challenge).
- Open
starter_model.ipynb
- Run the featurization and baseline training
- Submit predictions on the
test.csv
molecules
You’ll be using:
- ECFP fingerprints (via RDKit)
- Protein one-hot encoding
- A simple Random Forest
This notebook is designed to work directly on Kaggle with no extra setup.
Once you’ve got a baseline, check out belka_hint_exploration.ipynb
:
- Analyze class imbalance
- Explore novel vs. known molecules
- Evaluate predictions by protein or novelty
- Learn how the test set was constructed for generalization
Use this notebook to improve how you evaluate and tune your models.
- The test set contains molecules from new parts of chemical space
- A good model should generalize, not just memorize
- Try:
- Graph neural networks (GNNs)
- 3D-aware models
- Contrastive learning on SMILES
- Pre-trained chemical embeddings (ChemBERTa, Mol2Vec)
Your submission should be a CSV with:
id,binds
123456789,0.013
123456790,0.872
...
Submit to the platform’s leaderboard for evaluation!
This mini-challenge is derived from the original BELKA dataset by Leash Biosciences, restructured for educational use.
Special thanks to NeurIPS 2024 + BELKA organizers for sharing this awesome dataset.