GitHub - DataHackIL/odyssey_2025_belka: 🧬 BELKA-mini: Predict Small Molecule Binding

🧬 BELKA-mini: Predict Small Molecule Binding

Welcome to the BELKA-mini challenge — a scaled-down version of the BELKA competition from NeurIPS 2024. In this hackathon-style challenge, your goal is to predict whether a small molecule binds to a specific protein target based on its SMILES representation.

This repo provides everything you need to get started.

📁 Repository Contents

File / Folder	Description
`small_belka_splits/`	Cleaned and balanced `train.csv`, `val.csv`, `test.csv` files
`small_belka_splits_900k/`	A larger version of the problem; use it if you can, computationally.
`starter_model.ipynb`	A baseline model using ECFPs + Random Forest
`belka_hint_exploration.ipynb`	A deeper exploration notebook (shared later during the hackathon)
`submission.csv` (optional)	Example format for leaderboard submission
`BELKA-mini-Overview.md`	A detailed overview of the challenge
`BELKA-mini-Dataset-Description.md`	A detailed description of the dataset
`README.md`	You’re here!

🎯 Challenge Overview

Given:

A small molecule's SMILES string (representing its structure)
The protein target name

Your task is to:

Predict whether the molecule will bind to the target.

This is a binary classification problem. Binding is highly imbalanced (less than 1% of molecules bind), so careful modeling and evaluation are key!

🔬 Dataset Structure

Each row contains:

id: unique identifier
buildingblock1/2/3_smiles: SMILES for each component
molecule_smiles: full molecule SMILES (including triazine core)
protein_name: one of three targets (HSA, BRD4, sEH)
binds: 1 if the molecule binds the protein, 0 otherwise
is_novel: (test only) 1 if the molecule includes building blocks not seen in the training set

🧪 Evaluation Metric

Submissions are scored using average precision (AP).
Final score = average AP per protein + novelty group (as in the original BELKA challenge).

🚀 How to Get Started

Open starter_model.ipynb
Run the featurization and baseline training
Submit predictions on the test.csv molecules

You’ll be using:

ECFP fingerprints (via RDKit)
Protein one-hot encoding
A simple Random Forest

This notebook is designed to work directly on Kaggle with no extra setup.

💡 Hint Notebook (Advanced)

Once you’ve got a baseline, check out belka_hint_exploration.ipynb:

Analyze class imbalance
Explore novel vs. known molecules
Evaluate predictions by protein or novelty
Learn how the test set was constructed for generalization

Use this notebook to improve how you evaluate and tune your models.

🧠 Tips for Participants

The test set contains molecules from new parts of chemical space
A good model should generalize, not just memorize
Try:
- Graph neural networks (GNNs)
- 3D-aware models
- Contrastive learning on SMILES
- Pre-trained chemical embeddings (ChemBERTa, Mol2Vec)

📤 Submitting Predictions

Your submission should be a CSV with:

id,binds
123456789,0.013
123456790,0.872
...

Submit to the platform’s leaderboard for evaluation!

🙌 Credits

This mini-challenge is derived from the original BELKA dataset by Leash Biosciences, restructured for educational use.

Special thanks to NeurIPS 2024 + BELKA organizers for sharing this awesome dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 BELKA-mini: Predict Small Molecule Binding

📁 Repository Contents

🎯 Challenge Overview

🔬 Dataset Structure

🧪 Evaluation Metric

🚀 How to Get Started

💡 Hint Notebook (Advanced)

🧠 Tips for Participants

📤 Submitting Predictions

🙌 Credits

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
small_belka_splits		small_belka_splits
small_belka_splits_900k		small_belka_splits_900k
.gitignore		.gitignore
BELKA-mini-Dataset-Description.md		BELKA-mini-Dataset-Description.md
BELKA-mini-Overview.md		BELKA-mini-Overview.md
README.md		README.md
belka_hint_exploration.ipynb		belka_hint_exploration.ipynb
belka_starter_notebook.ipynb		belka_starter_notebook.ipynb
starter_model.ipynb		starter_model.ipynb
submission.csv		submission.csv

DataHackIL/odyssey_2025_belka

Folders and files

Latest commit

History

Repository files navigation

🧬 BELKA-mini: Predict Small Molecule Binding

📁 Repository Contents

🎯 Challenge Overview

🔬 Dataset Structure

🧪 Evaluation Metric

🚀 How to Get Started

💡 Hint Notebook (Advanced)

🧠 Tips for Participants

📤 Submitting Predictions

🙌 Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages