End-to-end scikit-learn pipeline to predict patient survival from the Bone Marrow Transplantation dataset (UCI). It demonstrates clean preprocessing, dimensionality reduction, and model selection—all reproducibly wired into a single Pipeline.
- Load ARFF data with
scipy.io.arff→ pandas DataFrame - Clean and type-cast columns, encode binary features to 0/1
- Column-wise preprocessing: categorical (impute + OneHotEncode) and numeric (impute + scale)
- Dimensionality reduction with PCA
- Classification with Logistic Regression
- Hyperparameter tuning via GridSearchCV
bone-marrow.arff— dataset in ARFF format (expected at repo root)script.py— standalone Python script that builds, trains, and tunes a classifierbone_marrow_pipeline.ipynb— interactive Jupyter notebook with EDA, visualizations, and detailed explanationsproject_overview.md— brief project backgroundrequirements.txt— Python dependenciesLICENSE— MIT LicenseREADME.md— this guide
- Create and activate a virtual environment (optional but recommended):
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip- Install dependencies:
pip install -r requirements.txt # or: pip install numpy pandas scikit-learn scipy- Run the pipeline:
Option A: Python Script (quick, command-line)
python script.pyOption B: Jupyter Notebook (interactive, with visualizations)
jupyter notebook bone_marrow_pipeline.ipynb
# or use: jupyter lab bone_marrow_pipeline.ipynbYou should see output like:
- Unique-value counts per column
- Names of columns with missing values
- Baseline pipeline accuracy on the test set
- The best model (after GridSearchCV) and its hyperparameters
- Final test-set accuracy of the best model
Key settings you can tweak in script.py:
- Train/test split:
test_size=0.2,random_state=42 - PCA components in the search space:
pca__n_components - Logistic Regression strength:
clf__C - Preprocessing choices inside
cat_valsandnum_vals
To add more models, extend search_space with alternative estimators (e.g., RandomForestClassifier) and corresponding hyperparameters, and swap the pipeline’s clf as needed.
- Load:
bone-marrow.arff→ DataFrame - Drop
Diseasecolumn (dataset-specific cleanup) - Coerce columns to numeric (
errors='coerce'), encode binary columns to 0/1 - Split to
X(features) andy(survival_status), dropsurvival_timefromX - Identify categorical vs numeric columns by cardinality (≤7 unique → categorical)
- Build preprocessing with
ColumnTransformer - Fit a Pipeline: preprocess → PCA → LogisticRegression
- Tune PCA and C with GridSearchCV (5-fold CV)
- File not found: Ensure
bone-marrow.arffexists at the repository root. - Different schema: If your ARFF doesn’t have
survival_status/survival_timeor includes/omitsDisease, adjust the column operations inscript.py. - Convergence warnings: Consider increasing
max_iterinLogisticRegression()(e.g.,max_iter=1000) or scaling features (already done) and checking class balance. - SciPy/NumPy mismatches: Upgrade both packages to compatible versions.
- Interactive exploration: Use
bone_marrow_pipeline.ipynbfor detailed EDA, visualizations, and step-by-step experimentation - Add metrics: precision/recall/F1, ROC AUC, confusion matrix (already in notebook!)
- Persist models:
joblib.dump(best_model, 'models/best_model.joblib')(code included in notebook) - Inference script: load the saved pipeline and run predictions on new CSV/JSON
- Experiment tracking: log results to CSV, MLflow, or Weights & Biases
- Testing: smoke test for pipeline fit and basic assertions on output shapes
This repo assumes columns survival_status (target) and survival_time exist, and removes Disease. If your ARFF differs, please update the column selection and preprocessing accordingly.
This project is licensed under the MIT License - see the LICENSE file for details.