This repository contains a machine learning pipeline developed to predict the survival of passengers aboard the Titanic. It is part of a series of projects undertaken during a Data Science internship at CodSoft.
The objective of this project is to build a supervised machine learning model that predicts whether a passenger survived the Titanic disaster. This problem is a classic classification task using structured data. The solution employs a Random Forest Classifier and incorporates robust preprocessing and feature engineering techniques.
-
Files Used:
train.csv: Labeled training datatest.csv: Unlabeled test data for predictiongender_submission.csv: Sample submission format
-
Handling Missing Values:
Age: Mean imputationFare: Median imputationEmbarked: Mode imputation
-
Encoding:
- Categorical variables encoded using
OneHotEncoder
- Categorical variables encoded using
-
Feature Scaling:
- Standardization of numerical features using
StandardScaler
- Standardization of numerical features using
New informative features were created to improve model performance:
FamilySize: CombinesSibSpandParchto reflect total family sizeIsAlone: Binary indicator of whether the passenger was traveling aloneTitle: Extracted from the name to capture social status
-
Model Used:
RandomForestClassifier -
Pipeline: Preprocessing and classification are combined using
Pipeline -
Model Evaluation:
- Accuracy on training data
- Cross-validation scores to check generalization
- Feature importance analysis
-
Predictions are saved as:
titanic_predictions.csv: Basic submission formattitanic_predictions_detailed.csv: Includes prediction probabilities and confidence
- Python 3.x
- Jupyter Notebook
- Pandas, NumPy
- Scikit-learn
- Matplotlib, Seaborn
- This project is one of the three required tasks for the Data Science Internship at CodSoft.
- I have actively used the Data Science course by Krish Naik and various YouTube tutorials to guide my learning and implementation throughout this project.
As part of the CodSoft internship, I will also complete two additional data science projects:
- Movie Rating Prediction
- Iris Flower Classification (or another task from the provided list)