- Overview
- Dataset
- Model Description
- Methodology
- Model Training and Evaluation
- Model Comparison
- Confusion Matrix
- Inference from Confusion Matrices
- Code Organization
- Prerequisites
- Running the Code
- Streamlit Deployment
This project implements various machine learning techniques to develop a model capable of predicting whether an object is a rock or a mine based on SONAR return data. The dataset used for this classification problem consists of SONAR return data, and multiple classification models are applied and compared to determine which one performs best.
The SONAR Rock vs Mine Prediction dataset consists of features derived from SONAR return data. It includes 208 instances, each with 60 numerical features representing the energy within frequency bands. The target labels are 'M' (mines) and 'R' (rocks).
Feature | Description |
---|---|
60 numerical values | Features representing energy levels over frequency bands |
'M' / 'R' | Class labels for mines and rocks |
This project employs several classification models, focusing on:
- Logistic Regression: A simple yet effective model for binary classification.
- Support Vector Classifier (SVC): Known for its effectiveness in high-dimensional spaces.
- Decision Tree Classifier: A non-parametric supervised learning method.
- Random Forest Classifier: An ensemble method that combines multiple decision trees.
The overall methodology of the project can be summarized in the following steps:
- Data Loading: Load the dataset into a pandas DataFrame.
- Data Visualization and Summary: Explore the dataset using descriptive statistics and visualizations to understand patterns.
- Data Preprocessing: Separate features and labels, and split the dataset into training and testing sets.
- Model Training: Train various classification models using the training data.
- Model Evaluation: Evaluate each model's performance using accuracy metrics and confusion matrices.
Each of the classifiers is trained on the training data. For example, here's how the Logistic Regression model is trained:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, Y_train)
The accuracy results for the tested models are as follows:
Model | Accuracy |
---|---|
Logistic Regression | 76.19% |
Support Vector Classifier | 80.95% |
Decision Tree Classifier | 71.43% |
Random Forest Classifier | 76.19% |
In this project, each model's performance was compared based on accuracy scores after training on the same dataset.
from sklearn.metrics import accuracy_score
models = {
"Logistic Regression": LogisticRegression(),
"Support Vector Classifier": SVC(),
"Decision Tree Classifier": DecisionTreeClassifier(),
"Random Forest Classifier": RandomForestClassifier()
}
accuracy_scores = {}
for name, model in models.items():
classifier = model.fit(X_train, Y_train)
pred = classifier.predict(X_test)
accuracy_scores[name] = accuracy_score(Y_test, pred)
classification_model_df = pd.DataFrame(accuracy_scores.items(), columns=["Model", "Accuracy"])
print(classification_model_df)
Model | Accuracy |
---|---|
Logistic Regression | 76.19% |
Support Vector Classifier | 80.95% |
Decision Tree Classifier | 71.43% |
Random Forest Classifier | 76.19% |
The confusion matrices for the models are as follows:
Confusion Matrix:
[[9 2]
[3 7]]
Confusion Matrix:
[[10 1]
[ 3 7]]
The confusion matrix for the Logistic Regression model is:
[[9 2]
[3 7]]
- True Positives (TP): 9 (Mines correctly predicted as mines)
- True Negatives (TN): 7 (Rocks correctly predicted as rocks)
- False Positives (FP): 2 (Rocks incorrectly predicted as mines)
- False Negatives (FN): 3 (Mines incorrectly predicted as rocks)
Interpretations from the Confusion Matrix:
-
Accuracy: The model achieves an accuracy of approximately 76.19%, suggesting a reasonable level of classification performance.
-
Precision: The precision for the positive class (mines) is calculated as:
Precision = {TP} / {TP + FP} = {9} / {9+2} which is approx 0.818 (or 81.8%) This indicates that among all instances predicted as mines, 81.8% were actually mines.
-
Recall: The recall for the positive class (mines) is calculated as:
Recall = {TP} / {TP + FN} = {9} / {9+3} which is approx 0.75 (or 75%) This means that the model correctly identifies 75% of actual mines.
-
F1 Score: The F1 score can be computed as:
F1 Score = {2 * {Precision} * {Recall}} / {{Precision} + {Recall}} = {2 * 0.818 * 0.75} / {0.818 + 0.75} which is approx 0.782 This simplifies to F1 Score to approx 0.782 (or 78.2%).
-
Class Imbalance Consideration: The model misclassified 2 rocks as mines (false positives), suggesting a possible tendency to overpredict mines, which may need to be addressed.
The confusion matrix for the Support Vector Classifier is:
[[10 1]
[ 3 7]]
- True Positives (TP): 10 (Mines correctly predicted as mines)
- True Negatives (TN): 7 (Rocks correctly predicted as rocks)
- False Positives (FP): 1 (Rock incorrectly predicted as a mine)
- False Negatives (FN): 3 (Mines incorrectly predicted as rocks)
Interpretations from the Confusion Matrix:
-
Accuracy: The SVC achieves an accuracy of approximately 80.95%, indicating better performance compared to the Logistic Regression model.
-
Precision: The precision for the positive class (mines) is calculated as:
Precision = {TP} / {TP + FP} = {10} / {10+1} which is approx 0.909 (or 90.9%) This shows that 90.9% of the instances predicted as mines were indeed mines.
-
Recall: The recall for the positive class (mines) is:
Recall = {TP} / {TP + FN} = {10} / {10+3} which is approx 0.769 (or 76.9%) This indicates that the model accurately identifies 76.9% of actual mines.
-
F1 Score: The F1 score can be computed as:
F1 Score = {2 * {Precision} * {Recall}} / {{Precision} + {Recall}} = {2 * 0.909 * 0.769} / {0.909 + 0.769} which is approx 0.833 This simplifies to F1 Score to approx 0.833 (or 83.3%).
-
Class Imbalance Consideration: The SVC has only one false positive, which suggests that it has a very good precision and is quite limited in misclassifying rocks as mines.
The code is organized into the following logical sections:
- Data Importing and Preprocessing
- Model Training and Evaluation
- Model Comparison
- Visualization and Prediction
To run the code successfully, ensure you have the following libraries installed:
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
You can install them using pip:
pip install pandas numpy scikit-learn matplotlib seaborn
To execute the project:
- Make sure you have all the required libraries installed.
- Open the Jupyter Notebook containing the code snippets provided in this README.
- Run the cells sequentially to train the models and evaluate their performances.
The project has been deployed as a web application using Streamlit. You can access the interactive application through the following link:
This app allows users to input SONAR data and receive real-time predictions regarding whether the object is a rock or a mine.
Input Form: Enter the values for SONAR returns in the provided input fields.
Prediction: Click on the "Predict" button to receive the prediction on whether the input corresponds to a rock or a mine.
Results Display: The application will display the results clearly on the screen.