Introduction

Cancer is one of the most dangerous disease. As there is cure for some types of cancer it is very costly and all humans cannot afford it. So diagnosing the cancer in early stages itself is the best thing we can do. So this project is about classify the different types of cancer using different machine learning algorithms.

For detailed analysis please check this article

Machine Learning Algorithms Used

Naive Bayes
Logistic Regression
Support Vector Machine
Random Forest
Stacking Classifier(NB,SVM,LR)
Voting Classifier

Dataset Link

Data is used from Kaggle website. Use this link to download data.

About the files

TFIDF_approaches.ipynb - This file consists code and results of all machine learning algorithms with TFIDF vectorization(unigrams and bigrams) of data
LR with 4 grams.ipynb - This file consists detailed analyis of Logistic Regression with one hot encoding and response coding with 4 grams
LR AND LSTM approach.ipynb - This file consists results for two layer LSTM model.
TFIDF_2000.ipynb - This file consits analysis on only top 2000 features. For best results check this notebook.

Results

BOW VECTORIZER

ALGORITHM USED	TRAIN LOG LOSS	CV LOG LOSS	TEST LOG LOSS	MISCLASSIFIED POINTS(%)
NB(ONE HOT)	0.90	1.27	1.21	39.84
KNN(RESPONSE CODING)	0.705	1.130	1.002	39.47
LR+BALANCING(ONE HOT)	0.614	1.143	1.048	34.77
LR+IMBALANCING(ONE HOT)	0.628	1.185	1.054	36.28
SVM+BALANCING(ONE HOT)	0.739	1.132	1.063	36.47
RF(ONE HOT)	0.703	1.192	1.097	36.84
RF(RESPONSE CODING)	0.052	1.325	1.211	46.8
STACKING CLASSIFIER(NB,LR,SVM)	0.663	1.177	1.081	36.24

Observation: RF(Response Coding) model is overfitted. Considering all business constraints mentioned in this blog, LR+Balancing better than all other models.

TF-IDF VECTORIZER

ALGORITHM USED	TRAIN LOG LOSS	CV LOG LOSS	TEST LOG LOSS	MISCLASSIFIED POINTS(%)
NB(ONE HOT)	0.906	1.229	1.225	41
LR+BALANCING(ONE HOT)	0.57	1.114	1.113	36.84
LR+IMBALANCING(ONE HOT)	0.56	1.13	1.12	36.09
LINEAR SVM(ONE HOT)	0.68	1.15	1.19	37.78
RF(ONE HOT)	0.64	1.12	1.16	37.7
RF(RESPONSE CODING)	0.05	1.38	1.44	48.8
STACKING CLASSIFIER(NB,LR,SVM)	0.63	1.16	1.19	38.90
MAXIMUM VOTING CLASSIFIER(NB,LR,SVM)	0.63	1.16	1.19	40.10

Observation : When compared with BOW vectorizer results of TF-IDF Vectorizer are better and LR is outperforming all other algorithms and RF is overfitting again. So lets see detailed analysis on LR

LR (UNIGRAMS AND BIGRAMS)

ALGORITHM USED	TRAIN LOG LOSS	CV LOG LOSS	TEST LOG LOSS	MISCLASSIFIED POINTS(%)
LR(BIGRAMS)	0.83	1.17	1.19	40.7

LR(4 GRAMS WITH TOP 2000 FEATURES)

ALGORITHM USED	TRAIN LOG LOSS	CV LOG LOSS	TEST LOG LOSS	MISCLASSIFIED POINTS(%)
LR(4 GRAMS)	0.439	0.957	0.982	35.4

LR (4 Grams With top 2000 features) outperformed all other algorithms and log loss value less than 1.0

Feel free to add pull request if you good results on this dataset.

Don't forget to star the repo if you liked the analysis.

Acknowledgment

This case study is part of this course

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LR and LSTM approach.ipynb		LR and LSTM approach.ipynb
LR with 4 grams.ipynb		LR with 4 grams.ipynb
README.md		README.md
TFIDF_2000.ipynb		TFIDF_2000.ipynb
TFIDF_approaches.ipynb		TFIDF_approaches.ipynb
tfidf_approach_results.jpg		tfidf_approach_results.jpg
tfidf_top1000_words.jpg		tfidf_top1000_words.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Machine Learning Algorithms Used

Dataset Link

About the files

Results

BOW VECTORIZER

TF-IDF VECTORIZER

LR (UNIGRAMS AND BIGRAMS)

LR(4 GRAMS WITH TOP 2000 FEATURES)

Acknowledgment

About

Releases

Packages

Languages

tulasiram58827/Cancer-Diagnosis

Folders and files

Latest commit

History

Repository files navigation

Introduction

Machine Learning Algorithms Used

Dataset Link

About the files

Results

BOW VECTORIZER

TF-IDF VECTORIZER

LR (UNIGRAMS AND BIGRAMS)

LR(4 GRAMS WITH TOP 2000 FEATURES)

Acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages