Machine Learning Practice

Some practices using statistical machine learning technique based on some dataset.

To see more detail or example about deep learning, you can checkout my Deep Learning repository.

Because Github don't support LaTeX for now, you can use the Google Chrome extension TeX All the Things (github) to read the notes.

Environment

Using Python 3

(most of the relative path links are according to the repository root)

Dependencies

numpy: For low-level math operations
pandas: For data manipulation
sklearn - Scikit Learn: For evaluation metrics, some data preprocessing

For comparison purpose

sklearn: For machine learning models
cvxopt: For convex optimization problem (for SVM)
For gradient boosting

For visualization

Mlxtend
matplotlib
- matplotlib.pyplot
- mpl_toolkits.mplot3d

For evaluation

surprise: A Python scikit building and analyzing recommender systems

NLP related

gensim: Topic Modelling
hmmlearn: Hidden Markov Models in Python, with scikit-learn like API
jieba: Chinese text segementation library
pyHanLP: Chinese NLP library (Python API)
nltk: Natural Language Toolkit

Projects

Subject	Technique / Task	Dataset	Solution	Notes
Letter Recognition	kNN / Classification	Letter Recognition Datasets (File)	kNN From Scratch, kNN Scikit Learn	Notes
Page Blocks Classification	Decision Tree / Classification	Page Blocks Classification Data Set (File)	Decision Tree (CART) From Scratch, Decision Tree Scikit Learn	Notes
CSM	Linear Regression / Regression	CSM Dataset (2014 and 2015) (File)	Linear Regression From Scratch, Linear Regression Scikit Learn, Linear Regression PyTorch NN	Notes
Nursery	Naive Bayes / Classification	Nursery Data Set (File)	Gaussian Naive Bayes From Scratch, Gaussian Naive Bayes Scikit Learn	Notes
Post-Operative Patient	SVM (cvxopt) / Binary Classification	Post-Operative Patient Data Set (File, Simplified)	SVM From Scratch (using cvxopt and simplified dataset), SVM Scikit Learn	Notes
Student Performance	AdaBoost / Classification	Student Performance Data Set (File)	AdaBoost From Scratch, AdaBoost Scikit Learn	Notes
Sales Transactions	k-Means / Clustering	Sales Transactions Dataset Weekly (File)	k-Means From Scratch, k-Means Scikit Learn	Notes
Frequent Itemset Mining	FP-Growth / Frequent Itemsets Mining	Retail Market Basket Data Set (File)	FP-Growth From Scratch	Notes
Automobile	PCA / Dimensionality Reduction	Automobile Data Set (File)	PCA From Scratch, PCA Scikit Learn	Notes
Anonymous Microsoft Web Data	SVD / Recommendation System	Anonymous Microsoft Web Data Data Set (File, Ratings Matrix (by R))	SVD From Scratch, R Notebook - IBCF Recommender System	Notes
Handwriting Digit	SVM (SMO) / Binary & Multi-class Classification	MNIST (File)	Binary SVM From Scratch, Multi-class (OVR) SVM From Scratch	Notes
Chinese Text Segmentation	HMM (EM) / Text Segmentation & POS Tagging	File	HMM From Scratch, HMM hmmlearn, Compare with Jieba and HanLP	-
Document Similarity and LSI	VSM, SVD / LSI	Corpus of the People's Daily (File)	VSM From Scratch, VSM Gensim, SVD/LSI Gensim	Notes
Click and Conversion Prediction	Logistic Regression / Recommendation System	Ali-CCP (File too large about 20GB)		Notes
LightGBM & XGBoost & CatBoost Practice	Boosting Tree / Classification	Social Network Ads (File)	LightGBM, XGBoost	Notes
Kaggle Elo	LightGBM / Feature Engineering	Elo Merchant Category Recommendation	LightGBM Project
DCIC 2019	LXGBoost / Feature Engineering	Failure Prediction of Concrete Piston for Concrete Pump Vehicles	XGBoost Project
Epinions CLiMF	Collaborative Filtering / Recommendation System	Epinions	CLiMF From Scratch, CLiMF TensorFlow	Notes, PaperResearch
Iris EM	EM Algorithm / Clustering	Iris Data Set	EM From Scratch	Notes
Iris Logistic	Logistic Regression / Classification	Iris Data Set	Logistic Regression From Scratch, Logistic Regression Scikit Learn, SVM (used for compare)	Notes

Machine Learning Categories

Consider the learning task

Surpervised Learning
- Classification - Discrete
- Regression - Continuous
Unsupervised Learning
- Clustering - Discrete
- Dimensionality Reduction - Continuous
- Association Rule Learning
Semi-supervised Learning
- Semi-Clustering
- Semi-Classification
Reinforcement Learning

Consider the learning model

Discriminative Model
- Discriminative Function
- Probabilistic Discriminative Model
Generative Model

Cosider the desired output of a ML system

Classification
- Logistic Regression (optimization algo.)
  - Multinomial/Softmax Regression (SMR)
- k-Nearest Neighbors (kNN)
- Support Vector Machine (SVM) - Derivation (optimization algo.)
- Naive Bayes
- Decision Tree (ID3, C4.5, CART)
Regression
- Linear Regression - Derivation (optimization algo.)
- Tree (CART)
Clustering
- k-Means
- Hierarchical Clustering
- DBSCAN
Association Rule Learning
- Apriori
- Eclat
- FP-growth - Frequent itemsets mining
Dimensionality Reduction
- Principal Compnent Analysis (PCA)
- Single Value Decomposition (SVD) - LSA, LSI, Recommendation System
- Canonical Correlation Analysis (CCA)
- Isomap (nonlinear)
- Locally Linear Embedding (LLE) (nonlinear)
- Laplancian Eigenmaps (nonlinear)

Ensemble Method (Meta-algorithm)

Bagging
- Random Forests
Boosting
- AdaBoost <- With some basic boosting notes
- Gradient Boosting
  - Gradient Boosting Decision Tree (GBDT) (aka. Multiple Additive Regression Tree (MART))
- XGBoost
- LightGBM

NLP Related

Hidden Markov Model (HMM) - Sequencial Labeling Problem
Conditional Random Field (CRF) - Classification Problem (e.g. Sentiment Analysis)

Backbone

Maximum Entropy Model (MEM)
Bayesian Network (aka. Probabilistic Directed Acyclic Graphical Model)

Others

Probabilistic Latent Semantic Analysis (PLSA)
Latent Dirichlet Allocation (LDA)
Vector Space Model (VSM)
Radial Basic Function (RBF) Network
Isolation Forest
- paper
One-Class SVM

Heuristic Algorithm (Optimization Method)

SMO --> SVM
EM --> HMM, etc.
GIS == improved ==> IIS --> MEM

Machine Learning Concepts

General Case

Categorized

Classification
- Data Preprocessing
  - Label Encoding
- Real-world Problem
  - Cost-sensitive Learning
  - Classification Imbalance
- Evaluation Metrics
  - Classification Metrics
- Binary to Multi-class Expension
Regression
- Evaluation Metrics
  - Regression Metrics
Clustering
- Evaluation Metrics
  - Clustering Metrics

Specific Field

Data Mining - Knowledge Discovering
Feature Engineering
- Training optimization
  - Memory usage
  - Evaluation time complexity
Recommendation System
- Collaborative Filtering (CF)
Information Retrieval - Topic Modelling
- Latent Semantic Analysis (LSA/LSI/SVD)
- Latent Dirichlet Allocation (LDA)
- Random Projections (RP)
- Hierarchical Dirichlet Process (HDP)
- word2vec

Machine Learning Mathematics

Topic

Kernel Usages
Convex Optimization
Distance/Similarity Measurement - basis of clustering and recommendation system

Application

(from A to Z)

Decision Tree
- Entropy
HMM
- Markov Chain
Naive Bayes
- Bayes' Theorem
PCA
- Orthogonal Transformations
- Eigenvalues
SVD
- Eigenvalues
SVM
- Convex Optimization
- Constrained Optimization
- Lagrange Multipliers
- Kernel

Books Recommendation

Machine Learning

Machine Learning in Action
- Source Code
統計學習方法 (李航)
機器學習 (周志華) (alias 西瓜書)
- datawhalechina/pumpkin-book - 南瓜書
  - github
Python Machine Learning
- Source Code
Introduction to Machine Learning 3rd
- Solution Manual
- Previous version: 1st, 2nd
Automated Machine Learning: Methods, Systems, Challenges (AutoML)
- pdf

Mathematics

Linear Algebra with Applications (Steven Leon)
Convex Optimization (Stephen Boyd & Lieven Vandenberghe)
Numerical Linear Algebra (L. Trefethen & D. Bau III)

Resources

Tutorial

Videos

Documentations

Interactive Learning

MOOC

Stanford Andrew Ng - CS229
- Coursera

Github

Machine Learning from Scratch (eriklindernoren/ML-From-Scratch)
Avik-Jain/100-Days-Of-ML-Code - 100 Days of ML Coding
- Chinese version
ddbourgin/numpy-ml - Machine learning, in numpy
Machine learning Resources
microsoft/recommenders - Best Practices on Recommendation Systems
dformoso/machine-learning-mindmap
- pdf

Textbook Implementation

Machine Learning in Action
- Jack-Cherish/Machine-Learning
Learning From Data (林軒田)
- Doraemonzzz/Learning-from-data
統計學習方法 (李航)
Stanford Andrew Ng CS229
- fengdu78/Coursera-ML-AndrewNg-Notes
- fengdu78/deeplearning_ai_books
NTU Hung-Yi Lee
- LeeML-Notes

Datasets

Competition

Global

Kaggle Competition
KDDCup
- KDD Cup 2019: AutoML for Temporal Relational Data
  - 4Paradigm
  - CodaLab
CodaLab

Taiwan

Open Data

China

Machine Learning Platform

Machine Learning Tool

AutoML
- Auto-sklearn
  - github
Optuna - A hyperparameter optimization framework
- github
Hyperopt - Distributed Asynchronous Hyper-parameter Optimization
- github
- hyperopt-sklearn

(Online) Development Environment

jupyter notebook

Extension plugin - pip install jupyter_contrib_nbextensions
- VIM binding
- Codefolding
- ExecuteTime
- Notify
Jupyter Theme - pip install --upgrade jupyterthemes

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
Algorithm		Algorithm
Datasets		Datasets
Notes		Notes
Project		Project
.gitignore		.gitignore
README.md		README.md

daviddwlee84/MachineLearningPractice

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Practice

Environment

Dependencies

Projects

Machine Learning Categories

Consider the learning task

Consider the learning model

Cosider the desired output of a ML system

Ensemble Method (Meta-algorithm)

NLP Related

Backbone

Others

Heuristic Algorithm (Optimization Method)

Machine Learning Concepts

General Case

Categorized

Specific Field

Machine Learning Mathematics

Topic

Categories

Basics

Application

Books Recommendation

Machine Learning

Mathematics

Resources

Tutorial

Videos

Documentations

Interactive Learning

MOOC

Github

Datasets

Competition

Machine Learning Platform

Machine Learning Tool

(Online) Development Environment

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages