Machine Learning & Deep Learning Notebook Starters

A comprehensive collection of Jupyter notebook templates for ML/DL projects with complete data preprocessing pipelines for various data types.

Notebooks Overview

1. ML_DL_General_Starter.ipynb

Purpose: Foundation template with all essential imports and setup

Contents:

Core data science libraries (NumPy, Pandas, Matplotlib, Seaborn)
Scikit-learn preprocessing and metrics
TensorFlow/Keras and PyTorch setup
GPU configuration and memory management
Random seed functions for reproducibility
Utility functions for plotting and data analysis
Standard directory structure setup

Use When: Starting any new ML/DL project

2. Tabular_Data_CSV_Lists.ipynb

Purpose: Complete pipeline for tabular data to neural networks

Contents:

Loading data from CSV files and Python lists/dictionaries
Data exploration and cleaning
Handling missing values (numeric and categorical)
Categorical feature encoding (one-hot, label encoding)
Train/validation/test splits with stratification
Feature scaling (StandardScaler, MinMaxScaler)
Converting to TensorFlow tensors and datasets
Converting to PyTorch tensors and DataLoaders
Example neural network architectures
Data persistence with pickle

Use When:

Working with structured data (Excel, CSV, databases)
Regression or classification on tabular features
Customer data, financial data, IoT sensor data

Example Use Cases:

Customer churn prediction
House price prediction
Credit risk assessment
Sensor failure detection

3. Image_Data_Processing.ipynb

Purpose: Complete pipeline for image data to CNNs

Contents:

Loading images from directories (ImageFolder structure)
Loading individual images (PIL, OpenCV, Keras)
Working with numpy arrays of images
Image preprocessing and normalization strategies
Data augmentation (rotation, flip, zoom, color jitter)
TensorFlow/Keras pipelines (ImageDataGenerator, tf.data)
PyTorch pipelines (Custom Dataset, transforms, DataLoader)
CNN architectures from scratch
Transfer learning with pre-trained models (ResNet, VGG, MobileNet, EfficientNet)
Visualization utilities

Use When:

Computer vision tasks
Image classification
Object detection preprocessing
Medical imaging

Example Use Cases:

Face emotion recognition
Medical image diagnosis
Product defect detection
Plant disease classification
Animal species identification

4. Time_Series_Text_Data.ipynb

Purpose: Sequence data processing for RNNs, LSTMs, and NLP models

Contents:

Time Series Section:

Loading time series from CSV and creating synthetic data
Creating windowed sequences for forecasting
Chronological train/val/test splits
Feature scaling for sequences
LSTM, GRU, and Conv1D models
PyTorch sequence models

Text Data Section:

Text preprocessing (cleaning, lowercasing, removing punctuation)
Tokenization and vocabulary building
Sequence padding and truncation
TensorFlow/Keras text pipelines
PyTorch text processing
Text classification models (LSTM, CNN, BiLSTM)
Pre-trained embeddings (GloVe integration)
Prediction utilities

Use When:

Time series forecasting
Sentiment analysis
Text classification
Sequence prediction

Example Use Cases:

Stock price prediction
Weather forecasting
Cryptocurrency price prediction
Product review sentiment analysis
Spam detection
News classification

Quick Start

Installation

# Core requirements
pip install numpy pandas matplotlib seaborn scikit-learn

# Deep Learning
pip install tensorflow torch torchvision

# Optional (for enhanced functionality)
pip install nltk opencv-python pillow

Basic Workflow

Choose the appropriate notebook based on your data type
Copy the notebook to your project directory
Adjust paths to point to your data
Run cells sequentially to prepare your data
Modify model architectures as needed
Train and evaluate your models

Example: Working with CSV Data

# 1. Load the Tabular_Data_CSV_Lists notebook
# 2. Update the data path
csv_path = 'path/to/your/data.csv'
df = pd.read_csv(csv_path)

# 3. Specify your target column
target_column = 'your_target_column'

# 4. Run preprocessing cells
# 5. The data will be automatically converted to tensors
# 6. Use the example models or build your own

Data Type Quick Reference

Data Type	Notebook	Key Libraries	Model Types
CSV/Excel	#2 Tabular	pandas, sklearn	MLP, Random Forest
Images	#3 Image	PIL, OpenCV	CNN, ResNet, EfficientNet
Time Series	#4 Time Series	pandas	LSTM, GRU, Conv1D
Text	#4 Text	nltk, keras.preprocessing	LSTM, CNN, Transformers

Common Patterns

Data Loading

# CSV
df = pd.read_csv('data.csv')

# Images from directory
train_ds = tf.keras.preprocessing.image_dataset_from_directory('data/train/')

# Time series windowing
X, y = create_sequences(data, sequence_length=30, forecast_horizon=1)

# Text
texts = df['text'].tolist()
labels = df['label'].tolist()

Preprocessing

# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Image normalization
images = images / 255.0  # [0, 1] range

# Text tokenization
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(texts)

Model Training

# TensorFlow/Keras
history = model.fit(
    train_data,
    validation_data=val_data,
    epochs=50,
    callbacks=[early_stopping, reduce_lr]
)

# PyTorch
for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader, criterion, optimizer)
    val_loss = validate(model, val_loader, criterion)

Best Practices

1. Always Set Random Seeds

np.random.seed(42)
tf.random.set_seed(42)
torch.manual_seed(42)

2. Use Separate Scalers for Train/Val/Test

# ✅ Correct
scaler.fit(X_train)  # Fit only on training data
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)  # Transform using training statistics

# ❌ Wrong
scaler.fit(X)  # Don't fit on all data

3. Preserve Temporal Order for Time Series

# ✅ Correct - chronological split
train_size = int(0.7 * len(data))
X_train = X[:train_size]

# ❌ Wrong - random split breaks temporal dependencies
X_train, X_test = train_test_split(X, shuffle=True)

4. Data Augmentation Only on Training

# Training with augmentation
train_augmentation = True

# Validation/Test without augmentation
val_augmentation = False

5. Monitor GPU Usage

# TensorFlow
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

# PyTorch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

Model Architecture Guidelines

Tabular Data (MLPs)

Start with 2-3 hidden layers
Use ReLU activation
Add Dropout (0.3-0.5) for regularization
Batch Normalization between layers

Images (CNNs)

Start with pre-trained models (ResNet, EfficientNet)
Use data augmentation
Global Average Pooling instead of Flatten
Fine-tune last few layers first

Time Series (RNNs/LSTMs)

2-3 LSTM/GRU layers with 64-128 units
Bidirectional for better context
Use Dropout (0.2-0.3)
Consider Conv1D for long sequences

Text (NLP)

Embedding dimension: 128-300
LSTM/GRU with 64-128 units
Bidirectional for better understanding
Consider pre-trained embeddings (GloVe, Word2Vec)

Common Issues & Solutions

Issue: Out of Memory (GPU)

Solutions:

Reduce batch size
Use gradient accumulation
Enable memory growth (TensorFlow)
Use mixed precision training

Issue: Overfitting

Solutions:

Add Dropout layers
Use data augmentation
Reduce model complexity
Increase training data
Add L2 regularization

Issue: Poor Convergence

Solutions:

Adjust learning rate
Use learning rate schedulers
Check data normalization
Verify loss function matches task
Try different optimizers (Adam, SGD, RMSprop)

Issue: Class Imbalance

Solutions:

Use stratified splits
Apply class weights
Oversample minority class (SMOTE)
Undersample majority class
Use appropriate metrics (F1, ROC-AUC)

Hyperparameter Starting Points

Learning Rate

Adam: 0.001 (default)
SGD: 0.01 - 0.1
Fine-tuning: 0.0001 - 0.00001

Batch Size

Small datasets: 16-32
Medium datasets: 32-64
Large datasets: 64-256
Images: 32-64
Text: 32-128

Epochs

Start with 50-100
Use EarlyStopping (patience=10-20)
Monitor validation loss

Dropout

Dense layers: 0.3-0.5
RNN/LSTM: 0.2-0.3
After Conv layers: 0.25-0.5

Useful Resources

Documentation

Pre-trained Models

Datasets

Contributing

Feel free to:

Add new preprocessing techniques
Include additional model architectures
Share optimization tips
Report issues or bugs
Suggest improvements

Notes

All notebooks use both TensorFlow and PyTorch implementations
Code is heavily commented for learning purposes
Examples use synthetic data - replace with your actual data
Models are starting points - tune for your specific use case
Always validate on held-out test set before deployment

Learning Path

Start with General Starter - Set up environment
Try Tabular Data - Understand basic preprocessing
Move to Images - Learn data augmentation
Tackle Sequences - Master time series and text

Performance Tips

Use GPU when available
Batch your data properly
Use mixed precision training (on compatible GPUs)
Enable XLA compilation (TensorFlow)
Use DataLoader num_workers (PyTorch)
Profile your code to find bottlenecks
Use prefetching for data loading

Happy Model Building!

For questions or issues, refer to the individual notebook documentation or check the official framework documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
01_ML_DL_General_Starter.ipynb		01_ML_DL_General_Starter.ipynb
02_Tabular_Data_CSV_Lists.ipynb		02_Tabular_Data_CSV_Lists.ipynb
03_Image_Data_Processing.ipynb		03_Image_Data_Processing.ipynb
04_Time_Series_Text_Data.ipynb		04_Time_Series_Text_Data.ipynb
README.md		README.md

Travis-ML/DL_Notebook_Starters

Folders and files

Latest commit

History

Repository files navigation

Machine Learning & Deep Learning Notebook Starters

Notebooks Overview

1. ML_DL_General_Starter.ipynb

2. Tabular_Data_CSV_Lists.ipynb

3. Image_Data_Processing.ipynb

4. Time_Series_Text_Data.ipynb

Quick Start

Installation

Basic Workflow

Example: Working with CSV Data

Data Type Quick Reference

Common Patterns

Data Loading

Preprocessing

Model Training

Best Practices

1. Always Set Random Seeds

2. Use Separate Scalers for Train/Val/Test

3. Preserve Temporal Order for Time Series

4. Data Augmentation Only on Training

5. Monitor GPU Usage

Model Architecture Guidelines

Tabular Data (MLPs)

Images (CNNs)

Time Series (RNNs/LSTMs)

Text (NLP)

Common Issues & Solutions

Issue: Out of Memory (GPU)

Issue: Overfitting

Issue: Poor Convergence

Issue: Class Imbalance

Hyperparameter Starting Points

Learning Rate

Batch Size

Epochs

Dropout

Useful Resources

Documentation

Pre-trained Models

Datasets

Contributing

Notes

Learning Path

Performance Tips

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages