Skip to content

Data preparation and manipulation is a must-have if you're training Neural Networks on your dataset, whatever format it may be. This repo contains starter notebooks that can be used to prep common datasets to get you on your way to the fun stuff faster. Tensorflow and PyTorch included. Happy training!

Notifications You must be signed in to change notification settings

Travis-ML/DL_Notebook_Starters

Repository files navigation

Machine Learning & Deep Learning Notebook Starters

A comprehensive collection of Jupyter notebook templates for ML/DL projects with complete data preprocessing pipelines for various data types.

Notebooks Overview

1. ML_DL_General_Starter.ipynb

Purpose: Foundation template with all essential imports and setup

Contents:

  • Core data science libraries (NumPy, Pandas, Matplotlib, Seaborn)
  • Scikit-learn preprocessing and metrics
  • TensorFlow/Keras and PyTorch setup
  • GPU configuration and memory management
  • Random seed functions for reproducibility
  • Utility functions for plotting and data analysis
  • Standard directory structure setup

Use When: Starting any new ML/DL project


2. Tabular_Data_CSV_Lists.ipynb

Purpose: Complete pipeline for tabular data to neural networks

Contents:

  • Loading data from CSV files and Python lists/dictionaries
  • Data exploration and cleaning
  • Handling missing values (numeric and categorical)
  • Categorical feature encoding (one-hot, label encoding)
  • Train/validation/test splits with stratification
  • Feature scaling (StandardScaler, MinMaxScaler)
  • Converting to TensorFlow tensors and datasets
  • Converting to PyTorch tensors and DataLoaders
  • Example neural network architectures
  • Data persistence with pickle

Use When:

  • Working with structured data (Excel, CSV, databases)
  • Regression or classification on tabular features
  • Customer data, financial data, IoT sensor data

Example Use Cases:

  • Customer churn prediction
  • House price prediction
  • Credit risk assessment
  • Sensor failure detection

3. Image_Data_Processing.ipynb

Purpose: Complete pipeline for image data to CNNs

Contents:

  • Loading images from directories (ImageFolder structure)
  • Loading individual images (PIL, OpenCV, Keras)
  • Working with numpy arrays of images
  • Image preprocessing and normalization strategies
  • Data augmentation (rotation, flip, zoom, color jitter)
  • TensorFlow/Keras pipelines (ImageDataGenerator, tf.data)
  • PyTorch pipelines (Custom Dataset, transforms, DataLoader)
  • CNN architectures from scratch
  • Transfer learning with pre-trained models (ResNet, VGG, MobileNet, EfficientNet)
  • Visualization utilities

Use When:

  • Computer vision tasks
  • Image classification
  • Object detection preprocessing
  • Medical imaging

Example Use Cases:

  • Face emotion recognition
  • Medical image diagnosis
  • Product defect detection
  • Plant disease classification
  • Animal species identification

4. Time_Series_Text_Data.ipynb

Purpose: Sequence data processing for RNNs, LSTMs, and NLP models

Contents:

Time Series Section:

  • Loading time series from CSV and creating synthetic data
  • Creating windowed sequences for forecasting
  • Chronological train/val/test splits
  • Feature scaling for sequences
  • LSTM, GRU, and Conv1D models
  • PyTorch sequence models

Text Data Section:

  • Text preprocessing (cleaning, lowercasing, removing punctuation)
  • Tokenization and vocabulary building
  • Sequence padding and truncation
  • TensorFlow/Keras text pipelines
  • PyTorch text processing
  • Text classification models (LSTM, CNN, BiLSTM)
  • Pre-trained embeddings (GloVe integration)
  • Prediction utilities

Use When:

  • Time series forecasting
  • Sentiment analysis
  • Text classification
  • Sequence prediction

Example Use Cases:

  • Stock price prediction
  • Weather forecasting
  • Cryptocurrency price prediction
  • Product review sentiment analysis
  • Spam detection
  • News classification

Quick Start

Installation

# Core requirements
pip install numpy pandas matplotlib seaborn scikit-learn

# Deep Learning
pip install tensorflow torch torchvision

# Optional (for enhanced functionality)
pip install nltk opencv-python pillow

Basic Workflow

  1. Choose the appropriate notebook based on your data type
  2. Copy the notebook to your project directory
  3. Adjust paths to point to your data
  4. Run cells sequentially to prepare your data
  5. Modify model architectures as needed
  6. Train and evaluate your models

Example: Working with CSV Data

# 1. Load the Tabular_Data_CSV_Lists notebook
# 2. Update the data path
csv_path = 'path/to/your/data.csv'
df = pd.read_csv(csv_path)

# 3. Specify your target column
target_column = 'your_target_column'

# 4. Run preprocessing cells
# 5. The data will be automatically converted to tensors
# 6. Use the example models or build your own

Data Type Quick Reference

Data Type Notebook Key Libraries Model Types
CSV/Excel #2 Tabular pandas, sklearn MLP, Random Forest
Images #3 Image PIL, OpenCV CNN, ResNet, EfficientNet
Time Series #4 Time Series pandas LSTM, GRU, Conv1D
Text #4 Text nltk, keras.preprocessing LSTM, CNN, Transformers

Common Patterns

Data Loading

# CSV
df = pd.read_csv('data.csv')

# Images from directory
train_ds = tf.keras.preprocessing.image_dataset_from_directory('data/train/')

# Time series windowing
X, y = create_sequences(data, sequence_length=30, forecast_horizon=1)

# Text
texts = df['text'].tolist()
labels = df['label'].tolist()

Preprocessing

# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Image normalization
images = images / 255.0  # [0, 1] range

# Text tokenization
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(texts)

Model Training

# TensorFlow/Keras
history = model.fit(
    train_data,
    validation_data=val_data,
    epochs=50,
    callbacks=[early_stopping, reduce_lr]
)

# PyTorch
for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader, criterion, optimizer)
    val_loss = validate(model, val_loader, criterion)

Best Practices

1. Always Set Random Seeds

np.random.seed(42)
tf.random.set_seed(42)
torch.manual_seed(42)

2. Use Separate Scalers for Train/Val/Test

# ✅ Correct
scaler.fit(X_train)  # Fit only on training data
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)  # Transform using training statistics

# ❌ Wrong
scaler.fit(X)  # Don't fit on all data

3. Preserve Temporal Order for Time Series

# ✅ Correct - chronological split
train_size = int(0.7 * len(data))
X_train = X[:train_size]

# ❌ Wrong - random split breaks temporal dependencies
X_train, X_test = train_test_split(X, shuffle=True)

4. Data Augmentation Only on Training

# Training with augmentation
train_augmentation = True

# Validation/Test without augmentation
val_augmentation = False

5. Monitor GPU Usage

# TensorFlow
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

# PyTorch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

Model Architecture Guidelines

Tabular Data (MLPs)

  • Start with 2-3 hidden layers
  • Use ReLU activation
  • Add Dropout (0.3-0.5) for regularization
  • Batch Normalization between layers

Images (CNNs)

  • Start with pre-trained models (ResNet, EfficientNet)
  • Use data augmentation
  • Global Average Pooling instead of Flatten
  • Fine-tune last few layers first

Time Series (RNNs/LSTMs)

  • 2-3 LSTM/GRU layers with 64-128 units
  • Bidirectional for better context
  • Use Dropout (0.2-0.3)
  • Consider Conv1D for long sequences

Text (NLP)

  • Embedding dimension: 128-300
  • LSTM/GRU with 64-128 units
  • Bidirectional for better understanding
  • Consider pre-trained embeddings (GloVe, Word2Vec)

Common Issues & Solutions

Issue: Out of Memory (GPU)

Solutions:

  • Reduce batch size
  • Use gradient accumulation
  • Enable memory growth (TensorFlow)
  • Use mixed precision training

Issue: Overfitting

Solutions:

  • Add Dropout layers
  • Use data augmentation
  • Reduce model complexity
  • Increase training data
  • Add L2 regularization

Issue: Poor Convergence

Solutions:

  • Adjust learning rate
  • Use learning rate schedulers
  • Check data normalization
  • Verify loss function matches task
  • Try different optimizers (Adam, SGD, RMSprop)

Issue: Class Imbalance

Solutions:

  • Use stratified splits
  • Apply class weights
  • Oversample minority class (SMOTE)
  • Undersample majority class
  • Use appropriate metrics (F1, ROC-AUC)

Hyperparameter Starting Points

Learning Rate

  • Adam: 0.001 (default)
  • SGD: 0.01 - 0.1
  • Fine-tuning: 0.0001 - 0.00001

Batch Size

  • Small datasets: 16-32
  • Medium datasets: 32-64
  • Large datasets: 64-256
  • Images: 32-64
  • Text: 32-128

Epochs

  • Start with 50-100
  • Use EarlyStopping (patience=10-20)
  • Monitor validation loss

Dropout

  • Dense layers: 0.3-0.5
  • RNN/LSTM: 0.2-0.3
  • After Conv layers: 0.25-0.5

Useful Resources

Documentation

Pre-trained Models

Datasets


Contributing

Feel free to:

  • Add new preprocessing techniques
  • Include additional model architectures
  • Share optimization tips
  • Report issues or bugs
  • Suggest improvements

Notes

  • All notebooks use both TensorFlow and PyTorch implementations
  • Code is heavily commented for learning purposes
  • Examples use synthetic data - replace with your actual data
  • Models are starting points - tune for your specific use case
  • Always validate on held-out test set before deployment

Learning Path

  1. Start with General Starter - Set up environment
  2. Try Tabular Data - Understand basic preprocessing
  3. Move to Images - Learn data augmentation
  4. Tackle Sequences - Master time series and text

Performance Tips

  1. Use GPU when available
  2. Batch your data properly
  3. Use mixed precision training (on compatible GPUs)
  4. Enable XLA compilation (TensorFlow)
  5. Use DataLoader num_workers (PyTorch)
  6. Profile your code to find bottlenecks
  7. Use prefetching for data loading

Happy Model Building!

For questions or issues, refer to the individual notebook documentation or check the official framework documentation.

About

Data preparation and manipulation is a must-have if you're training Neural Networks on your dataset, whatever format it may be. This repo contains starter notebooks that can be used to prep common datasets to get you on your way to the fun stuff faster. Tensorflow and PyTorch included. Happy training!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published