A comprehensive collection of Jupyter notebook templates for ML/DL projects with complete data preprocessing pipelines for various data types.
Purpose: Foundation template with all essential imports and setup
Contents:
- Core data science libraries (NumPy, Pandas, Matplotlib, Seaborn)
- Scikit-learn preprocessing and metrics
- TensorFlow/Keras and PyTorch setup
- GPU configuration and memory management
- Random seed functions for reproducibility
- Utility functions for plotting and data analysis
- Standard directory structure setup
Use When: Starting any new ML/DL project
Purpose: Complete pipeline for tabular data to neural networks
Contents:
- Loading data from CSV files and Python lists/dictionaries
- Data exploration and cleaning
- Handling missing values (numeric and categorical)
- Categorical feature encoding (one-hot, label encoding)
- Train/validation/test splits with stratification
- Feature scaling (StandardScaler, MinMaxScaler)
- Converting to TensorFlow tensors and datasets
- Converting to PyTorch tensors and DataLoaders
- Example neural network architectures
- Data persistence with pickle
Use When:
- Working with structured data (Excel, CSV, databases)
- Regression or classification on tabular features
- Customer data, financial data, IoT sensor data
Example Use Cases:
- Customer churn prediction
- House price prediction
- Credit risk assessment
- Sensor failure detection
Purpose: Complete pipeline for image data to CNNs
Contents:
- Loading images from directories (ImageFolder structure)
- Loading individual images (PIL, OpenCV, Keras)
- Working with numpy arrays of images
- Image preprocessing and normalization strategies
- Data augmentation (rotation, flip, zoom, color jitter)
- TensorFlow/Keras pipelines (ImageDataGenerator, tf.data)
- PyTorch pipelines (Custom Dataset, transforms, DataLoader)
- CNN architectures from scratch
- Transfer learning with pre-trained models (ResNet, VGG, MobileNet, EfficientNet)
- Visualization utilities
Use When:
- Computer vision tasks
- Image classification
- Object detection preprocessing
- Medical imaging
Example Use Cases:
- Face emotion recognition
- Medical image diagnosis
- Product defect detection
- Plant disease classification
- Animal species identification
Purpose: Sequence data processing for RNNs, LSTMs, and NLP models
Contents:
Time Series Section:
- Loading time series from CSV and creating synthetic data
- Creating windowed sequences for forecasting
- Chronological train/val/test splits
- Feature scaling for sequences
- LSTM, GRU, and Conv1D models
- PyTorch sequence models
Text Data Section:
- Text preprocessing (cleaning, lowercasing, removing punctuation)
- Tokenization and vocabulary building
- Sequence padding and truncation
- TensorFlow/Keras text pipelines
- PyTorch text processing
- Text classification models (LSTM, CNN, BiLSTM)
- Pre-trained embeddings (GloVe integration)
- Prediction utilities
Use When:
- Time series forecasting
- Sentiment analysis
- Text classification
- Sequence prediction
Example Use Cases:
- Stock price prediction
- Weather forecasting
- Cryptocurrency price prediction
- Product review sentiment analysis
- Spam detection
- News classification
# Core requirements
pip install numpy pandas matplotlib seaborn scikit-learn
# Deep Learning
pip install tensorflow torch torchvision
# Optional (for enhanced functionality)
pip install nltk opencv-python pillow- Choose the appropriate notebook based on your data type
- Copy the notebook to your project directory
- Adjust paths to point to your data
- Run cells sequentially to prepare your data
- Modify model architectures as needed
- Train and evaluate your models
# 1. Load the Tabular_Data_CSV_Lists notebook
# 2. Update the data path
csv_path = 'path/to/your/data.csv'
df = pd.read_csv(csv_path)
# 3. Specify your target column
target_column = 'your_target_column'
# 4. Run preprocessing cells
# 5. The data will be automatically converted to tensors
# 6. Use the example models or build your own| Data Type | Notebook | Key Libraries | Model Types |
|---|---|---|---|
| CSV/Excel | #2 Tabular | pandas, sklearn | MLP, Random Forest |
| Images | #3 Image | PIL, OpenCV | CNN, ResNet, EfficientNet |
| Time Series | #4 Time Series | pandas | LSTM, GRU, Conv1D |
| Text | #4 Text | nltk, keras.preprocessing | LSTM, CNN, Transformers |
# CSV
df = pd.read_csv('data.csv')
# Images from directory
train_ds = tf.keras.preprocessing.image_dataset_from_directory('data/train/')
# Time series windowing
X, y = create_sequences(data, sequence_length=30, forecast_horizon=1)
# Text
texts = df['text'].tolist()
labels = df['label'].tolist()# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Image normalization
images = images / 255.0 # [0, 1] range
# Text tokenization
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(texts)# TensorFlow/Keras
history = model.fit(
train_data,
validation_data=val_data,
epochs=50,
callbacks=[early_stopping, reduce_lr]
)
# PyTorch
for epoch in range(num_epochs):
train_loss = train_epoch(model, train_loader, criterion, optimizer)
val_loss = validate(model, val_loader, criterion)np.random.seed(42)
tf.random.set_seed(42)
torch.manual_seed(42)# ✅ Correct
scaler.fit(X_train) # Fit only on training data
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val) # Transform using training statistics
# ❌ Wrong
scaler.fit(X) # Don't fit on all data# ✅ Correct - chronological split
train_size = int(0.7 * len(data))
X_train = X[:train_size]
# ❌ Wrong - random split breaks temporal dependencies
X_train, X_test = train_test_split(X, shuffle=True)# Training with augmentation
train_augmentation = True
# Validation/Test without augmentation
val_augmentation = False# TensorFlow
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
# PyTorch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)- Start with 2-3 hidden layers
- Use ReLU activation
- Add Dropout (0.3-0.5) for regularization
- Batch Normalization between layers
- Start with pre-trained models (ResNet, EfficientNet)
- Use data augmentation
- Global Average Pooling instead of Flatten
- Fine-tune last few layers first
- 2-3 LSTM/GRU layers with 64-128 units
- Bidirectional for better context
- Use Dropout (0.2-0.3)
- Consider Conv1D for long sequences
- Embedding dimension: 128-300
- LSTM/GRU with 64-128 units
- Bidirectional for better understanding
- Consider pre-trained embeddings (GloVe, Word2Vec)
Solutions:
- Reduce batch size
- Use gradient accumulation
- Enable memory growth (TensorFlow)
- Use mixed precision training
Solutions:
- Add Dropout layers
- Use data augmentation
- Reduce model complexity
- Increase training data
- Add L2 regularization
Solutions:
- Adjust learning rate
- Use learning rate schedulers
- Check data normalization
- Verify loss function matches task
- Try different optimizers (Adam, SGD, RMSprop)
Solutions:
- Use stratified splits
- Apply class weights
- Oversample minority class (SMOTE)
- Undersample majority class
- Use appropriate metrics (F1, ROC-AUC)
- Adam: 0.001 (default)
- SGD: 0.01 - 0.1
- Fine-tuning: 0.0001 - 0.00001
- Small datasets: 16-32
- Medium datasets: 32-64
- Large datasets: 64-256
- Images: 32-64
- Text: 32-128
- Start with 50-100
- Use EarlyStopping (patience=10-20)
- Monitor validation loss
- Dense layers: 0.3-0.5
- RNN/LSTM: 0.2-0.3
- After Conv layers: 0.25-0.5
Feel free to:
- Add new preprocessing techniques
- Include additional model architectures
- Share optimization tips
- Report issues or bugs
- Suggest improvements
- All notebooks use both TensorFlow and PyTorch implementations
- Code is heavily commented for learning purposes
- Examples use synthetic data - replace with your actual data
- Models are starting points - tune for your specific use case
- Always validate on held-out test set before deployment
- Start with General Starter - Set up environment
- Try Tabular Data - Understand basic preprocessing
- Move to Images - Learn data augmentation
- Tackle Sequences - Master time series and text
- Use GPU when available
- Batch your data properly
- Use mixed precision training (on compatible GPUs)
- Enable XLA compilation (TensorFlow)
- Use DataLoader num_workers (PyTorch)
- Profile your code to find bottlenecks
- Use prefetching for data loading
Happy Model Building!
For questions or issues, refer to the individual notebook documentation or check the official framework documentation.