A PyTorch-based pipeline for filling missing values in oceanographic time series using a 3-layer bidirectional LSTM with enhanced feature engineering. Includes automatic gap detection, model-based imputation, accuracy reports, and clear visualisations. Perfect for scientific data cleaning.
This repository contains a PyTorch-based solution for imputing missing values in multi-variate time series data (specifically, oceanographic datasets) using an enhanced 3-layer bidirectional LSTM model and rich feature engineering.
Automatic missing value detection and imputation for each column
Temporal encoding (hour, day-of-year)
-->Rolling statistics (mean, std)
-->Lag and difference features
-->Exponential moving averages
-->Physical gradients (temperature and salinity profiles)
Multiple visualisation utilities: Analyse gap filling and compare before/after filling with plots and heatmaps
Accuracy evaluation: Reports fill rates and imputation accuracy
Clone this repository
Install the required packages:
bash
"pip install pandas numpy matplotlib scikit-learn torch" Prepare your data:
Ensure your Excel file matches the expected format.
Update the path in your script to point to your data location.
Usage Main Imputation Script
The primary script performs:
Data loading
Feature engineering
Model training and gap filling for each target column with missing data
Saving the filled dataset to Excel
Comprehensive visualizations
Place your script in the repository directory and ensure you adjust paths as necessary.
python
df = pd.read_excel(r"path_to_your_data/TS_Ssp.xlsx", sheet_name='Actuals', usecols='A,E:M,Q:Y')
bash "python your_script.py" The output will include:
Console logs of process details, fill rates, and model accuracy
A filled Excel file: enhanced_lstm_oceanographic_filled.xlsx
Multiple pop-up plots for visual analysis
EnhancedLSTMImputer: Handles all preprocessing, training, and gap filling.
TimeSeriesDataset: Custom PyTorch dataset for efficient minibatch creation.
EnhancedLSTM: Model class with three LSTM layers and dense layers.
visualize_gap_filling
visualize_before_after_comparison
plot_missing_data_heatmap
Reports initial/final missing values
Runs imputation and accuracy assessment
Call visualisation utilities
Before/after scatterplots to directly compare raw and imputed data.
Heatmaps showing the spatial distribution of missing values before and after filling.
Sequence Length & LSTM Units: Tune parameters like sequence_length and lstm_units according to your dataset's characteristics.
Features: Add/remove derived features in the create_enhanced_features function to suit your domain specifics.
GPU acceleration is supported if CUDA is available.
Progress messages and early stopping help ensure efficient training.
Data columns should have clear naming conventions (e.g., Temp_, Salt_) for gradient features to generate correctly.
Filled Data: Saved as enhanced_lstm_oceanographic_filled.xlsx
Accuracy Report: Printed to the console, showing per-column fill rate and "imputation accuracy" (relative mean).
This project is provided for research and educational purposes.
Feel free to fork, extend, and adapt for your own time series imputation needs!