This repository contains the code and configuration used in the paper "CIPHER: Scalable Time Series Analysis for Physical Sciences with Application to Solar Wind Phenomena". The goal of this project is to apply Symbolic Aggregate approXimation (iSAX) techniques to identify and cluster patterns in solar wind time series data.
- Environment Setup
- Project Structure
- Cache
- Data
- Creating Catalogs (First Step)
- Running Experiments
- Parallel Computing Setup
- Run Dask in a Distributed Cluster
- References and Acknowledgments
This project runs on Python 3.10+ and uses mamba (or conda) as the environment manager.
# Create the environment
mamba env create -f environment.yml
# Activate the environment
mamba activate solarwind-isaxIt is recommended to use Visual Studio Code (VSCode) with the following extensions:
- Python
- Command Variable
- Remote Development (optional if working on remote servers)
Ensure your project directory looks like this:
.
โโโ .vscode/
โ โโโ launch.json
โโโ cache/
โโโ data/
โ โโโ catalog/
โโโ sw-data/
โ โโโ nasaomnireader/
โโโ environment.yml
โโโ clustering/
You can create these folders manually or using:
mkdir -p .vscode data/catalog sw-data cacheAdd a cache folder in the root directory named cache.
This is where the temporary files generated by iSAX experiments will be stored.
You can change the path of the cache folder in .vscode/launch.json under the argument -cache_folder.
This project works with both PSP (Parker Solar Probe) and OMNI data.
To set up your data, create a new folder inside sw-data/ for each dataset and place the corresponding files there.
Example:
sw-data/
โโโ nasaomnireader/
โโโ psp/
๐ก This process only needs to be done once, before running any experiments.
-
Download the input data (in-situ solar wind data). You can use the NASA OMNI dataset or equivalent data compatible with the
nasaomnireader. -
Place the data inside:
sw-data/nasaomnireader/ -
In VSCode:
-
Open the project.
-
Activate your virtual environment.
-
Install local dependencies:
pip install -e .
-
-
Configure
.vscode/launch.jsonwith your local data path. Example configuration to generate a catalog:
{
"name": "iSAX generate catalog",
"type": "python",
"request": "launch",
"console": "integratedTerminal",
"module": "${command:extension.commandvariable.file.relativeFileDotsNoExtension}",
"cwd": "${workspaceFolder}",
"justMyCode": false,
"args": [
"-start_year", "1994",
"-stop_year", "2023",
"-instrument", "omni",
"-data_path", "/absolute/path/to/your/nasaomnireader",
"-histogram"
]
}- In the VSCode Debug panel, select โiSAX generate catalogโ and run it.
Two CSV files will be generated inside
data/catalog/.
-
Ensure the generated catalog (
.csv) is referenced correctly inside your experiment script (e.g.,run_isax_experiments_sf_cluster.py). -
Update the cache folder path in your
.vscode/launch.jsonconfiguration:"-cache_folder", "/path/to/cache/isax_cache_experiment/"
-
Run the experiment:
python run_isax_experiments_sf_cluster.py
Youโll see a progress bar (tqdm) in the terminal.
Depending on parameters and time range, the process may take several hours.
Warnings during execution are expected and do not indicate errors.
You can execute your experiments in parallel using Dask.
# Install the package locally
pip install -e .
# Start a scheduler
nohup dask-scheduler > scheduler.log 2>&1 &
# Start a worker
nohup dask-worker tcp://127.0.0.1:8786 --nworkers 8 --nthreads 1 > worker.log 2>&1 &To access the Dask dashboard locally:
ssh -L 8787:localhost:8787 [your_server_name]Then open http://localhost:8787/graph in your browser.
You can further improve performance by running your code in a distributed Dask cluster. This allows multiple machines or GPUs to share the workload efficiently.
pip install -e .pip install "dask[distributed]"nohup dask-scheduler > scheduler.log 2>&1 &This runs the scheduler in the background on port 8786, with a web dashboard available at port 8787.
nohup dask-worker tcp://127.0.0.1:8786 > worker1.log 2>&1 &To view Daskโs dashboard from your local machine:
ssh -L 8787:localhost:8787 [server_name]
# Example:
# ssh -L 8787:localhost:8787 fdl-danielanohup dask-worker tcp://127.0.0.1:8786 --nworkers 8 --nthreads 1 > worker1.log 2>&1 &# GPU 0
CUDA_VISIBLE_DEVICES=0 dask-cuda-worker tcp://127.0.0.1:8786 --nthreads 8 > worker1.log 2>&1 &
# GPU 1
CUDA_VISIBLE_DEVICES=1 dask-cuda-worker tcp://127.0.0.1:8786 > worker2.log 2>&1 &
# All workers in one command:
dask-cuda-worker tcp://127.0.0.1:8786 --device-memory-limit=16GBYou can launch multiple workers on the same or different machines.
from dask.distributed import Client
client = Client("tcp://127.0.0.1:8786") # IP/hostname of your scheduler
print(client)All Dask operations will now use the external distributed cluster.
To stop all Dask processes:
pkill -f dask-scheduler
pkill -f dask-worker๐ง Note: Whenever you switch branches, change configuration parameters, or restart your environment, restart the Dask processes as well.
This work is part of Heliolab 2026 | FDL Decoding Solar Wind Challenge.
Base code inspired by:
This project is distributed under the MIT License. Please cite the associated paper if you use this code in your research.