Merge branch 'develop' into feature/improve-dataloader-memory

ecmwf · Nov 15, 2024 · 937943b · 937943b
2 parents 3b69b33 + d0a8866
commit 937943b
Show file tree

Hide file tree

Showing 27 changed files with 808 additions and 338 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,7 @@ Keep it human-readable, your future self will thank you!
 ## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.2.2...HEAD)
 
 ### Fixed
+- Rename loss_scaling to variable_loss_scaling [#138](https://github.com/ecmwf/anemoi-training/pull/138)
 - Refactored callbacks. [#60](https://github.com/ecmwf/anemoi-training/pulls/60)
     - Updated docs [#115](https://github.com/ecmwf/anemoi-training/pull/115)
     - Fix enabling LearningRateMonitor [#119](https://github.com/ecmwf/anemoi-training/pull/119)
@@ -20,15 +21,20 @@ Keep it human-readable, your future self will thank you!
     - Save entire config in mlflow
 ### Added
 - Included more loss functions and allowed configuration [#70](https://github.com/ecmwf/anemoi-training/pull/70)
+- Include option to use datashader and optimised asyncronohous callbacks [#102](https://github.com/ecmwf/anemoi-training/pull/102)
    - Fix that applies the metric_ranges in the post-processed variable space [#116](https://github.com/ecmwf/anemoi-training/pull/116)
+- Allow updates to scalars [#137](https://github.com/ecmwf/anemoi-training/pulls/137)
+    - Add without subsetting in ScaleTensor
 - Sub-hour datasets [#63](https://github.com/ecmwf/anemoi-training/pull/63)
 - Add synchronisation workflow [#92](https://github.com/ecmwf/anemoi-training/pull/92)
 - Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/)
-
+- New limited area config file added, limited_area.yaml. [#134](https://github.com/ecmwf/anemoi-training/pull/134/)
+- New stretched grid config added, stretched_grid.yaml [#133](https://github.com/ecmwf/anemoi-training/pull/133)
 
 ### Changed
 - Renamed frequency keys in callbacks configuration. [#118](https://github.com/ecmwf/anemoi-training/pull/118)
 - Modified training configuration to support max_steps and tied lr iterations to max_steps by default [#67](https://github.com/ecmwf/anemoi-training/pull/67)
+- Merged node & edge trainable feature callbacks into one. [#135](https://github.com/ecmwf/anemoi-training/pull/135)
 
 ## [0.2.2 - Maintenance: pin python <3.13](https://github.com/ecmwf/anemoi-training/compare/0.2.1...0.2.2) - 2024-10-28
 
@@ -107,6 +113,7 @@ Keep it human-readable, your future self will thank you!
 - Updated configuration examples in documentation and corrected links - [#46](https://github.com/ecmwf/anemoi-training/pull/46)
 - Remove credential prompt from mlflow login, replace with seed refresh token via web - [#78](https://github.com/ecmwf/anemoi-training/pull/78)
 - Update CODEOWNERS
+- Change how mlflow measures CPU Memory usage - [94](https://github.com/ecmwf/anemoi-training/pull/94)
 
 ## [0.1.0 - Anemoi training - First release](https://github.com/ecmwf/anemoi-training/releases/tag/0.1.0) - 2024-08-16
 

diff --git a/docs/modules/diagnostics.rst b/docs/modules/diagnostics.rst
@@ -51,12 +51,32 @@ parameters to plot, as well as the plotting frequency, and
 asynchronosity.
 
 Setting ``config.diagnostics.plot.asynchronous``, means that the model
-training doesn't stop whilst the callbacks are being evaluated)
+training doesn't stop whilst the callbacks are being evaluated. This is
+useful for large models where the plotting can take a long time. The
+plotting module uses asynchronous callbacks via `asyncio` and
+`concurrent.futures.ThreadPoolExecutor` to handle plotting tasks without
+blocking the main application. A dedicated event loop runs in a separate
+background thread, allowing plotting tasks to be offloaded to worker
+threads. This setup keeps the main thread responsive, handling
+plot-related tasks asynchronously and efficiently in the background.
+
+There is an additional flag in the plotting callbacks to control the
+rendering method for geospatial plots, offering a trade-off between
+performance and detail. When `datashader` is set to True, Datashader is
+used for rendering, which accelerates plotting through efficient
+hexbining, particularly useful for large datasets. This approach can
+produce smoother-looking plots due to the aggregation of data points. If
+`datashader` is set to False, matplotlib.scatter is used, which provides
+sharper and more detailed visuals but may be slower for large datasets.
+
+**Note** - this asynchronous behaviour is only available for the
+plotting callbacks.
 
 .. code:: yaml
 
    plot:
       asynchronous: True # Whether to plot asynchronously
+      datashader: True # Whether to use datashader for plotting (faster)
       frequency: # Frequency of the plotting
       batch: 750
       epoch: 5

diff --git a/docs/modules/losses.rst b/docs/modules/losses.rst
@@ -66,7 +66,7 @@ define whether to include them in the loss function by setting
 Currently, the following scalars are available for use:
 
 -  ``variable``: Scale by the feature/variable weights as defined in the
-   config ``config.training.loss_scaling``.
+   config ``config.training.variable_loss_scaling``.
 
 ********************
  Validation Metrics

diff --git a/docs/user-guide/training.rst b/docs/user-guide/training.rst
@@ -172,8 +172,8 @@ by setting ``config.data.normaliser``, such that:
 
 It is possible to change the weighting given to each of the variables in
 the loss function by changing
-``config.training.loss_scaling.pl.<pressure level variable>`` and
-``config.training.loss_scaling.sfc.<surface variable>``.
+``config.training.variable_loss_scaling.pl.<pressure level variable>``
+and ``config.training.variable_loss_scaling.sfc.<surface variable>``.
 
 It is also possible to change the scaling given to the pressure levels
 using ``config.training.pressure_level_scaler``. For almost all

diff --git a/pyproject.toml b/pyproject.toml
@@ -41,9 +41,10 @@ dynamic = [ "version" ]
 
 dependencies = [
   "anemoi-datasets>=0.4",
-  "anemoi-graphs",
+  "anemoi-graphs>=0.4",
   "anemoi-models>=0.3",
   "anemoi-utils[provenance]>=0.4.4",
+  "datashader>=0.16.3",
   "einops>=0.6.1",
   "hydra-core>=1.3",
   "matplotlib>=3.7.1",

diff --git a/src/anemoi/training/config/diagnostics/plot/detailed.yaml b/src/anemoi/training/config/diagnostics/plot/detailed.yaml
@@ -1,4 +1,5 @@
 asynchronous: True # Whether to plot asynchronously
+datashader: True # Choose which technique to use for plotting
 frequency: # Frequency of the plotting
   batch: 750
   epoch: 5
@@ -24,8 +25,7 @@ precip_and_related_fields: [tp, cp]
 
 callbacks:
   # Add plot callbacks here
-  - _target_: anemoi.training.diagnostics.callbacks.plot.GraphNodeTrainableFeaturesPlot
-  - _target_: anemoi.training.diagnostics.callbacks.plot.GraphEdgeTrainableFeaturesPlot
+  - _target_: anemoi.training.diagnostics.callbacks.plot.GraphTrainableFeaturesPlot
     every_n_epochs: 5
   - _target_: anemoi.training.diagnostics.callbacks.plot.PlotLoss
     # group parameters by categories when visualizing contributions to the loss

diff --git a/src/anemoi/training/config/diagnostics/plot/rollout_eval.yaml b/src/anemoi/training/config/diagnostics/plot/rollout_eval.yaml
@@ -24,8 +24,7 @@ precip_and_related_fields: [tp, cp]
 
 callbacks:
   # Add plot callbacks here
-  - _target_: anemoi.training.diagnostics.callbacks.plot.GraphNodeTrainableFeaturesPlot
-  - _target_: anemoi.training.diagnostics.callbacks.plot.GraphEdgeTrainableFeaturesPlot
+  - _target_: anemoi.training.diagnostics.callbacks.plot.GraphTrainableFeaturesPlot
     every_n_epochs: 5
   - _target_: anemoi.training.diagnostics.callbacks.plot.PlotLoss
     # group parameters by categories when visualizing contributions to the loss

diff --git a/src/anemoi/training/config/diagnostics/plot/simple.yaml b/src/anemoi/training/config/diagnostics/plot/simple.yaml
@@ -1,4 +1,5 @@
 asynchronous: True # Whether to plot asynchronously
+datashader: True # Choose which technique to use for plotting
 frequency: # Frequency of the plotting
   batch: 750
   epoch: 10

diff --git a/src/anemoi/training/config/graph/limited_area.yaml b/src/anemoi/training/config/graph/limited_area.yaml
@@ -0,0 +1,60 @@
+---
+overwrite: True
+
+data: "data"
+hidden: "hidden"
+
+nodes:
+  # Data nodes
+  data:
+    node_builder:
+      _target_: anemoi.graphs.nodes.ZarrDatasetNodes
+      dataset: ${dataloader.training.dataset}
+    attributes: ${graph.attributes.nodes}
+  # Hidden nodes
+  hidden:
+    node_builder:
+      _target_: anemoi.graphs.nodes.LimitedAreaTriNodes # options: ZarrDatasetNodes, NPZFileNodes, TriNodes
+      resolution: 5 # grid resolution for npz (o32, o48, ...)
+      reference_node_name: ${graph.data}
+      mask_attr_name: cutout
+
+edges:
+# Encoder configuration
+- source_name: ${graph.data}
+  target_name: ${graph.hidden}
+  edge_builder:
+    _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
+    cutoff_factor: 0.6 # only for cutoff method
+  attributes: ${graph.attributes.edges}
+# Processor configuration
+- source_name: ${graph.hidden}
+  target_name: ${graph.hidden}
+  edge_builder:
+    _target_: anemoi.graphs.edges.MultiScaleEdges
+    x_hops: 1
+  attributes: ${graph.attributes.edges}
+# Decoder configuration
+- source_name: ${graph.hidden}
+  target_name: ${graph.data}
+  target_mask_attr_name: cutout
+  edge_builder:
+    _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
+    num_nearest_neighbours: 3 # only for knn method
+  attributes: ${graph.attributes.edges}
+
+
+attributes:
+  nodes:
+    area_weight:
+      _target_: anemoi.graphs.nodes.attributes.AreaWeights # options: Area, Uniform
+      norm: unit-max # options: l1, l2, unit-max, unit-sum, unit-std
+    cutout:
+      _target_: anemoi.graphs.nodes.attributes.CutOutMask
+  edges:
+    edge_length:
+      _target_: anemoi.graphs.edges.attributes.EdgeLength
+      norm: unit-std
+    edge_dirs:
+      _target_: anemoi.graphs.edges.attributes.EdgeDirection
+      norm: unit-std
diff --git a/src/anemoi/training/config/graph/stretched_grid.yaml b/src/anemoi/training/config/graph/stretched_grid.yaml
@@ -0,0 +1,63 @@
+# Stretched grid graph config intended to be used with a cutout dataset.
+# The stretched mesh resolution used here is intended for o96 global resolution with 10km
+# limited area resolution.
+overwrite: False
+
+data: "data"
+hidden: "hidden"
+
+nodes:
+  data:
+    node_builder:
+      _target_: anemoi.graphs.nodes.ZarrDatasetNodes
+      dataset: ${dataloader.training.dataset}
+    attributes:
+      area_weight:
+        _target_: anemoi.graphs.nodes.attributes.AreaWeights
+        norm: unit-max
+      cutout:
+        _target_: anemoi.graphs.nodes.attributes.CutOutMask
+  hidden:
+    node_builder:
+      _target_: anemoi.graphs.nodes.StretchedTriNodes
+      lam_resolution: 8
+      global_resolution: 5
+      reference_node_name: ${graph.data}
+      mask_attr_name: cutout
+      margin_radius_km: 11
+    attributes:
+      area_weights:
+        _target_: anemoi.graphs.nodes.attributes.AreaWeights
+        norm: unit-max
+
+edges:
+# Encoder
+- source_name: ${graph.data}
+  target_name: ${graph.hidden}
+  edge_builder:
+    _target_: anemoi.graphs.edges.KNNEdges
+    num_nearest_neighbours: 12
+  attributes: ${graph.attributes.edges}
+# Processor
+- source_name: ${graph.hidden}
+  target_name: ${graph.hidden}
+  edge_builder:
+    _target_: anemoi.graphs.edges.MultiScaleEdges
+    x_hops: 1
+  attributes: ${graph.attributes.edges}
+# Decoder
+- source_name: ${graph.hidden}
+  target_name: ${graph.data}
+  edge_builder:
+    _target_: anemoi.graphs.edges.KNNEdges
+    num_nearest_neighbours: 3
+  attributes: ${graph.attributes.edges}
+
+attributes:
+  edges:
+    edge_length:
+      _target_: anemoi.graphs.edges.attributes.EdgeLength
+      norm: unit-max
+    edge_dirs:
+      _target_: anemoi.graphs.edges.attributes.EdgeDirection
+      norm: unit-std
diff --git a/src/anemoi/training/config/training/default.yaml b/src/anemoi/training/config/training/default.yaml
@@ -46,7 +46,8 @@ training_loss:
   # loss class to initialise
   _target_: anemoi.training.losses.mse.WeightedMSELoss
   # Scalars to include in loss calculation
-  # Available scalars include, 'variable'
+  # Available scalars include:
+  # - 'variable': See `variable_loss_scaling` for more information
   scalars: ['variable']
   ignore_nans: False
 
@@ -85,7 +86,9 @@ lr:
 # in order to keep a constant global_lr
 # global_lr = local_lr * num_gpus_per_node * num_nodes / gpus_per_model
 
-loss_scaling:
+# Variable loss scaling
+# 'variable' must be included in `scalars` in the losses for this to be applied.
+variable_loss_scaling:
   default: 1
   pl:
     q: 0.6 #1

diff --git a/src/anemoi/training/data/datamodule.py b/src/anemoi/training/data/datamodule.py
@@ -114,8 +114,7 @@ def ds_train(self) -> NativeGridDataset:
 
     @cached_property
     def ds_valid(self) -> NativeGridDataset:
-        r = self.rollout
-        r = max(r, self.config.dataloader.get("validation_rollout", 1))
+        r = max(self.rollout, self.config.dataloader.get("validation_rollout", 1))
 
         assert self.config.dataloader.training.end < self.config.dataloader.validation.start, (
             f"Training end date {self.config.dataloader.training.end} is not before"

diff --git a/src/anemoi/training/diagnostics/callbacks/__init__.py b/src/anemoi/training/diagnostics/callbacks/__init__.py
@@ -56,8 +56,8 @@ def nestedget(conf: DictConfig, key: str, default: Any) -> Any:
 ]
 
 
-def _get_checkpoint_callback(config: DictConfig) -> list[AnemoiCheckpoint] | None:
-    """Get checkpointing callback."""
+def _get_checkpoint_callback(config: DictConfig) -> list[AnemoiCheckpoint]:
+    """Get checkpointing callbacks."""
     if not config.diagnostics.get("enable_checkpointing", True):
         return []
 
@@ -89,6 +89,7 @@ def _get_checkpoint_callback(config: DictConfig) -> list[AnemoiCheckpoint] | Non
             n_saved,
         )
 
+    checkpoint_callbacks = []
     if not config.diagnostics.profiler:
         for save_key, (
             name,
@@ -97,29 +98,27 @@ def _get_checkpoint_callback(config: DictConfig) -> list[AnemoiCheckpoint] | Non
         ) in ckpt_frequency_save_dict.items():
             if save_frequency is not None:
                 LOGGER.debug("Checkpoint callback at %s = %s ...", save_key, save_frequency)
-                return (
+                checkpoint_callbacks.append(
                     # save_top_k: the save_top_k flag can either save the best or the last k checkpoints
                     # depending on the monitor flag on ModelCheckpoint.
                     # See https://lightning.ai/docs/pytorch/stable/common/checkpointing_intermediate.html for reference
-                    [
-                        AnemoiCheckpoint(
-                            config=config,
-                            filename=name,
-                            save_last=True,
-                            **{save_key: save_frequency},
-                            # if save_top_k == k, last k models saved; if save_top_k == -1, all models are saved
-                            save_top_k=save_n_models,
-                            monitor="step",
-                            mode="max",
-                            **checkpoint_settings,
-                        ),
-                    ]
+                    AnemoiCheckpoint(
+                        config=config,
+                        filename=name,
+                        save_last=True,
+                        **{save_key: save_frequency},
+                        # if save_top_k == k, last k models saved; if save_top_k == -1, all models are saved
+                        save_top_k=save_n_models,
+                        monitor="step",
+                        mode="max",
+                        **checkpoint_settings,
+                    ),
                 )
             LOGGER.debug("Not setting up a checkpoint callback with %s", save_key)
     else:
         # the tensorboard logger + pytorch profiler cause pickling errors when writing checkpoints
         LOGGER.warning("Profiling is enabled - will not write any training or inference model checkpoints!")
-    return None
+    return checkpoint_callbacks
 
 
 def _get_config_enabled_callbacks(config: DictConfig) -> list[Callback]:
@@ -180,9 +179,7 @@ def get_callbacks(config: DictConfig) -> list[Callback]:
     trainer_callbacks: list[Callback] = []
 
     # Get Checkpoint callback
-    checkpoint_callback = _get_checkpoint_callback(config)
-    if checkpoint_callback is not None:
-        trainer_callbacks.extend(checkpoint_callback)
+    trainer_callbacks.extend(_get_checkpoint_callback(config))
 
     # Base callbacks
     trainer_callbacks.extend(