Skip to content

Commit

Permalink
Merge branch 'develop' into feature/improve-dataloader-memory
Browse files Browse the repository at this point in the history
  • Loading branch information
HCookie authored Nov 15, 2024
2 parents 3b69b33 + d0a8866 commit 937943b
Show file tree
Hide file tree
Showing 27 changed files with 808 additions and 338 deletions.
9 changes: 8 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Keep it human-readable, your future self will thank you!
## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.2.2...HEAD)

### Fixed
- Rename loss_scaling to variable_loss_scaling [#138](https://github.com/ecmwf/anemoi-training/pull/138)
- Refactored callbacks. [#60](https://github.com/ecmwf/anemoi-training/pulls/60)
- Updated docs [#115](https://github.com/ecmwf/anemoi-training/pull/115)
- Fix enabling LearningRateMonitor [#119](https://github.com/ecmwf/anemoi-training/pull/119)
Expand All @@ -20,15 +21,20 @@ Keep it human-readable, your future self will thank you!
- Save entire config in mlflow
### Added
- Included more loss functions and allowed configuration [#70](https://github.com/ecmwf/anemoi-training/pull/70)
- Include option to use datashader and optimised asyncronohous callbacks [#102](https://github.com/ecmwf/anemoi-training/pull/102)
- Fix that applies the metric_ranges in the post-processed variable space [#116](https://github.com/ecmwf/anemoi-training/pull/116)
- Allow updates to scalars [#137](https://github.com/ecmwf/anemoi-training/pulls/137)
- Add without subsetting in ScaleTensor
- Sub-hour datasets [#63](https://github.com/ecmwf/anemoi-training/pull/63)
- Add synchronisation workflow [#92](https://github.com/ecmwf/anemoi-training/pull/92)
- Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/)

- New limited area config file added, limited_area.yaml. [#134](https://github.com/ecmwf/anemoi-training/pull/134/)
- New stretched grid config added, stretched_grid.yaml [#133](https://github.com/ecmwf/anemoi-training/pull/133)

### Changed
- Renamed frequency keys in callbacks configuration. [#118](https://github.com/ecmwf/anemoi-training/pull/118)
- Modified training configuration to support max_steps and tied lr iterations to max_steps by default [#67](https://github.com/ecmwf/anemoi-training/pull/67)
- Merged node & edge trainable feature callbacks into one. [#135](https://github.com/ecmwf/anemoi-training/pull/135)

## [0.2.2 - Maintenance: pin python <3.13](https://github.com/ecmwf/anemoi-training/compare/0.2.1...0.2.2) - 2024-10-28

Expand Down Expand Up @@ -107,6 +113,7 @@ Keep it human-readable, your future self will thank you!
- Updated configuration examples in documentation and corrected links - [#46](https://github.com/ecmwf/anemoi-training/pull/46)
- Remove credential prompt from mlflow login, replace with seed refresh token via web - [#78](https://github.com/ecmwf/anemoi-training/pull/78)
- Update CODEOWNERS
- Change how mlflow measures CPU Memory usage - [94](https://github.com/ecmwf/anemoi-training/pull/94)

## [0.1.0 - Anemoi training - First release](https://github.com/ecmwf/anemoi-training/releases/tag/0.1.0) - 2024-08-16

Expand Down
22 changes: 21 additions & 1 deletion docs/modules/diagnostics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,12 +51,32 @@ parameters to plot, as well as the plotting frequency, and
asynchronosity.

Setting ``config.diagnostics.plot.asynchronous``, means that the model
training doesn't stop whilst the callbacks are being evaluated)
training doesn't stop whilst the callbacks are being evaluated. This is
useful for large models where the plotting can take a long time. The
plotting module uses asynchronous callbacks via `asyncio` and
`concurrent.futures.ThreadPoolExecutor` to handle plotting tasks without
blocking the main application. A dedicated event loop runs in a separate
background thread, allowing plotting tasks to be offloaded to worker
threads. This setup keeps the main thread responsive, handling
plot-related tasks asynchronously and efficiently in the background.

There is an additional flag in the plotting callbacks to control the
rendering method for geospatial plots, offering a trade-off between
performance and detail. When `datashader` is set to True, Datashader is
used for rendering, which accelerates plotting through efficient
hexbining, particularly useful for large datasets. This approach can
produce smoother-looking plots due to the aggregation of data points. If
`datashader` is set to False, matplotlib.scatter is used, which provides
sharper and more detailed visuals but may be slower for large datasets.

**Note** - this asynchronous behaviour is only available for the
plotting callbacks.

.. code:: yaml
plot:
asynchronous: True # Whether to plot asynchronously
datashader: True # Whether to use datashader for plotting (faster)
frequency: # Frequency of the plotting
batch: 750
epoch: 5
Expand Down
2 changes: 1 addition & 1 deletion docs/modules/losses.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ define whether to include them in the loss function by setting
Currently, the following scalars are available for use:

- ``variable``: Scale by the feature/variable weights as defined in the
config ``config.training.loss_scaling``.
config ``config.training.variable_loss_scaling``.

********************
Validation Metrics
Expand Down
4 changes: 2 additions & 2 deletions docs/user-guide/training.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,8 +172,8 @@ by setting ``config.data.normaliser``, such that:

It is possible to change the weighting given to each of the variables in
the loss function by changing
``config.training.loss_scaling.pl.<pressure level variable>`` and
``config.training.loss_scaling.sfc.<surface variable>``.
``config.training.variable_loss_scaling.pl.<pressure level variable>``
and ``config.training.variable_loss_scaling.sfc.<surface variable>``.

It is also possible to change the scaling given to the pressure levels
using ``config.training.pressure_level_scaler``. For almost all
Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,10 @@ dynamic = [ "version" ]

dependencies = [
"anemoi-datasets>=0.4",
"anemoi-graphs",
"anemoi-graphs>=0.4",
"anemoi-models>=0.3",
"anemoi-utils[provenance]>=0.4.4",
"datashader>=0.16.3",
"einops>=0.6.1",
"hydra-core>=1.3",
"matplotlib>=3.7.1",
Expand Down
4 changes: 2 additions & 2 deletions src/anemoi/training/config/diagnostics/plot/detailed.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
asynchronous: True # Whether to plot asynchronously
datashader: True # Choose which technique to use for plotting
frequency: # Frequency of the plotting
batch: 750
epoch: 5
Expand All @@ -24,8 +25,7 @@ precip_and_related_fields: [tp, cp]

callbacks:
# Add plot callbacks here
- _target_: anemoi.training.diagnostics.callbacks.plot.GraphNodeTrainableFeaturesPlot
- _target_: anemoi.training.diagnostics.callbacks.plot.GraphEdgeTrainableFeaturesPlot
- _target_: anemoi.training.diagnostics.callbacks.plot.GraphTrainableFeaturesPlot
every_n_epochs: 5
- _target_: anemoi.training.diagnostics.callbacks.plot.PlotLoss
# group parameters by categories when visualizing contributions to the loss
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,7 @@ precip_and_related_fields: [tp, cp]

callbacks:
# Add plot callbacks here
- _target_: anemoi.training.diagnostics.callbacks.plot.GraphNodeTrainableFeaturesPlot
- _target_: anemoi.training.diagnostics.callbacks.plot.GraphEdgeTrainableFeaturesPlot
- _target_: anemoi.training.diagnostics.callbacks.plot.GraphTrainableFeaturesPlot
every_n_epochs: 5
- _target_: anemoi.training.diagnostics.callbacks.plot.PlotLoss
# group parameters by categories when visualizing contributions to the loss
Expand Down
1 change: 1 addition & 0 deletions src/anemoi/training/config/diagnostics/plot/simple.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
asynchronous: True # Whether to plot asynchronously
datashader: True # Choose which technique to use for plotting
frequency: # Frequency of the plotting
batch: 750
epoch: 10
Expand Down
60 changes: 60 additions & 0 deletions src/anemoi/training/config/graph/limited_area.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
overwrite: True

data: "data"
hidden: "hidden"

nodes:
# Data nodes
data:
node_builder:
_target_: anemoi.graphs.nodes.ZarrDatasetNodes
dataset: ${dataloader.training.dataset}
attributes: ${graph.attributes.nodes}
# Hidden nodes
hidden:
node_builder:
_target_: anemoi.graphs.nodes.LimitedAreaTriNodes # options: ZarrDatasetNodes, NPZFileNodes, TriNodes
resolution: 5 # grid resolution for npz (o32, o48, ...)
reference_node_name: ${graph.data}
mask_attr_name: cutout

edges:
# Encoder configuration
- source_name: ${graph.data}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
cutoff_factor: 0.6 # only for cutoff method
attributes: ${graph.attributes.edges}
# Processor configuration
- source_name: ${graph.hidden}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.MultiScaleEdges
x_hops: 1
attributes: ${graph.attributes.edges}
# Decoder configuration
- source_name: ${graph.hidden}
target_name: ${graph.data}
target_mask_attr_name: cutout
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
num_nearest_neighbours: 3 # only for knn method
attributes: ${graph.attributes.edges}


attributes:
nodes:
area_weight:
_target_: anemoi.graphs.nodes.attributes.AreaWeights # options: Area, Uniform
norm: unit-max # options: l1, l2, unit-max, unit-sum, unit-std
cutout:
_target_: anemoi.graphs.nodes.attributes.CutOutMask
edges:
edge_length:
_target_: anemoi.graphs.edges.attributes.EdgeLength
norm: unit-std
edge_dirs:
_target_: anemoi.graphs.edges.attributes.EdgeDirection
norm: unit-std
63 changes: 63 additions & 0 deletions src/anemoi/training/config/graph/stretched_grid.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Stretched grid graph config intended to be used with a cutout dataset.
# The stretched mesh resolution used here is intended for o96 global resolution with 10km
# limited area resolution.
overwrite: False

data: "data"
hidden: "hidden"

nodes:
data:
node_builder:
_target_: anemoi.graphs.nodes.ZarrDatasetNodes
dataset: ${dataloader.training.dataset}
attributes:
area_weight:
_target_: anemoi.graphs.nodes.attributes.AreaWeights
norm: unit-max
cutout:
_target_: anemoi.graphs.nodes.attributes.CutOutMask
hidden:
node_builder:
_target_: anemoi.graphs.nodes.StretchedTriNodes
lam_resolution: 8
global_resolution: 5
reference_node_name: ${graph.data}
mask_attr_name: cutout
margin_radius_km: 11
attributes:
area_weights:
_target_: anemoi.graphs.nodes.attributes.AreaWeights
norm: unit-max

edges:
# Encoder
- source_name: ${graph.data}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges
num_nearest_neighbours: 12
attributes: ${graph.attributes.edges}
# Processor
- source_name: ${graph.hidden}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.MultiScaleEdges
x_hops: 1
attributes: ${graph.attributes.edges}
# Decoder
- source_name: ${graph.hidden}
target_name: ${graph.data}
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges
num_nearest_neighbours: 3
attributes: ${graph.attributes.edges}

attributes:
edges:
edge_length:
_target_: anemoi.graphs.edges.attributes.EdgeLength
norm: unit-max
edge_dirs:
_target_: anemoi.graphs.edges.attributes.EdgeDirection
norm: unit-std
7 changes: 5 additions & 2 deletions src/anemoi/training/config/training/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ training_loss:
# loss class to initialise
_target_: anemoi.training.losses.mse.WeightedMSELoss
# Scalars to include in loss calculation
# Available scalars include, 'variable'
# Available scalars include:
# - 'variable': See `variable_loss_scaling` for more information
scalars: ['variable']
ignore_nans: False

Expand Down Expand Up @@ -85,7 +86,9 @@ lr:
# in order to keep a constant global_lr
# global_lr = local_lr * num_gpus_per_node * num_nodes / gpus_per_model

loss_scaling:
# Variable loss scaling
# 'variable' must be included in `scalars` in the losses for this to be applied.
variable_loss_scaling:
default: 1
pl:
q: 0.6 #1
Expand Down
3 changes: 1 addition & 2 deletions src/anemoi/training/data/datamodule.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,8 +114,7 @@ def ds_train(self) -> NativeGridDataset:

@cached_property
def ds_valid(self) -> NativeGridDataset:
r = self.rollout
r = max(r, self.config.dataloader.get("validation_rollout", 1))
r = max(self.rollout, self.config.dataloader.get("validation_rollout", 1))

assert self.config.dataloader.training.end < self.config.dataloader.validation.start, (
f"Training end date {self.config.dataloader.training.end} is not before"
Expand Down
37 changes: 17 additions & 20 deletions src/anemoi/training/diagnostics/callbacks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ def nestedget(conf: DictConfig, key: str, default: Any) -> Any:
]


def _get_checkpoint_callback(config: DictConfig) -> list[AnemoiCheckpoint] | None:
"""Get checkpointing callback."""
def _get_checkpoint_callback(config: DictConfig) -> list[AnemoiCheckpoint]:
"""Get checkpointing callbacks."""
if not config.diagnostics.get("enable_checkpointing", True):
return []

Expand Down Expand Up @@ -89,6 +89,7 @@ def _get_checkpoint_callback(config: DictConfig) -> list[AnemoiCheckpoint] | Non
n_saved,
)

checkpoint_callbacks = []
if not config.diagnostics.profiler:
for save_key, (
name,
Expand All @@ -97,29 +98,27 @@ def _get_checkpoint_callback(config: DictConfig) -> list[AnemoiCheckpoint] | Non
) in ckpt_frequency_save_dict.items():
if save_frequency is not None:
LOGGER.debug("Checkpoint callback at %s = %s ...", save_key, save_frequency)
return (
checkpoint_callbacks.append(
# save_top_k: the save_top_k flag can either save the best or the last k checkpoints
# depending on the monitor flag on ModelCheckpoint.
# See https://lightning.ai/docs/pytorch/stable/common/checkpointing_intermediate.html for reference
[
AnemoiCheckpoint(
config=config,
filename=name,
save_last=True,
**{save_key: save_frequency},
# if save_top_k == k, last k models saved; if save_top_k == -1, all models are saved
save_top_k=save_n_models,
monitor="step",
mode="max",
**checkpoint_settings,
),
]
AnemoiCheckpoint(
config=config,
filename=name,
save_last=True,
**{save_key: save_frequency},
# if save_top_k == k, last k models saved; if save_top_k == -1, all models are saved
save_top_k=save_n_models,
monitor="step",
mode="max",
**checkpoint_settings,
),
)
LOGGER.debug("Not setting up a checkpoint callback with %s", save_key)
else:
# the tensorboard logger + pytorch profiler cause pickling errors when writing checkpoints
LOGGER.warning("Profiling is enabled - will not write any training or inference model checkpoints!")
return None
return checkpoint_callbacks


def _get_config_enabled_callbacks(config: DictConfig) -> list[Callback]:
Expand Down Expand Up @@ -180,9 +179,7 @@ def get_callbacks(config: DictConfig) -> list[Callback]:
trainer_callbacks: list[Callback] = []

# Get Checkpoint callback
checkpoint_callback = _get_checkpoint_callback(config)
if checkpoint_callback is not None:
trainer_callbacks.extend(checkpoint_callback)
trainer_callbacks.extend(_get_checkpoint_callback(config))

# Base callbacks
trainer_callbacks.extend(
Expand Down
Loading

0 comments on commit 937943b

Please sign in to comment.