[ENH] Precompute data to massively accelerate training in GPU #1850

jobs-git · 2025-05-27T15:10:14Z

Proof of concept

https://drive.google.com/file/d/1go7wEMqcejRl7YecHYhIFAhO6YF3T0Kd/view?usp=sharing

Results on Training

with pre-compute: 53.662221 s
vanilla: 265.779447 s

that's about 490% performance improvements, far from 2000%, but with this update, developers can reduce cost and hardware requirements by 4x!

Results on Hyperparameter Optimization

with pre-compute: 41.021749 s
vanilla: 209.937233 s

that's about 510% performance improvements.

Description

Fixes: #1849 #1426 #1860
Partially Fixes: sktime/sktime#8278
Closes: #1846

Supersedes: #806

The primary issue here is that __getitem__ () performs pre-processing, which are typically conducted before actually training.

As a result, the GPU becomes frequently idle, resulting to slower training completion.

Its known that GPU typically achieve higher throughput the more it can be made busy, but the pre-processing done in __getitem__ () each time an item is retrieved massively impacts this.

This commit ensures that pre-processing is performed prior to training. This is achieved by the following:

setting TimeSeriesDataset (..., precompute=True) will activate the call __precompute__ () function in to_dataloader()
the __precompute__ () function handles the collection of pre-computed items from __item_tensor__ () and store it in a cache. this function relies on sampler to be able to retrieve data index
the __item_tensor__ () is the unmodified algorithm of the original __getitem__ () to ensure equivalent outcome
the new __getitem__ () retrieves items from cache in order, this is because relying on idx may result to a different index sequence due to the first sampler call from precompute ()

Note: TimeSeriesDataset defaults to slow path, so the vanilla method can still be used.

~~### WIP help:~~
~~@fkiraly ~~
~~- [ ] Not sure where to get the batch indices if sampler=None so I have to force implement TimeSynchronizedBatchSampler. Guidance on this will be appreciated.~~

Resolution: When precompute = True in TimeSeriesDataset it will use TimeSynchronizedBatchSampler regardless of sampler set in to_dataloader.

Caveat and Limitations

This feature stores precomputed data in RAM. Thus, enough RAM must be ensured, otherwise OOM will ensue.

A super batching (i dont know how is that better) may be used if precompute is to be insisted, however that would be beyond the scope of this PR.

Recommend to use: precompute=False to use the original slow path.
In case of ultra large scale dataset, FSDP may be utilized for pytorch-lightning distributed computing, but that is also beyond the scope of this PR.

Recommend to use: precompute=False to use the original slow path.
Using precompute=True ignores two settings: batch_sampler and shuffle, as this will set the sampler to TimeSynchronizedBatchSampler and shuffle to False.

Recommend to use: precompute=False to use the original slow path in case you would like to set those two settings.

Can this be implemented? Yes of course, but that is beyond the scope of this PR. It would also increase complexity of using TimeSeriesDataset when a user wishes to enable precompute=True and increase memory requirements.

Checklist

Linked issues (if existing)
Amended changelog for large changes (and added myself there as contributor)
Added/modified tests
Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with pre-commit install.
To run hooks independent of commit, execute pre-commit run --all-files

The primary issue here is that `__getitem__ ()` performs pre-processing, which are typically conducted before actually training. As a result, the GPU becomes frequently idle, resulting to slower training completion. Its known that GPU typically achieve higher throughput the more it can be made busy, but the pre-processing done in `__getitem__ ()` each time an item is retrieved massively impacts this. This commit ensures that pre-processing is performed prior to training. This is achieved by the following: 1. calls with `to_dataloader ()` will also call the `precompute ()` function 2. the `precompute ()` function handles the collection of pre-computed items from `__precompute__getitem__ ()` and store it in a cache. this function relies on sampler to be able to retrieve data index 3. the `__precompute__getitem__ ()` is the unmodified algorithm of the original `__getitem__ ()` to ensure equivalent outcome 4. the new `__getitem__ ()` retrieves items from cache in order, this is because relying on `idx` may result to a different index sequence due to the first sampler call from `precompute ()`

fkiraly

Nice!

FYI @phoeenniixx, @PranavBhatP, @xandie985 - is this something we need to take into account in v2? We should also think about adding profiling checks.

Code quality (linting) is failing, this can be fixed by using pre-commit or automating formatting. The pytorch-forecasting does not have this in the docs (we should add it), but this is the same as in sktime:
https://www.sktime.net/en/stable/developer_guide/coding_standards.html

fkiraly · 2025-05-28T06:39:35Z

FYI @agobbifbk

agobbifbk · 2025-05-28T07:38:39Z

Agree, most of the time you can fit your data in memory and we should include the precomputation possibility in the d2 layer. We should have already the correct indexes computed, it is just a matter of create the tensors according to those indexes.

When testing with performing tuning with pytorch-lighting it is not able to perform hyperparameter optimization with this info: Failed to compute suggestion for learning rate because there are not enough points. Increase the loop iteration limits or the size of your dataset/dataloader.

When you say hyperparameter optimization you mean the learning rate using trainer.tune or do you really mean the hyperparameters of the model?

jobs-git · 2025-05-28T10:15:45Z

Agree, most of the time you can fit your data in memory and we should include the precomputation possibility in the d2 layer. We should have already the correct indexes computed, it is just a matter of create the tensors according to those indexes.

When testing with performing tuning with pytorch-lighting it is not able to perform hyperparameter optimization with this info: Failed to compute suggestion for learning rate because there are not enough points. Increase the loop iteration limits or the size of your dataset/dataloader.

When you say hyperparameter optimization you mean the learning rate using trainer.tune or do you really mean the hyperparameters of the model?

This is not an issue anymore, I missed removing the fit function (conducting during my testing) I placed prior to the Tuner.

fast path can be activated by enabling `precompute=True` in to_dataloader: ```python .to_dataloader (..., precompute=True) ```

phoeenniixx · 2025-05-28T12:49:53Z

Hi @jobs-git, I see you are still facing some linting issues, may I suggest you try setting up pre-commit in your local repo, the process is similar to sktime as mentioned by @fkiraly
https://www.sktime.net/en/stable/developer_guide/coding_standards.html#setting-up-local-code-quality-checks

I would suggest setting up pre-commit locally and then check if your code-quality fails using

pre-commit run --files <path-to-your-files>

and this will reduce your effort very much ( it automatically solves some issues, not requiring you to make changes)

phoeenniixx · 2025-05-28T12:50:46Z

you donot need to wait here and make changes based on the errors shown in code-quality ci workflows :)

codecov · 2025-05-28T14:55:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@01f97c4). Learn more about missing BASE report.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1850   +/-   ##
=======================================
  Coverage        ?   87.42%           
=======================================
  Files           ?       68           
  Lines           ?     6618           
  Branches        ?        0           
=======================================
  Hits            ?     5786           
  Misses          ?      832           
  Partials        ?        0

Flag	Coverage Δ
cpu	`87.42% <100.00%> (?)`
pytest	`87.42% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pytorch_forecasting/data/timeseries/_timeseries.py

fkiraly · 2025-06-03T13:47:55Z

is this ready for review?

jobs-git · 2025-06-03T18:16:25Z

is this ready for review?

Yes, please. Inputs are welcome.

fkiraly

Thanks - while we wait for reviews, can you kindly add tests? For TimeSeriesDataSet in isolation, and in integration with the networks?

fkiraly · 2025-06-06T20:55:34Z

pytorch_forecasting/data/timeseries/_timeseries.py

+        Returns:
+            tuple[dict[str, torch.Tensor], torch.Tensor]: x and y for model
+        """
+        if self.precompute:


can you explain why this is correct if self.precompute?

The index idx is not used at all, so this looks wrong.

Its being used under __item_tensor__ which is just a rename of the original __getitem__ its input param is not changed including it accepting the idx.

But the mechanism at which it gets idx is different.

When self.precompute=True, __item_tensor__ gets the idx first, so when this happens, the second time we receive idx in __getitem__ it will now be a different set. This will change the actual behavior of the original __getitem__ so we dont use it, instead we follow the order from self.precompute_cache which contains the correct/original order of idx from the first call.

When self.precompute=True, the order at which the idx gets obtained by __item_tensor__ is as follows:

create Dataloader by calling to_dataloader

this calls __precompute__

__precompute__ follows TimeSynchronizedBatchSampler to get idx

passes idx first sequences to __item_tensor__

Caveat:
So when Dataloader generates its own idx from its own Sampler it is not used, if we use the Dataloader idx we will NOT get the benefit of this PR, as that idx is being generated real time i guess.

So, instead of retrieving from cache, the effect is we will be back to computing original __getitem__ on each call, which is the slow path all over again, starving the GPU, technically is the source of slowness.

However, if a user wishes to use the idx of the Dataloader, the default will just skip precomputation, and get the original slow path from __getitem__ -> __item_tensor__ - which is the same vanilla __getitem__ anyway.

When self.precompute=True, __item_tensor__ gets the idx first, so when this happens, the second time we receive idx in __getitem__ it will now be a different set.

Sorry, I still do not get it. Can you explain which dunders are called from external, and which are only used internally? I thought that __getitem__ gets called - so when it does not use idx, the information is not passed on.

TLDR
Does this PR use idx from Dataloader? YES and NO. If using defaults (precompute=False), then YES. But if using accelerated path (precompute=True), then NO.

EXPLANATIONS

Vanilla

Dataloader -> pass idx -> __getitem__() -> use idx

PR - Revised vanilla - slow path (precompute=False)

Dataloader -> pass idx -> __getitem__() -> pass idx -> __itemtensor__() -> use idx

Notes:

__getitem__() is revised so it can switch to vanilla and accelerated based on precompute value, in this case, its using vanilla.

the __itemtensor__() is just the renamed __getitem__() so its the usual calculation path when precompute=False

Verdict:

idx from Dataloader is used, as usual.

PR - Accelerated path (precompute=True)

Two steps happen here

a. Pre computation routine

to_dataloader () -> __precompute__() -> generates idx from TSDS -> loop on every idx -> pass each idx -> __itemtensor__() -> use idx -> save to data to cache -> move to the next idx -> repeat

This is essentially how Dataloader uses __getitem__(). It calls __getitem__() and pass idx to it, repeats until all idx exhaust the data. But, what I did in this PR is used the FOR LOOP to retrieve idx in advance and exhaust the data so we can pre-calculate it without calling the Dataloader.

b. Dataloader call

Since we have cached the result of __itemtensor__() in advanced. Meaning __itemtensor__() has the pre-computed the data on each specific idx. We just retrieve it.

Dataloader -> pass idx -> __getitem__() -> idx ignored -> use cache sequence

Note:

notice, we dont need to call the __itemtensor__() this time since it already did its job

Verdict:

idx from Dataloader is not essential anymore, since another idx source has been used, specifically from TimeSynchronizedBatchSampler

@jobs-git, apologies, I still do not get it.

Since we have cached the result of itemtensor in advanced. meaning itemtensor has the precomputed the data on each specific idx. We just retrieve it.

As far as I understand: you are not retrieving the expected result at idx. You are retrieving the results in sequence, irrespective of which idx gets queried. Therefore, this is not the correct logic for __getitem__.

Please let me know if you disagree.

@phoeenniixx, @PranavBhatP, @agobbifbk, @fnhirwa - can you perhaps also have a look? I also do not fully understand the explanation.

For us to be able to accelerate this (precomputation=True), we need to query the idx's in advance (not wait for Dataloader to give it one by one) to perform precomputation. This is what the PR does.

I am confused by your meaning of sequence, since idx's from TimeSynchronizedBatchSampler is always random when shuffle=True, we just saved the result sequentially so retrieval is also sequence.

As mentioned previously, acceleration requires obtaining the idx's in advance, otherwise, we can't perform pre-computation.

@fkiraly I think what you prefer implemented is obtaining the precomputed values from real-time idx? This is what I prefer too, until I saw that the algorithm seems to be normalizing on a given idx.

That is when I realized we cant accelerate this if we wait for Dataloader, so we need to obtain idx's in advance from a different mechanism. Thus this PR.

May I ask what is the actual concern if the accelerated path uses a different mechanism to obtain random idx?

Of course, the exact training trajectory would likely be different, but a properly trained model should resolve to the same/similar outcome. If I could present prediction results, would that settle the concern for accelerated path using a different random idx generator? Just let me know.

fkiraly

Thanks - can you explain the logic in the self.precompute branch of __getitem__? It does not look correct to me.

jobs-git · 2025-06-07T02:42:11Z

Its being used under __item_tensor__ which is just a rename of the original __getitem__ its input param is not changed including it accepting the idx.

But the mechanism at which it gets idx is different.

When self.precompute=True, __item_tensor__ gets the idx first, so when this happens, the second time we receive idx in __getitem__ it will now be a different set. This will change the actual behavior of the original __getitem__ so we dont use it, instead we follow the order from self.precompute_cache which contains the correct/original order of idx from the first call.

When self.precompute=True, the order at which the idx gets obtained by __item_tensor__ is as follows:

create Dataloader by calling to_dataloader
this calls __precompute__
__precompute__ follows TimeSynchronizedBatchSampler to get idx
passes idx first sequences to __item_tensor__

Caveat:
So when Dataloader generates its own idx from its own Sampler it is not used, if we use the Dataloader idx we will NOT get the benefit of this PR, as that idx is being generated real time i guess.

So, instead of retrieving from cache, the effect is we will be back to computing original __getitem__ on each call, which is the slow path all over again, starving the GPU, technically is the source of slowness.

However, if a user wishes to use the idx of the Dataloader, the default will just skip precomputation, and get the original slow path from __getitem__ -> __item_tensor__ - which is the same vanilla __getitem__ anyway.

fkiraly · 2025-06-08T17:01:28Z

ok, I think I am starting to get it. @jobs-git, would it be possible to produce code for a basic usage examples?

Then, for each call of __getitem__ and __itemtensor__ , list which idx the different functions are called with? For the precompute=False and precompute=True case.

pytorch_forecasting/data/timeseries/_timeseries.py

jobs-git · 2025-06-08T22:19:38Z

ok, I think I am starting to get it. @jobs-git, would it be possible to produce code for a basic usage examples?

Then, for each call of __getitem__ and __itemtensor__ , list which idx the different functions are called with? For the precompute=False and precompute=True case.

Yes, of course, its possible to add sample code, regarding the term precompute, is that ok? I think it should be have been preprocess, but I noticed there is a usage of preprocess already, so I did not use it to avoid confusion.

jobs-git requested review from benHeid, fkiraly, fnhirwa, jdb78 and yarnabrina as code owners May 27, 2025 15:10

fkiraly added the enhancement New feature or request label May 27, 2025

fkiraly reviewed May 27, 2025

View reviewed changes

jobs-git added 6 commits May 28, 2025 18:39

Enable using both slow and fast path on __getitem__ ()

913984b

fast path can be activated by enabling `precompute=True` in to_dataloader: ```python .to_dataloader (..., precompute=True) ```

Fix formatting for lint

e41ce8b

Fix check error due to Dict

ab34f02

Fix check error in code quality

394bc79

Fix check error due to spacing

cd12c80

Fix check error due to spacing

087799f

jobs-git added 2 commits May 28, 2025 20:55

Fix check error due to spacing

7831ea7

Fix check error due to spacing

a2007f9

fkiraly reviewed May 29, 2025

View reviewed changes

pytorch_forecasting/data/timeseries/_timeseries.py Outdated Show resolved Hide resolved

jobs-git changed the title ~~[WIP] [BUG] fix precompute data to massively accelerate training in GPU~~ [WIP] [ENH] fix precompute data to massively accelerate training in GPU May 30, 2025

Add precompute as settings in TimeSeriesDataset class

9b9085e

jobs-git changed the title ~~[WIP] [ENH] fix precompute data to massively accelerate training in GPU~~ [ENH] fix precompute data to massively accelerate training in GPU Jun 1, 2025

jobs-git changed the title ~~[ENH] fix precompute data to massively accelerate training in GPU~~ [ENH] Precompute data to massively accelerate training in GPU Jun 1, 2025

fkiraly requested changes Jun 4, 2025

View reviewed changes

jobs-git added 3 commits June 5, 2025 21:20

Merge branch 'sktime:main' into patch-1

9fe2eb0

Added test

3f18cdb

Merge branch 'sktime:main' into patch-1

4332f7a

jobs-git requested a review from fkiraly June 6, 2025 17:50

fkiraly reviewed Jun 6, 2025

View reviewed changes

fkiraly requested changes Jun 6, 2025

View reviewed changes

This was referenced Jun 7, 2025

[ENH] re-implement the logic of data loading to ensure good performance #1860

Open

[BUG] Low GPU Utilization sktime/sktime#8278

Open

fkiraly reviewed Jun 8, 2025

View reviewed changes

pytorch_forecasting/data/timeseries/_timeseries.py Outdated Show resolved Hide resolved

jobs-git added 4 commits June 9, 2025 04:58

Merge branch 'sktime:main' into patch-1

ce8deb0

Organize precompute logic in __getitem__

1d13042

Organize precompute logic in __getitem__

2bd3e15

Merge branch 'sktime:main' into patch-1

4e169fb

[ENH] Precompute data to massively accelerate training in GPU #1850

Are you sure you want to change the base?

[ENH] Precompute data to massively accelerate training in GPU #1850

Uh oh!

Conversation

jobs-git commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proof of concept

Description

Caveat and Limitations

Checklist

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

fkiraly commented May 28, 2025

Uh oh!

agobbifbk commented May 28, 2025

Uh oh!

jobs-git commented May 28, 2025

Uh oh!

phoeenniixx commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phoeenniixx commented May 28, 2025

Uh oh!

codecov bot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

fkiraly commented Jun 3, 2025

Uh oh!

jobs-git commented Jun 3, 2025

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

fkiraly Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

jobs-git Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

jobs-git Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

fkiraly Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

jobs-git Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fkiraly Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

jobs-git Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jobs-git Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fkiraly left a comment

Choose a reason for hiding this comment

Uh oh!

jobs-git commented Jun 7, 2025

Uh oh!

fkiraly commented Jun 8, 2025

Uh oh!

Uh oh!

jobs-git commented Jun 8, 2025

Uh oh!

Uh oh!

jobs-git commented May 27, 2025 •

edited

Loading

phoeenniixx commented May 28, 2025 •

edited

Loading

codecov bot commented May 28, 2025 •

edited

Loading

jobs-git Jun 7, 2025 •

edited

Loading

jobs-git Jun 7, 2025 •

edited

Loading

jobs-git Jun 7, 2025 •

edited

Loading