-
Notifications
You must be signed in to change notification settings - Fork 684
[ENH] Precompute data to massively accelerate training in GPU #1850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The primary issue here is that `__getitem__ ()` performs pre-processing, which are typically conducted before actually training. As a result, the GPU becomes frequently idle, resulting to slower training completion. Its known that GPU typically achieve higher throughput the more it can be made busy, but the pre-processing done in `__getitem__ ()` each time an item is retrieved massively impacts this. This commit ensures that pre-processing is performed prior to training. This is achieved by the following: 1. calls with `to_dataloader ()` will also call the `precompute ()` function 2. the `precompute ()` function handles the collection of pre-computed items from `__precompute__getitem__ ()` and store it in a cache. this function relies on sampler to be able to retrieve data index 3. the `__precompute__getitem__ ()` is the unmodified algorithm of the original `__getitem__ ()` to ensure equivalent outcome 4. the new `__getitem__ ()` retrieves items from cache in order, this is because relying on `idx` may result to a different index sequence due to the first sampler call from `precompute ()`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
FYI @phoeenniixx, @PranavBhatP, @xandie985 - is this something we need to take into account in v2? We should also think about adding profiling checks.
Code quality (linting) is failing, this can be fixed by using pre-commit or automating formatting. The pytorch-forecasting
does not have this in the docs (we should add it), but this is the same as in sktime
:
https://www.sktime.net/en/stable/developer_guide/coding_standards.html
FYI @agobbifbk |
Agree, most of the time you can fit your data in memory and we should include the precomputation possibility in the d2 layer. We should have already the correct indexes computed, it is just a matter of create the tensors according to those indexes.
When you say |
This is not an issue anymore, I missed removing the fit function (conducting during my testing) I placed prior to the Tuner. |
fast path can be activated by enabling `precompute=True` in to_dataloader: ```python .to_dataloader (..., precompute=True) ```
Hi @jobs-git, I see you are still facing some linting issues, may I suggest you try setting up pre-commit in your local repo, the process is similar to I would suggest setting up
and this will reduce your effort very much ( it automatically solves some issues, not requiring you to make changes) |
you donot need to wait here and make changes based on the errors shown in code-quality ci workflows :) |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1850 +/- ##
=======================================
Coverage ? 87.42%
=======================================
Files ? 68
Lines ? 6618
Branches ? 0
=======================================
Hits ? 5786
Misses ? 832
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
is this ready for review? |
Yes, please. Inputs are welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - while we wait for reviews, can you kindly add tests? For TimeSeriesDataSet
in isolation, and in integration with the networks?
Returns: | ||
tuple[dict[str, torch.Tensor], torch.Tensor]: x and y for model | ||
""" | ||
if self.precompute: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why this is correct if self.precompute
?
The index idx
is not used at all, so this looks wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its being used under __item_tensor__
which is just a rename of the original __getitem__
its input param is not changed including it accepting the idx
.
But the mechanism at which it gets idx
is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When self.precompute=True
, __item_tensor__
gets the idx
first, so when this happens, the second time we receive idx
in __getitem__
it will now be a different set. This will change the actual behavior of the original __getitem__
so we dont use it, instead we follow the order from self.precompute_cache
which contains the correct/original order of idx
from the first call.
When self.precompute=True
, the order at which the idx
gets obtained by __item_tensor__
is as follows:
- create
Dataloader
by callingto_dataloader
- this calls
__precompute__
__precompute__
followsTimeSynchronizedBatchSampler
to getidx
- passes
idx
first sequences to__item_tensor__
Caveat:
So when Dataloader generates its own idx
from its own Sampler
it is not used, if we use the Dataloader idx
we will NOT get the benefit of this PR, as that idx
is being generated real time i guess.
So, instead of retrieving from cache
, the effect is we will be back to computing original __getitem__
on each call, which is the slow path all over again, starving the GPU, technically is the source of slowness.
However, if a user wishes to use the idx
of the Dataloader
, the default will just skip precomputation, and get the original slow path from __getitem__
-> __item_tensor__
- which is the same vanilla __getitem__
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When
self.precompute=True
,__item_tensor__
gets theidx
first, so when this happens, the second time we receiveidx
in__getitem__
it will now be a different set.
Sorry, I still do not get it. Can you explain which dunders are called from external, and which are only used internally? I thought that __getitem__
gets called - so when it does not use idx
, the information is not passed on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR
Does this PR use idx
from Dataloader
? YES and NO. If using defaults (precompute=False
), then YES. But if using accelerated path (precompute=True
), then NO.
EXPLANATIONS
-
Vanilla
Dataloader -> pass idx -> __getitem__() -> use idx
-
PR - Revised vanilla - slow path (
precompute=False
)Dataloader -> pass idx -> __getitem__() -> pass idx -> __itemtensor__() -> use idx
Notes:
__getitem__()
is revised so it can switch to vanilla and accelerated based onprecompute
value, in this case, its using vanilla.- the
__itemtensor__()
is just the renamed__getitem__()
so its the usual calculation path whenprecompute=False
Verdict:
idx
fromDataloader
is used, as usual.
-
PR - Accelerated path (
precompute=True
)Two steps happen here
-
a. Pre computation routine
to_dataloader () -> __precompute__() -> generates idx from TSDS -> loop on every idx -> pass each idx -> __itemtensor__() -> use idx -> save to data to cache -> move to the next idx -> repeat
This is essentially how Dataloader uses
__getitem__()
. It calls__getitem__()
and passidx
to it, repeats until allidx
exhaust the data. But, what I did in this PR is used the FOR LOOP to retrieveidx
in advance and exhaust the data so we can pre-calculate it without calling theDataloader
. -
b.
Dataloader
callSince we have cached the result of
__itemtensor__()
in advanced. Meaning__itemtensor__()
has the pre-computed the data on each specificidx
. We just retrieve it.Dataloader -> pass idx -> __getitem__() -> idx ignored -> use cache sequence
Note:
- notice, we dont need to call the
__itemtensor__()
this time since it already did its job
Verdict:
idx
fromDataloader
is not essential anymore, since anotheridx
source has been used, specifically fromTimeSynchronizedBatchSampler
- notice, we dont need to call the
-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jobs-git, apologies, I still do not get it.
Since we have cached the result of itemtensor in advanced. meaning itemtensor has the precomputed the data on each specific idx. We just retrieve it.
As far as I understand: you are not retrieving the expected result at idx
. You are retrieving the results in sequence, irrespective of which idx
gets queried. Therefore, this is not the correct logic for __getitem__
.
Please let me know if you disagree.
@phoeenniixx, @PranavBhatP, @agobbifbk, @fnhirwa - can you perhaps also have a look? I also do not fully understand the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For us to be able to accelerate this (precomputation=True
), we need to query the idx
's in advance (not wait for Dataloader
to give it one by one) to perform precomputation. This is what the PR does.
I am confused by your meaning of sequence, since idx
's from TimeSynchronizedBatchSampler
is always random when shuffle=True
, we just saved the result sequentially so retrieval is also sequence.
As mentioned previously, acceleration requires obtaining the idx
's in advance, otherwise, we can't perform pre-computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fkiraly I think what you prefer implemented is obtaining the precomputed values from real-time idx
? This is what I prefer too, until I saw that the algorithm seems to be normalizing on a given idx
.
That is when I realized we cant accelerate this if we wait for Dataloader
, so we need to obtain idx's in advance from a different mechanism. Thus this PR.
May I ask what is the actual concern if the accelerated path uses a different mechanism to obtain random idx
?
Of course, the exact training trajectory would likely be different, but a properly trained model should resolve to the same/similar outcome. If I could present prediction results, would that settle the concern for accelerated path using a different random idx
generator? Just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - can you explain the logic in the self.precompute
branch of __getitem__
? It does not look correct to me.
Its being used under But the mechanism at which it gets When When
Caveat: So, instead of retrieving from However, if a user wishes to use the |
ok, I think I am starting to get it. @jobs-git, would it be possible to produce code for a basic usage examples? Then, for each call of |
Yes, of course, its possible to add sample code, regarding the term |
Proof of concept
https://drive.google.com/file/d/1go7wEMqcejRl7YecHYhIFAhO6YF3T0Kd/view?usp=sharing
Results on Training
that's about 490% performance improvements, far from 2000%, but with this update, developers can reduce cost and hardware requirements by 4x!
Results on Hyperparameter Optimization
that's about 510% performance improvements.
Description
Fixes: #1849 #1426 #1860
Partially Fixes: sktime/sktime#8278
Closes: #1846
Supersedes: #806
The primary issue here is that
__getitem__ ()
performs pre-processing, which are typically conducted before actually training.As a result, the GPU becomes frequently idle, resulting to slower training completion.
Its known that GPU typically achieve higher throughput the more it can be made busy, but the pre-processing done in
__getitem__ ()
each time an item is retrieved massively impacts this.This commit ensures that pre-processing is performed prior to training. This is achieved by the following:
TimeSeriesDataset (..., precompute=True)
will activate the call__precompute__ ()
function into_dataloader()
__precompute__ ()
function handles the collection of pre-computed items from__item_tensor__ ()
and store it in a cache. this function relies on sampler to be able to retrieve data index__item_tensor__ ()
is the unmodified algorithm of the original__getitem__ ()
to ensure equivalent outcome__getitem__ ()
retrieves items from cache in order, this is because relying onidx
may result to a different index sequence due to the first sampler call fromprecompute ()
Note:
TimeSeriesDataset
defaults to slow path, so the vanilla method can still be used.### WIP help:~~@fkiraly ~~
- [ ] Not sure where to get the batch indices ifsampler=None
so I have to force implementTimeSynchronizedBatchSampler
. Guidance on this will be appreciated.Resolution: When
precompute = True
inTimeSeriesDataset
it will useTimeSynchronizedBatchSampler
regardless of sampler set into_dataloader
.Caveat and Limitations
This feature stores
precomputed
data in RAM. Thus, enough RAM must be ensured, otherwise OOM will ensue.A super batching (i dont know how is that better) may be used if
precompute
is to be insisted, however that would be beyond the scope of this PR.Recommend to use:
precompute=False
to use the original slow path.In case of ultra large scale dataset, FSDP may be utilized for pytorch-lightning distributed computing, but that is also beyond the scope of this PR.
Recommend to use:
precompute=False
to use the original slow path.Using
precompute=True
ignores two settings:batch_sampler
andshuffle
, as this will set the sampler toTimeSynchronizedBatchSampler
andshuffle
toFalse
.Recommend to use:
precompute=False
to use the original slow path in case you would like to set those two settings.Can this be implemented? Yes of course, but that is beyond the scope of this PR. It would also increase complexity of using
TimeSeriesDataset
when a user wishes to enableprecompute=True
and increase memory requirements.Checklist
pre-commit install
.To run hooks independent of commit, execute
pre-commit run --all-files