You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Moving data from device-to-host is really slow - atleast by 7-10x compared to JAX.
For a lot of workloads (eg: inference) this latency is crucial, and such subpar performance makes torch:xla simply an unfeasible option for production deployments.
These are anomalous numbers, considering the (relatively) small sizes of the Tensors.
Additionally, I can confirm that this issue arises when using production-grade models as well wherein such latencies are crippling for good performance.
I would also be curious about why the asymmetry between H2D vs D2H performance. I know D2H would be blocking, but is this an XLA bottleneck wherein its unable to efficiently stream & overlap tiles of computation - but just happens to be more well optimized for H2D transfers?
Happy to provide more details upon request.
The text was updated successfully, but these errors were encountered:
🐛 Bug
Moving data from device-to-host is really slow - atleast by 7-10x compared to JAX.
For a lot of workloads (eg: inference) this latency is crucial, and such subpar performance makes
torch:xla
simply an unfeasible option for production deployments.To Reproduce
Steps to reproduce the behavior:
Runtime
->Disconnect and delete runtime
to ensure no interference between frameworks.Expected behavior
This is the performance offered by JAX:
Whereas Torch XLA:
Clearly, the Device-to-Host bandwidth is lacking compared to JAX, by 10x in the worst case and 3x in the best.
Environment
2.5.1+libtpu
Additional context
Metrics analysis backs up this asymmetry:
These are anomalous numbers, considering the (relatively) small sizes of the Tensors.
Additionally, I can confirm that this issue arises when using production-grade models as well wherein such latencies are crippling for good performance.
I would also be curious about why the asymmetry between
H2D vs D2H
performance. I knowD2H
would be blocking, but is this an XLA bottleneck wherein its unable to efficiently stream & overlap tiles of computation - but just happens to be more well optimized forH2D
transfers?Happy to provide more details upon request.
The text was updated successfully, but these errors were encountered: