Flax overhead #1081

GJBoth · 2021-03-04T11:34:15Z

GJBoth
Mar 4, 2021

I'm comparing the speed of performing a simple dot product with a fixed matrix 'naked' or in a flax layer. I noticed that by using flax, I'm seeing a massively decreased performance: from approximately 350 microsecond to 5-10ms. Here's a colab where I reproduce the issue. I compare a naked approach, flax (both implicit and explicit) and a python dataclass. Only with the Flax classes do I see the performance hit.

Am I timing wrong (I used .block_until_ready() everywhere), or am I making a wrong comparison? Or is it something with Flax?

Answered by jheek

Mar 4, 2021

Unlike for example PyTorch we haven't put much effort in reducing overhead. This has a few reasons:

Python is slow, so fast code eventually tends to move to c++ making the implementation much more complex
Jax itself has large dispatch overheads
Most importantly, the dispatch overheads disappear once you compile you code. So using jax.jit on any of the examples you gave should lead to the same compiled program being generated with the exact same performance.

The reason why in this case the overhead is so high is because similar to other frameworks the compute is done async from the python thread. Normally the tensors are big enough to mask any overhead from the python interpreter. Still j…

View full answer

jheek · 2021-03-04T13:26:21Z

jheek
Mar 4, 2021
Maintainer

Unlike for example PyTorch we haven't put much effort in reducing overhead. This has a few reasons:

Python is slow, so fast code eventually tends to move to c++ making the implementation much more complex
Jax itself has large dispatch overheads
Most importantly, the dispatch overheads disappear once you compile you code. So using jax.jit on any of the examples you gave should lead to the same compiled program being generated with the exact same performance.

The reason why in this case the overhead is so high is because similar to other frameworks the compute is done async from the python thread. Normally the tensors are big enough to mask any overhead from the python interpreter. Still jax.jit compiled code can be an order of magnitude faster in many cases.

4 replies

GJBoth Mar 4, 2021
Author

Okay, I think I understand your points and they make a lot of sense. When jitting the .apply function it indeed becomes much faster, although still slower than the jitted naked function (30 mus layer vs 45mus for the flax layer). It wasn't clear to me that flax has such a massive overhead (and that it disappears once you jit) - maybe it's an idea to make this a more prominent in the documentation. Do you recommend jitting call? Or jit it outside the module?

Thanks!

jheek Mar 4, 2021
Maintainer

Generally you want to jit the largest chunk of code that you can get away with :) We could probably add a piece of documentation on performance as well. Although, these microbenchmarks are a bit deceptive I think. Also the story for Flax is no different from JAX. It's internals are mostly pure python and can have large overhead especially for things like jax.grad and jax.vmap.

Internally, we use JAX apis a bit more than you might expect for example by adding calls such that you have good profiling information and by using lazy evaluation to check the shape of parameters against the hyper-parameters/initializers. Here also the JAX features have large overhead until you jit it all away.

see for example the Jax FAQ for benchmarks: https://jax.readthedocs.io/en/latest/faq.html#benchmarking-jax-code

8bitmp3 Mar 4, 2021
Maintainer

@mattjj @skye @jekbradbury @jakevdp Do you think it may make sense to include a short sentence or two under a jit discussion in the 🔪 JAX - Sharp Bits 🔪 (which is, arguably, the most popular or cited guide for JAX) regarding what @jheek said:

Most importantly, the dispatch overheads disappear once you compile you code. So using jax.jit on any of the examples you gave should lead to the same compiled program being generated with the exact same performance.

Or maybe I missed it somewhere. I know this is touched upon in the JAX FAQ section but I don't know if it's something a busy user would get/understand/read after going through several excellent notebooks that appear at the top of the TOC on the JAX docs site. In any case, comparisons to mainstream C++ frameworks with Python wrappers are/will be quite common ("Python is slow"). So, having this kind of information in "our faces" (the "don't make me think" principle) that emphasizes the power of jax.jit in the context of performance vs TF/PT may be useful IMO 🤷‍♂️ .

GJBoth Mar 5, 2021
Author

As a user, I think this is a good idea. My Jax code blows everything else out of the water after I compile it, but I wasn't expecting the overhead to be this much, and that it disappears once I compile. I know it makes my code faster, but I thought it only worked on the matrix operations, meaning that my 'naked' operation and flax layer both would become a factor x faster, not that they'd both become as fast.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flax overhead #1081

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Flax overhead #1081

Uh oh!

GJBoth Mar 4, 2021

Replies: 1 comment · 4 replies

Uh oh!

jheek Mar 4, 2021 Maintainer

Uh oh!

GJBoth Mar 4, 2021 Author

Uh oh!

Uh oh!

jheek Mar 4, 2021 Maintainer

Uh oh!

8bitmp3 Mar 4, 2021 Maintainer

Uh oh!

GJBoth Mar 5, 2021 Author

GJBoth
Mar 4, 2021

Replies: 1 comment 4 replies

jheek
Mar 4, 2021
Maintainer

GJBoth Mar 4, 2021
Author

jheek Mar 4, 2021
Maintainer

8bitmp3 Mar 4, 2021
Maintainer

GJBoth Mar 5, 2021
Author