Compile time of `train_step` very depending on batch size #3349

Jorgvt · 2023-09-18T13:27:50Z

Jorgvt
Sep 18, 2023

Hi!

When working on a current project I found that the compiling time of my train_step function increased drastically when I increased the batch size I was using. At first it was taking about 20s when using a batch size of 2 but increased up to 4000s when I bumped the batch size up to 64. Here is the train_step function I'm using (it may be a little confusing because I'm working on an IQA task with a custom model that has an state):

@jax.jit
def train_step(state, batch):
    """Train for a single step."""
    img_o, img_a, img_b, label = batch
    def loss_fn(params):
        ## Forward pass through the model
        img_o_pred, updated_state = state.apply_fn({"params": params, **state.state}, img_o, mutable=list(state.state.keys()), train=True)
        img_a_pred, updated_state = state.apply_fn({"params": params, **state.state}, img_a, mutable=list(state.state.keys()), train=True)
        img_b_pred, updated_state = state.apply_fn({"params": params, **state.state}, img_b, mutable=list(state.state.keys()), train=True)

        ## Calculate the distances
        dist_oa = (img_o_pred - img_a_pred)**2
        dist_oa = jnp.clip((dist_oa).sum(axis=(1,2,3)), a_min=1e-8)**(1/2)
        dist_ob = (img_o_pred - img_b_pred)**2
        dist_ob = jnp.clip((dist_ob).sum(axis=(1,2,3)), a_min=1e-8)**(1/2)
        dist_diff = dist_oa - dist_ob
        
        ## Calculate binary crossentropy
        return optax.sigmoid_binary_cross_entropy(dist_diff, label).mean(), (updated_state, dist_diff)
    
    (loss, (updated_state, dist_diff)), grads = jax.value_and_grad(loss_fn, has_aux=True)(state.params)

    state = state.apply_gradients(grads=grads)
    metrics_updates = state.metrics.single_from_model_output(loss=loss, logits=dist_diff[:,None], labels=jnp.round(label).astype(int), axis_name="num_devices")
    metrics = state.metrics.merge(metrics_updates)
    state = state.replace(metrics=metrics)
    state = state.replace(state=updated_state)
    return state

As a note, I have a different function to calculate the metrics during validation, and this function isn't showing the same behavior so I thought that it may have been related to the calculation of the gradient, but I don't really know if it makes sense.

I was under the assumption that changing the batch size shouldn't have this big of an influence in compilation and, as I couldn't narrow down the problem, I tried to replicate it in a very simple MNIST classifier example in Colab (here).

What I found was basically the same, as the compilation time goes up with the batch size as you can see in this quick wandb dashboard I set up for the experiment: https://wandb.ai/jorgvt/JaX_Compile?workspace=user-jorgvt

I'd be more than willing to share more information with anyone that can shed some light!

chiamp · 2023-09-28T01:24:08Z

chiamp
Sep 28, 2023

Just double checking, you're not changing the batch size within the same training loop right?

2 replies

Jorgvt Oct 2, 2023
Author

Yea, I'm keeping the same batch size at all times. Just as clarification, in the original post I was talking about compilation times of the train_step function, I wasn't considering a full training loop yet. That's why it is so intriguing to me.

chiamp Jan 31, 2024

Sorry for the delay. I'm not too sure why you're seeing this behavior. Changing the batch size each loop would cause recompilation each time, but you're talking about purely just the compilation time, so I'm not sure. Maybe try asking in the JAX discussions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compile time of `train_step` very depending on batch size #3349

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Compile time of train_step very depending on batch size #3349

Uh oh!

Jorgvt Sep 18, 2023

Replies: 1 comment · 2 replies

Uh oh!

chiamp Sep 28, 2023

Uh oh!

Jorgvt Oct 2, 2023 Author

Uh oh!

chiamp Jan 31, 2024

Compile time of `train_step` very depending on batch size #3349

Jorgvt
Sep 18, 2023

Replies: 1 comment 2 replies

chiamp
Sep 28, 2023

Jorgvt Oct 2, 2023
Author