Skip to content

Failing upload for libtorch on windows (apparent timeout?) #1159

Open
@h-vetinari

Description

@h-vetinari

Checklist

  • I added a descriptive title
  • I searched open reports and couldn't find a duplicate

What happened?

We're running into a troubling issue on the pytorch feedstock, which is holding up the rollout of / migration for pytorch 2.7. After merging conda-forge/pytorch-cpu-feedstock#383, the windows build failed to upload libtorch 9(!) times in a row due to

http.client.RemoteDisconnected: Remote end closed connection without response

after uploading 100% of the 474MB artefact each time.

Some comments from that PR

@danpetry: [...] I've raised an incident report to the anaconda.org team internally

Is it because the packages already exist? What happens if you delete existing packages?

2025-05-03T21:27:13.0788175Z Distribution C:\bld\win-64\pytorch-2.7.0-cuda126_mkl_py39_h7d88c2c_300.conda already exists on cf-staging. Waiting another 15 seconds to try uploading again.
2025-05-03T21:27:28.8755178Z Distribution C:\bld\win-64\pytorch-2.7.0-cuda126_mkl_py39_h7d88c2c_300.conda already exists on cf-staging. Waiting another 30 seconds to try uploading again.
2025-05-03T21:27:59.5863085Z Distribution C:\bld\win-64\pytorch-2.7.0-cuda126_mkl_py39_h7d88c2c_300.conda already exists on cf-staging. Waiting another 45 seconds to try uploading again.
2025-05-03T21:28:45.1693795Z Distribution C:\bld\win-64\pytorch-2.7.0-cuda126_mkl_py39_h7d88c2c_300.conda already exists on cf-staging. Waiting another 60 seconds to try uploading again.
2025-05-03T21:29:45.8196131Z Distribution C:\bld\win-64\pytorch-2.7.0-cuda126_mkl_py39_h7d88c2c_300.conda already exists on cf-staging. Waiting another 75 seconds to try uploading again.
2025-05-03T21:29:45.8197501Z WARNING: Distribution C:\bld\win-64\pytorch-2.7.0-cuda126_mkl_py39_h7d88c2c_300.conda already existed in cf-staging for a while. Deleting and re-uploading.

@jezdez: This smells like a race condition during the upload or some other state issue, I'd also delete and try again uploading those Windows builds [...]

We (or at least I) don't have access to cf-staging to delete stuff, but this has never been an issue. More importantly, the time passed between 0.0% (of the upload of libtorch) and the error seem to be consistently just a few seconds above 5 minutes:

The error message makes me think it's perhaps somehow taking too long, and running into a timeout on the connection (on the server side). If you look at the error logs [below], the error is just a few seconds more than 5min after the start of the upload every time.

[moved logs to comment below due to length restriction]

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    type::bugdescribes erroneous operation, use severity::* to classify the type

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions