Skip to content

torch_xla.tpu.version() gets stuck occasionally #9449

Open
@yaochengji

Description

@yaochengji

🐛 Bug

torch_xla.tpu.version() gets stuck occasionally and timeout.

To Reproduce

torch_xla.tpu.version()

Sometimes got http timeout. Here's the callstack.

 File "/home/chengjiyao_google_com/vllm/vllm/v1/attention/backends/pallas.py", line 163, in __init__
    tpu_version = torch_xla.tpu.version()
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/torch_xla/_internal/tpu.py", line 187, in version
    env = get_tpu_env()
          ^^^^^^^^^^^^^
  File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/torch_xla/_internal/tpu.py", line 181, in get_tpu_env
    metadata = _get_metadata('tpu-env')
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/torch_xla/_internal/tpu.py", line 89, in _get_metadata
    resp = requests.get(path, headers={'Metadata-Flavor': 'Google'})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/requests/adapters.py", line 713, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='metadata.google.internal', port=80): Read timed out. (read timeout=None)

Environment

  • Reproducible on XLA backend TPU:

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingxla:tpuTPU specific issues and PRs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions