Open
Description
🐛 Bug
torch_xla.tpu.version() gets stuck occasionally and timeout.
To Reproduce
torch_xla.tpu.version()
Sometimes got http timeout. Here's the callstack.
File "/home/chengjiyao_google_com/vllm/vllm/v1/attention/backends/pallas.py", line 163, in __init__
tpu_version = torch_xla.tpu.version()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/torch_xla/_internal/tpu.py", line 187, in version
env = get_tpu_env()
^^^^^^^^^^^^^
File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/torch_xla/_internal/tpu.py", line 181, in get_tpu_env
metadata = _get_metadata('tpu-env')
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/torch_xla/_internal/tpu.py", line 89, in _get_metadata
resp = requests.get(path, headers={'Metadata-Flavor': 'Google'})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chengjiyao_google_com/miniconda3/envs/vllm/lib/python3.11/site-packages/requests/adapters.py", line 713, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='metadata.google.internal', port=80): Read timed out. (read timeout=None)
Environment
- Reproducible on XLA backend TPU: