-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Issue Summary
TensorFlow Serving built with TPU support fails to run in Docker containers with the error:
tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:95] Opening library:
terminate called after throwing an instance of 'std::filesystem::filesystem_error'
what(): filesystem error: cannot make canonical path: Invalid argument []
The TPU library path is empty despite setting TPU_LIBRARY_PATH, LD_LIBRARY_PATH, and LD_PRELOAD environment variables correctly. This makes TPU-based TensorFlow Serving deployments impossible in containerized environments (Docker, Kubernetes).
Environment
- TensorFlow Serving version: master branch (commit: 5fb3b1fefda9320202da184752a3366fbeddfeac)
- Platform: Google Cloud TPU VM (v4)
- Base OS: Ubuntu 22.04
- Python: 3.10.18
- Bazel: 7.4.1
- libtpu.so: Installed via pip install libtpu (version from PyPI)
- Deployment: Docker container (intended for Kubernetes)
Expected Behavior
TensorFlow Serving should:
- Read TPU_LIBRARY_PATH environment variable to locate libtpu.so
- Successfully initialize TPU support in containerized environments
- Serve models using TPU hardware for inference
Actual Behavior
The tpu_api_dlsym_initializer.cc:95 function attempts to open a library with an empty path string, causing immediate crash. The TPU initialization code appears to bypass environment variables entirely.
Reproduction Steps
- Build Custom TensorFlow Serving with TPU Support
Dockerfile.devel-tpu:
FROM ubuntu:22.04 as base_build
ENV DEBIAN_FRONTEND=noninteractive
# Install dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
automake build-essential ca-certificates curl git \
libcurl4-openssl-dev libfreetype6-dev libpng-dev libtool \
libzmq3-dev openjdk-11-jdk pkg-config python3.10 \
python3.10-dev python3-pip swig unzip wget zip zlib1g-dev \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
# Install Python dependencies
RUN pip3 install --no-cache-dir --upgrade pip setuptools && \
pip3 install --no-cache-dir future grpcio h5py keras_applications \
keras_preprocessing mock numpy portpicker requests
# Install Bazel 7.4.1
ENV BAZEL_VERSION=7.4.1
RUN mkdir /bazel && cd /bazel && \
curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh && \
chmod +x bazel-*.sh && ./bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh && \
rm -rf /bazel
# Copy TensorFlow Serving source
WORKDIR /tensorflow-serving
COPY . .
# Patch .bazelrc to disable GCE-specific TPU flags
RUN sed -i 's/^build:tpu --copt=-DLIBTPU_ON_GCE/#build:tpu --copt=-DLIBTPU_ON_GCE/' .bazelrc && \
echo "build:tpu-docker --define=with_tpu_support=true" >> .bazelrc && \
echo "build:tpu-docker --define=framework_shared_object=false" >> .bazelrc
# Build with TPU support
RUN bazel build --config=release --config=tpu-docker \
--verbose_failures \
tensorflow_serving/model_servers:tensorflow_model_server && \
cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/bin/
CMD ["/bin/bash"]
- Create Runtime Image with libtpu.so
Dockerfile.tpu:
FROM tensorflow-serving-devel-tpu as build_image
FROM ubuntu:22.04
# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates libgomp1 python3.10 python3-pip \
&& rm -rf /var/lib/apt/lists/*
# Copy TensorFlow Serving binary
COPY --from=build_image /usr/bin/tensorflow_model_server /usr/bin/tensorflow_model_server
# Install libtpu via pip
RUN pip3 install --no-cache-dir libtpu
# Create symlinks to standard locations
RUN LIBTPU_PATH=$(find /usr/local/lib -name "libtpu.so" | head -1) && \
mkdir -p /lib/libtpu /usr/lib && \
ln -sf "$LIBTPU_PATH" /lib/libtpu/libtpu.so && \
ln -sf "$LIBTPU_PATH" /usr/lib/libtpu.so
# Set environment variables
ENV TPU_LIBRARY_PATH=/lib/libtpu/libtpu.so
ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/lib/libtpu
ENV LD_PRELOAD=/lib/libtpu/libtpu.so
EXPOSE 8500 8501
CMD ["/usr/bin/tensorflow_model_server"]
- Run Container
docker build -t tensorflow-serving-devel-tpu -f Dockerfile.devel-tpu .
docker build -t tensorflow-serving-tpu -f Dockerfile.tpu .
docker run -p 8500:8500 -p 8501:8501 \
--mount type=bind,source=/path/to/model,target=/models/my_model \
-e MODEL_NAME=my_model \
tensorflow-serving-tpu
Result: Immediate crash with filesystem error.
What Has Been Tried
- Environment Variables: Set TPU_LIBRARY_PATH, LD_LIBRARY_PATH, LD_PRELOAD
- Multiple Library Locations: Symlinked libtpu.so to /lib/libtpu/, /usr/lib/, /usr/local/lib/
- Removed GCE-Specific Flags: Commented out -DLIBTPU_ON_GCE in .bazelrc
- Custom Build Configs: Tried --config=tpu-docker and direct --define=with_tpu_support=true
- Different libtpu Sources:
- Copied from TPU VM host (/usr/local/lib/python3.10/dist-packages/libtpu/libtpu.so)
- Installed via pip install libtpu
- Attempted torch-xla[tpu] installation
- Verified Binary: Confirmed TPU support is compiled in (checked with strings)
- Verified Library: Confirmed libtpu.so exists and is accessible (339MB, valid ELF)
Root Cause Analysis
The tpu_api_dlsym_initializer.cc code in TensorFlow core appears to have hardcoded logic that:
- Only works in Google Cloud Engine (GCE) VM environments
- Returns an empty string when not in GCE context
- Does NOT properly fall back to checking
TPU_LIBRARY_PATHenvironment variable - Fails during static initialization before environment variables can take effect
- The relevant code path (
tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:95) logs:
Opening library: [empty string]
This suggestsGetLibraryPath()returns "" instead of reading from environment variables.
Workarounds Attempted (All Failed)
- Mounting host libtpu.so into container
- Using
LD_PRELOADto force library loading - Multiple symlinks in every possible path
- Patching
.bazelrcto remove GCE-specific compilation flags
Questions
- Is TPU support in Docker containers officially supported?
- If not, are there plans to add support?
- Can
tpu_api_dlsym_initializer.ccbe updated to properly checkTPU_LIBRARY_PATHin non-GCE environments? - Are there internal Google builds/configurations that work in containers?