Skip to content

Unable to build Docker image for TFServing for TPU (filesystem error: cannot make canonical path with empty library path) #4123

@arpitagarwal-meesho

Description

@arpitagarwal-meesho

Issue Summary

TensorFlow Serving built with TPU support fails to run in Docker containers with the error:

tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:95] Opening library: 
terminate called after throwing an instance of 'std::filesystem::filesystem_error'
  what():  filesystem error: cannot make canonical path: Invalid argument []

The TPU library path is empty despite setting TPU_LIBRARY_PATH, LD_LIBRARY_PATH, and LD_PRELOAD environment variables correctly. This makes TPU-based TensorFlow Serving deployments impossible in containerized environments (Docker, Kubernetes).

Environment

  • TensorFlow Serving version: master branch (commit: 5fb3b1fefda9320202da184752a3366fbeddfeac)
  • Platform: Google Cloud TPU VM (v4)
  • Base OS: Ubuntu 22.04
  • Python: 3.10.18
  • Bazel: 7.4.1
  • libtpu.so: Installed via pip install libtpu (version from PyPI)
  • Deployment: Docker container (intended for Kubernetes)

Expected Behavior

TensorFlow Serving should:

  1. Read TPU_LIBRARY_PATH environment variable to locate libtpu.so
  2. Successfully initialize TPU support in containerized environments
  3. Serve models using TPU hardware for inference

Actual Behavior

The tpu_api_dlsym_initializer.cc:95 function attempts to open a library with an empty path string, causing immediate crash. The TPU initialization code appears to bypass environment variables entirely.

Reproduction Steps

  1. Build Custom TensorFlow Serving with TPU Support
    Dockerfile.devel-tpu:
FROM ubuntu:22.04 as base_build

ENV DEBIAN_FRONTEND=noninteractive

# Install dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    automake build-essential ca-certificates curl git \
    libcurl4-openssl-dev libfreetype6-dev libpng-dev libtool \
    libzmq3-dev openjdk-11-jdk pkg-config python3.10 \
    python3.10-dev python3-pip swig unzip wget zip zlib1g-dev \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN pip3 install --no-cache-dir --upgrade pip setuptools && \
    pip3 install --no-cache-dir future grpcio h5py keras_applications \
    keras_preprocessing mock numpy portpicker requests

# Install Bazel 7.4.1
ENV BAZEL_VERSION=7.4.1
RUN mkdir /bazel && cd /bazel && \
    curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh && \
    chmod +x bazel-*.sh && ./bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh && \
    rm -rf /bazel

# Copy TensorFlow Serving source
WORKDIR /tensorflow-serving
COPY . .

# Patch .bazelrc to disable GCE-specific TPU flags
RUN sed -i 's/^build:tpu --copt=-DLIBTPU_ON_GCE/#build:tpu --copt=-DLIBTPU_ON_GCE/' .bazelrc && \
    echo "build:tpu-docker --define=with_tpu_support=true" >> .bazelrc && \
    echo "build:tpu-docker --define=framework_shared_object=false" >> .bazelrc

# Build with TPU support
RUN bazel build --config=release --config=tpu-docker \
    --verbose_failures \
    tensorflow_serving/model_servers:tensorflow_model_server && \
    cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/bin/

CMD ["/bin/bash"]
  1. Create Runtime Image with libtpu.so
    Dockerfile.tpu:
FROM tensorflow-serving-devel-tpu as build_image
FROM ubuntu:22.04

# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates libgomp1 python3.10 python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Copy TensorFlow Serving binary
COPY --from=build_image /usr/bin/tensorflow_model_server /usr/bin/tensorflow_model_server

# Install libtpu via pip
RUN pip3 install --no-cache-dir libtpu

# Create symlinks to standard locations
RUN LIBTPU_PATH=$(find /usr/local/lib -name "libtpu.so" | head -1) && \
    mkdir -p /lib/libtpu /usr/lib && \
    ln -sf "$LIBTPU_PATH" /lib/libtpu/libtpu.so && \
    ln -sf "$LIBTPU_PATH" /usr/lib/libtpu.so

# Set environment variables
ENV TPU_LIBRARY_PATH=/lib/libtpu/libtpu.so
ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/lib/libtpu
ENV LD_PRELOAD=/lib/libtpu/libtpu.so

EXPOSE 8500 8501
CMD ["/usr/bin/tensorflow_model_server"]
  1. Run Container
docker build -t tensorflow-serving-devel-tpu -f Dockerfile.devel-tpu .
docker build -t tensorflow-serving-tpu -f Dockerfile.tpu .

docker run -p 8500:8500 -p 8501:8501 \
  --mount type=bind,source=/path/to/model,target=/models/my_model \
  -e MODEL_NAME=my_model \
  tensorflow-serving-tpu

Result: Immediate crash with filesystem error.

What Has Been Tried

  1. Environment Variables: Set TPU_LIBRARY_PATH, LD_LIBRARY_PATH, LD_PRELOAD
  2. Multiple Library Locations: Symlinked libtpu.so to /lib/libtpu/, /usr/lib/, /usr/local/lib/
  3. Removed GCE-Specific Flags: Commented out -DLIBTPU_ON_GCE in .bazelrc
  4. Custom Build Configs: Tried --config=tpu-docker and direct --define=with_tpu_support=true
  5. Different libtpu Sources:
  • Copied from TPU VM host (/usr/local/lib/python3.10/dist-packages/libtpu/libtpu.so)
  • Installed via pip install libtpu
  • Attempted torch-xla[tpu] installation
  1. Verified Binary: Confirmed TPU support is compiled in (checked with strings)
  2. Verified Library: Confirmed libtpu.so exists and is accessible (339MB, valid ELF)

Root Cause Analysis

The tpu_api_dlsym_initializer.cc code in TensorFlow core appears to have hardcoded logic that:

  1. Only works in Google Cloud Engine (GCE) VM environments
  2. Returns an empty string when not in GCE context
  3. Does NOT properly fall back to checking TPU_LIBRARY_PATH environment variable
  4. Fails during static initialization before environment variables can take effect
  5. The relevant code path (tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:95) logs:

Opening library: [empty string]
This suggests GetLibraryPath() returns "" instead of reading from environment variables.

Workarounds Attempted (All Failed)

  1. Mounting host libtpu.so into container
  2. Using LD_PRELOAD to force library loading
  3. Multiple symlinks in every possible path
  4. Patching .bazelrc to remove GCE-specific compilation flags

Questions

  1. Is TPU support in Docker containers officially supported?
  2. If not, are there plans to add support?
  3. Can tpu_api_dlsym_initializer.cc be updated to properly check TPU_LIBRARY_PATH in non-GCE environments?
  4. Are there internal Google builds/configurations that work in containers?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions