Skip to content

Headless rendering on an H100 #250

@dirkmcpherson

Description

@dirkmcpherson

I've been trying to get "python -m sapien.example.offscreen" to work on an H100, but vulkan is not recognizing the GPU. Any advice or suggestions would be greatly appreciated.

/usr/share/vulkan/icd.d/nvidia_icd.json

{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "libGLX_nvidia.so.0",
        "api_version" : "1.3.204"
    }

/usr/share/glvnd/egl_vendor.d/10_nvidia.json

{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libEGL_nvidia.so.0"
    }
}

/etc/vulkan/implicit_layer.d/nvidia_layers.json

{
    "file_format_version" : "1.0.0",
    "layer": {
        "name": "VK_LAYER_NV_optimus",
        "type": "INSTANCE",
        "library_path": "libGLX_nvidia.so.0",
        "api_version" : "1.3.204",
        "implementation_version" : "1",
        "description" : "NVIDIA Optimus layer",
        "functions": {
            "vkGetInstanceProcAddr": "vk_optimusGetInstanceProcAddr",
            "vkGetDeviceProcAddr": "vk_optimusGetDeviceProcAddr"
        },
        "enable_environment": {
            "__NV_PRIME_RENDER_OFFLOAD": "1"
        },
        "disable_environment": {
            "DISABLE_LAYER_NV_OPTIMUS_1": ""
        }
    }
}

vulkaninfo

Vulkan Instance Version: 1.3.204
GPU0:
VkPhysicalDeviceProperties:
---------------------------
        apiVersion        = 4206847 (1.3.255)
        driverVersion     = 1 (0x0001)
        vendorID          = 0x10005
        deviceID          = 0x0000
        deviceType        = PHYSICAL_DEVICE_TYPE_CPU
        deviceName        = llvmpipe (LLVM 15.0.7, 256 bits)
        pipelineCacheUUID = 32332e32-2e31-2d31-7562-756e7475332e

CUDA

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:18:00.0 Off |                    0 |
| N/A   30C    P0             70W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

Sapien.example.offscreen

DRIVER: Searching for driver manifest files
DRIVER:    In following folders:
DRIVER:       /root/.config/vulkan/icd.d
DRIVER:       /etc/xdg/vulkan/icd.d
DRIVER:       /etc/vulkan/icd.d
DRIVER:       /root/.local/share/vulkan/icd.d
DRIVER:       /usr/local/share/vulkan/icd.d
DRIVER:       /usr/share/vulkan/icd.d
DRIVER:    Found the following files:
DRIVER:       /usr/share/vulkan/icd.d/intel_icd.x86_64.json
DRIVER:       /usr/share/vulkan/icd.d/lvp_icd.x86_64.json
DRIVER:       /usr/share/vulkan/icd.d/virtio_icd.x86_64.json
DRIVER:       /usr/share/vulkan/icd.d/intel_hasvk_icd.x86_64.json
DRIVER:       /usr/share/vulkan/icd.d/radeon_icd.x86_64.json
DRIVER:       /usr/share/vulkan/icd.d/nvidia_icd.json
ERROR: loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0
ERROR: loader_validate_device_extensions: Device extension VK_KHR_external_semaphore_fd not supported by selected physical device or enabled layers.
ERROR: vkCreateDevice: Failed to validate extensions in list
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/sapien/example/offscreen.py", line 37, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/sapien/example/offscreen.py", line 9, in main
    scene = sapien.Scene()
  File "/usr/local/lib/python3.10/dist-packages/sapien/wrapper/scene.py", line 25, in __init__
    [sapien.physx.PhysxCpuSystem(), sapien.render.RenderSystem()]
RuntimeError: vk::PhysicalDevice::createDeviceUnique: ErrorExtensionNotPresent
Segmentation fault (core dumped)

I also tried replacing libGLX with libEGL in the nvidia_icd.json in the hopes that it would work with offscreen rendering better but no luck. I definitely have these libraries:

ldconfig -p | grep GLX

libGLX_nvidia.so.0 (libc6,x86-64) => /lib/x86_64-linux-gnu/libGLX_nvidia.so.0
libGLX_mesa.so.0 (libc6,x86-64) => /lib/x86_64-linux-gnu/libGLX_mesa.so.0
libGLX.so.0 (libc6,x86-64) => /lib/x86_64-linux-gnu/libGLX.so.0
libGLX.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libGLX.so

ldconfig -p | grep EGL

libEGL_nvidia.so.0 (libc6,x86-64) => /lib/x86_64-linux-gnu/libEGL_nvidia.so.0
libEGL_mesa.so.0 (libc6,x86-64) => /lib/x86_64-linux-gnu/libEGL_mesa.so.0
libEGL.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libEGL.so.1
libEGL.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libEGL.so

I've also tried the above setup with cuda driver 550.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions