Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan related question: what's the different between server and cli? #11099

Open
FNsi opened this issue Jan 6, 2025 · 9 comments
Open

Vulkan related question: what's the different between server and cli? #11099

FNsi opened this issue Jan 6, 2025 · 9 comments

Comments

@FNsi
Copy link
Contributor

FNsi commented Jan 6, 2025

Today I built llama.cpp and stsble-diffusion cpp in vulkan (termux) all building steps pass.

However, only llama-cli functioning well with ngl whatever, the llama-server and #stsble-diffusion.cpp both result
core dumped

Mesnwhil I found, in complying process, I missed glslangvalidator.

Screenshotg.jpg

The server dumped
Screenshot_2025.jpg

The llama-cli works well.
Screenshot_2.jpg

Stsble-diffusion.cpp also dumped

Screenshot_202501e.jpg

Thus.... Anyone has any clue?

@FNsi
Copy link
Contributor Author

FNsi commented Jan 6, 2025

Meanwhile I found, in complying process, I missed glslangvalidator.

I built glslang/ from source, rebuild the llama.cpp, server still core dumped.

@FNsi
Copy link
Contributor Author

FNsi commented Jan 6, 2025

Not related but for android, openblas is much faster than vulkan.

@bandoti
Copy link
Contributor

bandoti commented Jan 6, 2025

Regarding glslangValidator, I'm not sure if that is necessary. The only requirement as far as the CMake package is concerned is glslc:

find_package(Vulkan COMPONENTS glslc REQUIRED)

I'm not familiar with Termux but the first thought coming to mind is that llama server (and possibly stable-diffusion) runs in the background whereas llama-cli runs in the foreground, so I'd wager there could be an issue with the OS yanking the GPU away and causing a segfault.

Using Android and backend processes have to have RPC mechanism for example when switching between apps. Not sure if this helps but here's some info on Intents and Services.

I would imagine that llama server would run as a service, and the front-end needs to hook up to it via the intent.

@FNsi
Copy link
Contributor Author

FNsi commented Jan 6, 2025

I'm not familiar with Termux but the first thought coming to mind is that llama server (and possibly stable-diffusion) runs in the background whereas llama-cli runs in the foreground, so I'd wager there could be an issue with the OS yanking the GPU away and causing a segfault.

Using Android and backend processes have to have RPC mechanism for example when switching between apps.

I would imagine that llama server would run as a service, and the front-end needs to hook up to it via the intent.

I totally agree your ideas. Me either no familiar with termux, just got an android, and the ram is 12gb, but big but enough, so I decided to play it!


About stable diffusion, I fixed it by remove glfw package. ( Also the glslang), and it functioning.
Still, slow than openblas, that time only slow about 23%.


So I think things is complicated especially I cannot got the android vulkan icd file without root...


In the end, I have to say openblas is destined for weak arm device!

@bandoti
Copy link
Contributor

bandoti commented Jan 6, 2025

Another possibility is the new OpenCL backend. I'm not sure if it currently supports this use-case, but OpenCL being older than Vulkan typically can have somewhat better interoperability.

@jeffbolznv
Copy link
Collaborator

I don't know why server and cli would behave differently. DeviceLost errors usually mean either a timeout (hang, or just too much work) or GPU exception. Please share info about what GPU/driver you're using. Please also try running test-backend-ops to see if your setup passes directed tests.

@FNsi
Copy link
Contributor Author

FNsi commented Jan 6, 2025

It's mail G57
I use termux pkg vulkan_android_loader to deal with it.

About new clblase, haven't try yet, but openblas allow my phone to do whatever else since it only use cpu.

Also , only half speed (his top end device) of openblas ( my 4* A76 + 4* A55)

~ $ vulkaninfo
'DISPLAY' environment variable not set... skipping surface info
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.304


Instance Extensions: count = 12
===============================
        VK_EXT_debug_report                    : extension revision 10
        VK_EXT_swapchain_colorspace            : extension revision 4
        VK_GOOGLE_surfaceless_query            : extension revision 1
        VK_KHR_android_surface                 : extension revision 6
        VK_KHR_device_group_creation           : extension revision 1
        VK_KHR_external_fence_capabilities     : extension revision 1
        VK_KHR_external_memory_capabilities    : extension revision 1
        VK_KHR_external_semaphore_capabilities : extension revision 1
        VK_KHR_get_physical_device_properties2 : extension revision 2
        VK_KHR_get_surface_capabilities2       : extension revision 1
        VK_KHR_surface                         : extension revision 25
        VK_KHR_surface_protected_capabilities  : extension revision 1

Layers:
=======
Device Properties and Extensions:
=================================
GPU0:
VkPhysicalDeviceProperties:
---------------------------
        apiVersion        = 1.3.225 (4206817)
        driverVersion     = 40.0.0 (167772160)
        vendorID          = 0x13b5
        deviceID          = 0x90910010
        deviceType        = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName        = Mali-G57
        pipelineCacheUUID = 87736cdd-0776-d79e-cb81-1eb8f30eeb4e

VkPhysicalDeviceLimits:
-----------------------
        maxImageDimension1D                             = 16384
        maxImageDimension2D                             = 16384
        maxImageDimension3D                             = 16384
        maxImageDimensionCube                           = 16384
        maxImageArrayLayers                             = 4096
        maxTexelBufferElements                          = 268435456
        maxUniformBufferRange                           = 65536
        maxStorageBufferRange                           = 268435456
        maxPushConstantsSize                            = 256
        maxMemoryAllocationCount                        = 16384
        maxSamplerAllocationCount                       = 4294967295
        bufferImageGranularity                          = 0x00000001
        sparseAddressSpaceSize                          = 0x00000000
        maxBoundDescriptorSets                          = 7
        maxPerStageDescriptorSamplers                   = 500000
        maxPerStageDescriptorUniformBuffers             = 36
        maxPerStageDescriptorStorageBuffers             = 500000
        maxPerStageDescriptorSampledImages              = 500000
        maxPerStageDescriptorStorageImages              = 500000
        maxPerStageDescriptorInputAttachments           = 9
        maxPerStageResources                            = 500000
        maxDescriptorSetSamplers                        = 500000
        maxDescriptorSetUniformBuffers                  = 216
        maxDescriptorSetUniformBuffersDynamic           = 32
        maxDescriptorSetStorageBuffers                  = 500000
        maxDescriptorSetStorageBuffersDynamic           = 32
        maxDescriptorSetSampledImages                   = 500000
        maxDescriptorSetStorageImages                   = 500000
        maxDescriptorSetInputAttachments                = 9
        maxVertexInputAttributes                        = 32
        maxVertexInputBindings                          = 32
        maxVertexInputAttributeOffset                   = 2047
        maxVertexInputBindingStride                     = 2048
        maxVertexOutputComponents                       = 128
        maxTessellationGenerationLevel                  = 64
        maxTessellationPatchSize                        = 32
        maxTessellationControlPerVertexInputComponents  = 128
        maxTessellationControlPerVertexOutputComponents = 128
        maxTessellationControlPerPatchOutputComponents  = 120
        maxTessellationControlTotalOutputComponents     = 4096
        maxTessellationEvaluationInputComponents        = 128
        maxTessellationEvaluationOutputComponents       = 128
        maxGeometryShaderInvocations                    = 32
        maxGeometryInputComponents                      = 128
        maxGeometryOutputComponents                     = 128
        maxGeometryOutputVertices                       = 256
        maxGeometryTotalOutputComponents                = 2048
        maxFragmentInputComponents                      = 128
        maxFragmentOutputAttachments                    = 8
        maxFragmentDualSrcAttachments                   = 0
        maxFragmentCombinedOutputResources              = 1000008
        maxComputeSharedMemorySize                      = 32768
        maxComputeWorkGroupCount: count = 3
                4294967295
                4294967295
                4294967295
        maxComputeWorkGroupInvocations                  = 512
        maxComputeWorkGroupSize: count = 3
                512
                512
                512
        subPixelPrecisionBits                           = 8
        subTexelPrecisionBits                           = 8
        mipmapPrecisionBits                             = 8
        maxDrawIndexedIndexValue                        = 4294967295
        maxDrawIndirectCount                            = 1
        maxSamplerLodBias                               = 255
        maxSamplerAnisotropy                            = 16
        maxViewports                                    = 1
        maxViewportDimensions: count = 2
                16384
                16384
        viewportBoundsRange: count = 2
                -32768
                32767
        viewportSubPixelBits                            = 0
        minMemoryMapAlignment                           = 64
        minTexelBufferOffsetAlignment                   = 0x00000040
        minUniformBufferOffsetAlignment                 = 0x00000010
        minStorageBufferOffsetAlignment                 = 0x00000040
        minTexelOffset                                  = -8
        maxTexelOffset                                  = 7
        minTexelGatherOffset                            = -8
        maxTexelGatherOffset                            = 7
        minInterpolationOffset                          = -0.5
        maxInterpolationOffset                          = 0.5
        subPixelInterpolationOffsetBits                 = 4
        maxFramebufferWidth                             = 16384
        maxFramebufferHeight                            = 16384
        maxFramebufferLayers                            = 256
        framebufferColorSampleCounts: count = 3
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
                SAMPLE_COUNT_8_BIT
        framebufferDepthSampleCounts: count = 3
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
                SAMPLE_COUNT_8_BIT
        framebufferStencilSampleCounts: count = 3
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
                SAMPLE_COUNT_8_BIT
        framebufferNoAttachmentsSampleCounts: count = 4
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
                SAMPLE_COUNT_8_BIT
                SAMPLE_COUNT_16_BIT
        maxColorAttachments                             = 8
        sampledImageColorSampleCounts: count = 3
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
                SAMPLE_COUNT_8_BIT
        sampledImageIntegerSampleCounts: count = 3
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
                SAMPLE_COUNT_8_BIT
        sampledImageDepthSampleCounts: count = 3
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
                SAMPLE_COUNT_8_BIT
        sampledImageStencilSampleCounts: count = 3
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
                SAMPLE_COUNT_8_BIT
        storageImageSampleCounts: count = 1
                SAMPLE_COUNT_1_BIT
        maxSampleMaskWords                              = 1
        timestampComputeAndGraphics                     = true
        timestampPeriod                                 = 38.4615
        maxClipDistances                                = 0
        maxCullDistances                                = 0
        maxCombinedClipAndCullDistances                 = 0
        discreteQueuePriorities                         = 2
        pointSizeRange: count = 2
                1
                1024
        lineWidthRange: count = 2
                1
                1
        pointSizeGranularity                            = 0.0625
        lineWidthGranularity                            = 0
        strictLines                                     = true
        standardSampleLocations                         = true
        optimalBufferCopyOffsetAlignment                = 0x00000040
        optimalBufferCopyRowPitchAlignment              = 0x00000040
        nonCoherentAtomSize                             = 0x00000040

VkPhysicalDeviceSparseProperties:
---------------------------------
        residencyStandard2DBlockShape            = false
        residencyStandard2DMultisampleBlockShape = false
        residencyStandard3DBlockShape            = false
        residencyAlignedMipSize                  = false
        residencyNonResidentStrict               = false

VkPhysicalDeviceCustomBorderColorPropertiesEXT:
-----------------------------------------------
        maxCustomBorderColorSamplers = 4294967295

VkPhysicalDeviceFragmentDensityMap2PropertiesEXT:
-------------------------------------------------
        subsampledLoads                           = false
        subsampledCoarseReconstructionEarlyAccess = true
        maxSubsampledArrayLayers                  = 4096
        maxDescriptorSetSubsampledSamplers        = 8

VkPhysicalDeviceFragmentDensityMapPropertiesEXT:
------------------------------------------------
        minFragmentDensityTexelSize:
                width  = 32
                height = 32
        maxFragmentDensityTexelSize:
                width  = 32
                height = 32
        fragmentDensityInvocations = true

VkPhysicalDeviceLineRasterizationPropertiesKHR:
-----------------------------------------------
        lineSubPixelPrecisionBits = 8

VkPhysicalDeviceProvokingVertexPropertiesEXT:
---------------------------------------------
        provokingVertexModePerPipeline                       = false
        transformFeedbackPreservesTriangleFanProvokingVertex = false

VkPhysicalDeviceTransformFeedbackPropertiesEXT:
-----------------------------------------------
        maxTransformFeedbackStreams                = 1
        maxTransformFeedbackBuffers                = 4
        maxTransformFeedbackBufferSize             = 0x10000000
        maxTransformFeedbackStreamDataSize         = 512
        maxTransformFeedbackBufferDataSize         = 512
        maxTransformFeedbackBufferDataStride       = 512
        transformFeedbackQueries                   = true
        transformFeedbackStreamsLinesTriangles     = false
        transformFeedbackRasterizationStreamSelect = false
        transformFeedbackDraw                      = false

VkPhysicalDeviceVulkan11Properties:
-----------------------------------
        deviceUUID                        = 10009190-0100-0000-0000-000000000000
        driverUUID                        = 7df854a3-b874-c935-96d8-1fd67bcd838b
        deviceNodeMask                    = 0
        deviceLUIDValid                   = false
        subgroupSize                      = 16
        subgroupSupportedStages: count = 8
                SHADER_STAGE_FRAGMENT_BIT
                SHADER_STAGE_COMPUTE_BIT
                SHADER_STAGE_RAYGEN_BIT_KHR
                SHADER_STAGE_ANY_HIT_BIT_KHR
                SHADER_STAGE_CLOSEST_HIT_BIT_KHR
                SHADER_STAGE_MISS_BIT_KHR
                SHADER_STAGE_INTERSECTION_BIT_KHR
                SHADER_STAGE_CALLABLE_BIT_KHR
        subgroupSupportedOperations: count = 8
                SUBGROUP_FEATURE_BASIC_BIT
                SUBGROUP_FEATURE_VOTE_BIT
                SUBGROUP_FEATURE_ARITHMETIC_BIT
                SUBGROUP_FEATURE_BALLOT_BIT
                SUBGROUP_FEATURE_SHUFFLE_BIT
                SUBGROUP_FEATURE_SHUFFLE_RELATIVE_BIT
                SUBGROUP_FEATURE_CLUSTERED_BIT
                SUBGROUP_FEATURE_QUAD_BIT
        subgroupQuadOperationsInAllStages = false
        pointClippingBehavior             = POINT_CLIPPING_BEHAVIOR_USER_CLIP_PLANES_ONLY
        maxMultiviewViewCount             = 8
        maxMultiviewInstanceIndex         = 4294967295
        protectedNoFault                  = false
        maxPerSetDescriptors              = 500000
        maxMemoryAllocationSize           = 0x2cdf58000

VkPhysicalDeviceVulkan12Properties:
-----------------------------------
        driverID                                             = DRIVER_ID_ARM_PROPRIETARY
        driverName                                           = Mali-G57
        driverInfo                                           = v1.r40p0-01eac0.b0251c048237dcd59e6be15fba11a31a
        conformanceVersion:
                major    = 1
                minor    = 3
                subminor = 3
                patch    = 0
        denormBehaviorIndependence                           = SHADER_FLOAT_CONTROLS_INDEPENDENCE_ALL
        roundingModeIndependence                             = SHADER_FLOAT_CONTROLS_INDEPENDENCE_ALL
        shaderSignedZeroInfNanPreserveFloat16                = true
        shaderSignedZeroInfNanPreserveFloat32                = true
        shaderSignedZeroInfNanPreserveFloat64                = false
        shaderDenormPreserveFloat16                          = true
        shaderDenormPreserveFloat32                          = true
        shaderDenormPreserveFloat64                          = false
        shaderDenormFlushToZeroFloat16                       = true
        shaderDenormFlushToZeroFloat32                       = true
        shaderDenormFlushToZeroFloat64                       = false
        shaderRoundingModeRTEFloat16                         = true
        shaderRoundingModeRTEFloat32                         = true
        shaderRoundingModeRTEFloat64                         = false
        shaderRoundingModeRTZFloat16                         = true
        shaderRoundingModeRTZFloat32                         = true
        shaderRoundingModeRTZFloat64                         = false
        maxUpdateAfterBindDescriptorsInAllPools              = 4294967295
        shaderUniformBufferArrayNonUniformIndexingNative     = false
        shaderSampledImageArrayNonUniformIndexingNative      = false
        shaderStorageBufferArrayNonUniformIndexingNative     = true
        shaderStorageImageArrayNonUniformIndexingNative      = false
        shaderInputAttachmentArrayNonUniformIndexingNative   = false
        robustBufferAccessUpdateAfterBind                    = true
        quadDivergentImplicitLod                             = false
        maxPerStageDescriptorUpdateAfterBindSamplers         = 500000
        maxPerStageDescriptorUpdateAfterBindUniformBuffers   = 36
        maxPerStageDescriptorUpdateAfterBindStorageBuffers   = 500000
        maxPerStageDescriptorUpdateAfterBindSampledImages    = 500000
        maxPerStageDescriptorUpdateAfterBindStorageImages    = 500000
        maxPerStageDescriptorUpdateAfterBindInputAttachments = 9
        maxPerStageUpdateAfterBindResources                  = 500000
        maxDescriptorSetUpdateAfterBindSamplers              = 500000
        maxDescriptorSetUpdateAfterBindUniformBuffers        = 216
        maxDescriptorSetUpdateAfterBindUniformBuffersDynamic = 32
        maxDescriptorSetUpdateAfterBindStorageBuffers        = 500000
        maxDescriptorSetUpdateAfterBindStorageBuffersDynamic = 32
        maxDescriptorSetUpdateAfterBindSampledImages         = 500000
        maxDescriptorSetUpdateAfterBindStorageImages         = 500000
        maxDescriptorSetUpdateAfterBindInputAttachments      = 9
        supportedDepthResolveModes: count = 1
                RESOLVE_MODE_SAMPLE_ZERO_BIT
        supportedStencilResolveModes: count = 1
                RESOLVE_MODE_SAMPLE_ZERO_BIT
        independentResolveNone                               = false
        independentResolve                                   = false
        filterMinmaxSingleComponentFormats                   = false
        filterMinmaxImageComponentMapping                    = false
        maxTimelineSemaphoreValueDifference                  = 18446744073709551615
        framebufferIntegerColorSampleCounts: count = 3
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
                SAMPLE_COUNT_8_BIT

VkPhysicalDeviceVulkan13Properties:
-----------------------------------
        minSubgroupSize                                                               = 16
        maxSubgroupSize                                                               = 16
        maxComputeWorkgroupSubgroups                                                  = 32
        requiredSubgroupSizeStages: count = 8
                SHADER_STAGE_FRAGMENT_BIT
                SHADER_STAGE_COMPUTE_BIT
                SHADER_STAGE_RAYGEN_BIT_KHR
                SHADER_STAGE_ANY_HIT_BIT_KHR
                SHADER_STAGE_CLOSEST_HIT_BIT_KHR
                SHADER_STAGE_MISS_BIT_KHR
                SHADER_STAGE_INTERSECTION_BIT_KHR
                SHADER_STAGE_CALLABLE_BIT_KHR
        maxInlineUniformBlockSize                                                     = 65536
        maxPerStageDescriptorInlineUniformBlocks                                      = 32
        maxPerStageDescriptorUpdateAfterBindInlineUniformBlocks                       = 32
        maxDescriptorSetInlineUniformBlocks                                           = 192
        maxDescriptorSetUpdateAfterBindInlineUniformBlocks                            = 192
        maxInlineUniformTotalSize                                                     = 12582912
        integerDotProduct8BitUnsignedAccelerated                                      = true
        integerDotProduct8BitSignedAccelerated                                        = true
        integerDotProduct8BitMixedSignednessAccelerated                               = false
        integerDotProduct4x8BitPackedUnsignedAccelerated                              = true
        integerDotProduct4x8BitPackedSignedAccelerated                                = true
        integerDotProduct4x8BitPackedMixedSignednessAccelerated                       = false
        integerDotProduct16BitUnsignedAccelerated                                     = false
        integerDotProduct16BitSignedAccelerated                                       = false
        integerDotProduct16BitMixedSignednessAccelerated                              = false
        integerDotProduct32BitUnsignedAccelerated                                     = false
        integerDotProduct32BitSignedAccelerated                                       = false
        integerDotProduct32BitMixedSignednessAccelerated                              = false
        integerDotProduct64BitUnsignedAccelerated                                     = false
        integerDotProduct64BitSignedAccelerated                                       = false
        integerDotProduct64BitMixedSignednessAccelerated                              = false
        integerDotProductAccumulatingSaturating8BitUnsignedAccelerated                = true
        integerDotProductAccumulatingSaturating8BitSignedAccelerated                  = true
        integerDotProductAccumulatingSaturating8BitMixedSignednessAccelerated         = false

@FNsi
Copy link
Contributor Author

FNsi commented Jan 7, 2025

And the log for the test.
I see somthing failed

ABS NEG STEP ...etc

The log is huge, I uploaded in filebin...

Link is below.

https://filebin.net/7w813ekkdbgeukd3

@FNsi
Copy link
Contributor Author

FNsi commented Jan 10, 2025

Rebuild with OpenBLAS show the following.

llama.cpp/ggml/src/ggml-backend.cpp:746: pre-allocated tensor (k_cache_view-0 (copy of Kcur-0)) in a buffer (Vulkan0) that cannot run the operation (CPY)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants