Replies: 7 comments
-
|
Similar issue to me. |
Beta Was this translation helpful? Give feedback.
-
|
Drop all caches with You might also want to give this a read and disable VPR carveout if it's enabled. The post is about a different board but the VPR carveout is still a thing on Orin. https://forums.developer.nvidia.com/t/jp-5-0-2-missing-1gb-volatile-memory/229214 |
Beta Was this translation helpful? Give feedback.
-
|
Looks like it's a bug that NVIDIA still hasn't fixed. Only reliable mitigation is to reflash your system with Jetpack 6.2.1 and not upgrade any packages. Always use the NVIDIA forum for Jetson support, very few users here have access to Jetson systems. https://forums.developer.nvidia.com/t/unable-to-allocate-cuda0-buffer-after-updating-ubuntu-packages/347862 |
Beta Was this translation helpful? Give feedback.
-
|
Still not fixed. Nvidia offers no workaround, no patch. Its across the entire ecosystem, Llama.cpp is broken on all apps and platforms that upgraded to Jetpack r35.6.x and onward. I just bought and flashed and built for 2 new Jetson Orin Nanos and not looking forward to downgrade doing it all over twice. Lots of other users on forums also stranded. Completely an NVidia memory allocator problem but I guess once you reach 5 trillion valuation you dont need to worry about things like users being able to use your products. |
Beta Was this translation helpful? Give feedback.
-
|
The nature of the Jetson Linux 36.4.7 release is very dodgy, it was originally meant to be a patch release to fix some vulnerabilities, but to this day they still have not released any BSP or sources after more than two months. They might have some ridiculously long embargo period that they refuse to disclose, personally I don't see this getting fixed before 2026 rolls around. |
Beta Was this translation helpful? Give feedback.
-
|
hello, AGX Orin user here, tegra 36.4.7, for me it is working at ~37 tokens/s, model Qwen3-VL-30B-A3B-Instruct. I think it is acceptable. Thank you so much to everybody. |
Beta Was this translation helpful? Give feedback.
-
|
If models are running at half the speed it means the Nvidia memory allocator failed and it fell back to CPU based inference. If your build supports that mode it will hobble along with CPU-only inference. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to launch a model with llama.cpp on a Jetson Orin Nano device, but I am getting an OOM error each time that I try to run with the full model.
I used the llama.cpp@03792ad, built with:
Then, I tested llama-cli with:
Using the whole model (31 layers) gave me this output:
As you read, the device has
device CUDA0 (Orin) (0000:00:00.0) - 6687 MiB freeand the model would only require3573.76 MiB.If someone had the same problem, could you guide me through? Also, I tried jetson-containers, but it seems outdated (got no gemma-3n model available error).
BTW, I also tried using
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 llama-clibut got the same error. (Not sure if the unified memory is working, I have a swap memory of 128Gb)Beta Was this translation helpful? Give feedback.
All reactions