OOM on Jetson Orin Nano and gemma-3n-E2B-it #16706

Fhrozen · 2025-10-21T17:30:07Z

Fhrozen
Oct 21, 2025

I am trying to launch a model with llama.cpp on a Jetson Orin Nano device, but I am getting an OOM error each time that I try to run with the full model.

I used the llama.cpp@03792ad, built with:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Then, I tested llama-cli with:

llama-cli -hf unsloth/gemma-3n-E2B-it-GGUF:Q4_K_XL --n-gpu-layers 30

Using the whole model (31 layers) gave me this output:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
*   Trying 3.164.110.114:443...
* Connected to huggingface.co (3.164.110.114) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=huggingface.co
*  start date: Apr 13 00:00:00 2025 GMT
*  expire date: May 12 23:59:59 2026 GMT
*  subjectAltName: host "huggingface.co" matched cert's "huggingface.co"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M02
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0xaaab01024d40)
> GET /v2/unsloth/gemma-3n-E2B-it-GGUF/manifests/Q4_K_XL HTTP/2
Host: huggingface.co
user-agent: llama-cpp
accept: application/json

* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
...
* Connection #0 to host huggingface.co left intact
common_download_file_single_online: using cached file: /home/nelson/.cache/llama.cpp/unsloth_gemma-3n-E2B-it-GGUF_gemma-3n-E2B-it-UD-Q4_K_XL.gguf
build: 6816 (03792ad9) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (Orin) (0000:00:00.0) - 6687 MiB free
llama_model_loader: loaded meta data with 51 key-value pairs and 727 tensors from /home/nelson/.cache/llama.cpp/unsloth_gemma-3n-E2B-it-GGUF_gemma-3n-E2B-it-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3n
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma-3N-E2B-It
llama_model_loader: - kv   3:                           general.finetune str              = 3n-E2B-it
llama_model_loader: - kv   4:                           general.basename str              = Gemma-3N-E2B-It
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 4.5B
llama_model_loader: - kv   7:                            general.license str              = gemma
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Gemma 3n E2B It
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  13:                               general.tags arr[str,6]       = ["automatic-speech-recognition", "uns...
llama_model_loader: - kv  14:                     gemma3n.context_length u32              = 32768
llama_model_loader: - kv  15:                   gemma3n.embedding_length u32              = 2048
llama_model_loader: - kv  16:                        gemma3n.block_count u32              = 30
llama_model_loader: - kv  17:                gemma3n.feed_forward_length arr[i32,30]      = [8192, 8192, 8192, 8192, 8192, 8192, ...
llama_model_loader: - kv  18:               gemma3n.attention.head_count u32              = 8
llama_model_loader: - kv  19:   gemma3n.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:               gemma3n.attention.key_length u32              = 256
llama_model_loader: - kv  21:             gemma3n.attention.value_length u32              = 256
llama_model_loader: - kv  22:                     gemma3n.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  23:           gemma3n.attention.sliding_window u32              = 512
llama_model_loader: - kv  24:            gemma3n.attention.head_count_kv u32              = 2
llama_model_loader: - kv  25:                   gemma3n.altup.active_idx u32              = 0
llama_model_loader: - kv  26:                   gemma3n.altup.num_inputs u32              = 4
llama_model_loader: - kv  27:   gemma3n.embedding_length_per_layer_input u32              = 256
llama_model_loader: - kv  28:         gemma3n.attention.shared_kv_layers u32              = 10
llama_model_loader: - kv  29:          gemma3n.activation_sparsity_scale arr[f32,30]      = [1.644854, 1.644854, 1.644854, 1.6448...
llama_model_loader: - kv  30:   gemma3n.attention.sliding_window_pattern arr[bool,30]     = [true, true, true, true, false, true,...
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  35:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 106
llama_model_loader: - kv  39:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  42:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                          general.file_type u32              = 15
llama_model_loader: - kv  47:                      quantize.imatrix.file str              = gemma-3n-E2B-it-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv  48:                   quantize.imatrix.dataset str              = unsloth_calibration_gemma-3n-E2B-it.txt
llama_model_loader: - kv  49:             quantize.imatrix.entries_count u32              = 400
llama_model_loader: - kv  50:              quantize.imatrix.chunks_count u32              = 1326
llama_model_loader: - type  f32:  362 tensors
llama_model_loader: - type  f16:   93 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q4_K:  156 tensors
llama_model_loader: - type q5_K:   56 tensors
llama_model_loader: - type q6_K:   43 tensors
llama_model_loader: - type iq4_xs:   16 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 3.49 GiB (6.73 BPW)
load: printing all EOG tokens:
load:   - 106 ('<end_of_turn>')
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3n
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 30
print_info: n_head           = 8
print_info: n_head_kv        = 2
print_info: n_rot            = 256
print_info: n_swa            = 512
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 1.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = E2B
print_info: model params     = 4.46 B
print_info: general.name     = Gemma-3N-E2B-It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 106 '<end_of_turn>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 3573.76 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 3747356416
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/nelson/.cache/llama.cpp/unsloth_gemma-3n-E2B-it-GGUF_gemma-3n-E2B-it-UD-Q4_K_XL.gguf', try reducing --n-gpu-layers if you're running out of VRAM
main: error: unable to load model

As you read, the device has device CUDA0 (Orin) (0000:00:00.0) - 6687 MiB free and the model would only require 3573.76 MiB.

If someone had the same problem, could you guide me through? Also, I tried jetson-containers, but it seems outdated (got no gemma-3n model available error).

BTW, I also tried using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 llama-cli but got the same error. (Not sure if the unified memory is working, I have a swap memory of 128Gb)

Tennyleaz · 2025-11-17T09:25:05Z

Tennyleaz
Nov 17, 2025

Similar issue to me.
Running Qwen3-4B-Instruct-2507-Q3_K_M, with full layers usually lead to out of memory error. Only if I reduce -ngl size, could I run this model.

0 replies

TinyServal · 2025-11-18T01:02:29Z

TinyServal
Nov 18, 2025

Drop all caches with sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' and retry with mmap disabled. GGML_CUDA_ENABLE_UNIFIED_MEMORY is not relevant on this platform, "Unified Memory" in CUDA is not referring to the memory architecture on the Jetson, it's for managed memory allocations across VRAM and host DRAM on desktops.

You might also want to give this a read and disable VPR carveout if it's enabled. The post is about a different board but the VPR carveout is still a thing on Orin. https://forums.developer.nvidia.com/t/jp-5-0-2-missing-1gb-volatile-memory/229214

0 replies

TinyServal · 2025-11-18T01:13:03Z

TinyServal
Nov 18, 2025

Looks like it's a bug that NVIDIA still hasn't fixed. Only reliable mitigation is to reflash your system with Jetpack 6.2.1 and not upgrade any packages. Always use the NVIDIA forum for Jetson support, very few users here have access to Jetson systems. https://forums.developer.nvidia.com/t/unable-to-allocate-cuda0-buffer-after-updating-ubuntu-packages/347862

0 replies

neocoretechs · 2025-12-23T11:12:08Z

neocoretechs
Dec 23, 2025

Still not fixed. Nvidia offers no workaround, no patch. Its across the entire ecosystem, Llama.cpp is broken on all apps and platforms that upgraded to Jetpack r35.6.x and onward. I just bought and flashed and built for 2 new Jetson Orin Nanos and not looking forward to downgrade doing it all over twice. Lots of other users on forums also stranded. Completely an NVidia memory allocator problem but I guess once you reach 5 trillion valuation you dont need to worry about things like users being able to use your products.

0 replies

TinyServal · 2025-12-23T12:29:31Z

TinyServal
Dec 23, 2025

The nature of the Jetson Linux 36.4.7 release is very dodgy, it was originally meant to be a patch release to fix some vulnerabilities, but to this day they still have not released any BSP or sources after more than two months. They might have some ridiculously long embargo period that they refuse to disclose, personally I don't see this getting fixed before 2026 rolls around.

0 replies

elfarolab · 2025-12-23T18:30:17Z

elfarolab
Dec 23, 2025

hello,

AGX Orin user here, tegra 36.4.7, for me it is working at ~37 tokens/s, model Qwen3-VL-30B-A3B-Instruct. I think it is acceptable.
Other most common models are running at the half of the speed such as: Ministral, GLM-4.6, Qwen3-Nemotron-14B.

Thank you so much to everybody.

0 replies

neocoretechs · 2025-12-28T18:20:15Z

neocoretechs
Dec 28, 2025

If models are running at half the speed it means the Nvidia memory allocator failed and it fell back to CPU based inference. If your build supports that mode it will hobble along with CPU-only inference.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM on Jetson Orin Nano and gemma-3n-E2B-it #16706

Uh oh!

{{title}}

Uh oh!

Replies: 7 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!