Open
Description
Name and Version
build: 5774 (27208bf) with gcc-15 (Homebrew GCC 15.1.0) 15.1.0 for x86_64-pc-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Nvidia T4
Models
https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF/resolve/main/gemma-3n-E4B-it-UD-Q4_K_XL.gguf
Problem description & steps to reproduce
I tried to do a perplexity measure on Gemma 3N e4b using this command: llama-perplexity -m /root/gemma-3n-E4B-it-UD-Q4_K_XL.gguf --swa-full -f ../input/wikitext2-data/test.txt -ngl 99
, but the perplexity is suspiciously high (higher than a 0.6b model)...
First Bad Commit
No response
Relevant log output
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
Device 1: Tesla T4, compute capability 7.5, VMM: yes
build: 5774 (27208bf6) with gcc-15 (Homebrew GCC 15.1.0) 15.1.0 for x86_64-pc-linux-gnu
llama_model_load_from_file_impl: using device CUDA0 (Tesla T4) - 14992 MiB free
llama_model_load_from_file_impl: using device CUDA1 (Tesla T4) - 14992 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 847 tensors from /root/gemma-3n-E4B-it-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3n
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3N-E4B-It
llama_model_loader: - kv 3: general.finetune str = 3n-E4B-it
llama_model_loader: - kv 4: general.basename str = Gemma-3N-E4B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 6.9B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3n.context_length u32 = 32768
llama_model_loader: - kv 9: gemma3n.embedding_length u32 = 2048
llama_model_loader: - kv 10: gemma3n.block_count u32 = 35
llama_model_loader: - kv 11: gemma3n.feed_forward_length arr[i32,35] = [16384, 16384, 16384, 16384, 16384, 1...
llama_model_loader: - kv 12: gemma3n.attention.head_count u32 = 8
llama_model_loader: - kv 13: gemma3n.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3n.attention.key_length u32 = 256
llama_model_loader: - kv 15: gemma3n.attention.value_length u32 = 256
llama_model_loader: - kv 16: gemma3n.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 17: gemma3n.attention.sliding_window u32 = 512
llama_model_loader: - kv 18: gemma3n.attention.head_count_kv u32 = 2
llama_model_loader: - kv 19: gemma3n.altup.active_idx u32 = 0
llama_model_loader: - kv 20: gemma3n.altup.num_inputs u32 = 4
llama_model_loader: - kv 21: gemma3n.embedding_length_per_layer_input u32 = 256
llama_model_loader: - kv 22: gemma3n.attention.shared_kv_layers u32 = 15
llama_model_loader: - kv 23: gemma3n.activation_sparsity_scale arr[f32,35] = [1.644854, 1.644854, 1.644854, 1.6448...
llama_model_loader: - kv 24: gemma3n.attention.sliding_window_pattern arr[bool,35] = [true, true, true, true, false, true,...
llama_model_loader: - kv 25: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 26: tokenizer.ggml.model str = llama
llama_model_loader: - kv 27: tokenizer.ggml.pre str = default
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,262144] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 29: tokenizer.ggml.scores arr[f32,262144] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,262144] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 33: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 36: tokenizer.ggml.add_sep_token bool = false
llama_model_loader: - kv 37: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 39: general.quantization_version u32 = 2
llama_model_loader: - kv 40: general.file_type u32 = 15
llama_model_loader: - kv 41: quantize.imatrix.file str = gemma-3n-E4B-it-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_gemma-3n-E4B-it.txt
llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 459
llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 1326
llama_model_loader: - type f32: 422 tensors
llama_model_loader: - type f16: 108 tensors
llama_model_loader: - type q8_0: 1 tensors
llama_model_loader: - type q4_K: 189 tensors
llama_model_loader: - type q5_K: 61 tensors
llama_model_loader: - type q6_K: 50 tensors
llama_model_loader: - type iq4_xs: 16 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 5.01 GiB (6.27 BPW)
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3n
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 2048
print_info: n_layer = 35
print_info: n_head = 8
print_info: n_head_kv = 2
print_info: n_rot = 256
print_info: n_swa = 512
print_info: is_swa_any = 1
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 1.0e+00
print_info: n_ff = 16384
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = E4B
print_info: model params = 6.87 B
print_info: general.name = Gemma-3N-E4B-It
print_info: vocab type = SPM
print_info: n_vocab = 262144
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 36/36 layers to GPU
load_tensors: CUDA0 model buffer size = 1193.31 MiB
load_tensors: CUDA1 model buffer size = 3936.01 MiB
load_tensors: CPU_Mapped model buffer size = 352.00 MiB
..............................................
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 512
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 4.00 MiB
llama_kv_cache_unified_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 2048 cells
llama_kv_cache_unified: CUDA0 KV buffer size = 12.00 MiB
llama_kv_cache_unified: CUDA1 KV buffer size = 4.00 MiB
llama_kv_cache_unified: size = 16.00 MiB ( 2048 cells, 4 layers, 4 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 2048 cells
llama_kv_cache_unified: CUDA0 KV buffer size = 60.00 MiB
llama_kv_cache_unified: CUDA1 KV buffer size = 4.00 MiB
llama_kv_cache_unified: size = 64.00 MiB ( 2048 cells, 16 layers, 4 seqs), K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: CUDA0 compute buffer size = 340.04 MiB
llama_context: CUDA1 compute buffer size = 710.02 MiB
llama_context: CUDA_Host compute buffer size = 36.03 MiB
llama_context: graph nodes = 3266
llama_context: graph splits = 7
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CUDA : ARCHS = 750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 985.429 ms
perplexity: calculating perplexity over 532 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 3.41 seconds per pass - ETA 7.55 minutes
[1]23.8297,[2]21.4697,[3]19.8675,[4]20.9398,[5]21.7471,[6]20.4047,[7]19.3114,[8]18.7489,[9]17.5857,[10]18.3422,[11]18.6660,[12]20.5078,[13]21.8797,[14]23.6463,[15]25.4186,[16]26.1627,[17]26.1332,[18]26.6967,[19]25.7511,[20]25.7979,[21]25.3409,[22]25.8540,[23]25.8605,[24]25.2923,[25]24.8518,[26]24.9886,[27]25.2800,[28]25.7704,[29]26.1470,[30]26.0155,[31]25.8971,[32]26.0488,[33]25.5285,[34]25.6075,[35]25.6738,[36]25.6182,[37]25.4378,[38]25.7397,[39]25.7518,[40]25.6764,[41]25.4600,[42]25.1981,[43]24.9798,[44]24.7480,[45]24.7220,[46]24.8355,[47]24.8683,[48]24.9370,[49]25.2017,[50]25.2371,[51]25.3599,[52]25.4731,[53]25.2945,[54]25.5673,[55]25.6332,[56]25.8897,[57]26.1280,[58]25.8219,[59]25.8321,[60]25.7373,[61]25.8483,[62]25.9158,[63]25.9633,[64]26.0994,[65]26.0959,[66]25.8278,[67]25.7536,[68]26.0287,[69]26.1151,[70]26.2757,[71]26.2774,[72]26.4018,[73]26.3057,[74]26.5324,[75]26.7150,[76]26.6143,[77]26.7592,[78]26.9052,[79]27.0864,[80]27.0957,[81]27.1244,[82]27.3395,[83]27.3444,[84]27.4042,[85]27.4717,[86]27.4674,[87]27.5009,[88]27.6106,[89]27.5143,[90]27.5731,[91]27.5038,[92]27.5119,[93]27.4446,[94]27.4645,[95]27.5988,[96]27.5769,[97]27.5820,[98]27.7249,[99]27.6260,[100]27.6273,[101]27.6263,[102]27.5036,[103]27.5403,[104]27.5866,[105]27.7558,[106]27.8412,[107]27.9647,[108]27.9683,[109]27.9077,[110]27.9572,[111]27.9905,[112]28.0027,[113]28.0377,[114]27.9575,[115]27.9762,[116]27.9798,[117]27.8716,[118]27.7931,[119]27.7752,[120]27.7087,[121]27.7459,[122]27.7443,[123]27.8200,[124]27.9456,[125]27.9557,[126]27.9473,[127]28.0533,[128]27.9292,[129]27.8342,[130]27.7237,[131]27.5833,[132]27.5382,[133]27.6435,[134]27.7647,[135]27.7138,[136]27.6233,[137]27.5075,[138]27.4916,[139]27.4595,[140]27.5551,[141]27.4921,[142]27.4826,[143]27.4461,[144]27.4003,[145]27.2686,[146]27.2675,[147]27.2412,[148]27.2477,[149]27.1412,[150]27.0003,[151]26.9522,[152]26.9684,[153]26.9275,[154]26.9112,[155]26.9750,[156]27.0029,[157]27.0352,[158]27.0390,[159]27.0937,[160]26.8343,[161]26.9225,[162]26.8017,[163]26.6215,[164]26.4313,[165]26.3793,[166]26.4348,[167]26.4706,[168]26.4991,[169]26.5127,[170]26.6560,[171]26.6228,[172]26.6241,[173]26.6411,[174]26.5995,[175]26.4470,[176]26.4427,[177]26.3525,[178]26.2595,[179]26.2565,[180]26.2296,[181]26.3100,[182]26.2687,[183]26.1074,[184]25.9805,[185]25.9332,[186]25.8729,[187]25.8048,[188]25.8179,[189]25.8426,[190]25.7795,[191]25.8436,[192]25.8863,[193]25.8545,[194]25.8786,[195]25.9061,[196]25.9054,[197]25.9396,[198]25.9889,[199]25.9713,[200]25.9217,[201]25.9262,[202]25.9201,[203]25.9558,[204]25.9969,[205]26.0124,[206]26.0439,[207]25.9325,[208]25.8800,[209]25.8808,[210]25.9261,[211]25.9196,[212]25.9495,[213]25.9605,[214]25.9565,[215]25.9415,[216]25.8854,[217]25.8848,[218]25.9209,[219]25.9732,[220]26.0084,[221]25.9523,[222]25.9377,[223]25.9203,[224]25.9376,[225]25.9292,[226]25.8885,[227]25.9160,[228]25.9463,[229]25.9676,[230]26.1019,[231]26.1137,[232]26.1139,[233]26.1700,[234]26.0752,[235]26.1799,[236]26.1678,[237]26.1288,[238]26.1370,[239]26.1215,[240]26.1561,[241]26.2353,[242]26.2645,[243]26.2893,[244]26.3168,[245]26.3145,[246]26.3359,[247]26.3600,[248]26.3314,[249]26.2218,[250]26.0706,[251]25.8787,[252]25.8685,[253]25.8457,[254]25.9134,[255]25.8489,[256]25.8562,[257]25.8208,[258]25.8898,[259]25.9274,[260]26.0060,[261]26.0470,[262]26.1211,[263]26.1937,[264]26.2456,[265]26.2707,[266]26.2576,[267]26.3469,[268]26.3961,[269]26.5014,[270]26.5304,[271]26.5131,[272]26.5463,[273]26.6238,[274]26.6725,[275]26.7412,[276]26.8189,[277]26.8357,[278]26.9189,[279]26.9522,[280]26.9903,[281]26.9915,[282]26.9971,[283]26.9208,[284]26.8762,[285]26.9048,[286]26.8862,[287]26.8703,[288]26.8434,[289]26.7763,[290]26.8140,[291]26.8192,[292]26.8244,[293]26.8624,[294]26.8272,[295]26.8488,[296]26.8648,[297]26.8560,[298]26.8689,[299]26.8825,[300]26.8355,[301]26.8516,[302]26.8123,[303]26.7612,[304]26.7808,[305]26.7577,[306]26.7796,[307]26.7596,[308]26.7784,[309]26.8219,[310]26.8545,[311]26.8838,[312]26.9088,[313]26.9006,[314]26.8698,[315]26.9429,[316]26.9790,[317]26.9951,[318]26.9599,[319]26.9630,[320]27.0035,[321]27.0170,[322]27.0361,[323]27.0361,[324]27.0132,[325]27.0399,[326]27.0052,[327]26.9591,[328]26.9027,[329]26.8787,[330]26.8067,[331]26.7859,[332]26.6854,[333]26.6639,[334]26.6614,[335]26.7275,[336]26.7922,[337]26.8346,[338]26.9071,[339]26.9655,[340]26.9734,[341]26.9175,[342]26.8995,[343]26.9298,[344]26.9893,[345]26.9540,[346]26.9255,[347]26.9037,[348]26.8847,[349]26.9223,[350]26.9584,[351]27.0009,[352]26.9464,[353]26.8788,[354]26.8299,[355]26.7940,[356]26.8060,[357]26.8025,[358]26.8143,[359]26.8260,[360]26.7987,[361]26.8218,[362]26.8274,[363]26.8331,[364]26.8544,[365]26.8872,[366]26.8925,[367]26.8687,[368]26.8398,[369]26.7781,[370]26.7687,[371]26.8239,[372]26.8401,[373]26.8716,[374]26.8576,[375]26.8738,[376]26.8856,[377]26.8866,[378]26.8773,[379]26.8775,[380]26.8859,[381]26.8796,[382]26.9003,[383]26.8903,[384]26.8802,[385]26.8459,[386]26.8440,[387]26.8710,[388]26.8773,[389]26.8797,[390]26.8814,[391]26.9455,[392]26.9871,[393]26.9954,[394]26.9939,[395]27.0530,[396]27.0318,[397]27.0317,[398]27.0786,[399]27.0593,[400]27.0882,[401]27.1107,[402]27.1271,[403]27.1425,[404]27.1423,[405]27.1388,[406]27.1679,[407]27.2116,[408]27.2181,[409]27.2615,[410]27.2543,[411]27.2732,[412]27.2989,[413]27.3537,[414]27.3887,[415]27.3879,[416]27.4353,[417]27.4472,[418]27.4668,[419]27.5068,[420]27.5152,[421]27.5175,[422]27.4868,[423]27.5219,[424]27.4988,[425]27.4322,[426]27.4610,[427]27.4069,[428]27.4414,[429]27.4388,[430]27.4644,[431]27.4619,[432]27.4512,[433]27.4116,[434]27.3659,[435]27.3502,[436]27.3367,[437]27.3485,[438]27.3250,[439]27.3275,[440]27.3153,[441]27.3112,[442]27.3381,[443]27.3393,[444]27.3627,[445]27.3775,[446]27.3604,[447]27.3644,[448]27.3219,[449]27.3312,[450]27.3573,[451]27.3342,[452]27.3214,[453]27.3026,[454]27.2898,[455]27.2787,[456]27.2584,[457]27.2627,[458]27.2394,[459]27.1890,[460]27.1999,[461]27.1951,[462]27.1812,[463]27.1936,[464]27.1915,[465]27.1769,[466]27.2099,[467]27.2091,[468]27.2009,[469]27.2157,[470]27.2299,[471]27.2268,[472]27.2423,[473]27.2762,[474]27.2783,[475]27.2911,[476]27.3177,[477]27.2916,[478]27.3253,[479]27.3506,[480]27.3878,[481]27.4509,[482]27.4447,[483]27.4801,[484]27.4877,[485]27.5037,[486]27.5085,[487]27.5522,[488]27.5603,[489]27.5835,[490]27.5795,[491]27.5662,[492]27.5285,[493]27.5426,[494]27.5332,[495]27.5383,[496]27.5407,[497]27.5609,[498]27.5783,[499]27.5981,[500]27.5826,[501]27.5816,[502]27.5563,[503]27.5201,[504]27.4751,[505]27.4470,[506]27.4330,[507]27.4339,[508]27.4051,[509]27.3876,[510]27.3684,[511]27.4028,[512]27.4015,[513]27.3931,[514]27.3810,[515]27.3919,[516]27.3152,[517]27.3478,[518]27.3631,[519]27.3946,[520]27.3775,[521]27.3571,[522]27.3861,[523]27.4071,[524]27.4243,[525]27.3913,[526]27.3369,[527]27.2871,[528]27.2738,[529]27.2761,[530]27.2884,[531]27.2961,[532]27.3055,
Final estimate: PPL = 27.3055 +/- 0.32184
llama_perf_context_print: load time = 2386.74 ms
llama_perf_context_print: prompt eval time = 321380.19 ms / 272384 tokens ( 1.18 ms per token, 847.54 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 413892.63 ms / 272385 tokens