Skip to content

Eval bug: Extreme perplexity for gemma 3n #14437

Open
@pt13762104

Description

@pt13762104

Name and Version

build: 5774 (27208bf) with gcc-15 (Homebrew GCC 15.1.0) 15.1.0 for x86_64-pc-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

Nvidia T4

Models

https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF/resolve/main/gemma-3n-E4B-it-UD-Q4_K_XL.gguf

Problem description & steps to reproduce

I tried to do a perplexity measure on Gemma 3N e4b using this command: llama-perplexity -m /root/gemma-3n-E4B-it-UD-Q4_K_XL.gguf --swa-full -f ../input/wikitext2-data/test.txt -ngl 99 , but the perplexity is suspiciously high (higher than a 0.6b model)...

First Bad Commit

No response

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
  Device 1: Tesla T4, compute capability 7.5, VMM: yes
build: 5774 (27208bf6) with gcc-15 (Homebrew GCC 15.1.0) 15.1.0 for x86_64-pc-linux-gnu
llama_model_load_from_file_impl: using device CUDA0 (Tesla T4) - 14992 MiB free
llama_model_load_from_file_impl: using device CUDA1 (Tesla T4) - 14992 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 847 tensors from /root/gemma-3n-E4B-it-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3n
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma-3N-E4B-It
llama_model_loader: - kv   3:                           general.finetune str              = 3n-E4B-it
llama_model_loader: - kv   4:                           general.basename str              = Gemma-3N-E4B-It
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 6.9B
llama_model_loader: - kv   7:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   8:                     gemma3n.context_length u32              = 32768
llama_model_loader: - kv   9:                   gemma3n.embedding_length u32              = 2048
llama_model_loader: - kv  10:                        gemma3n.block_count u32              = 35
llama_model_loader: - kv  11:                gemma3n.feed_forward_length arr[i32,35]      = [16384, 16384, 16384, 16384, 16384, 1...
llama_model_loader: - kv  12:               gemma3n.attention.head_count u32              = 8
llama_model_loader: - kv  13:   gemma3n.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:               gemma3n.attention.key_length u32              = 256
llama_model_loader: - kv  15:             gemma3n.attention.value_length u32              = 256
llama_model_loader: - kv  16:                     gemma3n.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  17:           gemma3n.attention.sliding_window u32              = 512
llama_model_loader: - kv  18:            gemma3n.attention.head_count_kv u32              = 2
llama_model_loader: - kv  19:                   gemma3n.altup.active_idx u32              = 0
llama_model_loader: - kv  20:                   gemma3n.altup.num_inputs u32              = 4
llama_model_loader: - kv  21:   gemma3n.embedding_length_per_layer_input u32              = 256
llama_model_loader: - kv  22:         gemma3n.attention.shared_kv_layers u32              = 15
llama_model_loader: - kv  23:          gemma3n.activation_sparsity_scale arr[f32,35]      = [1.644854, 1.644854, 1.644854, 1.6448...
llama_model_loader: - kv  24:   gemma3n.attention.sliding_window_pattern arr[bool,35]     = [true, true, true, true, false, true,...
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  29:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 106
llama_model_loader: - kv  33:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  36:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  37:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  38:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 15
llama_model_loader: - kv  41:                      quantize.imatrix.file str              = gemma-3n-E4B-it-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv  42:                   quantize.imatrix.dataset str              = unsloth_calibration_gemma-3n-E4B-it.txt
llama_model_loader: - kv  43:             quantize.imatrix.entries_count u32              = 459
llama_model_loader: - kv  44:              quantize.imatrix.chunks_count u32              = 1326
llama_model_loader: - type  f32:  422 tensors
llama_model_loader: - type  f16:  108 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q4_K:  189 tensors
llama_model_loader: - type q5_K:   61 tensors
llama_model_loader: - type q6_K:   50 tensors
llama_model_loader: - type iq4_xs:   16 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.01 GiB (6.27 BPW) 
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3n
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 35
print_info: n_head           = 8
print_info: n_head_kv        = 2
print_info: n_rot            = 256
print_info: n_swa            = 512
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 1.0e+00
print_info: n_ff             = 16384
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = E4B
print_info: model params     = 6.87 B
print_info: general.name     = Gemma-3N-E4B-It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 106 '<end_of_turn>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 36/36 layers to GPU
load_tensors:        CUDA0 model buffer size =  1193.31 MiB
load_tensors:        CUDA1 model buffer size =  3936.01 MiB
load_tensors:   CPU_Mapped model buffer size =   352.00 MiB
..............................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     4.00 MiB
llama_kv_cache_unified_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 2048 cells
llama_kv_cache_unified:      CUDA0 KV buffer size =    12.00 MiB
llama_kv_cache_unified:      CUDA1 KV buffer size =     4.00 MiB
llama_kv_cache_unified: size =   16.00 MiB (  2048 cells,   4 layers,  4 seqs), K (f16):    8.00 MiB, V (f16):    8.00 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 2048 cells
llama_kv_cache_unified:      CUDA0 KV buffer size =    60.00 MiB
llama_kv_cache_unified:      CUDA1 KV buffer size =     4.00 MiB
llama_kv_cache_unified: size =   64.00 MiB (  2048 cells,  16 layers,  4 seqs), K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA0 compute buffer size =   340.04 MiB
llama_context:      CUDA1 compute buffer size =   710.02 MiB
llama_context:  CUDA_Host compute buffer size =    36.03 MiB
llama_context: graph nodes  = 3266
llama_context: graph splits = 7
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CUDA : ARCHS = 750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 985.429 ms
perplexity: calculating perplexity over 532 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 3.41 seconds per pass - ETA 7.55 minutes
[1]23.8297,[2]21.4697,[3]19.8675,[4]20.9398,[5]21.7471,[6]20.4047,[7]19.3114,[8]18.7489,[9]17.5857,[10]18.3422,[11]18.6660,[12]20.5078,[13]21.8797,[14]23.6463,[15]25.4186,[16]26.1627,[17]26.1332,[18]26.6967,[19]25.7511,[20]25.7979,[21]25.3409,[22]25.8540,[23]25.8605,[24]25.2923,[25]24.8518,[26]24.9886,[27]25.2800,[28]25.7704,[29]26.1470,[30]26.0155,[31]25.8971,[32]26.0488,[33]25.5285,[34]25.6075,[35]25.6738,[36]25.6182,[37]25.4378,[38]25.7397,[39]25.7518,[40]25.6764,[41]25.4600,[42]25.1981,[43]24.9798,[44]24.7480,[45]24.7220,[46]24.8355,[47]24.8683,[48]24.9370,[49]25.2017,[50]25.2371,[51]25.3599,[52]25.4731,[53]25.2945,[54]25.5673,[55]25.6332,[56]25.8897,[57]26.1280,[58]25.8219,[59]25.8321,[60]25.7373,[61]25.8483,[62]25.9158,[63]25.9633,[64]26.0994,[65]26.0959,[66]25.8278,[67]25.7536,[68]26.0287,[69]26.1151,[70]26.2757,[71]26.2774,[72]26.4018,[73]26.3057,[74]26.5324,[75]26.7150,[76]26.6143,[77]26.7592,[78]26.9052,[79]27.0864,[80]27.0957,[81]27.1244,[82]27.3395,[83]27.3444,[84]27.4042,[85]27.4717,[86]27.4674,[87]27.5009,[88]27.6106,[89]27.5143,[90]27.5731,[91]27.5038,[92]27.5119,[93]27.4446,[94]27.4645,[95]27.5988,[96]27.5769,[97]27.5820,[98]27.7249,[99]27.6260,[100]27.6273,[101]27.6263,[102]27.5036,[103]27.5403,[104]27.5866,[105]27.7558,[106]27.8412,[107]27.9647,[108]27.9683,[109]27.9077,[110]27.9572,[111]27.9905,[112]28.0027,[113]28.0377,[114]27.9575,[115]27.9762,[116]27.9798,[117]27.8716,[118]27.7931,[119]27.7752,[120]27.7087,[121]27.7459,[122]27.7443,[123]27.8200,[124]27.9456,[125]27.9557,[126]27.9473,[127]28.0533,[128]27.9292,[129]27.8342,[130]27.7237,[131]27.5833,[132]27.5382,[133]27.6435,[134]27.7647,[135]27.7138,[136]27.6233,[137]27.5075,[138]27.4916,[139]27.4595,[140]27.5551,[141]27.4921,[142]27.4826,[143]27.4461,[144]27.4003,[145]27.2686,[146]27.2675,[147]27.2412,[148]27.2477,[149]27.1412,[150]27.0003,[151]26.9522,[152]26.9684,[153]26.9275,[154]26.9112,[155]26.9750,[156]27.0029,[157]27.0352,[158]27.0390,[159]27.0937,[160]26.8343,[161]26.9225,[162]26.8017,[163]26.6215,[164]26.4313,[165]26.3793,[166]26.4348,[167]26.4706,[168]26.4991,[169]26.5127,[170]26.6560,[171]26.6228,[172]26.6241,[173]26.6411,[174]26.5995,[175]26.4470,[176]26.4427,[177]26.3525,[178]26.2595,[179]26.2565,[180]26.2296,[181]26.3100,[182]26.2687,[183]26.1074,[184]25.9805,[185]25.9332,[186]25.8729,[187]25.8048,[188]25.8179,[189]25.8426,[190]25.7795,[191]25.8436,[192]25.8863,[193]25.8545,[194]25.8786,[195]25.9061,[196]25.9054,[197]25.9396,[198]25.9889,[199]25.9713,[200]25.9217,[201]25.9262,[202]25.9201,[203]25.9558,[204]25.9969,[205]26.0124,[206]26.0439,[207]25.9325,[208]25.8800,[209]25.8808,[210]25.9261,[211]25.9196,[212]25.9495,[213]25.9605,[214]25.9565,[215]25.9415,[216]25.8854,[217]25.8848,[218]25.9209,[219]25.9732,[220]26.0084,[221]25.9523,[222]25.9377,[223]25.9203,[224]25.9376,[225]25.9292,[226]25.8885,[227]25.9160,[228]25.9463,[229]25.9676,[230]26.1019,[231]26.1137,[232]26.1139,[233]26.1700,[234]26.0752,[235]26.1799,[236]26.1678,[237]26.1288,[238]26.1370,[239]26.1215,[240]26.1561,[241]26.2353,[242]26.2645,[243]26.2893,[244]26.3168,[245]26.3145,[246]26.3359,[247]26.3600,[248]26.3314,[249]26.2218,[250]26.0706,[251]25.8787,[252]25.8685,[253]25.8457,[254]25.9134,[255]25.8489,[256]25.8562,[257]25.8208,[258]25.8898,[259]25.9274,[260]26.0060,[261]26.0470,[262]26.1211,[263]26.1937,[264]26.2456,[265]26.2707,[266]26.2576,[267]26.3469,[268]26.3961,[269]26.5014,[270]26.5304,[271]26.5131,[272]26.5463,[273]26.6238,[274]26.6725,[275]26.7412,[276]26.8189,[277]26.8357,[278]26.9189,[279]26.9522,[280]26.9903,[281]26.9915,[282]26.9971,[283]26.9208,[284]26.8762,[285]26.9048,[286]26.8862,[287]26.8703,[288]26.8434,[289]26.7763,[290]26.8140,[291]26.8192,[292]26.8244,[293]26.8624,[294]26.8272,[295]26.8488,[296]26.8648,[297]26.8560,[298]26.8689,[299]26.8825,[300]26.8355,[301]26.8516,[302]26.8123,[303]26.7612,[304]26.7808,[305]26.7577,[306]26.7796,[307]26.7596,[308]26.7784,[309]26.8219,[310]26.8545,[311]26.8838,[312]26.9088,[313]26.9006,[314]26.8698,[315]26.9429,[316]26.9790,[317]26.9951,[318]26.9599,[319]26.9630,[320]27.0035,[321]27.0170,[322]27.0361,[323]27.0361,[324]27.0132,[325]27.0399,[326]27.0052,[327]26.9591,[328]26.9027,[329]26.8787,[330]26.8067,[331]26.7859,[332]26.6854,[333]26.6639,[334]26.6614,[335]26.7275,[336]26.7922,[337]26.8346,[338]26.9071,[339]26.9655,[340]26.9734,[341]26.9175,[342]26.8995,[343]26.9298,[344]26.9893,[345]26.9540,[346]26.9255,[347]26.9037,[348]26.8847,[349]26.9223,[350]26.9584,[351]27.0009,[352]26.9464,[353]26.8788,[354]26.8299,[355]26.7940,[356]26.8060,[357]26.8025,[358]26.8143,[359]26.8260,[360]26.7987,[361]26.8218,[362]26.8274,[363]26.8331,[364]26.8544,[365]26.8872,[366]26.8925,[367]26.8687,[368]26.8398,[369]26.7781,[370]26.7687,[371]26.8239,[372]26.8401,[373]26.8716,[374]26.8576,[375]26.8738,[376]26.8856,[377]26.8866,[378]26.8773,[379]26.8775,[380]26.8859,[381]26.8796,[382]26.9003,[383]26.8903,[384]26.8802,[385]26.8459,[386]26.8440,[387]26.8710,[388]26.8773,[389]26.8797,[390]26.8814,[391]26.9455,[392]26.9871,[393]26.9954,[394]26.9939,[395]27.0530,[396]27.0318,[397]27.0317,[398]27.0786,[399]27.0593,[400]27.0882,[401]27.1107,[402]27.1271,[403]27.1425,[404]27.1423,[405]27.1388,[406]27.1679,[407]27.2116,[408]27.2181,[409]27.2615,[410]27.2543,[411]27.2732,[412]27.2989,[413]27.3537,[414]27.3887,[415]27.3879,[416]27.4353,[417]27.4472,[418]27.4668,[419]27.5068,[420]27.5152,[421]27.5175,[422]27.4868,[423]27.5219,[424]27.4988,[425]27.4322,[426]27.4610,[427]27.4069,[428]27.4414,[429]27.4388,[430]27.4644,[431]27.4619,[432]27.4512,[433]27.4116,[434]27.3659,[435]27.3502,[436]27.3367,[437]27.3485,[438]27.3250,[439]27.3275,[440]27.3153,[441]27.3112,[442]27.3381,[443]27.3393,[444]27.3627,[445]27.3775,[446]27.3604,[447]27.3644,[448]27.3219,[449]27.3312,[450]27.3573,[451]27.3342,[452]27.3214,[453]27.3026,[454]27.2898,[455]27.2787,[456]27.2584,[457]27.2627,[458]27.2394,[459]27.1890,[460]27.1999,[461]27.1951,[462]27.1812,[463]27.1936,[464]27.1915,[465]27.1769,[466]27.2099,[467]27.2091,[468]27.2009,[469]27.2157,[470]27.2299,[471]27.2268,[472]27.2423,[473]27.2762,[474]27.2783,[475]27.2911,[476]27.3177,[477]27.2916,[478]27.3253,[479]27.3506,[480]27.3878,[481]27.4509,[482]27.4447,[483]27.4801,[484]27.4877,[485]27.5037,[486]27.5085,[487]27.5522,[488]27.5603,[489]27.5835,[490]27.5795,[491]27.5662,[492]27.5285,[493]27.5426,[494]27.5332,[495]27.5383,[496]27.5407,[497]27.5609,[498]27.5783,[499]27.5981,[500]27.5826,[501]27.5816,[502]27.5563,[503]27.5201,[504]27.4751,[505]27.4470,[506]27.4330,[507]27.4339,[508]27.4051,[509]27.3876,[510]27.3684,[511]27.4028,[512]27.4015,[513]27.3931,[514]27.3810,[515]27.3919,[516]27.3152,[517]27.3478,[518]27.3631,[519]27.3946,[520]27.3775,[521]27.3571,[522]27.3861,[523]27.4071,[524]27.4243,[525]27.3913,[526]27.3369,[527]27.2871,[528]27.2738,[529]27.2761,[530]27.2884,[531]27.2961,[532]27.3055,
Final estimate: PPL = 27.3055 +/- 0.32184

llama_perf_context_print:        load time =    2386.74 ms
llama_perf_context_print: prompt eval time =  321380.19 ms / 272384 tokens (    1.18 ms per token,   847.54 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  413892.63 ms / 272385 tokens

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions