Skip to content

Eval bug: Jinja not replacing date_string #12729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
LoSunny opened this issue Apr 3, 2025 · 6 comments · May be fixed by #12802
Open

Eval bug: Jinja not replacing date_string #12729

LoSunny opened this issue Apr 3, 2025 · 6 comments · May be fixed by #12802
Assignees

Comments

@LoSunny
Copy link

LoSunny commented Apr 3, 2025

Name and Version

$ ~/llama.cpp/build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A800-SXM4-80GB MIG 7g.80gb, compute capability 8.0, VMM: yes
version: 5002 (2c3f8b85)
built with x86_64-conda-linux-gnu-cc (conda-forge gcc 11.4.0-13) 11.4.0 for x86_64-conda-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

AMD EPYC 7742 64-Core Processor + A800-SXM4-80GB

Models

No response

Problem description & steps to reproduce

Compile llama.cpp from source and run it with ~/llama.cpp/build/bin/llama-server -m /models/Llama-3.3-70B-Instruct-Q8_0.gguf --port 8000 -t 8 -ngl 81 -c 15360 --jinja

First Bad Commit

No response

Relevant log output

$ ~/llama.cpp/build/bin/llama-server -m /models/Llama-3.3-70B-Instruct-Q8_0.gguf --port 8000 -t 8 -ngl 81 -c 15360 --jinja
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A800-SXM4-80GB MIG 7g.80gb, compute capability 8.0, VMM: yes
build: 5002 (2c3f8b85) with x86_64-conda-linux-gnu-cc (conda-forge gcc 11.4.0-13) 11.4.0 for x86_64-conda-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 256

system_info: n_threads = 8 (n_threads_batch = 8) / 256 | CUDA : ARCHS = 500,610,700,750,800 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8000, http threads: 255
main: loading model
srv    load_model: loading model '/models/Llama-3.3-70B-Instruct-Q8_0.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA A800-SXM4-80GB MIG 7g.80gb) - 80839 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 724 tensors from /models/Llama-3.3-70B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 71B
llama_model_loader: - kv   3:                            general.license str              = llama3.3
llama_model_loader: - kv   4:                   general.base_model.count u32              = 1
llama_model_loader: - kv   5:                  general.base_model.0.name str              = Llama 3.1 70B
llama_model_loader: - kv   6:          general.base_model.0.organization str              = Meta Llama
llama_model_loader: - kv   7:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv   8:                               general.tags arr[str,5]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   9:                          general.languages arr[str,8]       = ["en", "fr", "it", "pt", "hi", "es", ...
llama_model_loader: - kv  10:                          llama.block_count u32              = 80
llama_model_loader: - kv  11:                       llama.context_length u32              = 131072
llama_model_loader: - kv  12:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  13:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  14:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  15:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  16:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  17:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  19:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  20:                          general.file_type u32              = 7
llama_model_loader: - kv  21:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  22:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q8_0:  562 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 69.82 GiB (8.50 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 8192
print_info: n_layer          = 80
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 28672
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 70B
print_info: model params     = 70.55 B
print_info: general.name     = n/a
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128004 '<|finetune_right_pad_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors:        CUDA0 model buffer size = 70429.66 MiB
load_tensors:   CPU_Mapped model buffer size =  1064.62 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 15360
llama_context: n_ctx_per_seq = 15360
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (15360) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.49 MiB
init: kv_size = 15360, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1
init:      CUDA0 KV buffer size =  4800.00 MiB
llama_context: KV self size  = 4800.00 MiB, K (f16): 2400.00 MiB, V (f16): 2400.00 MiB
llama_context:      CUDA0 compute buffer size =  2014.00 MiB
llama_context:  CUDA_Host compute buffer size =    46.01 MiB
llama_context: graph nodes  = 2726
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 15360
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 15360
main: model loaded
main: chat template, chat_template: {{- bos_token }}
{%- if custom_tools is defined %}
    {%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
    {%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
    {%- set date_string = "26 Jul 2024" %}
{%- endif %}
{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}

{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
    {%- set system_message = messages[0]['content']|trim %}
    {%- set messages = messages[1:] %}
{%- else %}
    {%- set system_message = "" %}
{%- endif %}

{#- System message + builtin tools #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if builtin_tools is defined or tools is not none %}
    {{- "Environment: ipython\n" }}
{%- endif %}
{%- if builtin_tools is defined %}
    {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}}
{%- endif %}
{{- "Cutting Knowledge Date: December 2023\n" }}
{{- "Today Date: " + date_string + "\n\n" }}
{%- if tools is not none and not tools_in_user_message %}
    {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
    {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
    {{- "Do not use variables.\n\n" }}
    {%- for t in tools %}
        {{- t | tojson(indent=4) }}
        {{- "\n\n" }}
    {%- endfor %}
{%- endif %}
{{- system_message }}
{{- "<|eot_id|>" }}

{#- Custom tools are passed in a user message with some extra guidance #}
{%- if tools_in_user_message and not tools is none %}
    {#- Extract the first user message so we can plug it in here #}
    {%- if messages | length != 0 %}
        {%- set first_user_message = messages[0]['content']|trim %}
        {%- set messages = messages[1:] %}
    {%- else %}
        {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
{%- endif %}
    {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
    {{- "Given the following functions, please respond with a JSON for a function call " }}
    {{- "with its proper arguments that best answers the given prompt.\n\n" }}
    {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
    {{- "Do not use variables.\n\n" }}
    {%- for t in tools %}
        {{- t | tojson(indent=4) }}
        {{- "\n\n" }}
    {%- endfor %}
    {{- first_user_message + "<|eot_id|>"}}
{%- endif %}

{%- for message in messages %}
    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
        {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
    {%- elif 'tool_calls' in message %}
        {%- if not message.tool_calls|length == 1 %}
            {{- raise_exception("This model only supports single tool-calls at once!") }}
        {%- endif %}
        {%- set tool_call = message.tool_calls[0].function %}
        {%- if builtin_tools is defined and tool_call.name in builtin_tools %}
            {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
            {{- "<|python_tag|>" + tool_call.name + ".call(" }}
            {%- for arg_name, arg_val in tool_call.arguments | items %}
                {{- arg_name + '="' + arg_val + '"' }}
                {%- if not loop.last %}
                    {{- ", " }}
                {%- endif %}
                {%- endfor %}
            {{- ")" }}
        {%- else  %}
            {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
            {{- '{"name": "' + tool_call.name + '", ' }}
            {{- '"parameters": ' }}
            {{- tool_call.arguments | tojson }}
            {{- "}" }}
        {%- endif %}
        {%- if builtin_tools is defined %}
            {#- This means we're in ipython mode #}
            {{- "<|eom_id|>" }}
        {%- else %}
            {{- "<|eot_id|>" }}
        {%- endif %}
    {%- elif message.role == "tool" or message.role == "ipython" %}
        {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
        {%- if message.content is mapping or message.content is iterable %}
            {{- message.content | tojson }}
        {%- else %}
            {{- message.content }}
        {%- endif %}
        {{- "<|eot_id|>" }}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}
, example_format: '<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
main: server is listening on http://127.0.0.1:8000 - starting the main loop
srv  update_slots: all slots are idle
@CISC
Copy link
Collaborator

CISC commented Apr 3, 2025

Well, technically it can't because there's no (simple) way of passing variables.

This variable in particular though should be replaced with the strftime_now function in the template.

If that hasn't already happened in the official Llama 3.3 repo you can export the chat template (using scripts/get_chat_template.py f.ex.) and then load a modified copy using --chat-template-file. Alternatively you can create a new GGUF with updated chat template using gguf-py/gguf/scripts/gguf_new_metadata.py.

@CISC
Copy link
Collaborator

CISC commented Apr 3, 2025

@ochafik WDYT, is this something that can/will be supported?

AFAIK the only other use-cases for variables is RAG (documents variable) and IBM Granite's thinking and controls.

@ochafik
Copy link
Collaborator

ochafik commented Apr 4, 2025

@ochafik WDYT, is this something that can/will be supported?

@CISC Ah, absolutely, I'll look into it. I want the time to be overridable by flag (for tests) & synced w/ the strftime_now minja function.

AFAIK the only other use-cases for variables is RAG (documents variable)

Seems it could definitely be useful (mostly for Cohere & Granite models AFAICT, with the latter ignoring the doc title).

and IBM Granite's thinking and controls.

Which granite model are you looking at, btw? (ibm-granite/granite-3.1-8b-instruct has no thinking, and its controls doesn't seem very standard)

Allowing arbitrary values to be passed through to jinja may require more thinking, I reckon we'll want to only allow these specific ones.

@CISC
Copy link
Collaborator

CISC commented Apr 4, 2025

Which granite model are you looking at, btw? (ibm-granite/granite-3.1-8b-instruct has no thinking, and its controls doesn't seem very standard)

The ibm-granite/granite-3.2-8b-instruct has these, there doesn't really need to be a standard as transformers will pass through any unused parameters.

Allowing arbitrary values to be passed through to jinja may require more thinking, I reckon we'll want to only allow these specific ones.

Makes sense.

@ochafik ochafik self-assigned this Apr 5, 2025
ochafik added a commit to ochafik/llama.cpp that referenced this issue Apr 5, 2025
ochafik added a commit to ochafik/llama.cpp that referenced this issue Apr 5, 2025
@wqerrewetw
Copy link

This variable in particular though should be replaced with the strftime_now function in the template.

But minja strftime_now only work with %Y-%m-%d not work other format-codes like %Y-%m-%d %H:%M:%S

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

ochafik added a commit to ochafik/llama.cpp that referenced this issue Apr 7, 2025
@ochafik
Copy link
Collaborator

ochafik commented Apr 7, 2025

This variable in particular though should be replaced with the strftime_now function in the template.

But minja strftime_now only work with %Y-%m-%d not work other format-codes like %Y-%m-%d %H:%M:%S

@wqerrewetw it should work (it's defined in chat-template.hpp / relies on std::put_time which largely overlaps w/ the python format), if you face any issue please file a bug ✌️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants