feat: Add Gemma3 chat handler (#1976) #1989

kossum · 2025-03-30T19:54:40Z

Added gemma3 chat handler, and fixed the image embedding, supports multiple images.

Included llamacpp functions and structures:

clip_image_load_from_bytes
clip_image_batch_encode
clip_image_preprocess
clip_image_f32_batch_init
clip_image_f32_batch_free
clip_image_u8_init
clip_image_u8_free

Usage (Current version, after Apr 4 2025):

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler

chat_handler = Gemma3ChatHandler(clip_model_path="path/to/mmproj")
llama = Llama(
  model_path="path/to/model",
  chat_handler=chat_handler,
  n_ctx=1024,  # n_ctx should be increased to accommodate the image embedding
)

messages = [
  {
    'role': 'user',
    'content': [
      {'type': 'text', 'text': 'Please describe this image'},
      {'type': 'image_url', 'image_url': 'https://raw.githubusercontent.com/huggingface/transformers/refs/heads/main/tests/fixtures/tests_samples/COCO/000000039769.png'},
    ]
  }
]

output = llama.create_chat_completion(
  messages,
  stop=['<end_of_turn>', '<eos>'],
  max_tokens=200,
)

print(output['choices'][0]['message']['content'])

- {'type': 'image', 'image': ...}
+ {'type': 'image_url', 'image_url': ...}

Test Results:

Passed local environment tests: Python 3.12, unsloth/gemma-3-4b-it-GGUF, unsloth/gemma-3-12b-it-GGUF, unsloth/gemma-3-27b-it-GGUF, bartowski/google_gemma-3-12b-it-GGUF

Compatibility:

Fully backward compatible with existing interfaces.
Maintains original APIs while adding new options and interfaces.

…in Gemma3ChatHandler

RuurdBijlsma · 2025-04-03T20:58:10Z

i've been using it a bit it works nicely, had to find out the message structure but maybe that's normal for different chat handlers. I'm not that familiar with llama-cpp

"type": "image",
"image": {
    "url": "https://image.com/img.jpg",
}

i was used to "image_url" for both places "image_url" is used now.

dchatel · 2025-04-04T07:16:10Z

How would that work with a local image?

kossum · 2025-04-04T11:31:26Z

Sorry, i didn't modify the origin chat template of gemma3 and then used "type": "image". Now i have changed the format of the messages to be compatible with the openai api, just like other chat handlers.

Here is a full example:

from pathlib import Path
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler

def image_to_base64_uri(image: bytes | str):
  import base64
  import urllib.request as request

  if isinstance(image, bytes):
    data = base64.b64encode(image).decode('utf-8')
  else:
    with request.urlopen(image) as f:
      data = base64.b64encode(f.read()).decode('utf-8')
  return f'data:image/png;base64,{data}'

chat_handler = Gemma3ChatHandler(clip_model_path='path/to/mmproj')
llama = Llama(
    model_path='path/to/model',
    chat_handler=chat_handler,
    n_ctx=2048,  # n_ctx should be increased to accommodate the image embedding
)

messages = [
    {
        'role': 'user',
        'content': [
            {'type': 'text', 'text': 'please compare these pictures'},
            {'type': 'image_url', 'image_url': 'https://xxxx/img1.jpg'},
            {'type': 'image_url', 'image_url': {'url': 'https://xxxx/img2.png'}},
            {'type': 'image_url', 'image_url': image_to_base64_uri(Path('path/to/img3.jpg').read_bytes())},
            {'type': 'image_url', 'image_url': {'url': image_to_base64_uri(Path('path/to/img4.png').read_bytes())}},
            {'type': 'text', 'text': 'and then tell me which one looks the best'},
        ]
    }
]

output = llama.create_chat_completion(
    messages,
    stop=['<end_of_turn>', '<eos>'],
    max_tokens=500,
    stream=True,
)

for chunk in output:
  delta = chunk['choices'][0]['delta']
  if 'role' in delta:
    print(delta['role'], end=':\n')
  elif 'content' in delta:
    print(delta['content'], end='')

llama._sampler.close()
llama.close()

joaojhgs · 2025-04-15T00:59:01Z

bump on this, thanks for your work! gemma3 is a great model to have support to, I'm waiting on it!

joaojhgs · 2025-04-15T13:05:01Z

Hey @kossum just wondering, does this handler support function calling? I ask because the handler for llava1.5 does support multimodal (vision) and also tool calling at once, as Gemma3 also has tool calling capabilities, it would be great to add both into a single handler!

kossum · 2025-04-16T11:23:54Z

Hello @joaojhgs, gemma3 (especially the 12b and 27b versions) has strong instruction-following abilities and can generate structured function call outputs through well-designed prompts.

But unlike gpt4 or claude, gemma3 does not have builtin support for tool call tokens or json schema enforcement. That means:

No builtin tool use markers: gemma3 does not automatically identify or tag tool usage.
Requires explicit prompt design: you need to clearly define function names, parameters, and output format in the prompt.
Lacks standardized templates: currently, gemma3’s chat_template does not include tool use structures.

So to implement function calling with gemma3, you must rely on carefully designed prompts to guide the model in producing the correct format.

Simple example:

import json
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler

chat_handler = Gemma3ChatHandler(clip_model_path='path/to/mmproj')
llama = Llama(
    model_path='path/to/model',
    chat_handler=chat_handler,
    n_ctx=2048,  # n_ctx should be increased to accommodate the image embedding
)


def analyze_image(image_id: str, description: str):
  print('image_id:', image_cache.get(image_id))
  print('description:', description)
  ...


image_cache = {'img_01': 'https://xxxx/img_01.jpg'}
function_table = {'analyze_image': analyze_image}

# input arg1
image_id = 'img_01'
# input arg2
question = f'Here is the image with ID `img_01`. Please analyze it.'

output = llama.create_chat_completion(
    [
        {
            'role': 'system',
            'content': '''You can call the following function:
- analyze_image(image_id: str, description: str)

You will be shown an image. First, analyze and describe its content in detail.
Then, return a function call with:
- the assigned image_id (provided in the input)
- a description of what the image shows (your own analysis)

Respond only with a JSON (without code blocks) function call like:
{
  "function": "analyze_image",
  "arguments": {
    "image_id": "<image id>",
    "description": "<description of the image>"
  }
}
'''
        },
        {
            'role': 'user',
            'content': [
                {'type': 'text', 'text': question},
                {'type': 'image_url', 'image_url': image_cache[image_id]},
            ]
        }
    ],
    stop=['<end_of_turn>', '<eos>'],
    max_tokens=500,
)

data = json.loads(output['choices'][0]['message']['content'])
result = function_table[data['function']](**data['arguments'])
...

Naturally, if multimodal capabilities aren’t needed, this chat handler can be omitted.

joaojhgs · 2025-04-16T22:34:55Z

Hello @joaojhgs, gemma3 (especially the 12b and 27b versions) has strong instruction-following abilities and can generate structured function call outputs through well-designed prompts.

But unlike gpt4 or claude, gemma3 does not have builtin support for tool call tokens or json schema enforcement.

Thanks, I didn't know about that!

okaris · 2025-05-23T15:43:28Z

@abetlen I've been waiting for this to be merged for some time. I'm curious are you still actively maintaining this repo? Thanks!

Domino9752 · 2025-05-30T19:52:09Z

Say Hey-
I added your code to my venv, and when running your example I received this error:

Traceback (most recent call last):

File "d:\Foundary\Gemma-3_llama.py", line 57, in
output = llama.create_chat_completion(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Foundary\venv\Lib\site-packages\llama_cpp\llama.py", line 2001, in create_chat_completion
return handler(
^^^^^^^^^^
File "D:\Foundary\venv\Lib\site-packages\llama_cpp\llama_chat_format.py", line 2838, in call
self.eval_image(llama, value)
File "D:\Foundary\venv\Lib\site-packages\llama_cpp\llama_chat_format.py", line 3480, in eval_image
if not self._llava_cpp.clip_image_batch_encode(self.clip_ctx, llama.n_threads, img_f32_p, embed):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: exception: access violation writing 0x0000015F26813000

So, seeing the "OSError" label, I wanted to ask if you had run your code under windows 11?

kossum · 2025-05-31T09:25:45Z

Hello @Domino9752 ,thanks for testing! The issue happens because the author updated the llama.cpp library in the current 0.3.9 version, but the corresponding changes for the llava part haven’t been made yet.
Could you please try rolling back to version 0.3.8 and see if it works? Alternatively, you can use my fork for now. I’ll update it soon with the necessary fixes.

Domino9752 · 2025-05-31T16:39:35Z

kossum-

Thanks for your help. I changed over to your branch:

python -m pip install git+https://github.com/kossum/llama-cpp-python@gemma3-fix --no-cache-dir --force-reinstall --upgrade --config-settings="cmake.args=-DGGML_CUDA=on"

...and now it runs. I previously had the syntax shown in the April 4th message using "image_url".
Upon reading llama_chat_format.py, I realized the correct syntax for the messages (with @gemma3-fix) is:

path_to_image = r"D:\FLUX1-dev\inputs\Clipped\0515-2200-6-2_224.png"

messages = [
{
'role': 'user',
'content': [
{'type': 'text', 'text': "Please describe this image in great detail, OK?"},
{'type': 'image', 'url': image_to_base64_uri(Path(path_to_image).read_bytes())},
]
}
]

Intel i7 14700K / RTX3090
Llama.generate: 528 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print: load time = 46.10 ms
llama_perf_context_print: prompt eval time = 1241.25 ms / 530 tokens ( 2.34 ms per token, 426.99 tokens per second)
llama_perf_context_print: eval time = 25365.96 ms / 684 runs ( 37.08 ms per token, 26.97 tokens per second)
llama_perf_context_print: total time = 26286.61 ms / 1214 tokens

xia0nan · 2025-06-01T03:18:27Z

Hi @kossum, thank you for the work. I've tested it, and it works well. Here's the test code:

install from your branch pip install git+https://github.com/kossum/llama-cpp-python.git@main
and then run testing:

import os
import requests
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler

# URLs for the model weights
MODEL_URL = "https://huggingface.co/vinimuchulski/gemma-3-4b-it-qat-q4_0-gguf/resolve/main/gemma-3-4b-it-q4_0.gguf?download=true"
MMPROJ_URL = "https://huggingface.co/vinimuchulski/gemma-3-4b-it-qat-q4_0-gguf/resolve/main/mmproj-model-f16-4B.gguf?download=true"

MODEL_FILE = "gemma-3-4b-it-q4_0.gguf"
MMPROJ_FILE = "mmproj-model-f16-4B.gguf"

def download_file(url, local_path):
    if not os.path.exists(local_path):
        print(f"Downloading {local_path} ...")
        with requests.get(url, stream=True) as r:
            r.raise_for_status()
            with open(local_path, 'wb') as f:
                for chunk in r.iter_content(chunk_size=8192):
                    f.write(chunk)
        print(f"Downloaded {local_path}")
    else:
        print(f"{local_path} already exists, skipping download.")

# Download the weights if they don't exist
download_file(MODEL_URL, MODEL_FILE)
download_file(MMPROJ_URL, MMPROJ_FILE)

# Initialize the multimodal chat handler
chat_handler = Gemma3ChatHandler(clip_model_path=MMPROJ_FILE)

# Load the Gemma 3 model with multimodal support
llm = Llama(
    model_path=MODEL_FILE,
    chat_handler=chat_handler,
    n_ctx=2048,  # You can increase or decrease as needed
)

# Sample inference: describe an image from a URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Please describe this image."},
            {"type": "image_url", "image_url": {"url": "https://raw.githubusercontent.com/huggingface/transformers/refs/heads/main/tests/fixtures/tests_samples/COCO/000000039769.png"}},
        ]
    }
]

output = llm.create_chat_completion(
    messages,
    stop=['<end_of_turn>', '<eos>'],
    max_tokens=200,
)

print("Model output:", output['choices'][0]['message']['content'])

There is one issue I want to ask for help - the output is speed is quite slow.
On CPU kernel:

Llama.generate: 299 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print: load time = 458.17 ms
llama_perf_context_print: prompt eval time = 363380.98 ms / 301 tokens ( 1207.25 ms per token, 0.83 tokens per second)
llama_perf_context_print: eval time = 97226.48 ms / 199 runs ( 488.58 ms per token, 2.05 tokens per second)
llama_perf_context_print: total time = 98303.12 ms / 500 tokens

On GPU kernel:

Llama.generate: 299 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print: load time = 401.93 ms
llama_perf_context_print: prompt eval time = 285504.85 ms / 301 tokens ( 948.52 ms per token, 1.05 tokens per second)
llama_perf_context_print: eval time = 80909.79 ms / 199 runs ( 406.58 ms per token, 2.46 tokens per second)
llama_perf_context_print: total time = 81813.53 ms / 500 tokens

When I test the same model in LM Studio, the speed is much faster, it's around 4-9 tokens/second with CPU, and with GPU it can go as high as 60 TPS.
Why there is a difference between me using llama_cpp_python and LM Studio for the same model? Do you have some suggestions to make output speed faster?

kossum · 2025-06-01T05:31:19Z

Hi @xia0nan, thanks for your feedback.

If you find that inference speed is slow, please note that you can use the n_gpu_layers parameter to specify how many transformer layers should be offloaded to the GPU. For example:

llm = llama.Llama(
    model_path=MODEL_PATH,
    chat_handler=Gemma3ChatHandler(clip_model_path=MMPROJ_PATH),
    n_gpu_layers=48,
    n_ctx=1024,
)

However, in order to use the GPU, you need to build llama-cpp-python with the appropriate backend. Please refer to the Supported Backends section in the README for detailed instructions on enabling GPU support during installation.

Also, LM Studio may use GPU acceleration by default or include additional optimizations, which can explain the speed difference.

Please note that GPU support and related build issues are not directly within the scope of this chat handler. If you encounter any problems when compiling with GPU support, feel free to open a separate issue. For more general installation or backend-related questions, you may also want to refer to the main llama-cpp-python repository.

joaojhgs · 2025-06-06T01:37:25Z

hey @kossum, just out of curiosity, is it possible to port this into my project without waiting for it to be mergerd or using your branch? I mean making a class override, I've done that before in cases where I couldn't wait for a third party lib merge, but this one has those ctypes functions that looks like compiled stuff that I'd need support from the lib directly, could I add those ctypes directly into my project as well?

Also, as I have been trying to load the mmproj model, the loading function fails with complaints about the mmproj file missing some required keys, such as general.description, clip.hast_text_encoder etc, I'm using the unsloth 4b model and their F16 mmproj file, am I perhaps missing something?

chriszs · 2025-06-06T02:46:51Z

could I add those ctypes directly into my project as well?

You can. The trick is you have to duplicate the ctypes file into your project, override the init method from Llava and import from your local ctypes when instantiating.

kossum added 2 commits March 31, 2025 04:15

feat: Add Gemma3 chat handler (abetlen#1976)

f33dde3

resolve the image embedding issue in gemma3

25b2f8f

kossum mentioned this pull request Apr 2, 2025

Gemma3 Multimodal #1976

Open

fix: added n_ctx check for prompt requirements when embedding images …

1b45588

…in Gemma3ChatHandler

fix: modify the gemma3 chat template to be compatible with openai api

025e7fa

kossum force-pushed the main branch from 0a144b3 to 025e7fa Compare April 12, 2025 11:55

AuroraWright mentioned this pull request May 7, 2025

Flush libc stdout/stderr in suppress_stdout_stderr #2015

Open

kossum added 2 commits June 4, 2025 12:25

Merge branch 'main' into main

0a929e1

fix: add compatibility with v0.3.9 for Gemma3ChatHandler

14a51f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Gemma3 chat handler (#1976) #1989

feat: Add Gemma3 chat handler (#1976) #1989

Uh oh!

kossum commented Mar 30, 2025 •

edited

Loading

Uh oh!

RuurdBijlsma commented Apr 3, 2025 •

edited

Loading

Uh oh!

dchatel commented Apr 4, 2025

Uh oh!

kossum commented Apr 4, 2025 •

edited

Loading

Uh oh!

joaojhgs commented Apr 15, 2025

Uh oh!

joaojhgs commented Apr 15, 2025

Uh oh!

kossum commented Apr 16, 2025 •

edited

Loading

Uh oh!

joaojhgs commented Apr 16, 2025

Uh oh!

okaris commented May 23, 2025

Uh oh!

Domino9752 commented May 30, 2025

Uh oh!

kossum commented May 31, 2025

Uh oh!

Domino9752 commented May 31, 2025 •

edited

Loading

Uh oh!

xia0nan commented Jun 1, 2025 •

edited

Loading

Uh oh!

kossum commented Jun 1, 2025

Uh oh!

joaojhgs commented Jun 6, 2025

Uh oh!

chriszs commented Jun 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

feat: Add Gemma3 chat handler (#1976) #1989

Are you sure you want to change the base?

feat: Add Gemma3 chat handler (#1976) #1989

Uh oh!

Conversation

kossum commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RuurdBijlsma commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dchatel commented Apr 4, 2025

Uh oh!

kossum commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joaojhgs commented Apr 15, 2025

Uh oh!

joaojhgs commented Apr 15, 2025

Uh oh!

kossum commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joaojhgs commented Apr 16, 2025

Uh oh!

okaris commented May 23, 2025

Uh oh!

Domino9752 commented May 30, 2025

Uh oh!

kossum commented May 31, 2025

Uh oh!

Domino9752 commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xia0nan commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kossum commented Jun 1, 2025

Uh oh!

joaojhgs commented Jun 6, 2025

Uh oh!

chriszs commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kossum commented Mar 30, 2025 •

edited

Loading

RuurdBijlsma commented Apr 3, 2025 •

edited

Loading

kossum commented Apr 4, 2025 •

edited

Loading

kossum commented Apr 16, 2025 •

edited

Loading

Domino9752 commented May 31, 2025 •

edited

Loading

xia0nan commented Jun 1, 2025 •

edited

Loading

chriszs commented Jun 6, 2025 •

edited

Loading