Skip to content

feat: Add Gemma3 chat handler (#1976) #1989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

kossum
Copy link

@kossum kossum commented Mar 30, 2025

Added gemma3 chat handler, and fixed the image embedding, supports multiple images.

Included llamacpp functions and structures:

  • clip_image_load_from_bytes
  • clip_image_batch_encode
  • clip_image_preprocess
  • clip_image_f32_batch_init
  • clip_image_f32_batch_free
  • clip_image_u8_init
  • clip_image_u8_free

Usage (Current version, after Apr 4 2025):

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler

chat_handler = Gemma3ChatHandler(clip_model_path="path/to/mmproj")
llama = Llama(
  model_path="path/to/model",
  chat_handler=chat_handler,
  n_ctx=1024,  # n_ctx should be increased to accommodate the image embedding
)

messages = [
  {
    'role': 'user',
    'content': [
      {'type': 'text', 'text': 'Please describe this image'},
      {'type': 'image_url', 'image_url': 'https://raw.githubusercontent.com/huggingface/transformers/refs/heads/main/tests/fixtures/tests_samples/COCO/000000039769.png'},
    ]
  }
]

output = llama.create_chat_completion(
  messages,
  stop=['<end_of_turn>', '<eos>'],
  max_tokens=200,
)

print(output['choices'][0]['message']['content'])
- {'type': 'image', 'image': ...}
+ {'type': 'image_url', 'image_url': ...}

Test Results:

Compatibility:

  • Fully backward compatible with existing interfaces.
  • Maintains original APIs while adding new options and interfaces.

@kossum kossum mentioned this pull request Apr 2, 2025
@RuurdBijlsma
Copy link

RuurdBijlsma commented Apr 3, 2025

i've been using it a bit it works nicely, had to find out the message structure but maybe that's normal for different chat handlers. I'm not that familiar with llama-cpp

"type": "image",
"image": {
    "url": "https://image.com/img.jpg",
}

i was used to "image_url" for both places "image_url" is used now.

@dchatel
Copy link

dchatel commented Apr 4, 2025

How would that work with a local image?

@kossum
Copy link
Author

kossum commented Apr 4, 2025

Sorry, i didn't modify the origin chat template of gemma3 and then used "type": "image". Now i have changed the format of the messages to be compatible with the openai api, just like other chat handlers.

Here is a full example:

from pathlib import Path
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler

def image_to_base64_uri(image: bytes | str):
  import base64
  import urllib.request as request

  if isinstance(image, bytes):
    data = base64.b64encode(image).decode('utf-8')
  else:
    with request.urlopen(image) as f:
      data = base64.b64encode(f.read()).decode('utf-8')
  return f'data:image/png;base64,{data}'

chat_handler = Gemma3ChatHandler(clip_model_path='path/to/mmproj')
llama = Llama(
    model_path='path/to/model',
    chat_handler=chat_handler,
    n_ctx=2048,  # n_ctx should be increased to accommodate the image embedding
)

messages = [
    {
        'role': 'user',
        'content': [
            {'type': 'text', 'text': 'please compare these pictures'},
            {'type': 'image_url', 'image_url': 'https://xxxx/img1.jpg'},
            {'type': 'image_url', 'image_url': {'url': 'https://xxxx/img2.png'}},
            {'type': 'image_url', 'image_url': image_to_base64_uri(Path('path/to/img3.jpg').read_bytes())},
            {'type': 'image_url', 'image_url': {'url': image_to_base64_uri(Path('path/to/img4.png').read_bytes())}},
            {'type': 'text', 'text': 'and then tell me which one looks the best'},
        ]
    }
]

output = llama.create_chat_completion(
    messages,
    stop=['<end_of_turn>', '<eos>'],
    max_tokens=500,
    stream=True,
)

for chunk in output:
  delta = chunk['choices'][0]['delta']
  if 'role' in delta:
    print(delta['role'], end=':\n')
  elif 'content' in delta:
    print(delta['content'], end='')

llama._sampler.close()
llama.close()

@joaojhgs
Copy link

bump on this, thanks for your work! gemma3 is a great model to have support to, I'm waiting on it!

@joaojhgs
Copy link

Hey @kossum just wondering, does this handler support function calling? I ask because the handler for llava1.5 does support multimodal (vision) and also tool calling at once, as Gemma3 also has tool calling capabilities, it would be great to add both into a single handler!

@kossum
Copy link
Author

kossum commented Apr 16, 2025

Hello @joaojhgs, gemma3 (especially the 12b and 27b versions) has strong instruction-following abilities and can generate structured function call outputs through well-designed prompts.

But unlike gpt4 or claude, gemma3 does not have builtin support for tool call tokens or json schema enforcement. That means:

  • No builtin tool use markers: gemma3 does not automatically identify or tag tool usage.
  • Requires explicit prompt design: you need to clearly define function names, parameters, and output format in the prompt.
  • Lacks standardized templates: currently, gemma3’s chat_template does not include tool use structures.

So to implement function calling with gemma3, you must rely on carefully designed prompts to guide the model in producing the correct format.

Simple example:

import json
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler

chat_handler = Gemma3ChatHandler(clip_model_path='path/to/mmproj')
llama = Llama(
    model_path='path/to/model',
    chat_handler=chat_handler,
    n_ctx=2048,  # n_ctx should be increased to accommodate the image embedding
)


def analyze_image(image_id: str, description: str):
  print('image_id:', image_cache.get(image_id))
  print('description:', description)
  ...


image_cache = {'img_01': 'https://xxxx/img_01.jpg'}
function_table = {'analyze_image': analyze_image}

# input arg1
image_id = 'img_01'
# input arg2
question = f'Here is the image with ID `img_01`. Please analyze it.'

output = llama.create_chat_completion(
    [
        {
            'role': 'system',
            'content': '''You can call the following function:
- analyze_image(image_id: str, description: str)

You will be shown an image. First, analyze and describe its content in detail.
Then, return a function call with:
- the assigned image_id (provided in the input)
- a description of what the image shows (your own analysis)

Respond only with a JSON (without code blocks) function call like:
{
  "function": "analyze_image",
  "arguments": {
    "image_id": "<image id>",
    "description": "<description of the image>"
  }
}
'''
        },
        {
            'role': 'user',
            'content': [
                {'type': 'text', 'text': question},
                {'type': 'image_url', 'image_url': image_cache[image_id]},
            ]
        }
    ],
    stop=['<end_of_turn>', '<eos>'],
    max_tokens=500,
)

data = json.loads(output['choices'][0]['message']['content'])
result = function_table[data['function']](**data['arguments'])
...

Naturally, if multimodal capabilities aren’t needed, this chat handler can be omitted.

@joaojhgs
Copy link

Hello @joaojhgs, gemma3 (especially the 12b and 27b versions) has strong instruction-following abilities and can generate structured function call outputs through well-designed prompts.

But unlike gpt4 or claude, gemma3 does not have builtin support for tool call tokens or json schema enforcement.

Thanks, I didn't know about that!

@okaris
Copy link

okaris commented May 23, 2025

@abetlen I've been waiting for this to be merged for some time. I'm curious are you still actively maintaining this repo? Thanks!

@Domino9752
Copy link

Say Hey-
I added your code to my venv, and when running your example I received this error:

Traceback (most recent call last):

File "d:\Foundary\Gemma-3_llama.py", line 57, in
output = llama.create_chat_completion(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Foundary\venv\Lib\site-packages\llama_cpp\llama.py", line 2001, in create_chat_completion
return handler(
^^^^^^^^^^
File "D:\Foundary\venv\Lib\site-packages\llama_cpp\llama_chat_format.py", line 2838, in call
self.eval_image(llama, value)
File "D:\Foundary\venv\Lib\site-packages\llama_cpp\llama_chat_format.py", line 3480, in eval_image
if not self._llava_cpp.clip_image_batch_encode(self.clip_ctx, llama.n_threads, img_f32_p, embed):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: exception: access violation writing 0x0000015F26813000

So, seeing the "OSError" label, I wanted to ask if you had run your code under windows 11?

@kossum
Copy link
Author

kossum commented May 31, 2025

Hello @Domino9752 ,thanks for testing! The issue happens because the author updated the llama.cpp library in the current 0.3.9 version, but the corresponding changes for the llava part haven’t been made yet.
Could you please try rolling back to version 0.3.8 and see if it works? Alternatively, you can use my fork for now. I’ll update it soon with the necessary fixes.

@Domino9752
Copy link

Domino9752 commented May 31, 2025

kossum-

Thanks for your help. I changed over to your branch:

python -m pip install git+https://github.com/kossum/llama-cpp-python@gemma3-fix --no-cache-dir --force-reinstall --upgrade --config-settings="cmake.args=-DGGML_CUDA=on"

...and now it runs. I previously had the syntax shown in the April 4th message using "image_url".
Upon reading llama_chat_format.py, I realized the correct syntax for the messages (with @gemma3-fix) is:

path_to_image = r"D:\FLUX1-dev\inputs\Clipped\0515-2200-6-2_224.png"

messages = [
{
'role': 'user',
'content': [
{'type': 'text', 'text': "Please describe this image in great detail, OK?"},
{'type': 'image', 'url': image_to_base64_uri(Path(path_to_image).read_bytes())},
]
}
]

Intel i7 14700K / RTX3090
Llama.generate: 528 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print: load time = 46.10 ms
llama_perf_context_print: prompt eval time = 1241.25 ms / 530 tokens ( 2.34 ms per token, 426.99 tokens per second)
llama_perf_context_print: eval time = 25365.96 ms / 684 runs ( 37.08 ms per token, 26.97 tokens per second)
llama_perf_context_print: total time = 26286.61 ms / 1214 tokens

@xia0nan
Copy link

xia0nan commented Jun 1, 2025

Hi @kossum, thank you for the work. I've tested it, and it works well. Here's the test code:

  1. install from your branch pip install git+https://github.com/kossum/llama-cpp-python.git@main
  2. and then run testing:
import os
import requests
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler

# URLs for the model weights
MODEL_URL = "https://huggingface.co/vinimuchulski/gemma-3-4b-it-qat-q4_0-gguf/resolve/main/gemma-3-4b-it-q4_0.gguf?download=true"
MMPROJ_URL = "https://huggingface.co/vinimuchulski/gemma-3-4b-it-qat-q4_0-gguf/resolve/main/mmproj-model-f16-4B.gguf?download=true"

MODEL_FILE = "gemma-3-4b-it-q4_0.gguf"
MMPROJ_FILE = "mmproj-model-f16-4B.gguf"

def download_file(url, local_path):
    if not os.path.exists(local_path):
        print(f"Downloading {local_path} ...")
        with requests.get(url, stream=True) as r:
            r.raise_for_status()
            with open(local_path, 'wb') as f:
                for chunk in r.iter_content(chunk_size=8192):
                    f.write(chunk)
        print(f"Downloaded {local_path}")
    else:
        print(f"{local_path} already exists, skipping download.")

# Download the weights if they don't exist
download_file(MODEL_URL, MODEL_FILE)
download_file(MMPROJ_URL, MMPROJ_FILE)

# Initialize the multimodal chat handler
chat_handler = Gemma3ChatHandler(clip_model_path=MMPROJ_FILE)

# Load the Gemma 3 model with multimodal support
llm = Llama(
    model_path=MODEL_FILE,
    chat_handler=chat_handler,
    n_ctx=2048,  # You can increase or decrease as needed
)

# Sample inference: describe an image from a URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Please describe this image."},
            {"type": "image_url", "image_url": {"url": "https://raw.githubusercontent.com/huggingface/transformers/refs/heads/main/tests/fixtures/tests_samples/COCO/000000039769.png"}},
        ]
    }
]

output = llm.create_chat_completion(
    messages,
    stop=['<end_of_turn>', '<eos>'],
    max_tokens=200,
)

print("Model output:", output['choices'][0]['message']['content'])

There is one issue I want to ask for help - the output is speed is quite slow.
On CPU kernel:

Llama.generate: 299 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print: load time = 458.17 ms
llama_perf_context_print: prompt eval time = 363380.98 ms / 301 tokens ( 1207.25 ms per token, 0.83 tokens per second)
llama_perf_context_print: eval time = 97226.48 ms / 199 runs ( 488.58 ms per token, 2.05 tokens per second)
llama_perf_context_print: total time = 98303.12 ms / 500 tokens

On GPU kernel:

Llama.generate: 299 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print: load time = 401.93 ms
llama_perf_context_print: prompt eval time = 285504.85 ms / 301 tokens ( 948.52 ms per token, 1.05 tokens per second)
llama_perf_context_print: eval time = 80909.79 ms / 199 runs ( 406.58 ms per token, 2.46 tokens per second)
llama_perf_context_print: total time = 81813.53 ms / 500 tokens

When I test the same model in LM Studio, the speed is much faster, it's around 4-9 tokens/second with CPU, and with GPU it can go as high as 60 TPS.
Why there is a difference between me using llama_cpp_python and LM Studio for the same model? Do you have some suggestions to make output speed faster?

@kossum
Copy link
Author

kossum commented Jun 1, 2025

Hi @xia0nan, thanks for your feedback.

If you find that inference speed is slow, please note that you can use the n_gpu_layers parameter to specify how many transformer layers should be offloaded to the GPU. For example:

llm = llama.Llama(
    model_path=MODEL_PATH,
    chat_handler=Gemma3ChatHandler(clip_model_path=MMPROJ_PATH),
    n_gpu_layers=48,
    n_ctx=1024,
)

However, in order to use the GPU, you need to build llama-cpp-python with the appropriate backend. Please refer to the Supported Backends section in the README for detailed instructions on enabling GPU support during installation.

Also, LM Studio may use GPU acceleration by default or include additional optimizations, which can explain the speed difference.

Please note that GPU support and related build issues are not directly within the scope of this chat handler. If you encounter any problems when compiling with GPU support, feel free to open a separate issue. For more general installation or backend-related questions, you may also want to refer to the main llama-cpp-python repository.

@joaojhgs
Copy link

joaojhgs commented Jun 6, 2025

hey @kossum, just out of curiosity, is it possible to port this into my project without waiting for it to be mergerd or using your branch? I mean making a class override, I've done that before in cases where I couldn't wait for a third party lib merge, but this one has those ctypes functions that looks like compiled stuff that I'd need support from the lib directly, could I add those ctypes directly into my project as well?

Also, as I have been trying to load the mmproj model, the loading function fails with complaints about the mmproj file missing some required keys, such as general.description, clip.hast_text_encoder etc, I'm using the unsloth 4b model and their F16 mmproj file, am I perhaps missing something?

@chriszs
Copy link

chriszs commented Jun 6, 2025

could I add those ctypes directly into my project as well?

You can. The trick is you have to duplicate the ctypes file into your project, override the init method from Llava and import from your local ctypes when instantiating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants