-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: Add Gemma3 chat handler (#1976) #1989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…in Gemma3ChatHandler
i've been using it a bit it works nicely, had to find out the message structure but maybe that's normal for different chat handlers. I'm not that familiar with llama-cpp "type": "image",
"image": {
"url": "https://image.com/img.jpg",
} i was used to "image_url" for both places "image_url" is used now. |
How would that work with a local image? |
Sorry, i didn't modify the origin chat template of gemma3 and then used Here is a full example: from pathlib import Path
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler
def image_to_base64_uri(image: bytes | str):
import base64
import urllib.request as request
if isinstance(image, bytes):
data = base64.b64encode(image).decode('utf-8')
else:
with request.urlopen(image) as f:
data = base64.b64encode(f.read()).decode('utf-8')
return f'data:image/png;base64,{data}'
chat_handler = Gemma3ChatHandler(clip_model_path='path/to/mmproj')
llama = Llama(
model_path='path/to/model',
chat_handler=chat_handler,
n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)
messages = [
{
'role': 'user',
'content': [
{'type': 'text', 'text': 'please compare these pictures'},
{'type': 'image_url', 'image_url': 'https://xxxx/img1.jpg'},
{'type': 'image_url', 'image_url': {'url': 'https://xxxx/img2.png'}},
{'type': 'image_url', 'image_url': image_to_base64_uri(Path('path/to/img3.jpg').read_bytes())},
{'type': 'image_url', 'image_url': {'url': image_to_base64_uri(Path('path/to/img4.png').read_bytes())}},
{'type': 'text', 'text': 'and then tell me which one looks the best'},
]
}
]
output = llama.create_chat_completion(
messages,
stop=['<end_of_turn>', '<eos>'],
max_tokens=500,
stream=True,
)
for chunk in output:
delta = chunk['choices'][0]['delta']
if 'role' in delta:
print(delta['role'], end=':\n')
elif 'content' in delta:
print(delta['content'], end='')
llama._sampler.close()
llama.close() |
bump on this, thanks for your work! gemma3 is a great model to have support to, I'm waiting on it! |
Hey @kossum just wondering, does this handler support function calling? I ask because the handler for llava1.5 does support multimodal (vision) and also tool calling at once, as Gemma3 also has tool calling capabilities, it would be great to add both into a single handler! |
Hello @joaojhgs, gemma3 (especially the 12b and 27b versions) has strong instruction-following abilities and can generate structured function call outputs through well-designed prompts. But unlike gpt4 or claude, gemma3 does not have builtin support for tool call tokens or json schema enforcement. That means:
So to implement function calling with gemma3, you must rely on carefully designed prompts to guide the model in producing the correct format. Simple example: import json
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler
chat_handler = Gemma3ChatHandler(clip_model_path='path/to/mmproj')
llama = Llama(
model_path='path/to/model',
chat_handler=chat_handler,
n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)
def analyze_image(image_id: str, description: str):
print('image_id:', image_cache.get(image_id))
print('description:', description)
...
image_cache = {'img_01': 'https://xxxx/img_01.jpg'}
function_table = {'analyze_image': analyze_image}
# input arg1
image_id = 'img_01'
# input arg2
question = f'Here is the image with ID `img_01`. Please analyze it.'
output = llama.create_chat_completion(
[
{
'role': 'system',
'content': '''You can call the following function:
- analyze_image(image_id: str, description: str)
You will be shown an image. First, analyze and describe its content in detail.
Then, return a function call with:
- the assigned image_id (provided in the input)
- a description of what the image shows (your own analysis)
Respond only with a JSON (without code blocks) function call like:
{
"function": "analyze_image",
"arguments": {
"image_id": "<image id>",
"description": "<description of the image>"
}
}
'''
},
{
'role': 'user',
'content': [
{'type': 'text', 'text': question},
{'type': 'image_url', 'image_url': image_cache[image_id]},
]
}
],
stop=['<end_of_turn>', '<eos>'],
max_tokens=500,
)
data = json.loads(output['choices'][0]['message']['content'])
result = function_table[data['function']](**data['arguments'])
... Naturally, if multimodal capabilities aren’t needed, this chat handler can be omitted. |
Thanks, I didn't know about that! |
@abetlen I've been waiting for this to be merged for some time. I'm curious are you still actively maintaining this repo? Thanks! |
Say Hey-
File "d:\Foundary\Gemma-3_llama.py", line 57, in So, seeing the "OSError" label, I wanted to ask if you had run your code under windows 11? |
Hello @Domino9752 ,thanks for testing! The issue happens because the author updated the llama.cpp library in the current 0.3.9 version, but the corresponding changes for the llava part haven’t been made yet. |
kossum- Thanks for your help. I changed over to your branch: python -m pip install git+https://github.com/kossum/llama-cpp-python@gemma3-fix --no-cache-dir --force-reinstall --upgrade --config-settings="cmake.args=-DGGML_CUDA=on" ...and now it runs. I previously had the syntax shown in the April 4th message using "image_url". path_to_image = r"D:\FLUX1-dev\inputs\Clipped\0515-2200-6-2_224.png" messages = [ Intel i7 14700K / RTX3090 |
Hi @kossum, thank you for the work. I've tested it, and it works well. Here's the test code:
import os
import requests
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler
# URLs for the model weights
MODEL_URL = "https://huggingface.co/vinimuchulski/gemma-3-4b-it-qat-q4_0-gguf/resolve/main/gemma-3-4b-it-q4_0.gguf?download=true"
MMPROJ_URL = "https://huggingface.co/vinimuchulski/gemma-3-4b-it-qat-q4_0-gguf/resolve/main/mmproj-model-f16-4B.gguf?download=true"
MODEL_FILE = "gemma-3-4b-it-q4_0.gguf"
MMPROJ_FILE = "mmproj-model-f16-4B.gguf"
def download_file(url, local_path):
if not os.path.exists(local_path):
print(f"Downloading {local_path} ...")
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open(local_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
print(f"Downloaded {local_path}")
else:
print(f"{local_path} already exists, skipping download.")
# Download the weights if they don't exist
download_file(MODEL_URL, MODEL_FILE)
download_file(MMPROJ_URL, MMPROJ_FILE)
# Initialize the multimodal chat handler
chat_handler = Gemma3ChatHandler(clip_model_path=MMPROJ_FILE)
# Load the Gemma 3 model with multimodal support
llm = Llama(
model_path=MODEL_FILE,
chat_handler=chat_handler,
n_ctx=2048, # You can increase or decrease as needed
)
# Sample inference: describe an image from a URL
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Please describe this image."},
{"type": "image_url", "image_url": {"url": "https://raw.githubusercontent.com/huggingface/transformers/refs/heads/main/tests/fixtures/tests_samples/COCO/000000039769.png"}},
]
}
]
output = llm.create_chat_completion(
messages,
stop=['<end_of_turn>', '<eos>'],
max_tokens=200,
)
print("Model output:", output['choices'][0]['message']['content']) There is one issue I want to ask for help - the output is speed is quite slow.
On GPU kernel:
When I test the same model in LM Studio, the speed is much faster, it's around 4-9 tokens/second with CPU, and with GPU it can go as high as 60 TPS. |
Hi @xia0nan, thanks for your feedback. If you find that inference speed is slow, please note that you can use the llm = llama.Llama(
model_path=MODEL_PATH,
chat_handler=Gemma3ChatHandler(clip_model_path=MMPROJ_PATH),
n_gpu_layers=48,
n_ctx=1024,
) However, in order to use the GPU, you need to build llama-cpp-python with the appropriate backend. Please refer to the Supported Backends section in the README for detailed instructions on enabling GPU support during installation. Also, LM Studio may use GPU acceleration by default or include additional optimizations, which can explain the speed difference. Please note that GPU support and related build issues are not directly within the scope of this chat handler. If you encounter any problems when compiling with GPU support, feel free to open a separate issue. For more general installation or backend-related questions, you may also want to refer to the main llama-cpp-python repository. |
hey @kossum, just out of curiosity, is it possible to port this into my project without waiting for it to be mergerd or using your branch? I mean making a class override, I've done that before in cases where I couldn't wait for a third party lib merge, but this one has those ctypes functions that looks like compiled stuff that I'd need support from the lib directly, could I add those ctypes directly into my project as well? Also, as I have been trying to load the mmproj model, the loading function fails with complaints about the mmproj file missing some required keys, such as general.description, clip.hast_text_encoder etc, I'm using the unsloth 4b model and their F16 mmproj file, am I perhaps missing something? |
You can. The trick is you have to duplicate the ctypes file into your project, override the init method from Llava and import from your local ctypes when instantiating. |
Added gemma3 chat handler, and fixed the image embedding, supports multiple images.
Included llamacpp functions and structures:
Usage (Current version, after Apr 4 2025):
Test Results:
unsloth/gemma-3-4b-it-GGUF
,unsloth/gemma-3-12b-it-GGUF
,unsloth/gemma-3-27b-it-GGUF
,bartowski/google_gemma-3-12b-it-GGUF
Compatibility: