-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Support for C4AI Command R7B / Cohere2ForCausalLM #10816
Comments
There are AFAIK only two models that brings citation features , LongCite (already has a GGUF, but the model itself is kinda retarded at reasoning) and Command-R, now it brings 7B citation with a decent IQ |
This simple patch allows to convert and run the model fine, and the output looks good so far in my early testing. I don't know what kind of support llama.cpp has for I uploaded the weights to HF https://huggingface.co/dranger003/c4ai-command-r7b-12-2024-GGUF. And I tested using:
diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 9dc1673b..ddb0e3e8 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -3047,6 +3047,7 @@ class MambaModel(Model):
@Model.register("CohereForCausalLM")
+@Model.register("Cohere2ForCausalLM")
class CommandR2Model(Model):
model_arch = gguf.MODEL_ARCH.COMMAND_R |
@ExtReMLapin Can you show your full output and what is your command line? Also, what platform are you on and where did you get the converted weights? Below is my full output after converting the weights using the proposed change. Aside from the 8K context, the output works fine.
|
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "CohereForAI/c4ai-command-r7b-12-2024"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map = 'auto')
# Format message with the c4ai-command-r7b-12-2024 chat template
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
with open("cohere_prompt.txt", "r") as f:
prompt = f.read()
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(
input_ids,
max_new_tokens=500,
do_sample=True,
temperature=0.3,
)
gen_text = tokenizer.decode(gen_tokens[0], skip_special_tokens=True)
print(gen_text) ./llama-cli --model ./ggml-c4ai-command-r-7b-12-2024-q4_k.gguf --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 8192 --tensor-split 0.0,1.0,0.0 -sm none -mg 1 -ngl 99999 -f ./cohere_prompt.txt |
transformers pypi package answer : Cited Documents: 0
Grounded answer: The text does not mention a specific name for a "grosse araignée" (large spider). It only describes an incident involving a large spider that was present in a corridor. gguf answer : fqshdfdhf^ùpzd*^oedf" Edit: yes I know the prompts changed from command r to r7b but it still works in HF |
It uses (3 SWA layers + 1 global attention layer). So, Here is an implementation of interleaved SWA/global-attention layers. |
Thanks for your work @dranger003 did you have the opportunity to test the prompt I sent earlier ? Out of the office currently, can’t test your fork. |
@ExtReMLapin I tested your prompt, and the output is identical to the one using the HF model (using temp 0). Although, I think what you have is using citations from Command-R/R+, but this is Command R7B. Looking at the documentation from Cohere, these two appear to have a different approach to grounded RAG: |
@ExtReMLapin Actually, I took another look at the rag template from So I generated a template output using this python code:
Then I ran llama.cpp with
|
Prerequisites
Feature Description
I would like to request support for C4AI Command R7B by Cohere.
Here is some relevant information:
Download link: https://huggingface.co/CohereForAI/c4ai-command-r7b-12-2024
Some specifications:
Blog post: https://cohere.com/blog/command-r7b
Motivation
I believe it will be a great addition to llama.cpp
Possible Implementation
Model Architecture: This is an auto-regressive language model that uses an optimized transformer architecture. After pretraining, this model uses supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. The model features three layers with sliding window attention (window size 4096) and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.
The text was updated successfully, but these errors were encountered: