Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support for C4AI Command R7B / Cohere2ForCausalLM #10816

Open
4 tasks done
arch-btw opened this issue Dec 13, 2024 · 10 comments · May be fixed by #10900
Open
4 tasks done

Feature Request: Support for C4AI Command R7B / Cohere2ForCausalLM #10816

arch-btw opened this issue Dec 13, 2024 · 10 comments · May be fixed by #10900
Labels
enhancement New feature or request

Comments

@arch-btw
Copy link
Contributor

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I would like to request support for C4AI Command R7B by Cohere.

Here is some relevant information:

Download link: https://huggingface.co/CohereForAI/c4ai-command-r7b-12-2024

Some specifications:

  • A well-rounded model
  • Model Size: 7 billion parameters
  • Context length: 128K
  • Enhanced efficiency in math, code, and reasoning tasks
  • Multilingual, reasoning, tool use.
  • RAG capability

Blog post: https://cohere.com/blog/command-r7b

Motivation

I believe it will be a great addition to llama.cpp

Possible Implementation

Model Architecture: This is an auto-regressive language model that uses an optimized transformer architecture. After pretraining, this model uses supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. The model features three layers with sliding window attention (window size 4096) and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.

@arch-btw arch-btw added the enhancement New feature or request label Dec 13, 2024
@ExtReMLapin
Copy link
Contributor

There are AFAIK only two models that brings citation features , LongCite (already has a GGUF, but the model itself is kinda retarded at reasoning) and Command-R, now it brings 7B citation with a decent IQ

@dranger003
Copy link
Contributor

This simple patch allows to convert and run the model fine, and the output looks good so far in my early testing. I don't know what kind of support llama.cpp has for "position_embedding_type": "rope_gptj", and so the model shows 8K context rather than 128K.

I uploaded the weights to HF https://huggingface.co/dranger003/c4ai-command-r7b-12-2024-GGUF.

And I tested using:

./build/bin/llama-cli -fa --no-display-prompt -c 0 -m ggml-c4ai-command-r-7b-12-2024-q8_0.gguf -p "<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me all about yourself.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>"
diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 9dc1673b..ddb0e3e8 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -3047,6 +3047,7 @@ class MambaModel(Model):


 @Model.register("CohereForCausalLM")
+@Model.register("Cohere2ForCausalLM")
 class CommandR2Model(Model):
     model_arch = gguf.MODEL_ARCH.COMMAND_R

@ExtReMLapin
Copy link
Contributor

Thanks for your work dranger, gave a try on our end, so far even with a max ctx of 8192 results are not usable

image

@dranger003
Copy link
Contributor

@ExtReMLapin Can you show your full output and what is your command line? Also, what platform are you on and where did you get the converted weights? Below is my full output after converting the weights using the proposed change. Aside from the 8K context, the output works fine.

build\bin\Release\llama-cli.exe -sp -fa -c 0 -ngl 33 -m ggml-c4ai-command-r7b-12-2024-q8_0.gguf -p "<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me all about yourself.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>"
build: 4355 (152610ed) with MSVC 19.42.34435.0 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 258 tensors from ggml-c4ai-command-r7b-12-2024-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = command-r
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = C4AI Command R7B
llama_model_loader: - kv   3:                         general.size_label str              = 8.0B
llama_model_loader: - kv   4:                            general.license str              = cc-by-nc-4.0
llama_model_loader: - kv   5:                          general.languages arr[str,23]      = ["en", "fr", "de", "es", "it", "pt", ...
llama_model_loader: - kv   6:                      command-r.block_count u32              = 32
llama_model_loader: - kv   7:                   command-r.context_length u32              = 8192
llama_model_loader: - kv   8:                 command-r.embedding_length u32              = 4096
llama_model_loader: - kv   9:              command-r.feed_forward_length u32              = 14336
llama_model_loader: - kv  10:             command-r.attention.head_count u32              = 32
llama_model_loader: - kv  11:          command-r.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                   command-r.rope.freq_base f32              = 50000.000000
llama_model_loader: - kv  13:     command-r.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  14:             command-r.attention.key_length u32              = 128
llama_model_loader: - kv  15:           command-r.attention.value_length u32              = 128
llama_model_loader: - kv  16:                          general.file_type u32              = 7
llama_model_loader: - kv  17:                      command-r.logit_scale f32              = 0.250000
llama_model_loader: - kv  18:                command-r.rope.scaling.type str              = none
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = command-r
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  26:            tokenizer.ggml.unknown_token_id u32              = 1
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  29:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  30:           tokenizer.chat_template.tool_use str              = {%- macro document_turn(documents) -%...
llama_model_loader: - kv  31:                tokenizer.chat_template.rag str              = {% set tools = [] %}\n{%- macro docume...
llama_model_loader: - kv  32:                   tokenizer.chat_templates arr[str,2]       = ["tool_use", "rag"]
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% if documents %}\n{% set tools = [] ...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   33 tensors
llama_model_loader: - type q8_0:  225 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 41
llm_load_vocab: token to piece cache size = 1.8428 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = command-r
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 253333
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 2.5e-01
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = none
llm_load_print_meta: freq_base_train  = 50000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.94 GiB (8.50 BPW)
llm_load_print_meta: general.name     = C4AI Command R7B
llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: UNK token        = 1 '<UNK>'
llm_load_print_meta: PAD token        = 0 '<PAD>'
llm_load_print_meta: LF token         = 136 'Ä'
llm_load_print_meta: FIM PAD token    = 0 '<PAD>'
llm_load_print_meta: EOG token        = 0 '<PAD>'
llm_load_print_meta: EOG token        = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: max token length = 1024
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CUDA0 model buffer size =  8135.02 MiB
llm_load_tensors:   CPU_Mapped model buffer size =  1062.50 MiB
...............................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 8192
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: freq_base     = 50000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   516.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 841
llama_new_context_with_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 1590990773
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1

<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me all about yourself.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>I am an AI-assistant chatbot trained to provide helpful and insightful information to users. I can assist with a wide range of tasks, including:

- Answering questions: I can provide detailed and accurate answers on various topics, from general knowledge to specialized fields.
- Generating text: I can create content on different subjects, such as stories, articles, or summaries.
- Offering assistance: I can help with tasks like providing information, generating ideas, or offering suggestions.
- Engaging in conversation: I can chat and interact naturally, allowing for dynamic and engaging conversations.

My responses are designed to be informative, coherent, and helpful. I can adapt to different tones and styles, ensuring that my output is tailored to the user's needs. I can also provide references and sources if requested.

Please note that I am an AI assistant and do not possess personal experiences or emotions. I am programmed to provide assistance and do not have personal beliefs or opinions.

Feel free to ask me anything, and I will do my best to assist you!<|END_RESPONSE|><|END_OF_TURN_TOKEN|> [end of text]


llama_perf_sampler_print:    sampling time =      30.25 ms /   242 runs   (    0.13 ms per token,  7999.47 tokens per second)
llama_perf_context_print:        load time =    3719.67 ms
llama_perf_context_print: prompt eval time =      17.56 ms /    22 tokens (    0.80 ms per token,  1252.56 tokens per second)
llama_perf_context_print:        eval time =    2477.70 ms /   219 runs   (   11.31 ms per token,    88.39 tokens per second)
llama_perf_context_print:       total time =    2559.39 ms /   241 tokens

@ExtReMLapin
Copy link
Contributor

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereForAI/c4ai-command-r7b-12-2024"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map = 'auto')

# Format message with the c4ai-command-r7b-12-2024 chat template
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

with open("cohere_prompt.txt", "r") as f:
    prompt = f.read()
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

gen_tokens = model.generate(
    input_ids,
    max_new_tokens=500,
    do_sample=True,
    temperature=0.3,
)

gen_text = tokenizer.decode(gen_tokens[0], skip_special_tokens=True)
print(gen_text)
./llama-cli --model ./ggml-c4ai-command-r-7b-12-2024-q4_k.gguf --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 8192 --tensor-split 0.0,1.0,0.0 -sm none -mg 1 -ngl 99999 -f ./cohere_prompt.txt

cohere_prompt.txt

@ExtReMLapin
Copy link
Contributor

ExtReMLapin commented Dec 18, 2024

transformers pypi package answer :

Cited Documents: 0
Grounded answer: The text does not mention a specific name for a "grosse araignée" (large spider). It only describes an incident involving a large spider that was present in a corridor.

gguf answer : fqshdfdhf^ùpzd*^oedf"

Edit: yes I know the prompts changed from command r to r7b but it still works in HF

@foldl
Copy link
Contributor

foldl commented Dec 19, 2024

It uses (3 SWA layers + 1 global attention layer). So, build_command_r need to be updated, even though the result seems promising.

Here is an implementation of interleaved SWA/global-attention layers.

https://github.com/foldl/chatllm.cpp/blob/ff54a787948f02151b38231375be042b632a271e/models/cohere.cpp#L246C1-L258C1

@dranger003 dranger003 linked a pull request Dec 19, 2024 that will close this issue
@ExtReMLapin
Copy link
Contributor

Thanks for your work @dranger003 did you have the opportunity to test the prompt I sent earlier ? Out of the office currently, can’t test your fork.

@dranger003
Copy link
Contributor

@ExtReMLapin I tested your prompt, and the output is identical to the one using the HF model (using temp 0). Although, I think what you have is using citations from Command-R/R+, but this is Command R7B.

Looking at the documentation from Cohere, these two appear to have a different approach to grounded RAG:
Command-R/R+
Command R7B

@dranger003
Copy link
Contributor

@ExtReMLapin Actually, I took another look at the rag template from tokenizer_config.json and I noticed you are right, there is a enable_citations option. I tried it and it actually works quite well, too. I didn't see this in the documentation for some reason.

So I generated a template output using this python code:

conversation = [
    {
        "role": "user",
        "content": "quelles sont les mentions d'une 'grosse araignée' ?",
    },
]

documents = [
    {
        "id": "0",
        "title": "Chunk #10:22535, rid: #10:22535",
        "snippet": "...",
    },
    {
        "id": "1",
        "title": "Chunk #13:4, rid: #13:4",
        "snippet": "...",
    },
]

text = tokenizer.apply_chat_template(
    conversation=conversation,
    documents=documents,
    add_generation_prompt=True,
    tokenize=False,
    enable_citations=True, # pour activer les citations !
)

print(text)

Then I ran llama.cpp with ggml-c4ai-command-r7b-12-2024-q8_0.gguf from my HF repo and here is the output:

Dans le texte, une <co>grosse araignée</co: 0:[0]> est mentionnée <co>deux fois.</co: 0:[0,1]> La première fois, elle est décrite comme <co>immobile</co: 0:[0]> et <co>couverte de givre</co: 0:[0]> <co>au milieu d'une toile.</co: 0:[0]> Harry <co>utilise un sortilège pour la faire bouger</co: 0:[0]>, mais <co>elle ne diminue pas de taille.</co: 0:[0]> La deuxième fois, <co>Hermione</co: 0:[1]> mentionne une <co>grosse araignée répugnante</co: 0:[1]> qui a été <co>créée à partir de l'ours en peluche de Ron</co: 0:[1]> lorsqu'il <co>avait trois ans.</co: 0:[1]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants