-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Configuring Custom Models
Warning
- We will now walk through the steps of finding, downloading and configuring a custom model. All these steps are required for it to (possibly) work.
- Models found on Huggingface or anywhere else are "unsupported" you should follow this guide before asking for help.
Whether you "Sideload" or "Download" a custom model you must configure it to work properly.
- We will refer to a "Download" as being any model that you found using the "Add Models" feature.
- A custom model is one that is not provided in the default models list within GPT4All. Using the search bar in the "Explore Models" window will yield custom models that require to be configured manually by the user.
- A "Sideload" is any model you get from somewhere else and then put in the models directory.
Open GPT4All and click on "Find models". In this example, we use the "Search bar" in the Explore Models window. Typing anything into the search bar will search HuggingFace and return a list of custom models. As an example, down below, we type "GPT4All-Community", which will find models from the GPT4All-Community repository.
It is strongly recommended to use custom models from the GPT4All-Community repository, which can be found using the search feature in the explore models page or alternatively can be sideload, but be aware, that those also have to be configured manually.
Warning
- Do not click the "Download" button, as that one is buggy at present time (GPT4All version 3.2.1).
- The GGUF model down below (QuantFactory/Phi-3-mini-128k-instruct-GGUF) is an example of a bad model that is not compatible with GPT4All version 3.2.1 (something is wrong), so do not download it just yet! You can find a working version of this model at the GPT4All-Community repository.
Once you have found a model, click "More info can be found here.", which in this example brings you to huggingface.
Here, you find the information that you need to configure the model. (This model may be outdated, it may have been a failed experiment, it may not yet be compatible with GPT4All, it may be dangerous, it may also be GREAT!)
- You need to know the Prompt Template.
- You need to know the maximum context (128k)
- You need to know if there is a problem. See the community tab and look.
Maybe this won't affect you. Though it's a good place to find out.
So next, let's find that template... Hopefully the model authors were kind and included it.
This could be a good helpful template. Hopefully this works. Keep in mind:
- The model authors may not have tested their own model
- The model authors may not have bothered to change the model configuration files from finetuning to inferencing workflows.
- Even if they show you a template it may be wrong.
- Each model has its own tokens and its own syntax.
- The models are trained using these tokens, which is why you must use them for the model to work.
- The model uploader may not understand this either and can fail to provide a good model or a mismatching template.
Apart from the model card, there are three files that could hold relevant information for running the model.
For more information see Advanced Topics: Configuration files Explained
- config.json
- tokenizer_config.json
- generation_config.json
Check config.json to find the capabilities (such as the maximum context length) of the model and any bos and eos that were used during model training or finetuning. Check generation_config.json to find out about any further bos and eos that were used during training or finetuning. Check tokenizer_config.json to find out about the recommended chat template, which is necessary to craft our prompt template. Especially useful, if the model author failed to provide a template in the model card. Is it missing from the model training, or just the model card? Check all three files, you need to cross-check, if the tokenizer_config.json contains the proper beginning of string (bos) and end of string (eos) tokens to be compatible with GPT4All.
Important
The chat templates must be followed on a per model basis. Every model is different.
You can imagine them to be like magic spells.
Your magic won't work if you say the wrong word. It won't work if you say it at the wrong place or time.
At this step, we need to combine the chat template that we found in the model card with a special syntax that is compatible with the GPT4All-Chat application (The format shown in the above screenshot is only an example). If you looked into the tokenizer_config.json, see Advanced Topics: Jinja2 Explained
Special tokens like <|user|>
will say the user is about to talk. <|end|>
will tell the llm we are done with that, now continue on.
- We use
%1
as placeholder for the content of the users prompt. - We use
%2
as placholder for the content of the models response.
That example prompt should (in theory) be compatible with GPT4All, it will look like this for you...
<|system|>
You are a helpful AI assistant.<|end|>
<|user|>
%1<|end|>
<|assistant|>
%2<|end|>
You can see how the template will inject the messages you type where the %1 goes. You can add something fun in there if you want to... Now the chat knows my name!
<|user|>
3Simplex:%1<|end|>
<|assistant|>
%2<|end|>
This works because we are using this template when we send the information to the LLM, anything inside those tags will be sent.
The system prompt will define the behavior of the model when you chat. You can say "Talk like a pirate, and be sure to keep your bird quite!"
The prompt template will tell the model what is happening and when.
The default settings are a good safe place to start. The default and provides good output for most models. For instance, you can't blow up your RAM on only 2048 context and you can always increase it to whatever the model supports.
This is the maximum context that you will use with the model. Context is somewhat the sum of the models tokens in the system prompt + chat template + user prompts + model responses + tokens that were added to the models context via retrieval augmented generation (RAG), which would be the LocalDocs feature. You need to keep context length within two safe margins.
-
- your system can only use so much memory. Using more than you have will cause severe slowdowns or even crashes.
-
- your model is only capable of what it was trained for. Using more than that will give trash answers and gibberish.
Since we are talking about computer terminology here, 1k = 1024 not 1000. So 128k, as is advertised by the phi3 model will translate to (1024 x 128 = 131072).
I will use 4192 which is 4k of a response. I like allowing for a great response but want to stop the model at that point. (Maybe you want it longer? Try 8192)
This is one that you need to think about if you have a small GPU or a big model.
This will be set to load all layers on the GPU. You may need to use less to get the model to work for you.
These settings are model independent. They are only for the GPT4All environment. You can play with them all you like.
The rest of these are special settings that need more training and experience to learn. They don't need to be changed most of the time.
You should now have a fully configured model I hope it works for you!
Read on for more advanced topics such as:
- Jinja2
- Explain Jinja2 templates and how to decode them for use in Gpt4All.
- Explain how the tokens work in the templates.
- Configuration Files Explained
- Explain why the model is now configured but still doesn't work.
- Explain the .Json files used to make the gguf.
- Explain how the tokens work.
I see you are looking at a Jinja2 template.
Breaking down a Jinja2 template is fairly straight forward if you can follow a few rules.
You must keep the tokens as written in the Jinja2 and strip out all of the other syntax etc. Also try to watch for mistakes here. Sometimes they fail to input a functional Jinja2 template. The Jinja2 must have the following tokens:
- role beginning identifier tag
- role ending identifier tag
- roles
Sometimes they are combined into one like this <|user|> which indicates both a role and a beginning tag.
Let's start at the beginning of this Jinja2.
> {% set loop_messages = messages %}
> {% for message in loop_messages %}
> {% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}
Most of this has to be removed because it's irrelevant to the LLM unless we get a Jinja2 parser from some nice contributor.
We keep this <|start_header_id|>
as it states it is the starting header for the role.
We translate this + message['role'] +
into the role to be used for the template.
You will have to figure out what the role names used by this model are, but these are the common ones.
Sometimes the roles will be shown in the Jinja2 sometimes it won't.
- system (if model supports a system prompt)
- look for something like "if role system"
- user or human (sometimes)
- assistant or model (sometimes)
We keep this <|end_header_id|>
We keep this \n\n
which translates into one new line (press enter) for each \n
you see. (two in this case)
Now we will translate message['content']
into the variable used by GPT4All.
-
%1
for user messages -
%2
for assistant replies
We keep this<|eot_id|>
which indicates the end of whatever the role was doing.
Now we have our "content" from this Jinja2 block. {% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}
and we removed all the extra stuff.
From what I can tell GPT4all sends the BOS automatically and waits for the LLM to send the EOS in return.
- BOS will tell the LLM where it begins generating a new message from. You can skip the BOS token.
- "content" is also sent automatically by GPT4all. You can skip this
content
. (not to be confused withmessage['content']
)
This whole section is not used by the GPT4All template.
{% if loop.index0 == 0 %}
{% set content = bos_token + content %}
{% endif %}
{{ content }}
{% endfor %}
Finally, we get to the part that shows a role defined for the "assistant". The way it is written implies the other one above is for either a system or user role. (Probably both because it would simply show "user" if it wasn't dual purpose.)
This is left open ended for the model to generate from this point on forward. As we can see from its absence the LLM is expected to provide an eos
tag when it is done generating. Follow the same rules as we did above.
{% if add_generation_prompt %}
{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{% endif %}
This also provides us with an implied confirmation of how it should all look when it's done.
We will break this into two parts for GPT4All.
A System Prompt: (There is no variable you will just write what you want in it.)
<|start_header_id|>system<|end_header_id|>
YOUR CUSTOM SYSTEM PROMPT TEXT HERE<|eot_id|>
A Chat Template:
<|start_header_id|>user<|end_header_id|>
%1<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
%2
Why didn't it work? It looks like it's all good!
Hint: You probably did it right and the model is not built properly. You can find out by following the next segment below.
So, the model you got from some stranger on the internet didn't work like you expected it to?
They probably didn't test it. They probably don't know it won't work for everyone else.
Some problems are caused by the settings provided in the config files used to make the gguf.
Perhaps llama.cpp doesn't support that model and GPT4All can't use it.
Sometimes the model is just bad. (maybe an experiment)
You will be lucky if they include the source files, used for this exact gguf. (This person did not.)
The model used in the example above only links you to the source, of their source. This means you can't tell what they did to it when they made the gguf using that source. After the gguf was made someone may have changed anything on either side, Microsoft or QuantFactory.
In the following example I will use a model with a known source. This source will have an error, and they can fix it, or you can, like we did. (Expert: Make your own gguf by converting and quantizing the source.)
The following relevant files were used in the making of the gguf.
- config.json (Look for "eos_token_id")
- tokenizer_config.json (Look for "eos_token" and "chat_template")
- generation_config.json (Look for "eos_token_id")
- special_tokens_map.json (Look for "eos_token" and "bos_token")
- tokenizer.json (Make sure those, match this.)
We will begin in this tokenizer_config.json it defines how the model's tokenizer should process input text.
"add_bos_token": false,
"add_eos_token": false,
"add_prefix_space": true,
"added_tokens_decoder": {
"0": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<\|startoftext\|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<\|endoftext\|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<\|im_end\|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": "<\|startoftext\|>",
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ system_message }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<\|im_start\|>user\\n' + content + '<\|im_end\|>\\n<\|im_start\|>assistant\\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<\|im_end\|>' + '\\n' }}{% endif %}{% endfor %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<\|im_end\|>",
"legacy": true,
"model_max_length": 16384,
"pad_token": "<unk>",
"padding_side": "right",
"sp_model_kwargs": {},
"spaces_between_special_tokens": false,
"split_special_tokens": false,
"tokenizer_class": "LlamaTokenizer",
"unk_token": "<unk>",
"use_default_system_prompt": false
}
Here we want to make sure that the "chat_template" exists. (It exists, good.)
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ system_message }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<\|im_start\|>user\\n' + content + '<\|im_end\|>\\n<\|im_start\|>assistant\\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<\|im_end\|>' + '\\n' }}{% endif %}{% endfor %}",
There is a BOS token and an EOS token. (They exist, excellent!)
"bos_token": "<\|startoftext\|>",
"eos_token": "<\|im_end\|>",
You can also see what the id numbers to expect are, very nice.
"7": {
"content": "<\|im_end\|>",
Hopefully all of those tokens match in this file and in the other files as well. (let's see)
Open up the next important file special_tokens_map.json. This file is special because when the model is built, the tokens in this file will treat these tokens differently from regular vocabulary tokens. For example:
- They may be exempt from subword tokenization, they can never be broken!
- For example, the word "unhappiness" might be tokenized into "un", "happy", and "ness".
- However, special tokens like [EOS], [BOS], are typically treated as single, indivisible units.
- They have specific positions in input sequences, like the bos and eos, the model also learned a special meaning for them.
- BOS (Beginning of Sequence) token:
- Often represented as "[BOS]" or "
" - Typically placed at the very start of an input sequence.
- Signals to the model that a new sequence is beginning.
- Often represented as "[BOS]" or "
- EOS (End of Sequence) token:
- Often represented as "[EOS]" or ""
- Typically placed at the very end of an input sequence.
- Signals to the model that the sequence has ended.
- Crucial for tasks where the model needs to know when to stop generating output.
- BOS (Beginning of Sequence) token:
Lets take a look at this special_tokens_map.json
{
"bos_token": {
"content": "<|startoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}
As you can imagine, if we are missing any special tokens here that were in the tokenizer_config.json, you can end up with gibberish as the output. It just might break those tokens up and never know it was supposed to stop or start or whatever else may be important to the training of that model.
Next let's look at the tokenizer.json file. This file includes all the "vocabulary" of the model. We should know this all matches the other files; this includes all the tokens the model will use and the "mapping" of them. For instance, we know the tokenizer_config.json believes a few things.
"7": {
"content": "<\|im_end\|>",
It must match the tokenizer.json to work. In this case take a close look at the first seven of the 64000 tokens.
"added_tokens": [
{
"id": 0,
"content": "<unk>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 1,
"content": "<|startoftext|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 2,
"content": "<|endoftext|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 6,
"content": "<|im_start|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 7,
"content": "<|im_end|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},..................................
"陨": 63999
},
The id number of the token is 7, and the token itself is <\|im_end\|>
.
This must be true to work, everything in the files must match, you need to cross-check each file for errors.
Now we will see the generation_config.json.
{
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"transformers_version": "4.40.0"
}
If something is set here it is enforced during generation. You may have missed it if you weren't paying attention. This doesn't match our other files!
The other files tell the model to use "eos_token": "<\|im_end\|>",
this one is watching for "eos_token_id": 2,
and we all know that this model is using "id": 2
which is "content": "<|endoftext|>",
Which isn't going to work. The gguf model you downloaded will have an endless generation loop, unless this is corrected.
Finally lets look at the config.json file. When a model is loaded this is what it will know about itself.
{
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 16384,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 48,
"num_key_value_heads": 4,
"pretraining_tp": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 5000000,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.40.0",
"use_cache": false,
"vocab_size": 64000
}
Well here are all the things the model believes to be true. We can see it is also wrong. The model believes "eos_token_id": 2,
will stop the generation, but it was trained to use "eos_token_id": 7,
which the chat template is telling us to use. It is also found in the special_tokens_map.json so it will be protected for this purpose.
Now you know why your model won't work, hopefully you didn't download it yet!