-
Notifications
You must be signed in to change notification settings - Fork 486
Closed
Description
Description
I decided to try the popular configuration min_p = 0.1 and temp = 1.5 or higher.
I get the following result:
I used the example LLama.Examples/Examples/LLama3ChatSession.cs
To show the incorrect behavior.
The only thing I changed
var chatHistory = new ChatHistory();
and
var inferenceParams = new InferenceParams
{
SamplingPipeline = new DefaultSamplingPipeline
{
Temperature = 1.5f,
MinP=0.1f,
},
MaxTokens = 100, // keep generating tokens until the anti prompt is encountered
AntiPrompts = [model.Tokens.EndOfTurnToken!] // model specific end of turn string
};
In my project I use BatchedExecutor with the correct formatting "Promt template" and Anti promts, and I get exactly the same result. I also changed the sampling order of ProcessTokenDataArray and it did not change anything. I tested it on CUDA and Vulkan. I noticed a pattern in that the first 20-30 tokens are correct, and then chaos begins.
In LM Studio and Kobold CPP I set the temperature even higher, and Min p even lower, but everything worked fine there
Reproduction Steps
- Use DefaultSamplingPipeline
- Set temperature higher than 1.2
- Set min_p = 0.1 or higher
Environment & Configuration
- Operating system: Win10
- .NET runtime version: 8.0.4
- LLamaSharp version: 0.16.0
- CUDA version (if you are using cuda backend): 12
- CPU & GPU device: RTX 3050 8gb and i5-12400
- Model: L3-8B-Stheno-v3.2-Q6_K.gguf
Known Workarounds
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
