Eval bug: llama-simple-chat crashes with "failed to decode" after some requests

### Name and Version

$ bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
version: 5797 (de569441)
built with cc (Debian 14.2.0-19) 14.2.0 for x86_64-linux-gnu


### Operating systems

Linux

### GGML backends

Vulkan

### Hardware

NVidia RTX 4080

### Models

ggml-org_gemma-3-1b-it-GGUF_gemma-3-1b-it-Q4_K_M.gguf

### Problem description & steps to reproduce

I want to write a small application using llama.cpp and I orient myself on the simple-chat example. But after some conversation with the AI model the program crashes with the error message `failed to decode`.

I can reproduce this with the unmodified simple-chat example. I just do some chit chat with the model:

```
$ bin/llama-simple-chat -m ggml-org_gemma-3-1b-it-GGUF_gemma-3-1b-it-Q4_K_M.gguf 
.............................................
> Tell me a joke.
[some unfunny joke]
> Tell me another one.
[some other even more unfunny joke]
...repeatedly ask for jokes or any other kind of conversation...
```

And after some questions and answers it crashes: 

```
simple-chat.cpp:124: failed to decode
```

This does not happen with `llama-cli` using the same model but this program is MUCH to complex to use as a base for an application. So it would be nice to find out what causes the problem in the simple-chat program.

I guess this has something to do with the context size? When I select a smaller one then the crash happens much faster. Is this supposed to happen? Or is there something wrong in the example regarding the context handling?

### First Bad Commit

_No response_

### Relevant log output

```shell
simple-chat.cpp:124: failed to decode

[New LWP 67199]
[New LWP 67196]
[New LWP 67195]
[New LWP 67194]
[New LWP 67192]
[New LWP 67191]
Again[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f142d0a49ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007f142d0a49ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f142d099668 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f142d0996ad in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f142d104787 in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f142d5f206b in ggml_print_backtrace () from lama.cpp/build/bin/libggml-base.so
#5  0x00007f142d5f2166 in ggml_abort () from llama.cpp/build/bin/libggml-base.so
#6  0x000055db7e3b2999 in main ()
[Inferior 1 (process 67190) detached]
Aborted
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: llama-simple-chat crashes with "failed to decode" after some requests #14487

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: llama-simple-chat crashes with "failed to decode" after some requests #14487

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions