Releases: LostRuins/koboldcpp
koboldcpp-1.100.1
koboldcpp-1.100.1
I-can't-believe-it's-not-version-2.0-edition
- NEW: WAN Video Generation has been added to KoboldCpp! - You can now generate short videos in KoboldCpp using the WAN model. Special thanks to @leejet for the sd.cpp implementation, and @wbruna for help merging and QoL fixes.
- Note: WAN requires a LOT of VRAM to run. If you run out of memory, try generating fewer frames and using a lower resolution. Especially on Vulkan, the VAE buffer size may be too large, use
--sdvaecpu
to run VAE on CPU instead. For comparison, 30 frames (2 seconds) of a 384x576 video will still require about 16GB VRAM even with VAE on CPU and CPU offloading enabled. You can also generate a single frame in which case it will behave like a normal image generation model. - Obtain the WAN2.2 14B rapid mega AIO model here. This is the most versatile option and can do both T2V and I2V. I do not recommend using the 1.3B WAN2.1 or the 5B WAN2.2, they both produce rather poor results. If you really don't care about quality, you can use small the 1.3B from here.
- Next, you will need the correct VAE and UMT5-XXL, note that some WAN models use different ones so if you're bringing your own do check it. Reference links are here.
- Load them all via the GUI launcher or by using
--sdvae
,--sdmodel
and--sdt5xxl
- Launch KoboldCpp and open SDUI at http://localhost:5001/sdui. I recommend starting with something small like 15 frames of a 384x384 video with 20 steps. Be prepared to wait a few minutes. The video will be rendered and saved to SDUI when done!
- It's recommended to use
--sdoffloadcpu
and--sdvaecpu
if you don't have enough VRAM. The VAE buffer can really be huge.
- Note: WAN requires a LOT of VRAM to run. If you run out of memory, try generating fewer frames and using a lower resolution. Especially on Vulkan, the VAE buffer size may be too large, use
- Added additional toggle flags for image generation:
--sdoffloadcpu
- Allows image generation weights to be dynamically loaded/unloaded to RAM when not in use, e.g. during VAE decoding.--sdvaecpu
- Performs VAE decoding on CPU using RAM instead.--sdclipcpu
- Performs CLIP/T5 decoding on GPU instead (new default is CPU)
- Updated StableUI to support animations/videos. If you want to perform I2V (Image-To-Video), you can do so in the txt2img panel.
- Renamed
--sdclipl
to--sdclip1
, and--sdclipg
to--sdclip2
. These flags are now used whenever there is a vision encoder to be used (e.g. WAN's clip_vision if applicable). - Disable TAESD if not applicable.
- Moved all
.embd
resource files into a separate directory for improved organization. Also extracted out image generation vocabs into their own files. - Moved
lowvram
CUDA option into a new flag--lowvram
(same as -nkvo), which can be used in both CUDA and Vulkan to avoid offloading the KV. Note: This is slow and not generally recommended. - Fixed Kimi template, added Granite 4 template.
- Enabled building for CUDA13 in the CMake, however it's untested and no binaries will be provided, also fixed Vulkan noext compiles.
- Fixed q4_0 repacking incoherence on CPU only, which started in v1.98.
- Fixed FastForwarding issues due to misidentified hybrid/rnn models, which should not happen anymore.
- Added
--sdgendefaults
to allow setting some default image generation parameters. - On admin config reload, reset nonexistent fields in config to default values instead of keeping the old value.
- Updated Kobold Lite, multiple fixes and improvements
- Set default filenames based on slot's name when downloading from saved slot.
- Added
dry_penalty_last_n
from @joybod which decouples dry range from rep pen range. - LaTeX rendering fixes, autoscroll fixes, various small tweaks
- Merged new model support including GLM4.6 and Granite 4, fixes and improvements from upstream
Hotfix 1.100.1 - Fixed a regression with flash attention on oldcpu builds, fixed kokoro regression.
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.99.4
koboldcpp-1.99.4
a darker shade of blue edition

- NEW: - The bundled KoboldAI Lite UI has received a substantial design overhaul in an effort to make it look more modern and polished. The default color scheme has been changed, however the old color scheme is still available (set 'Nostalgia' color scheme in advanced settings). A few extra custom color schemes have also been added (Thanks Lakius, TwistedShadows, toastypigeon, @PeterPeet). Please report any UI bugs you encounter.
- QOL Change: - Added aliases for llama.cpp command-line flags. To reduce the learning curve for llama.cpp users, the following llama.cpp compatibility flags have been added:
-m
,-t
,--ctx-size
,-c
,--gpu-layers
,--n-gpu-layers
,-ngl
,--tensor-split
,-ts
,--main-gpu
,-mg
,--batch-size
,-b
,--threads-batch
,--no-context-shift
,--mlock
,-p
,--no-mmproj-offload
,--model-draft
,-md
,--draft-max
,--draft-n
,--gpu-layers-draft
,--n-gpu-layers-draft
,-ngld
,--flash-attn
,-fa
,--n-cpu-moe
,-ncmoe
,--override-kv
,--override-tensor
,-ot
,--no-mmap
. They should behave as you'd expect from llama.cpp. - Renamed
--promptlimit
to--genlimit
, now applies to API requests as well, can be set in the UI launcher. - Added a new parameter
--ratelimit
that will apply per-IP based rate limiting (to help prevent abuse of public instances). - Fixed Automatic VRAM detection for rocm and vulkan backends on AMD systems (thanks @lone-cloud)
- Hide API info display if running in CLI mode.
Flash attention is now checked by default when using GUI launcher.(Reverted in 1.99.1 by popular demand)- Try fix some embedding models using too much memory.
Standardize model file download locations to the koboldcpp executable's directory. This should help solve issues about non-writable system paths when launching from a different working directory. If you prefer the old behavior, please send some feedback, but I think standardizing it is better than adding special exceptions for some directory paths.(Reverted in 1.99.2, with some exceptions)- Add psutil to conda environment. Please report if this breaks any setups.
- Added
/v1/audio/voices
endpoint, fixed dia wrong voice mapping - Updated Kobold Lite, multiple fixes and improvements
- UI design rework, as mentioned above
- Fixes for markdown renderer
- Added a popup to allow enabling TTS or image generation if it's disabled but available.
- Added new scenario "Aletheia"
- Increased default context size and amount generated
- Fix for GPT-OSS instruct format.
- Smarter automatic detection for "Enter Sends" default based on platform. Toggle moved into advanced settings.
- Fix for Palemoon browser compatibility
- Reworked best practices recommendation to think tags - now provides Think/NoThink instruct tags for each instruct sequence. You are now recommended to explicitly select the correct Think/NoThink instruct tags instead of using the
<think>
forced/prevented prefill. This will provide better results for preventing reasoning than simply injecting a blank<think></think>
since some models require specialized reasoning trace formats. - For example, to prevent thinking in GLM-Air, you're simply recommended to set the instruct tag to
GLM-4.5 Non-Thinking
and leave "Insert Thinking" as "Normal" instead of manually messing with the tag injections. This ensures the correct postfix tags for each format are used. - By default, KoboldCppAutomatic template permits thinking in models that use it.
- Merged new model support, fixes and improvements from upstream
Hotfix 1.99.1 - Fix for chroma, revert FA default off, revert ggml-org#16056, fixed rocm compile issues.
Hotfix 1.99.2 - Reverted the download file path changes on request from @henk717 for most cases. Fixed rocm VRAM detection.
Hotfix 1.99.3 and Hotfix 1.99.4 - Fixed aria2 downloading and try to fix kokoro
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.98.1
koboldcpp-1.98.1
Kokobold edition
kobo.mp4
- NEW: TTS.cpp model support has been integrated into KoboldCpp, providing access to new Text-To-Speech models - The TTS.cpp project (repo here) was developed by by @mmwillet, and a modified version has now been added into KoboldCpp, bringing support for 3 new Text-To-Speech models Kokoro, Parler and Dia.
- Of the above models, Kokoro is the most recommended for general use.
- Uses the GGML library in KoboldCpp, although the new ops are CPU only, so Kokoro provides the best speed taking size into consideration. You can expect speeds of 2x realtime for Kokoro (fastest), 0.5x realtime for Parler, and 0.1x realtime for Dia (slowest).
- To use, simply download the GGUF model and load it in the 'Audio' tab as a TTS model. Note: WavTokenizer is not required for these models. Please use the
no_espeak
versions, KoboldCpp has custom IPA mappings for English and espeak is not supported. - KoboldAI Lite provides automatic mapping for the speaker voices. If you wish to use a custom voice for Kokoro, the supported voices are
af_alloy
,af_aoede
,af_bella
,af_heart
,af_jessica
,af_kore
,af_nicole
,af_nova
,af_river
,af_sarah
,af_sky
,am_adam
,am_echo
,am_eric
,am_fenrir
,am_liam
,am_michael
,am_onyx
,am_puck
,am_santa
,bf_alice
,bf_emma
,bf_isabella
,bf_lily
,bm_daniel
,bm_fable
,bm_george
,bm_lewis
. Only English speech is properly supported.
- Thanks to @wbruna, image generation has been updated and received multiple improvements:
- Added separate flash attention and conv2d toggles for image generation
--sdflashattention
and--sdconvdirect
- Added ability to use q8 for Image Generation model quantization, in addition to existing q4.
--sdquant
now accepts a parameter[0/1/2
] that specifies quantization level, similar to--quantkv
- Added separate flash attention and conv2d toggles for image generation
- Added
--overridenativecontext
flag which allows you to easily override the expected trained context of a model when determining automatic RoPE scaling. If you didn't get that, you don't need this feature. - Seed-OSS support is merged, including instruct templates for thinking and non-thinking modes.
- Further improvements to tool calling and audio transcription handling
- Fixed Stable Diffusion 3.5 loading issue
- Embedding models now default to the lower of current model max context and trained context. Should help with Qwen3 embedding models. This can be adjusted with
--embeddingsmaxctx
override. - Improve server identifier header for better compatibility with some libraries
- Termux
android_install.sh
script can now launch existing downloaded models - Minor chat adapter fixes, including Kimi.
- Added alias for
--tensorsplit
- Benchmark CSV formatting fix.
- Updated Kobold Lite, multiple fixes and improvements
- Scenario picker can now load any adventure or chat scenario in Instruct mode.
- Slightly increased default amount to generate.
- Improved file saving behavior, try to remember previously used filename.
- Improved KaTeX rendering and handle additional cases
- Improved streaming UI for code block streaming at the start of any turn.
- Added setting to embed generated TTS audio into the context as part of the AI's turn.
- Minor formatting fixes
- Added Vision 👁️ and Auditory 🦻 support indicators for inline multimodal media content.
- Added Seed-OSS instruct templates. Note that Thinking regex must be set manually for this model by changing the think tag.
- Overhaul narration and media adding system, allow TTS to be manually added with
Add File
.
- Merged new model support, fixes and improvements from upstream
Hotfix 1.98.1 - Fix Kokoro for better accuracy and quality, added 4096 as a --blasbatchsize
option, fix windows 7 functionality, fixed flash attention issues, synced some new updates from upstream.
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.97.4
koboldcpp-1.97.4

- Merged support for GLM4.5 family of models
- Merged support for GPT-OSS models (note that this model performs poorly if OpenAI instruct templates are not obeyed. To use it in raw story mode, append
<|start|>assistant<|channel|>final<|message|>
to memory) - Merged support for Voxtral (Voxtral Small 24B is better than Voxtral Mini 3B, but both are not great. See ggml-org#14862 (comment))
- Added
/ping
stub endpoint to permit usage on Runpod serverless. - Allow MoE layers to be easily kept on CPU with
--moecpu (layercount)
flag. Using this flag without a number will keep all MoE layers on CPU. - Clearer indication of support for each multimodal modality Vision/Audio
- Increased max length of terminal prints allowed in debugmode.
- Do not attempt context shifting for any mrope models.
- Adjusted some adapter instruct templates, tweaked mistral template.
- Handle empty objects returned by tool calls, also remove misinterpretation of the tools calls instruct tag within ChatML autoguess.
- Allow multiple tool calls to be chained, and allow them to be triggered by any role.
- WebSearch fix url params parsing
- Increased regex stack size limit for MSVC builds (fix for mistral models).
- Updated Kobold Lite, multiple fixes and improvements
- Added 2 more save slots
- Added a (+/-) modifier field for Adventure mode rolls
- Fixed deleting wrong image if multiple selected images are identical.
- Button to insert textDB separator
- Improved mid-streaming rendering
- Slightly lowered default rep pen
- Simplified Mistral template, added GPT-OSS Harmony template
- Merged new model support, fixes and improvements from upstream
Hotfix 1.97.1 - More template fixes, now shows generated token's ID in debugmode terminal log, fixed flux loading speed regression, Vulkan BSOD fixed.
Hotfix 1.97.2 - Fix CLBlast regression, limit vulkan bsod fix to nvidia only, updated lite, merged upstream fixes.
Hotfix 1.97.3 - Fix a regression with GPT-OSS that resulted in incoherence
Hotfix 1.97.4 - Fixed OldPC CUDA builds when flash attention was not used. This broke after 1.95 and is now fixed.
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.96.2
koboldcpp-1.96.2

- NEW: Now supports audio inputs for models (in addition to existing vision inputs). Specifically, support for Qwen 2.5 Omni 3B has been added (the 3B is better than the 7B which cannot understand music).
- Use it similar to existing vision models - you download the base model and then the mmproj and load both.
- You can then launch KoboldCpp, and upload your images/audio in the KoboldAI Lite UI, and ask the AI questions about them.
- Multiple images and audio files can be used together, though be aware that you will need a high context especially for large audio files.
- The 3B seems to perform better than the 7B. The 7B hallucinates on music very hard.
- Added miniaudio:
.wav
,.mp3
and.flac
files are now supported on all audio endpoints (Whisper transcribe and multimodal audio) - Fixes for gemma3n incoherence, should be working out of the box now.
- Fixes to allow the new Jamba 1.7 models to work. Note that context shift and fast forwarding cannot be used on Jamba.
- Allow automatically resuming incomplete model downloads if aria2c is used.
- Prints some system information on startup to terminal to aid future debugging
- Added emulation for OpenAI
/v1/images/generations
endpoint for image generation - Fixed noscript image generation
- Apply nsigma masking (thanks @Reithan)
- Allow flash attention to be used with image generation (thanks @wbruna)
- Backwards compatibility for
json_schema
field improved. - Ensured that
finish_reason
is always sent last with no additional text content on the same chunk. - Important Change: Default context size is now 8k (up from 4k) to better represent modern models. This may affect your memory usage. Existing kcpps configs are unaffected.
- Important Change: The flag
--usecublas
has been renamed to--usecuda
. Backwards compatibility for the old flag name is retained, but you're recommended to change to the new name. - Added new AutoGuess templates for Kimi K2, Jamba and Dots. Hunyuan A13B template is not included as the ideal template cannot be determined.
- Improved formatting of multimodal chunk handling
- Fixes for remotetunnel not starting on some linux systems.
- Updated Kobold Lite, multiple fixes and improvements
- Aesthetic UI has been completely refactored and slightly simplified for easier management. Most functionality should be unchanged.
- Allow connecting to OpenAI endpoints without a key.
- Added more experimental flags to control audio compression, autoguess tags and unsaved file warnings.
- Allow uploading audio files and embedding them into your saved stories, lamejs mp3 encoder added.
- Allow audio capture from microphone to embed into story
- Added shortcut for inserting instructions into memory
- Allow disabling default stop sequences.
- Breaking Change: Attached image and audio data is no longer stored inline in the story, but instead as metadata in the savefile
- Save files from past versions are 100% forwards compatible, but any new media files in future saves are only partially backwards compatible - all media saved in future versions will not be accessible when re-opened in past versions of the UI.
- This is required to handle the large size of audio files. All old savefiles will upgrade perfectly fine, but you can't add new media and then access it back in old versions again.
- Fixed a few html parsing bugs.
- Merged new model support, fixes and improvements from upstream
Hotfix 1.96.1 - Fixed a few UI issues, fixed loading large multipart models, adjusted autoguess templates by @kallewoof, merged exaone 4 support
Hotfix 1.96.2 - Splits a batch into smaller batches when processing if it fails, updated lite with a few minor fixes, increase max img2img size
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.95.1
koboldcpp-1.95.1
- NEW: Added support for Flux Kontext: This is a powerful image editing model based on Flux that can edit images using natural language. Easily replace backgrounds, edit text, or add extra items into your images. You can download a ready-to-use kcppt template here, simply load it into KoboldCpp and all necessary model files will be downloaded on launch. Then open StableUI at http://localhost:5001/sdui, add your prompt, reference images and generate. Thanks to @stduhpf for the sd.cpp implementation!
- Photomaker now supports uploading multiple reference images, same as Kontext. Up to 4 reference images are accepted.
- Merged upstream support and added AutoGuess template for Gemma3n (text only) and ERNIE.
- Further grammar sampling speedups from caching by @Reithan
- Fixed a bug when combining save states with draft models.
- Fixed an issue where prompt processing encountered errors after the KV refactor
- Fixed support for python 3.13 (thanks @tsite)
- Updated Kobold Lite, multiple fixes and improvements
- Fixed Push-to-Talk on mobile, added Toggle to Talk (voice input) option.
- Improved some error handling for aborted streaming
- Fixed some linebreaks in corpo chat mode
- Fixed a bug in thinking regex
- Merged new model support, fixes and improvements from upstream
Hotfix 1.95.1 - Fixed error when using swa together with flash attention.
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.94.2
koboldcpp-1.94.2
are we comfy yet?
- NEW: Added unpacked mini-launcher: Now when unpacking KoboldCpp to a directory, a 5MB mini pyinstaller launcher is also generated in that same directory, that allows you to easily start an unpacked KoboldCpp without needing to install python or other dependencies. You can copy the unpacked directory and use it anywhere (thanks @henk717)
- NEW: Chroma Image Generation Support: Merged support for the Chroma model, a new architecture based on Flux Schnell (thanks @stduhpf)
- NEW: Added PhotoMaker Face Cloning Use
--sdphotomaker
to load PhotoMaker along with any SDXL based model. Then open KoboldCpp SDUI and upload any reference image in the PhotoMaker input to clone the face! Works in all modes (inpaint/img2img/text2img). - Swapping .gguf models in admin mode now allows overriding the config with a different one as well (both are customizable).
- Improve GNBF grammar performance by attempting culled grammar search first (thanks @Reithan)
- Allow changing the main GPU with
--maingpu
when loading multi-gpu setups. The main GPU uses more VRAM and has a larger performance impact. By default it is the first GPU. - Added configurable soft resolution limits and VAE tiling limits (thanks @wbruna), also fixed VAE tiling artifacts.
- Added
--sdclampedsoft
which provides "soft" total resolution clamping instead.(e.g. 640 would allow 640x640, 512x768 and 768x512 images), can be combined with--sdclamped
which provides hard clamping (no dimension can exceed it) - Added
--sdtiledvae
which replaces--sdnotile
: Allows specifying a size beyond which VAE tiling is applied.
- Added
- Use
--embeddingsmaxctx
to limit the max context length for embedding models (if you run out of memory, this will help) - Added
--embeddingsgpu
to allow offloading embeddings model layers to GPU. This is NOT recommended as it doesn't provide much speedup, since embedding models already use the GPU for processing even without dedicated offload. - Display available RAM on startup, display version number in terminal window title
- ComfyUI emulation now covers the
/upload/image
endpoint which allows Img2Img comfyui workflows. Files are stored temporarily in memory only. - Added more performance stats for token speeds and timings.
- Updated Kobold Lite, multiple fixes and improvements
- Fixed Chub.ai importer again
- Added card importer for char-archive.evulid.cc
- Added option to import image from webcam
- Allow markdown when streaming current turn
- Improved CSS import sanitizer (thanks @PeterPeet)
- Word Frequency Search (inspired from @trincadev MyGhostWriter)
- Allow usermods and CSS to be loaded from file.
- Added WebSearch for corpo mode
- Added Img2Img support for ComfyUI backends
- Added ability to use custom OpenAI endpoint for TextDB embedding model
- Minor linting and splitter/merge tool by @ehoogeveen-medweb
- Fixed lookahead scanning for Author's note insertion point
- Merged new model support, fixes and improvements from upstream
Hotfix 1.94.1 - Minor bugfixes, fixed ollama compatible vision, added avx/avx2 detection for backend auto-selection, cleaned up oldpc builds to only include oldpc files.
Hotfix 1.9.2 - Fixed an issue with swa models when context is full, try to fix a vulkan oom regression
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Deprecation Reminder: Binary filenames have been renamed: The files named koboldcpp_cu12.exe
, koboldcpp_oldcpu.exe
, koboldcpp_nocuda.exe
, koboldcpp-linux-x64-cuda1210
, and koboldcpp-linux-x64-cuda1150
have been removed. Please switch to the new filenames.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.93.2
koboldcpp-1.93.2
those left behind
- NEW: Added Windows Shell integration. You can now associate
.gguf
files to open automatically in KoboldCpp (e.g. double clicking a gguf). If another kcpp instance is already running locally on the same port, it will be replaced. The default handler can be installed/uninstalled from the 'Extras' tab (thanks @henk717)- This is handled by the
/api/extra/shutdown
api, which can only be triggered from localhost. - Will not affect instances started without
--singleinstance
flag. All this is automatic when you launch via windows shell integration.
- This is handled by the
- NEW: Added an option to simply unload a model from the admin API, the server will free the memory but continue to run. You can then switch to a different model via the admin panel in Lite.
- NEW: Added Save and Load States (sessions). This allows you to take a Savestate Snapshot of the current context, and then reload it again later at any time. Available over the admin API, you can trigger it from the admin panel in Lite.
- Works similarly to 'session files' in llama.cpp, but the snapshot states are stored entirely in memory.
- Used correctly, it can allow you to swap between multiple different sessions/chats without any reprocessing at all.
- There are 3 available slots to use (total 4 including the current session).
- Fixed a regression with flash attention not working for some GPUs in the previous version.
- Added a text LoRA scale option. Removed text LoRA base as it was no longer used in modern ggufs. If provided it will be silently ignored.
- Function/Tool calling can now use higher temperatures (up to 1.0)
- Added more Ollama compatibility endpoints.
- Fixed a few clip skip issues in image generation.
- Added an adapter flag
add_sd_step_limit
to limit max image generation step counts. - Fixed crash on thread count 0.
- Match a few common openai tts voice ids
- Fixed a ctx bug with embeddings (still does not work with qwen3 embed, but should work with most others)
- KoboldCpp Colab now uses KoboldCpp's internal downloader instead of downloading the models first externally.
- Updated Kobold Lite, multiple fixes and improvements
- Added support for embeddings models into KoboldAI Lite's TextDB (thanks @esolithe)
- Added support for saving and loading world info files independently (thanks @esolithe)
- NEW: Added new "Smart" Image Autogeneration mode. This allows the AI to decide when it should generate images, and create image prompt automatically.
- Added a new scenario: Replaced defunct aetherroom.club with prompts.forthisfeel.club
- Added support for importing cards from character-tavern.com
- Improved Tavern World Info support
- Added support for welcome messages in corpo mode.
- Fixed copy to clipboard not working for some browsers.
- Interactive Storywriter scenario fix: now no longer overwrites your regex settings. However, hiding input text is now off by default.
- Added a toggle to make a usermod permanent. Use with caution.
- Markdown fixes, also prevent your username from being overwritten when changing chat scenario.
- Merged fixes and improvements from upstream
Hotfix 1.93.1 - Fixed a crash due to outdated VC runtime dlls, fixed a bad adapter, added base64 embeddings support, added webcam upload support for KoboldAI Lite Add Image, fixed chubai importer, added more options for idle response trigger times.
Hotfix 1.93.2 - Revert back to VS2019+cuda12.1 for windows build to solve reports of crashes. Fixed issues with embeddings endpoint. Added --embeddingsmaxctx
option.
Important Breaking Changes (File Naming Change Notice):
- For improved clarity and ease of use, many binaries are being RENAMED.
- Please observe the new name changes for your automated scripts to avoid disruption:
- Linux:
koboldcpp-linux-x64-cuda1210
is nowkoboldcpp-linux-x64
(Cuda12, AVX2, Newer PCs)koboldcpp-linux-x64-cuda1150
is nowkoboldcpp-linux-x64-oldpc
(Cuda11, AVX1, Older PCs)koboldcpp-linux-x64-nocuda
is stillkoboldcpp-linux-x64-nocuda
(No CUDA)
- Windows:
koboldcpp_cu12.exe
is nowkoboldcpp.exe
(Cuda12, AVX2, Newer PCs)koboldcpp_oldcpu.exe
is nowkoboldcpp-oldpc.exe
(Cuda11, AVX1, Older PCs)koboldcpp_nocuda.exe
is nowkoboldcpp-nocuda.exe
(No CUDA)
- If you are using our official URLs or docker images, this should be handled automatically, but ensure your docker image is up-to-date.
- If you are using platforms that do not support the main build, you can continue using the
oldpc
builds, which remain on cuda11 and avx1 and will continue to be maintained. The cuda12+ version on the main build may be subject to change in future. - For now, both filenames are uploaded to avoid breaking existing scripts. The old filenames will be removed soon, so please update.
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Deprecation Warning: The files named koboldcpp_cu12.exe
, koboldcpp_oldcpu.exe
, koboldcpp_nocuda.exe
, koboldcpp-linux-x64-cuda1210
, and koboldcpp-linux-x64-cuda1150
will be removed very soon. Please switch to the new filenames.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.92.1
koboldcpp-1.92.1
early bug is for the birds edition
- Added support for SWA mode which uses much less memory for the KV cache, use
--useswa
to enable.- Note: SWA mode is not compatible with ContextShifting, and may result in degraded output when used with FastForwarding.
- Fixed an off-by-one error in some cases when Fast Forwarding that resulted in degraded output.
- Greatly improved tool calling by enforcing grammar on the output field names, and doing the automatic tool selection as a separate pass. Tool calling should be much more reliable now.
- Added model size information in the HF Huggingface Search and download menu
- CLI terminal output is now truncated in the middle of very long strings instead of at the end.
- Fixed unicode path handling for Image Generation models.
- Enabled threadpools, this should result in a speedup for Qwen3MoE.
- Merged Vision support for Llama4 models, simplified some vision preprocessing code.
- Fixes for prompt formatting for GLM4 models. GLM4 batch processing on Vulkan is fixed (thanks @0cc4m).
- Fixed incorrect AutoGuess adapter for some Mistral models. Also fixed some KoboldCppAutomatic placeholder tag replacements.
- AI Horde default advertised context now matches main max context by default. This can be changed.
- Disable
--showgui
if--skiplauncher
is used - StableUI now increments clip_skip and seed correctly when generating multiple images in a batch (thanks @wbruna)
- clip_skip is now stored inside image metadata, and random seed's actual number is also indicated.
- Added DDIM sampler for image generation.
- Added a simple optional python reqs install script in
launch.cmd
for launching when run from unpacked directories. - Updated Kobold Lite, multiple fixes and improvements
- Integrated dPaste.org (open source pastebin) as a platform for quickly sharing Save Files. You can also use a self hosted instance by changing the endpoint URL. You can now share stories as a single URL with
Save/Load > Share > Export Share as Web URL
- Added an option to allow Horizontal Stacking of multiple images in one row.
- Fixed importing of Chub.AI character cards as they changed their endpoint.
- Added support for RisuAI V3 character cards (.charx archive format), also fixed KAISTORY handling.
- SSE streaming is now the default for all cases. It can be disabled in Advanced Settings.
- Changed markdown renderer to render markdown separately for each instruct turn.
- Better passthrough for KoboldCppAutomatic instruct preset, especially with split tags.
- Added an option to use TTS from Pollinations API, which routes through OpenAI TTS models. Note that this TTS service has a server-side censorship via a content filter that I cannot control.
- Lite now sends stop sequences in OpenAI Chat Completions Endpoint mode (up to 4)
- Added ST based randomizer macros like
{{roll:3d6}}
(thanks @hu-yijie) - Added new Immortal sampler preset by Jeb Carter
- In polled streaming mode, you can fetch last generated text if the request fails halfway.
- Added an exit button when editing raw text in corpo mode.
- Re-enabled a debug option for using raw placeholder tags on request. Not recommended.
- Added a debug option that allows changing the connected API at runtime.
- Integrated dPaste.org (open source pastebin) as a platform for quickly sharing Save Files. You can also use a self hosted instance by changing the endpoint URL. You can now share stories as a single URL with
- Merged fixes and improvements from upstream
Hotfix 1.92.1 - Fixes for a GLM4 vulkan bug, allow extra EOG tokens to trigger a stop.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3 etc) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
kcpp_tools_rolling
This release contains the latest KoboldCpp tools used to convert and quantize models. Alternatively, you can also use the tools released by the llama.cpp project, they should be cross compatible. The binaries here will be periodically updated.