OpenVINO™ GenAI is a library of the most popular Generative AI model pipelines, optimized execution methods, and samples that run on top of highly performant OpenVINO Runtime.
This library is friendly to PC and laptop execution, and optimized for resource consumption. It requires no external dependencies to run generative models as it already includes all the core functionality (e.g. tokenization via openvino-tokenizers).
Please follow the following blogs to setup your first hands-on experience with C++ and Python samples.
OpenVINO™ GenAI library provides very lightweight C++ and Python APIs to run following Generative Scenarios:
- Text generation using Large Language Models. For example, chat with local LLaMa model
- Image generation using Diffuser models, for example, generation using Stable Diffusion models
- Speech recognition using Whisper family models
- Text generation using Large Visual Models, for instance, Image analysis using LLaVa or miniCPM models family
Library efficiently supports LoRA adapters for Text and Image generation scenarios:
- Load multiple adapters per model
- Select active adapters for every generation
- Mix multiple adapters with coefficients via alpha blending
All scenarios are run on top of OpenVINO Runtime that supports inference on CPU, GPU and NPU. See here for platform support matrix.
OpenVINO™ GenAI library provides a transparent way to use state-of-the-art generation optimizations:
- Speculative decoding that employs two models of different sizes and uses the large model to periodically correct the results of the small model. See here for more detailed overview
- KVCache token eviction algorithm that reduces the size of the KVCache by pruning less impacting tokens.
Additionally, OpenVINO™ GenAI library implements a continuous batching approach to use OpenVINO within LLM serving. Continuous batching library could be used in LLM serving frameworks and supports the following features:
- Prefix caching that caches fragments of previous generation requests and corresponding KVCache entries internally and uses them in case of repeated query. See here for more detailed overview
Continuous batching functionality is used within OpenVINO Model Server (OVMS) to serve LLMs, see here for more details.
# Installing OpenVINO GenAI via pip
pip install openvino-genai
# Install optimum-intel to be able to download, convert and optimize LLMs from Hugging Face
# Optimum is not required to run models, only to convert and compress
pip install optimum-intel@git+https://github.com/huggingface/optimum-intel.git
# (Optional) Install (TBD) to be able to download models from Model Scope
For more examples check out our LLM Inference Guide
#(Basic) download and convert to OpenVINO TinyLlama-Chat-v1.0 model
optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --weight-format fp16 --trust-remote-code "TinyLlama-1.1B-Chat-v1.0"
#(Recommended) download, convert to OpenVINO and compress to int4 TinyLlama-Chat-v1.0 model
optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --weight-format int4 --trust-remote-code "TinyLlama-1.1B-Chat-v1.0"
import openvino_genai as ov_genai
#Will run model on CPU, GPU or NPU are possible options
pipe = ov_genai.LLMPipeline("./TinyLlama-1.1B-Chat-v1.0/", "CPU")
print(pipe.generate("The Sun is yellow because", max_new_tokens=100))
Code below requires installation of C++ compatible package (see here for more details)
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string models_path = argv[1];
ov::genai::LLMPipeline pipe(models_path, "CPU");
std::cout << pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(100)) << '\n';
}
See here
For more examples check out our LLM Inference Guide
#(Basic) download and convert to OpenVINO MiniCPM-V-2_6 model
optimum-cli export openvino --model openbmb/MiniCPM-V-2_6 --trust-remote-code --weight-format fp16 MiniCPM-V-2_6
#(Recommended) Same as above but with compression: language model is compressed to int4, other model components are compressed to int8
optimum-cli export openvino --model openbmb/MiniCPM-V-2_6 --trust-remote-code --weight-format int4 MiniCPM-V-2_6
import openvino_genai as ov_genai
#Will run model on CPU, GPU is a possible option
pipe = ov_genai.VLMPipeline("./MiniCPM-V-2_6/", "CPU")
rgb = read_image("cat.jpg")
print(pipe.generate(prompt, image=rgb, max_new_tokens=100))
Code below requires installation of C++ compatible package (see here for more details)
#include "load_image.hpp"
#include <openvino/genai/visual_language/pipeline.hpp>
#include <iostream>
int main(int argc, char* argv[]) {
std::string models_path = argv[1];
ov::genai::VLMPipeline pipe(models_path, "CPU");
ov::Tensor rgb = utils::load_image(argv[2]);
std::cout << pipe.generate(
prompt,
ov::genai::image(rgb),
ov::genai::max_new_tokens(100)
) << '\n';
}
See here
For more examples check out our LLM Inference Guide
#Download and convert to OpenVINO dreamlike-anime-1.0 model
optimum-cli export openvino --model dreamlike-art/dreamlike-anime-1.0 --task stable-diffusion --weight-format fp16 dreamlike_anime_1_0_ov/FP16
#You can also use INT8 hybrid quantization to further optimize the model and reduce inference latency
optimum-cli export openvino --model dreamlike-art/dreamlike-anime-1.0 --task stable-diffusion --weight-format int8 --dataset conceptual_captions dreamlike_anime_1_0_ov/INT8
import argparse
from PIL import Image
import openvino_genai
def main():
parser = argparse.ArgumentParser()
parser.add_argument('model_dir')
parser.add_argument('prompt')
args = parser.parse_args()
device = 'CPU' # GPU, NPU can be used as well
pipe = openvino_genai.Text2ImagePipeline(args.model_dir, device)
image_tensor = pipe.generate(
args.prompt,
width=512,
height=512,
num_inference_steps=20
)
image = Image.fromarray(image_tensor.data[0])
image.save("image.bmp")
Code below requires installation of C++ compatible package (see here for additional setup details, or this blog for full instruction How to Build OpenVINO™ GenAI APP in C++
#include "openvino/genai/image_generation/text2image_pipeline.hpp"
#include "imwrite.hpp"
int main(int argc, char* argv[]) {
const std::string models_path = argv[1], prompt = argv[2];
const std::string device = "CPU"; // GPU, NPU can be used as well
ov::genai::Text2ImagePipeline pipe(models_path, device);
ov::Tensor image = pipe.generate(prompt,
ov::genai::width(512),
ov::genai::height(512),
ov::genai::num_inference_steps(20));
imwrite("image.bmp", image, true);
}
See here
For more examples check out our LLM Inference Guide
NOTE: Whisper Pipeline requires preprocessing of audio input (to adjust sampling rate and normalize)
#Download and convert to OpenVINO whisper-base model
optimum-cli export openvino --trust-remote-code --model openai/whisper-base whisper-base
NOTE: This sample is a simplified version of the full sample that is available here
import openvino_genai
import librosa
def read_wav(filepath):
raw_speech, samplerate = librosa.load(filepath, sr=16000)
return raw_speech.tolist()
device = "CPU" # GPU can be used as well
pipe = openvino_genai.WhisperPipeline("whisper-base", device)
raw_speech = read_wav("sample.wav")
print(pipe.generate(raw_speech))
NOTE: This sample is a simplified version of the full sample that is available here
#include <iostream>
#include "audio_utils.hpp"
#include "openvino/genai/whisper_pipeline.hpp"
int main(int argc, char* argv[]) {
std::filesystem::path models_path = argv[1];
std::string wav_file_path = argv[2];
std::string device = "CPU"; // GPU can be used as well
ov::genai::WhisperPipeline pipeline(models_path, device);
ov::genai::RawSpeechInput raw_speech = utils::audio::read_wav(wav_file_path);
std::cout << pipeline.generate(raw_speech, ov::genai::max_new_tokens(100)) << '\n';
}
See here
- List of supported models (NOTE: models can work, but were not tried yet)
- OpenVINO LLM inference Guide
- Optimum-intel and OpenVINO
The OpenVINO™ GenAI repository is licensed under Apache License Version 2.0. By contributing to the project, you agree to the license and copyright terms therein and release your contribution under these terms.