Start streaming models by creating a SafetensorsStreamer
object that will serve as a context manager. The SafetensorsStreamer
object opens OS threads that will read the tensors from the Safetensor file to the CPU memory. Closing the object destroys the threads. You can control the number of concurrent threads with the environment variable RUNAI_STREAMER_CONCURRENCY
.
If your SafeTensors file resides on a file system, run the following code to load the tensors to the CPU buffer and stream them to the GPU memory:
from runai_model_streamer import SafetensorsStreamer
file_path = "/path/to/file.safetensors"
with SafetensorsStreamer() as streamer:
streamer.stream_file(file_path)
for name, tensor in streamer.get_tensors():
tensor.to('CUDA:0')
Note: To make the tensors available on the CPU memory, clone the yielded tensors before calling
streamer.get_tensors()
. Note that otherwise, tensors may be overwritten when usingRUNAI_STREAMER_MEMORY_LIMIT
or completely destroyed when closing theSafetensorsStreamer
object.
Note: Streaming models from S3 requires the installation of the streamer S3 package, as can be found here.
To load tensors from object storage, replace the file path in the code above with your S3 path, e.g.:
file_path = "s3://my-bucket/my/file/path.safetensors"
To load tensors from S3 compatible object store, define the following environment variables
RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 RUNAI_STREAMER_S3_ENDPOINT="your_S3_endpoint" AWS_EC2_METADATA_DISABLED=true
Setting the environment variable AWS_EC2_METADATA_DISABLED
is needed in order to avoid a delay of few seconds, which happens only when the aws s3 sdk is used for compatible storage as explained here
The streamer allocates a buffer on the CPU Memory for storing the tensors before moving them to the GPU Memory. Control the size of the allocated buffer by using the environment variable RUNAI_STREAMER_MEMORY_LIMIT
.
RUNAI_STREAMER_MEMORY_LIMIT=-1
The default value. The size of the allocated CPU Memory buffer is equal to the size of the safetensor file (without the file header) and there is no memory reuse between multiple get_tensors()
requests. Use this option for maximum performance and fastest model streaming times.
RUNAI_STREAMER_MEMORY_LIMIT=0
The size of allocated CPU memory is minimal and is equal to the size of the largest tensor in the file. The buffer is reused between all get_tensors()
requests.
RUNAI_STREAMER_MEMORY_LIMIT=NUMBER
Use this option to control the size of the buffer by setting the value to a specific memory size. For example, limit the buffer to 4GB by setting RUNAI_STREAMER_MEMORY_LIMIT=4000000000
. In this case, the streamer reuses the buffer memory between multiple get_tensors()
requests.
Warning: You cannot limit the buffer to sizes smaller than the size of the largest tensor in the file.