Using the RunPod platform to deploy a serverless endpoint running a machine learning model. The endpoint should be capable of receiving a request and generating a corresponding response.
RunPod is a powerful and cost-effective cloud platform designed for AI and machine learning workloads. It offers on-demand GPU instances, allowing users to access high-performance compute resources when needed. With its serverless computing capabilities, RunPod enables autoscaling API endpoints, making it easy to scale inference for AI models efficiently.
The platform provides AI endpoints tailored for inference workloads, ensuring seamless deployment and execution of machine learning models. Additionally, RunPod allows users to manage software on third-party compute resources while benefiting from dynamic scaling to meet computational demands effectively.
-
Write a handler.py file that can handle serverless requests for the given model.
-
Build a Docker image that includes your serverless handler and the model.
-
Deploy a serverless endpoint on the RunPod platform.
-
Test that the serverless endpoint is able to receive an input (via text) and return an appropriate image for display
-
For ease of sharing the code and integrating with the RunPod platform, I created this GitHub repository to store my Dockerfile and handler file.
-
Created a Dockerfile that has
pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtimeas the base image. Thepytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtimeDocker image provides a pre-built environment with PyTorch, CUDA, and cuDNN, optimized for running GPU-accelerated deep learning models like text-to-image. This simplifies setup and ensures compatibility for efficient model execution.The dockerfile then installs the necessary python packages to be used for downloading the model and running the pipeline to generate the necessary image from text. The python package
huggingface_hubis used to download thestabilityai/stable-diffusion-2-1-base, into a specified directory -app/model.The handler file is also copied to the workspace folder -
/app, and it is also set as the entry point. -
The handler file picks up the downloaded model from the
/app/modelfolder and uses it to prepare the pipeline using thediffuserspackage. The model is thereby loaded and made ready to use for any future requests to the handler.The
eventfunction of the handler is invoked anytime an API request is made to the serverless function. It expects a request in the form:{ "input": { "prompt": "Earth as a 3D cartoon animation" } }The prompt is extracted by the event function and is used as an input to the pipeline to create images. Only the first image is considered, saved as png file and returned to the caller as a base64 string.
-
Deploying the serverless endpoint: For deploying, I have used the very handy option of connecting this GitHub repo to be the source for deploying to the RunPod platform. Any updates to the main branch will result in an automatic build. This will build from the Dockerfile, and the platform takes care of rolling out the updates incrementally to all workers.
In order to quickly test if the serverless endpoint is working or not, I use the Requests tab to trigger a request with a prompt and verify from the worker logs.

The Telemetry option is also quite handy to understand how GPU is consumed during the requests.

An example of a sample request and its response is shown below. You can see that output.body contains the base64 image data.

Finally, the Metrics tab shows an overview of the number of requests over various time frames and other necessary metrics like 'Execution Time'.

- Testing the Endpoint: Now that I know that my serverless endpoint is up & running, I decided to create a quick React app in Github that will accept a text prompt, call this endpoint deployed to my RunPod platform and display the image based on the response.
-- The React App has a simple UI to accept a search prompt.
-- It uses it to trigger the image generation process via a POST request to the Serverless Endpoint.
-- It receives a Job ID as response, which it uses to poll for the image to be generated by the API as a Base64 image.
-- Once the API returns the image string, it decodes it to generate the image and displays it in the UI for the user
The React App is hosted at https://anamika8.github.io/runpod_frontend/. Below is a sample result:
