In this post, we’ll briefly introduce the LLaVA model, and walk you through a tutorial on how to quickly deploy it to a production environment behind a REST API using Google Colab and Modelbit.
Intro to LLaVa - a multimodal vision model
Meet Llava, a computer vision model that’s quietly transforming the way machines interact with the world. LLaVA (Large Language-and-Vision Assistant), is an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding. Imagine a tool that doesn’t just see images but understands them, reads the text embedded in them, and reasons about their context—all while conversing with you in a way that feels almost natural. Llava isn’t just another incremental step in AI; it’s a leap towards a future where machines not only process information but truly grasp it.
Llava stands out because it blurs the lines between visual and textual understanding. Most models are specialists—they either excel at processing images or parsing text. Llava does both. It’s like having a polyglot who speaks fluent visual and textual languages, making it adept at tasks like explaining a complex chart, recognizing and reading text in photos, or identifying intricate details in high-resolution images.
But Llava isn’t a monolith; it comes in various forms tailored to different needs. There’s the base Llava, perfect for general applications where you need robust, versatile AI assistance. Then there’s Llava-Med, a specialized version that delves into the complexities of medical imaging, providing insights into CT scans and MRIs that would make a radiologist proud. Imagine a tool that can spot subtle anomalies in a lung scan and explain them in detail, all within a matter of seconds.
For those with heavy computational demands, Llava scales up with versions packing up to 34 billion parameters, offering unmatched depth in understanding and analysis. And for developers, the MoE-Llava variant leverages a Mixture of Experts approach, efficiently distributing the workload to handle complex multimodal data without breaking a sweat.
Llava isn’t just a model; it’s a peek into the future of how we’ll interact with machines. It’s not about replacing human expertise but amplifying it—whether in medicine, education, or any field where understanding visuals is as crucial as reading text. In a world overflowing with data, Llava offers a way to cut through the noise and truly comprehend what’s in front of us.
Tutorial - Deploying LLaVA to a REST API Endpoint
Now, let’s walk through the steps to deploy the LLaVa multi-modal vision and language model to a REST endpoint. Here are the steps we’ll take:
- We’ll use Google Colab to set up our LLaVA model, install necessary packages, and create an inference function that we can use to call our model.
- We’ll use Modelbit to deploy our LLaVA model to a production environment with a REST API. If you don’t have a Modelbit account, you can create a free trial here.
Setting up our notebook
We recommend using Google Colab notebook with high memory and a T4 GPU for this example. Because finding A100s on Colab can be hit-or-miss, we'll start by using a version that fits on a T4. Scroll down for a guide to deploying a larger version.
First, install the accelerate, "bitsandbytes" and "modelbit" packages:
Go ahead and login to Modelbit:
Import the rest of your dependencies:
Finally, download the LLaVa weights from HuggingFace:
Building the model and performing an inference
First we'll write a function that loads the model:
Note l"oad_in_8bit=true", which quantizes the model to fit in VRAM in a T4 GPU.
The "@cache" decorator will cause this function to only load the model once. After that, it stays in memory. The same behavior will be preserved in production in Modelbit.
Next we'll write our function that prompts the model and returns the result:
This function downloads the picture from the URL and prompts LLaVa with the picture and the text prompt, returning just the model's response..
Deploying LLaVA to REST
From here, deployment to a REST API is just one line of code:
We want to make sure to bring along the weight files from the "llava-hf" directory using the extra_files parameter.
And we want to specify "bitsandbytes" and accelerate dependencies because transformers needs them in this case but does not specify that dependency.
Finally, of course, this model requires a GPU.
Optional: Deploying a larger version of LLaVA
If you can get an A100 from Colab, you can build a larger (non-quantized) version of the model and deploy that to Modelbit!
To do so, simply remove "load_in_8bit=true" in your from_pretrained calls. Since quantized models automatically load into CUDA but default models do not, you'll also need to add ".to("cuda")" to your model construction. Here's the new "load_model" definition:
The inference function is unchanged. Finally, when deploying, make sure you specify a large enough GPU:
No need for "accelerate" or "bitsandbytes" since we're no longer quantizing.
You can now call your LLaVa model from its production REST endpoint!
Next Steps
Now that you’ve successfully deployed a LLaVA model to production, you can start to build an application around it, or integrate it into an existing app.
If you have questions and want to learn more about LLaVa or Modelbit, feel free to reach out to us!
Want more tutorials for deploying ML models to production?
- Tutorial for Deploying Segment Anything Model to Production
- Tutorial for Deploying OpenAI's Whisper Model to Production
- Tutorial for Deploying Llama-2 to a REST API Endpoint
- Tutorial for Deploying a BERT Model to Production
- Tutorial for Deploying ResNet-50 to a REST API
- Tutorial for Deploying OWL-ViT to Production
- Tutorial for Deploying a Grounding DINO Model to Production
- Tutorial for Deploying Depth Anything Model to Production