Frameworks for Serving LLMs

 

7 Frameworks for Serving LLMs





While browsing through LinkedIn, I came across a comment that made me realize the need to write a simple yet insightful article to shed light on this matter:

“Despite the hype, I couldn’t find a straightforward MLOps engineer who could explain how we can deploy these open-source models and the associated costs.” — Usman Afridi

This article aims to compare different open-source libraries for LLM inference and serving. We will explore their killer features and shortcomings with real-world deployment examples. We will look at frameworks such as vLLM, Text generation inference, OpenLLM, Ray Serve, and others.

If this has caught your attention, I invite you to delve deeper and explore them in more detail.

For the hardware setup, I used a single A100 GPU with a memory capacity of 40 GB. I used LLaMA-1 13b as the model, since it is supported by all the libraries in the list.

The article will not cover traditional libraries for serving deep learning models like TorchServe, KServe, or Triton Inference Server. Although you can infer LLMs with these libraries, I have focused only on frameworks explicitly designed to work with LLMs.

Post a Comment

0 Comments