Making ML workload serving easy using Bentoml

Introduction

In the fast-moving world of MLOps, one of the biggest challenges is efficient deployment and serving of models. BentoML is an amazing solution in this area because it provides a simplified way to serve ML models in the production environment. This blog post is going to dive deep into BentoML, integrate it with KubeRay, and compare it with Ray Serve, providing a complete guide for data scientists and ML engineers.

BentoML is an open-source framework in the effort to bridge the gap between data science and DevOps. It unifies, under one umbrella, packaging, deploying, and managing ML models developed out of various frameworks. The main idea about Bento revolves around a "Bento," or standardized format for packaging ML models.

What is a Bento?

A Bento is a self-contained unit that includes:

The ML model itself
All necessary dependencies
Inference logic
API definition
Configuration files

This packaging ensures that your model can be easily shared, deployed, and run in different environments without compatibility issues.

Key Features of BentoML

Let's explore the key features of BentoML in more detail:

Model Packaging: BentoML supports a wide range of ML frameworks, including TensorFlow, PyTorch, scikit-learn, XGBoost, and more. It provides a consistent API to save models from these frameworks:

import bentoml

# For scikit-learn
bentoml.sklearn.save_model("my_model", sklearn_model)

# For PyTorch
bentoml.pytorch.save_model("my_model", pytorch_model)

# For TensorFlow
bentoml.tensorflow.save_model("my_model", tensorflow_model)

API Server: BentoML generates a production-ready API server that supports both REST and gRPC protocols. This server includes features like request parsing, data validation, and error handling out of the box.
Supports adaptive batching:

Batching is a critical technique in machine learning and data processing, where multiple inputs are grouped into a single batch for processing. This approach significantly enhances efficiency and throughput compared to handling inputs individually. Effective batching can dramatically improve performance, especially when dealing with high-volume or real-time data.

Key Concepts in Batching

Batch Window: The maximum duration a service waits to accumulate inputs into a batch for processing. This ensures timely processing, especially in low-traffic conditions, by preventing long waits for small batch completion.

Batch Size: The maximum number of inputs a batch can contain before it’s processed. This maximizes throughput by leveraging the full capacity of the system’s resources within the constraint of the batch window.

[ Image Source: https://docs.bentoml.com/en/latest/guides/adaptive-batching.html ]
1. Microservices Architecture: BentoML supports deploying models as microservices, which allows for:

Independent scaling of different models
Easier updates and rollbacks
Better resource utilization

Monitoring and Observability BentoML integrates with popular monitoring tools like Prometheus and Grafana. It provides built-in metrics for:

Request latency
Throughput
Error rates
System resource usage

2. Getting Started with BentoML: A Detailed Example

Let's walk through a more detailed example of using BentoML to serve a machine learning model.

in This example we will be using bento with Microsoft Phi2 to predict if a candidate is suitable for the job role

1. Setting Up the Environment

First, ensure you have the necessary dependencies installed:

Note : you can also use faiss-gpu according to your use case and system

pip install bentoml langchain transformers torch faiss-cpu pydantic

2. Implementing the Custom Language Model

We'll start by implementing a custom language model using the Phi-2 model from Microsoft. This model will be used for generating responses based on the resume content and job description.

import torch
from langchain.llms.base import LLM
from transformers import AutoTokenizer, AutoModelForCausalLM

class Phi2LLM(LLM):
    model_name: str = "microsoft/phi-2"
    tokenizer: Optional[Any] = None
    model: Optional[Any] = None

    def __init__(self):
        super().__init__()

    def load_model(self):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name, trust_remote_code=True)
        self.model = AutoModelForCausalLM.from_pretrained(self.model_name, trust_remote_code=True)
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def _call(self, prompt: str, stop: Optional[List[str]] = None, run_manager: Optional[CallbackManagerForLLMRun] = None) -> str:
        if self.model is None:
            self.load_model()

        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model = self.model.to(device)
        inputs = self.tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=2048).to(device)
        with torch.no_grad():
            outputs = self.model.generate(**inputs, max_new_tokens=512, pad_token_id=self.tokenizer.eos_token_id)
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return response

    @property
    def _llm_type(self) -> str:
        return "custom"

3. Creating the BentoML Service

Now, let's create a BentoML service that uses our custom language model along with document processing techniques to analyze resumes.

import bentoml
from bentoml.io import JSON
from pydantic import BaseModel, Field
from typing import Dict
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

class JobApplication(BaseModel):
    job_description: str = Field(
        default="We want to hire an SRE for our Company One2N who must be skilled with Kubernetes (CKA/CKAD Level), DevOps Practices, IaC Tools and Linux"
    )
    resume: str

@bentoml.service(
    name="resume_analyzer",
    resources={
        "cpu": "1",
        "memory": "2Gi",
        "gpu": 1
    }
)
class ResumeAnalyzer:
    def __init__(self):
        self.model = Phi2LLM()
        self.embeddings = HuggingFaceEmbeddings()

    @bentoml.api(input=JSON(pydantic_model=JobApplication), output=JSON())
    def analyze(self, job_application: JobApplication) -> Dict[str, str]:
        job_description = job_application.job_description
        resume_content = job_application.resume
        query = f"""
        Assume you are the HR representative responsible for looking over hiring operations at the company.
        The company is hiring for the role of a Site Reliability Engineer. 
        Given the Job Description for the role:
            {job_description}
        Please share your professional evaluation on whether the candidate's profile aligns with the role. 
        Highlight the strengths and weaknesses of the applicant in relation to the specified job requirements.
        """

        answer = self.process_resume(resume_content, query)

        return {"result": answer}

    def process_resume(self, resume_content: str, query: str) -> str:
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
        texts = text_splitter.split_text(resume_content)

        db = FAISS.from_texts(texts, self.embeddings)

        qa_chain = RetrievalQA.from_chain_type(
            llm=self.model,
            chain_type="stuff",
            retriever=db.as_retriever(search_kwargs={"k": 3}),
            return_source_documents=True
        )

        result = qa_chain({"query": query})

        return result['result']

4. Saving and Serving the Model

To save and serve the model using BentoML, follow these steps:

Save the model:

import bentoml

resume_analyzer = ResumeAnalyzer()
bentoml.save(resume_analyzer)

Serve the model:

Create a bentofile.yaml in your project directory:

service: "api:ResumeAnalyzer"
labels:
  owner: bentoml-team
  project: gallery
include:
  - "*.py"
python:
  packages:
    - torch
    - transformers
    - langchain
    - langchain_community
    - langchain_huggingface
    - faiss-gpu
    - pypdf

5. Using the Served Model

Once the model is served, you can send requests to it using HTTP. Here's an example using Python's requests library:

import requests
import json

url = "<http://localhost:3000/analyze>"
headers = {"content-type": "application/json"}
data = {
    "job_description":  "We want to hire an SRE for our Company One2N who must be skilled with Kubernetes (CKA/CKAD Level), DevOps Practices, IaC Tools and Linux",
    "resume": "'"$(base64 -w 0 ./path/to/your/resume)"'"
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

This example demonstrates how BentoML can be used to serve a complex machine learning pipeline that includes custom language models, document processing, and retrieval techniques. The BentoML service encapsulates all the necessary components and provides a simple API for resume analysis.

Key benefits of using BentoML in this scenario include:

Easy packaging of the entire ML pipeline, including the custom language model and supporting components.
Automatic API generation for the service.
Resource management for CPU, memory, and GPU allocation.
Simplified deployment and scaling options.

3. Comparing BentoML to Ray Serve: In-Depth Analysis

While both BentoML and Ray Serve are designed for serving ML models, they have different strengths and use cases. Let's explore these differences in more detail:

BentoML:

Ease of Use: BentoML provides a more straightforward, unified interface for packaging and serving models from various ML frameworks. It abstracts away many of the complexities of model serving, making it easier for data scientists to deploy their models without extensive DevOps knowledge.

@bentoml.service(name="simple_service")
class SimpleService:
    @bentoml.api(input=bentoml.io.JSON(), output=bentoml.io.JSON())
    def predict(self, input_data):
        # Your prediction logic here
        return {"result": result}

Production Readiness: BentoML offers out-of-the-box support for production features like monitoring, logging, and adaptive batching. It also provides built-in support for model versioning and rollback.
Framework Agnostic: BentoML works well with a wide range of ML frameworks and can easily switch between them. This is particularly useful in organizations that use multiple frameworks or are considering switching frameworks in the future.
Microservices Architecture: BentoML is designed with microservices in mind, making it easier to deploy and scale individual models. This aligns well with modern cloud-native architectures.

When to Choose BentoML vs Ray Serve:

Choose BentoML if:
- You need a simple, production-ready solution for model serving
- You work with multiple ML frameworks and want a unified serving interface
- You prefer a microservices architecture for your ML deployments
Choose Ray Serve if:
- You need to perform complex, distributed computations as part of your serving logic
- You're already using Ray for other parts of your ML pipeline
- You require fine-grained control over scaling and deployment patterns

Conclusion

Both BentoML and Ray Serve offer powerful solutions for serving ML models, each with its own strengths. BentoML shines in its simplicity and production-readiness, making it an excellent choice for teams looking to quickly deploy and manage ML models at scale. Its integration capabilities with tools like KubeRay further enhance its flexibility in cloud-native environments.

On the other hand, Ray Serve excels in scenarios requiring complex, distributed computations and offers tighter integration with the broader Ray ecosystem. It provides more flexibility and control, which can be beneficial for advanced use cases.

Ultimately, the choice between BentoML and Ray Serve will depend on your specific requirements, existing infrastructure, and the complexity of your serving needs. By understanding the strengths and capabilities of each tool, you can make an informed decision that best suits your ML serving requirements.

Remember, the field of MLOps is rapidly evolving, and it's always worth keeping an eye on the latest developments in both BentoML and Ray Serve as they continue to enhance their features and capabilities.