Model Serving

Model serving represents making a trained model available to the users and other software components.

The goal of model serving is to be able to query the model, to send data inputs and obtain model outputs (predictions) by providing high abstraction interfaces for Tensorflow, XGBoost, ScikitLearn, PyTorch, Huggingface Transformer/LLM models using standardized data plane protocols.

Models can be queried directly from the applications, for example TensorFlow model.predict() function. The limitation of this approach is that every user or application needs access to the stored model (architecture + parameters), making this solution unscalable for a system with multiple users and applications.

Another option is API Model Serving, which exposes a model with a REST API.

API Model Serving

The idea of API model serving is to expose a model via REST API.

A model server with the access to the model's parameters and architecture is created.
To query the model, only IP and model endpoint are needed, for example:
curl -v "http://MODEL_IP:PORT/v1/models/custom-model:predict" -d @./input.json

Autoscaling

Scaling is an important aspect of model serving. With Kubeflow, it is possible to configure a serverless infrastructure, so that the number of server instances (replicas) increases with the number of requests to the model's API.

Workflow

Model serving is simplified with Kubeflow, requiring two main steps:

Store a trained model in a persistent storage location. Storage can be:
- S3 bucket, world accessible or with credentials - example.
- Kubernetes PVC, which is cluster-specific - example.
- Uri, which could be a github location - example.
Create an InferenceService CRD, to deploy a model server. Define yaml file with:
- A name of the service.
- Location from which to obtain the model for serving (from the previous step).
- Resources, for example GPUs - example.

GPU Drivers

If you rely on GPUs for your inference servers, it is likely the required drivers are not available inside the image. You can instruct the system to make them available with a simple label:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "tensorflow-gpu"
  labels:
    nvidia-drivers: "true"
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  default:
    predictor:
      tensorflow:
        storageUri: "s3://BUCKET"
        runtimeVersion: "2.6.2-gpu"
        resources:
          limits:
            nvidia.com/gpu: 1

Frameworks

Model serving is supported for major ML frameworks, such as TensorFlow, PyTorch or SKLearn.
Support is provided for XGBoost, LightGBM, PMML and ONNXRuntime, among others.

More information is provided on kfserving samples documentation page.

Examples

TensorFlow and PyTorch examples use already trained models from KFServing github repository.
For training and serving a custom model, please refer to custom model example.

Single Model

Connect to a running Notebook server
- If needed, follow the guide for setting up Jupyter notebooks in Kubeflow

In the Notebook terminal clone git repository with examples

git clone `https://gitlab.cern.ch/mlops/platform/kubeflow/kubeflow-examples.git`

Follow the instructions from here

Multi Model Serving

Connect to a running Notebook server
- If needed, follow the guide for setting up Jupyter notebooks in Kubeflow

In the Notebook terminal clone git repository with examples

git clone `https://gitlab.cern.ch/mlops/platform/kubeflow/kubeflow-examples.git`

Follow the instructions from here

Custom

Few steps are needed to serve a custom model:

Create a bucket at s3.cern.ch using documentation
- Make sure the credentials are properly obtained
Connect to a running Notebook server
- If needed, follow the guide for setting up Jupyter notebooks in Kubeflow

In the Notebook terminal clone git repository with examples

git clone `https://gitlab.cern.ch/mlops/platform/kubeflow/kubeflow-examples.git`

Navigate to examples/serving/custom in a Notebook server
Fill in credentials file with your bucket credentials
Create ~/.aws directory
```
mkdir ~/.aws
```
Copy credentials file to ~/.aws directory
```
cp credentials ~/.aws
```
Train the model using model_training.ipynb notebook
- Make sure bucket variable corresponds to the bucket created in the first step
- After running the notebook, model should be available in the bucket
Fill in storage uri and bucket credentials in InferenceService definition files:
- CPU serving example
- GPU serving example
Create an InferenceService using CPU or GPU:
- CPU: kubectl apply -f custom_model_cpu.yaml
- GPU: kubectl apply -f custom_model_gpu.yaml
Check the status of InferenceService with: kubectl get inferenceservice
Obtain an Authentication Session Cookie from Chrome
- Click View -> Developer -> Developer Tools -> Network
- Navigate to ml.cern.ch
- Check Request Headers
  - Copy section authservice_session
Run Inference
- AUTH_SESSION is the authentication session cookie obtained in the previous step
- NAMESPACE is a personal Kubeflow namespace, which can be seen in the top left corner of the UI
- curl -H 'Cookie: authservice_session=AUTH_SESSION' -H 'Host: custom-model-cpu-predictor.NAMESPACE.ml.cern.ch' https://ml.cern.ch/v1/models/custom-model-cpu:predict -d @./input.json

Expected output

{
    "predictions": [[0.779803514]
    ]
}

Running HuggingFace Models

The Hugging Face serving runtime implements two backends namely Hugging Face and vLLM that can serve Hugging Face models out of the box. The preprocess and post-process handlers are already implemented based on different ML tasks, for example text classification, token-classification, text-generation, text2text-generation, fill-mask. For more information, check out the official Kserve documentation

Note: Some Models on HuggingFace are gated and you must be authenticated in order to access them, to authenticate create a token here.
Then specify the HF_TOKEN environment variable in your InferenceService manifest or mount it from a Kubernetes secret in your namespace.

Serve the BERT model using KServe python HuggingFace runtime for both preprocess(tokenization)/postprocess and inference.
```
kindmet    spec                   
```
href="#__codelineno-7-1">apiVersion: serving.kserve.io/v1beta1 class="p">: InferenceService adata: name: huggingface-bert annotations: sidecar.istio.io/inject: "false" class="p">: predictor: model: modelFormat: name: huggingface args: - --model_name=bert - --model_id=bert-base-uncased - --tensor_input_names=input_ids env: - name: HF_TOKEN value: <YOUR_HF_TOKEN> resources: limits: cpu: "1" memory: 2Gi nvidia.com/gpu: "1" requests: cpu: 100m memory: 2Gi

Serve the BERT model using Triton inference runtime and KServe transformer with HuggingFace runtime for the preprocess(tokenization) and postprocess steps.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-triton
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  predictor:
    model:
      args:
      - --log-verbose=1
      modelFormat:
        name: triton
      protocolVersion: v2
      resources:
        limits:
          cpu: "1"
          memory: 8Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 8Gi
      runtimeVersion: 23.10-py3
      storageUri: gs://kfserving-examples/models/triton/huggingface/model_repository
  transformer:
    containers:
    - args:
      - --model_name=bert
      - --model_id=bert-base-uncased
      - --predictor_protocol=v2
      - --tensor_input_names=input_ids
      env:
      - name: HF_TOKEN
        value: <YOUR_HF_TOKEN>
      image: kserve/huggingfaceserver:latest
      name: kserve-container
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: 100m
          memory: 2Gi

Serve the llama2 model using KServe HuggingFace vLLM runtime. For the llama2 model, vLLM is supported and used as the default backend. If available for a model, vLLM is set as the default backend, otherwise KServe HuggingFace runtime is used as a failsafe. You can find vLLM support models here.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama2
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
      - --model_name=llama2
      - --model_id=meta-llama/Llama-2-7b-chat-hf
      env:
      - name: HF_TOKEN
        value: <YOUR_HF_TOKEN>
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"

If vllm needs to be disabled include the flag --backend=huggingface in the container args. In this case the HuggingFace backend is used.

kind: InferenceService
metadata:
  name: huggingface-t5
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=t5
        - --model_id=google-t5/t5-small
        - --backend=huggingface
      env:
      - name: HF_TOKEN
        value: <YOUR_HF_TOKEN>
      resources:
        limits:
          cpu: "1"
          memory: 4Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"

Perform the inference:

Apart from the usual inference endpoint http://${INGRESS_HOST}/v2/models/${MODEL_NAME}:predict, KServe Huggingface runtime deployments supports OpenAI v1/completions and v1/chat/completions endpoints for inference.

Sample OpenAI Completions request:

curl -H "content-type:application/json" \
-H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}/openai/v1/completions \
-d '{"model": "t5", "prompt": "translate English to German: The house is wonderful.", "stream":false, "max_tokens": 30 }'

Sample Expected Reply:

{
  "id": "de53f527-9cb9-47a5-9673-43d180b704f2",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": "Das Haus ist wunderbar."
    }
  ],
  "created": 1717998661,
  "model": "t5",
  "system_fingerprint": null,
  "object": "text_completion",
  "usage": {
    "completion_tokens": 7,
    "prompt_tokens": 11,
    "total_tokens": 18
  }
}

Sample OpenAI Chat request:

curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-v http://${INGRESS_HOST}/openai/v1/chat/completions \
-d '{"model":"llama3","messages":[{"role":"system","content":"You are an assistant that speaks like Shakespeare."},{"role":"user","content":"Write a poem about colors"}],"max_tokens":30,"stream":false}'

Sample Expected Reply:

{
   "id": "cmpl-9aad539128294069bf1e406a5cba03d3",
   "choices": [
     {
       "finish_reason": "length",
       "index": 0,
       "message": {
         "content": "  O, fair and vibrant colors, how ye doth delight\nIn the world around us, with thy hues so bright!\n",
         "tool_calls": null,
         "role": "assistant",
         "function_call": null
       },
       "logprobs": null
     }
   ],
   "created": 1718638005,
   "model": "llama3",
   "system_fingerprint": null,
   "object": "chat.completion",
   "usage": {
     "completion_tokens": 30,
     "prompt_tokens": 37,
     "total_tokens": 67
   }
 }

Additional

Additional examples are provided at the kserve samples page.

Running Inference from outside the Service

Obtain an Authentication Session Cookie from Chrome
- Click View -> Developer -> Developer Tools -> Network
- Navigate to ml.cern.ch
- Check Request Headers
  - Copy section authservice_session
Run Inference
- AUTH_SESSION is the authentication session cookie obtained in the previous step
- INFERENCESERVICE_NAME is the name of your service e.g. sklearn-iris
- MODEL_NAME is the name of the model in your inference service e.g. iris
- NAMESPACE is a personal Kubeflow namespace, which can be seen in the top left corner of the UI
- Execute the following command:
```
curl \ 
-H 'Cookie: authservice_session=AUTH_SESSION' \
https://ml.cern.ch/serving/NAMESPACE/INFERENCESERVICE_NAME/v1/models/MODEL_NAME:predict \
-d @./input.json
```
- Alternatively, you can also use the Host header to point to your inference service:
```
curl \ 
-H 'Cookie: authservice_session=AUTH_SESSION' \
-H 'Host: INFERENCESERVICE_NAME.NAMESPACE.ml.cern.ch' \
https://ml.cern.ch/v1/models/MODEL_NAME:predict \
-d @./input.json
```

Note: Session Cookies are short lived, expiring in approximatively 30 minutes so they must be renewed in order to authenticate successfully against the inference endpoint

Service Account Token Method

To be announced

Model Serving

API Model Serving

Autoscaling

Workflow

GPU Drivers

Frameworks

Examples

Single Model

Multi Model Serving

Custom

Running HuggingFace Models

Additional

Running Inference from outside the Service

Cookie Method

Service Account Token Method