Skip to content

Model Serving

Model serving represents making a trained model available to the users and other software components.

The goal of model serving is to be able to query the model, to send data inputs and obtain model outputs (predictions) by providing high abstraction interfaces for Tensorflow, XGBoost, ScikitLearn, PyTorch, Huggingface Transformer/LLM models using standardized data plane protocols.

Models can be queried directly from the applications, for example TensorFlow model.predict() function. The limitation of this approach is that every user or application needs access to the stored model (architecture + parameters), making this solution unscalable for a system with multiple users and applications.

Another option is API Model Serving, which exposes a model with a REST API.

API Model Serving

The idea of API model serving is to expose a model via REST API.

A model server with the access to the model's parameters and architecture is created.
To query the model, only IP and model endpoint are needed, for example:
curl -v "http://MODEL_IP:PORT/v1/models/custom-model:predict" -d @./input.json

Autoscaling

Scaling is an important aspect of model serving. With Kubeflow, it is possible to configure a serverless infrastructure, so that the number of server instances (replicas) increases with the number of requests to the model's API.

Workflow

Model serving is simplified with Kubeflow, requiring two main steps:

  • Store a trained model in a persistent storage location. Storage can be:
    • S3 bucket, world accessible or with credentials - example.
    • Kubernetes PVC, which is cluster-specific - example.
    • Uri, which could be a github location - example.
  • Create an InferenceService CRD, to deploy a model server. Define yaml file with:
    • A name of the service.
    • Location from which to obtain the model for serving (from the previous step).
    • Resources, for example GPUs - example.

GPU Drivers

If you rely on GPUs for your inference servers, it is likely the required drivers are not available inside the image. You can instruct the system to make them available with a simple label:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "tensorflow-gpu"
  labels:
    nvidia-drivers: "true"
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  default:
    predictor:
      tensorflow:
        storageUri: "s3://BUCKET"
        runtimeVersion: "2.6.2-gpu"
        resources:
          limits:
            nvidia.com/gpu: 1

Frameworks

Model serving is supported for major ML frameworks, such as TensorFlow, PyTorch or SKLearn.
Support is provided for XGBoost, LightGBM, PMML and ONNXRuntime, among others.

More information is provided on kfserving samples documentation page.

Examples

TensorFlow and PyTorch examples use already trained models from KFServing github repository.
For training and serving a custom model, please refer to custom model example.

Single Model

Multi Model Serving

Custom

Few steps are needed to serve a custom model:

  • Create a bucket at s3.cern.ch using documentation
    • Make sure the credentials are properly obtained
  • Connect to a running Notebook server
  • In the Notebook terminal clone git repository with examples
    git clone `https://gitlab.cern.ch/mlops/platform/kubeflow/kubeflow-examples.git`
    
  • Navigate to examples/serving/custom in a Notebook server
  • Fill in credentials file with your bucket credentials
  • Create ~/.aws directory
    mkdir ~/.aws
    
  • Copy credentials file to ~/.aws directory
    cp credentials ~/.aws
    
  • Train the model using model_training.ipynb notebook

    • Make sure bucket variable corresponds to the bucket created in the first step
    • After running the notebook, model should be available in the bucket
  • Fill in storage uri and bucket credentials in InferenceService definition files:

  • Create an InferenceService using CPU or GPU:

    • CPU: kubectl apply -f custom_model_cpu.yaml
    • GPU: kubectl apply -f custom_model_gpu.yaml
  • Check the status of InferenceService with: kubectl get inferenceservice

  • Obtain an Authentication Session Cookie from Chrome

    • Click View -> Developer -> Developer Tools -> Network
    • Navigate to ml.cern.ch
    • Check Request Headers
      • Copy section authservice_session
  • Run Inference

    • AUTH_SESSION is the authentication session cookie obtained in the previous step
    • NAMESPACE is a personal Kubeflow namespace, which can be seen in the top left corner of the UI
    • curl -H 'Cookie: authservice_session=AUTH_SESSION' -H 'Host: custom-model-cpu-predictor.NAMESPACE.ml.cern.ch' https://ml.cern.ch/v1/models/custom-model-cpu:predict -d @./input.json
  • Expected output

    {
        "predictions": [[0.779803514]
        ]
    }
    

Running HuggingFace Models

The Hugging Face serving runtime implements two backends namely Hugging Face and vLLM that can serve Hugging Face models out of the box. The preprocess and post-process handlers are already implemented based on different ML tasks, for example text classification, token-classification, text-generation, text2text-generation, fill-mask. For more information, check out the official Kserve documentation

Note: Some Models on HuggingFace are gated and you must be authenticated in order to access them, to authenticate create a token here.
Then specify the HF_TOKEN environment variable in your InferenceService manifest or mount it from a Kubernetes secret in your namespace.

  1. Serve the BERT model using KServe python HuggingFace runtime for both preprocess(tokenization)/postprocess and inference.

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: huggingface-bert
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      predictor:
        model:
          modelFormat:
            name: huggingface
          args:
          - --model_name=bert
          - --model_id=bert-base-uncased
          - --tensor_input_names=input_ids
          env:
          - name: HF_TOKEN
            value: <YOUR_HF_TOKEN>
          resources:
            limits:
              cpu: "1"
              memory: 2Gi
              nvidia.com/gpu: "1"
            requests:
              cpu: 100m
              memory: 2Gi
    

  2. Serve the BERT model using Triton inference runtime and KServe transformer with HuggingFace runtime for the preprocess(tokenization) and postprocess steps.

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: huggingface-triton
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      predictor:
        model:
          args:
          - --log-verbose=1
          modelFormat:
            name: triton
          protocolVersion: v2
          resources:
            limits:
              cpu: "1"
              memory: 8Gi
              nvidia.com/gpu: "1"
            requests:
              cpu: "1"
              memory: 8Gi
          runtimeVersion: 23.10-py3
          storageUri: gs://kfserving-examples/models/triton/huggingface/model_repository
      transformer:
        containers:
        - args:
          - --model_name=bert
          - --model_id=bert-base-uncased
          - --predictor_protocol=v2
          - --tensor_input_names=input_ids
          env:
          - name: HF_TOKEN
            value: <YOUR_HF_TOKEN>
          image: kserve/huggingfaceserver:latest
          name: kserve-container
          resources:
            limits:
              cpu: "1"
              memory: 2Gi
            requests:
              cpu: 100m
              memory: 2Gi
    

  3. Serve the llama2 model using KServe HuggingFace vLLM runtime. For the llama2 model, vLLM is supported and used as the default backend. If available for a model, vLLM is set as the default backend, otherwise KServe HuggingFace runtime is used as a failsafe. You can find vLLM support models here.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama2
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
      - --model_name=llama2
      - --model_id=meta-llama/Llama-2-7b-chat-hf
      env:
      - name: HF_TOKEN
        value: <YOUR_HF_TOKEN>
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"

If vllm needs to be disabled include the flag --backend=huggingface in the container args. In this case the HuggingFace backend is used.

kind: InferenceService
metadata:
  name: huggingface-t5
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=t5
        - --model_id=google-t5/t5-small
        - --backend=huggingface
      env:
      - name: HF_TOKEN
        value: <YOUR_HF_TOKEN>
      resources:
        limits:
          cpu: "1"
          memory: 4Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"

Perform the inference:

Apart from the usual inference endpoint http://${INGRESS_HOST}/v2/models/${MODEL_NAME}:predict, KServe Huggingface runtime deployments supports OpenAI v1/completions and v1/chat/completions endpoints for inference.

Sample OpenAI Completions request:

curl -H "content-type:application/json" \
-H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}/openai/v1/completions \
-d '{"model": "t5", "prompt": "translate English to German: The house is wonderful.", "stream":false, "max_tokens": 30 }'
Sample Expected Reply:

{
  "id": "de53f527-9cb9-47a5-9673-43d180b704f2",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": "Das Haus ist wunderbar."
    }
  ],
  "created": 1717998661,
  "model": "t5",
  "system_fingerprint": null,
  "object": "text_completion",
  "usage": {
    "completion_tokens": 7,
    "prompt_tokens": 11,
    "total_tokens": 18
  }
}

Sample OpenAI Chat request:

curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-v http://${INGRESS_HOST}/openai/v1/chat/completions \
-d '{"model":"llama3","messages":[{"role":"system","content":"You are an assistant that speaks like Shakespeare."},{"role":"user","content":"Write a poem about colors"}],"max_tokens":30,"stream":false}'

Sample Expected Reply:

{
   "id": "cmpl-9aad539128294069bf1e406a5cba03d3",
   "choices": [
     {
       "finish_reason": "length",
       "index": 0,
       "message": {
         "content": "  O, fair and vibrant colors, how ye doth delight\nIn the world around us, with thy hues so bright!\n",
         "tool_calls": null,
         "role": "assistant",
         "function_call": null
       },
       "logprobs": null
     }
   ],
   "created": 1718638005,
   "model": "llama3",
   "system_fingerprint": null,
   "object": "chat.completion",
   "usage": {
     "completion_tokens": 30,
     "prompt_tokens": 37,
     "total_tokens": 67
   }
 } 

Additional

Additional examples are provided at the kserve samples page.

Running Inference from outside the Service

  • Obtain an Authentication Session Cookie from Chrome

    • Click View -> Developer -> Developer Tools -> Network
    • Navigate to ml.cern.ch
    • Check Request Headers
      • Copy section authservice_session
  • Run Inference

    • AUTH_SESSION is the authentication session cookie obtained in the previous step
    • INFERENCESERVICE_NAME is the name of your service e.g. sklearn-iris
    • MODEL_NAME is the name of the model in your inference service e.g. iris
    • NAMESPACE is a personal Kubeflow namespace, which can be seen in the top left corner of the UI
    • curl -H 'Cookie: authservice_session=AUTH_SESSION' -H 'Host: INFERENCESERVICE_NAME.NAMESPACE.ml.cern.ch' https://ml.cern.ch/v1/models/MODEL_NAME:predict -d @./input.json

Note: Session Cookies are short lived, expiring in approximatively 30 minutes so they must be renewed in order to authenticate successfully against the inference endpoint

Service Account Token Method

  • To be announced