Model Serving
Model serving represents making a trained model available to the users and other software components.
The goal of model serving is to be able to query the model, to send data inputs and obtain model outputs (predictions) by providing high abstraction interfaces for Tensorflow, XGBoost, ScikitLearn, PyTorch, Huggingface Transformer/LLM models using standardized data plane protocols.
Models can be queried directly from the applications, for example TensorFlow model.predict()
function. The limitation of this approach is that every user or application needs access to the stored model (architecture + parameters), making this solution unscalable for a system with multiple users and applications.
Another option is API Model Serving, which exposes a model with a REST API.
API Model Serving
The idea of API model serving is to expose a model via REST API.
A model server with the access to the model's parameters and architecture is created.
To query the model, only IP and model endpoint are needed, for example:
curl -v "http://MODEL_IP:PORT/v1/models/custom-model:predict" -d @./input.json
Autoscaling
Scaling is an important aspect of model serving. With Kubeflow, it is possible to configure a serverless infrastructure, so that the number of server instances (replicas) increases with the number of requests to the model's API.
Workflow
Model serving is simplified with Kubeflow, requiring two main steps:
- Store a trained model in a persistent storage location. Storage can be:
- Create an InferenceService CRD, to deploy a model server. Define yaml file with:
- A name of the service.
- Location from which to obtain the model for serving (from the previous step).
- Resources, for example GPUs - example.
GPU Drivers
If you rely on GPUs for your inference servers, it is likely the required drivers are not available inside the image. You can instruct the system to make them available with a simple label:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "tensorflow-gpu"
labels:
nvidia-drivers: "true"
annotations:
sidecar.istio.io/inject: "false"
spec:
default:
predictor:
tensorflow:
storageUri: "s3://BUCKET"
runtimeVersion: "2.6.2-gpu"
resources:
limits:
nvidia.com/gpu: 1
Frameworks
Model serving is supported for major ML frameworks, such as TensorFlow, PyTorch or SKLearn.
Support is provided for XGBoost, LightGBM, PMML and ONNXRuntime, among others.
More information is provided on kfserving samples documentation page.
Examples
TensorFlow and PyTorch examples use already trained models from KFServing github repository.
For training and serving a custom model, please refer to custom model example.
Single Model
- Connect to a running Notebook server
- If needed, follow the guide for setting up Jupyter notebooks in Kubeflow
- In the Notebook terminal clone git repository with examples
- Follow the instructions from here
Multi Model Serving
- Connect to a running Notebook server
- If needed, follow the guide for setting up Jupyter notebooks in Kubeflow
- In the Notebook terminal clone git repository with examples
- Follow the instructions from here
Custom
Few steps are needed to serve a custom model:
- Create a bucket at s3.cern.ch using documentation
- Make sure the credentials are properly obtained
- Connect to a running Notebook server
- If needed, follow the guide for setting up Jupyter notebooks in Kubeflow
- In the Notebook terminal clone git repository with examples
- Navigate to examples/serving/custom in a Notebook server
- Fill in
credentials
file with your bucket credentials - Create
~/.aws
directory - Copy credentials file to
~/.aws
directory -
Train the model using model_training.ipynb notebook
- Make sure
bucket
variable corresponds to the bucket created in the first step - After running the notebook, model should be available in the bucket
- Make sure
-
Fill in storage uri and bucket credentials in InferenceService definition files:
-
Create an InferenceService using CPU or GPU:
- CPU:
kubectl apply -f custom_model_cpu.yaml
- GPU:
kubectl apply -f custom_model_gpu.yaml
- CPU:
-
Check the status of InferenceService with:
kubectl get inferenceservice
-
Obtain an Authentication Session Cookie from Chrome
- Click
View -> Developer -> Developer Tools -> Network
- Navigate to ml.cern.ch
- Check Request Headers
- Copy section
authservice_session
- Copy section
- Click
-
Run Inference
AUTH_SESSION
is the authentication session cookie obtained in the previous stepNAMESPACE
is a personal Kubeflow namespace, which can be seen in the top left corner of the UIcurl -H 'Cookie: authservice_session=AUTH_SESSION' -H 'Host: custom-model-cpu-predictor.NAMESPACE.ml.cern.ch' https://ml.cern.ch/v1/models/custom-model-cpu:predict -d @./input.json
-
Expected output
Running HuggingFace Models
The Hugging Face serving runtime implements two backends namely Hugging Face and vLLM that can serve Hugging Face models out of the box. The preprocess and post-process handlers are already implemented based on different ML tasks, for example text classification, token-classification, text-generation, text2text-generation, fill-mask. For more information, check out the official Kserve documentation
Note: Some Models on HuggingFace are gated and you must be authenticated in order to access them, to authenticate create a token here.
Then specify theHF_TOKEN
environment variable in your InferenceService manifest or mount it from a Kubernetes secret in your namespace.
-
Serve the BERT model using KServe python HuggingFace runtime for both preprocess(tokenization)/postprocess and inference.
apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: huggingface-bert annotations: sidecar.istio.io/inject: "false" spec: predictor: model: modelFormat: name: huggingface args: - --model_name=bert - --model_id=bert-base-uncased - --tensor_input_names=input_ids env: - name: HF_TOKEN value: <YOUR_HF_TOKEN> resources: limits: cpu: "1" memory: 2Gi nvidia.com/gpu: "1" requests: cpu: 100m memory: 2Gi
-
Serve the BERT model using Triton inference runtime and KServe transformer with HuggingFace runtime for the preprocess(tokenization) and postprocess steps.
apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: huggingface-triton annotations: sidecar.istio.io/inject: "false" spec: predictor: model: args: - --log-verbose=1 modelFormat: name: triton protocolVersion: v2 resources: limits: cpu: "1" memory: 8Gi nvidia.com/gpu: "1" requests: cpu: "1" memory: 8Gi runtimeVersion: 23.10-py3 storageUri: gs://kfserving-examples/models/triton/huggingface/model_repository transformer: containers: - args: - --model_name=bert - --model_id=bert-base-uncased - --predictor_protocol=v2 - --tensor_input_names=input_ids env: - name: HF_TOKEN value: <YOUR_HF_TOKEN> image: kserve/huggingfaceserver:latest name: kserve-container resources: limits: cpu: "1" memory: 2Gi requests: cpu: 100m memory: 2Gi
-
Serve the llama2 model using KServe HuggingFace vLLM runtime. For the llama2 model, vLLM is supported and used as the default backend. If available for a model, vLLM is set as the default backend, otherwise KServe HuggingFace runtime is used as a failsafe. You can find vLLM support models here.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama2
annotations:
sidecar.istio.io/inject: "false"
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama2
- --model_id=meta-llama/Llama-2-7b-chat-hf
env:
- name: HF_TOKEN
value: <YOUR_HF_TOKEN>
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
If vllm needs to be disabled include the flag --backend=huggingface
in the container args. In this case the HuggingFace backend is used.
kind: InferenceService
metadata:
name: huggingface-t5
annotations:
sidecar.istio.io/inject: "false"
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=t5
- --model_id=google-t5/t5-small
- --backend=huggingface
env:
- name: HF_TOKEN
value: <YOUR_HF_TOKEN>
resources:
limits:
cpu: "1"
memory: 4Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
Perform the inference:
Apart from the usual inference endpoint http://${INGRESS_HOST}/v2/models/${MODEL_NAME}:predict
, KServe Huggingface runtime deployments supports OpenAI v1/completions
and v1/chat/completions
endpoints for inference.
Sample OpenAI Completions request:
curl -H "content-type:application/json" \
-H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}/openai/v1/completions \
-d '{"model": "t5", "prompt": "translate English to German: The house is wonderful.", "stream":false, "max_tokens": 30 }'
{
"id": "de53f527-9cb9-47a5-9673-43d180b704f2",
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"text": "Das Haus ist wunderbar."
}
],
"created": 1717998661,
"model": "t5",
"system_fingerprint": null,
"object": "text_completion",
"usage": {
"completion_tokens": 7,
"prompt_tokens": 11,
"total_tokens": 18
}
}
Sample OpenAI Chat request:
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-v http://${INGRESS_HOST}/openai/v1/chat/completions \
-d '{"model":"llama3","messages":[{"role":"system","content":"You are an assistant that speaks like Shakespeare."},{"role":"user","content":"Write a poem about colors"}],"max_tokens":30,"stream":false}'
Sample Expected Reply:
{
"id": "cmpl-9aad539128294069bf1e406a5cba03d3",
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"content": " O, fair and vibrant colors, how ye doth delight\nIn the world around us, with thy hues so bright!\n",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": null
}
],
"created": 1718638005,
"model": "llama3",
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"completion_tokens": 30,
"prompt_tokens": 37,
"total_tokens": 67
}
}
Additional
Additional examples are provided at the kserve samples page.
Running Inference from outside the Service
Cookie Method
-
Obtain an Authentication Session Cookie from Chrome
- Click
View -> Developer -> Developer Tools -> Network
- Navigate to ml.cern.ch
- Check Request Headers
- Copy section
authservice_session
- Copy section
- Click
-
Run Inference
AUTH_SESSION
is the authentication session cookie obtained in the previous stepINFERENCESERVICE_NAME
is the name of your service e.g.sklearn-iris
MODEL_NAME
is the name of the model in your inference service e.g.iris
NAMESPACE
is a personal Kubeflow namespace, which can be seen in the top left corner of the UIcurl -H 'Cookie: authservice_session=AUTH_SESSION' -H 'Host: INFERENCESERVICE_NAME.NAMESPACE.ml.cern.ch' https://ml.cern.ch/v1/models/MODEL_NAME:predict -d @./input.json
Note: Session Cookies are short lived, expiring in approximatively 30 minutes so they must be renewed in order to authenticate successfully against the inference endpoint
Service Account Token Method
- To be announced