Hyperparameter Optimization

Hyperparameter optimization is a task of finding the set of non-trainable parameters for which the model reaches best performance. Hyperparameters include learning rate, number of layers, number of nodes in each layer, size of convolutional layers, different activation functions, and many other options.

We offer two methods to run automated hyperparameter tuning. One is using Katib which is a KubeFlow component. It requires no inclusion of third party packages and code. Only the reporting of the evaluation metrics needs to be changed. The second is using Ray Tune which is component from the Ray framework to run hyperparameter optimization programmatically. It also integrates with Ray Train framework for distributed training (see Model Training).

Method	How do you interact with the service?
Applying Katib Manifest	Submission via `kubectl`
Using Katib SDK	Programmatically, can be controlled from within a notebook, VS Code, pipeline
KubeFlow UI	Via UI
Applying RayJob Manifest	Submission via `kubectl`

Katib

Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports hyperparameter tuning, early stopping and neural architecture search (NAS). Katib supports a lot of various AutoML algorithms, such as Bayesian optimization, Tree of Parzen Estimators, Random Search, Covariance Matrix Adaptation Evolution Strategy, Hyperband, Efficient Neural Architecture Search, Differentiable Architecture Search and many more. Additional algorithm support is coming soon.

Experiment

An experiment is a single tuning run, also called an optimization run.
Configuration settings are specified define the experiment. Main configurations are:

Objective - metrics that needs to be optimized, such as accuracy or loss. It can be specified if the metrics needs to be maximized or minimized.
Search space - the set of all possible hyperparameter values that the hyperparameter tuning job should consider for optimization, and the constraints for each hyperparameter.
Search algorithm - The algorithm to use when searching for the optimal hyperparameter values.

Trial

A trial is one iteration of the hyperparameter tuning process.
Each experiment runs several trials. The experiment runs the trials until it reaches either the objective or the configured maximum number of trials.

Neural Network Architecture Search with Katib

Neural Architecture Search (NAS) automates the process of architecture design of neural networks. NAS approaches optimize the topology of the networks, including how to connect nodes and which operators to choose.

User-defined optimization metrics can thereby include accuracy, model size or inference time to arrive at an optimal architecture for specific applications. Due to the extremely large search space, traditional evolution or reinforcement learning-based AutoML algorithms tend to be computationally expensive.

Configuring and Running NAS Examples

The list below describes the NAS-specific parameters in the YAML file for an Experiment.

nasConfig: The configuration for NAS. You can specify the configurations of the neural network design that you want to optimize, including the number of layers in the network, the types of operations, and more.
- graphConfig: The graph config that defines structure for a directed acyclic graph of the neural network. You can specify the number of layers, input_sizes for the input layer and output_sizes for the output layer.
- operations: The range of operations that you want to tune for your ML model. For each neural network layer the NAS algorithm selects one of the operations to build a neural network. Each operation contains sets of parameters similar to HP tuning Experiment.

Katib Metric Collection

Running hyperparameter optimization with Katib requires you to adjust the reporting of the evaluation metrics. You can choose to either print the metrics to stdout, log them to a file or report them via the Katib SDK. Per default it will scrape it from stdout of the trial process and will look for a specific regex pattern like:

epoch 1: accuracy=0.9 loss=0.5

See Katib documentation for more information. It is also possible to override the regex or to read metrics from a file.

Katib Manifest

Prepare your code to accept hyperparameters. This can be done via environment variables or via arguments.

import argparse

def train(config):
    number_layers = config["number_layers"]
    model = Model(number_layers=number_layers)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--number-layers', type=int)
    args = parser.parse_args()
    train({"number_layers": args.number_layers})

Build and push your image. Replace the placeholder with your username/project and image name.

docker login registry.cern.ch
docker build -t registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE .
docker push registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE

Adjust the manifest to your needs. Parameters can be passed to your code via ${trialParameters.REPLACE_WITH_HYPERPARAMETER_NAME} to either environment variables or as arguments.

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: hpo
spec:
  parallelTrialCount: 3
  maxTrialCount: 10
  maxFailedTrialCount: 1
objective:
  type: minimize
  goal: 0.001
  objectiveMetricName: loss
algorithm:
  algorithmName: random # Bayesian Optimization, Hyperband, ...
parameters:
  - name: REPLACE_WITH_HYPERPARAMETER_NAME
    parameterType: int
    feasibleSpace:
      min: "REPLACE_WITH_HYPERPARAMETER_MIN_VALUE"
      max: "REPLACE_WITH_HYPERPARAMETER_MAX_VALUE"
trialTemplate:
  primaryContainerName: training
  trialParameters:
    - name: REPLACE_WITH_HYPERPARAMETER_NAME
      description: lorem ipsum
      reference: REPLACE_WITH_HYPERPARAMETER_NAME
  trialSpec:
    apiVersion: batch/v1
    kind: Job
    spec:
      template:
        metadata:
          labels:
            mount-eos: "true"
            inject-oauth2-token-pipeline: "true"
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: training
              image: REPLACE_WITH_YOUR_IMAGE 
              command:
                - "python train.py --batch-size=${trialParameters.REPLACE_WITH_HYPERPARAMETER_NAME}"
          restartPolicy: Never

Run kubectl apply -f experiment.yaml to start the optimization

Run kubectl get experiments

NAME   TYPE      STATUS   AGE
hpo    Created   True     9s

Results can be monitored and analysed in the UI

Katib UI

Katib experiments can be also started from from the KubeFlow dashboard. This requires:

Navigate to https://ml.cern.ch/_/katib/
Click NEW EXPERIMENT
At the bottom of the page click Edit and submit YAML
Copy the exmaple manifest from the above example manifest to the text editor
Click CREATE
The progress can be monitored in the UI here

Katib SDK

Katib experiments can be managed through the Python SDK. You can find a full notebook here. This can be useful if you want to quickly iterate over your code but can also be used from within KubeFlow Pipelines.

def objective(parameters):
    loss = train(parameters["batch_size"])
    # Katib parses metrics in this format: <metric-name>=<metric-value>.
    print(f"loss={loss}")

katib_client = katib.KatibClient(namespace=REPLACE_WITH_YOUR_USERNAME)
parameters = {
    "batch_size": katib.search.int(min=4, max=64),
}
name = "tune-experiment"
katib_client.tune(
    name=name,
    objective=objective,
    parameters=parameters,
    objective_metric_name="loss",
    max_trial_count=12,
    resources_per_trial={"cpu": "2"},
)

# Wait for the trials to finish
katib_client.wait_for_experiment_condition(name=name)
print(katib_client.get_optimal_hyperparameters(name))

Inside a Notebook

Adjust the code to your need
Create a notebook via
Paste the code in a cell
Run the code

Inside a pipeline

Adjust the code to your need

Create the pipeline definition

from kfp.dsl import ContainerSpec
from kfp import dsl, compiler

@dsl.container_component
def launch_hyperparameter_tuning():
    # PASTE the code from above:
    # katib_client = katib.KatibClient(namespace=ns)...
    pass

@dsl.pipeline
def pipeline() -> str:
    return launch_hyperparameter_tuning().output

compiler.Compiler().compile(pipeline, package_path='pipeline.yaml')

Compile the pipeline
```
python pipeline.yaml
```
Create and run the pipeline according to the pipeline documentation

Ray

Ray Tune can be used to configure and run automated hyperparameter tuning within a RayCluster. You can either create a RayCluster beforehand and interact dynamically with it via the Ray SDK or submit a RayJob which will first create the cluster according to the spec, wait for its readiness and submit the job to it.

import os
import ray
from ray import train, tune
from ray.air.integrations.mlflow import MLflowLoggerCallback, setup_mlflow

def train_wrapper(config):
    batch_size = config["batch_size"]
    model = Model(batch_size=batch_size) 
    metrics = model.train()
    return metrics

def tune():
    tuner = tune.Tuner(
        train_wrapper,
        tune_config=tune.TuneConfig(num_samples=2, metric="mean_loss_per_epoch", mode="min"),
        run_config=train.RunConfig(
            verbose=3,
            name="HPO Experiment"
        ),
        param_space={
            "batch_size": tune.randint(10, 100),
        },
    )
    results = tuner.fit()
    return results

if __name__ == "__main__":
    tune()

Ray Images

Ray provides base images for many Python versions. Remember to use the registry proxy cache to prevent rate limiting. For example, use registry.cern.ch/docker.io/rayproject/ray:2.43.0-py39-gpu to get Ray 2.43 and Python 3.9.

RayJob Manifest

A RayJob automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the Ray job finishes.

Prepare your image

FROM ray:2.43.0-py312-cpu
USER root
RUN pip install torch
RUN pip install torchvision
USER ray

Build and push your image. Replace the placeholder with your username/project and image name.

docker login registry.cern.ch
docker build -t registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE .
docker push registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE

Adjust the manifest to your needs

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: hpo
spec:
  entrypoint: REPLACE_WITH_ENTRYPOINT # e.g. python3 main.py --hpo
  rayClusterSpec:
    rayVersion: 'REPLACE_WITH_RAY_VERSION' # e.g. '2.43.0' should match the Ray version in the image of the containers
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
          labels:
            mount-eos: "true"
            inject-oauth2-token: "true"
        spec:
          containers:
          - name: ray-head
            image: REPLACE_WITH_YOUR_IMAGE
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265 # Ray dashboard
              name: dashboard
            - containerPort: 10001
              name: client
            resources:
              limits:
                cpu: "1"
              requests:
                cpu: "200m"
    workerGroupSpecs:
        - replicas: REPLACE_WITH_NUMBER_OF_WORKERS # e.g 1
          minReplicas: 1
          maxReplicas: 2
          groupName: small-group
          rayStartParams: {}
          template:
            metadata:
              annotations:
                sidecar.istio.io/inject: "false"
              labels:
                mount-eos: "true"
                inject-oauth2-token: "true"
            spec:
              containers:
              - name: worker
                image: REPLACE_WITH_YOUR_IMAGE
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "1"
                    nvidia.com/gpu: REPLACE_WITH_NUMBER_OF_GPUS_PER_WORKER # e.g. 1
                  requests:
                    cpu: "200m"    
    submitterPodTemplate:
      metadata:
        annotations:
          sidecar.istio.io/inject: "false"
      spec:
        restartPolicy: Never
        containers:
        - name: rayjob-submitter
          image: rayproject/ray:2.9.0

Build and push your image. Replace the placeholder with your username/project and image name.

docker login registry.cern.ch
docker build -t registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE .
docker push registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE

Run kubectl apply -f job.yaml in a terminal either in a jupyter notebook or a VS Code Server

Run kubectl get rayjobs to see the status of your job

NAME                  JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
job                                Initializing        2025-03-03T11:35:32Z                          11s

Run kubectl get pods to see the worker pods. After a while, the submitter pods will be created

NAME                                                      READY   STATUS         RESTARTS   AGE
job-raycluster-7gx7p-head-hrsjk                           2/2     Running        0          104s
job-raycluster-7gx7p-worker-small-group-dgzhp             2/2     Running        0          104s
job-m44ns                                                 1/1     Running        0          4m17s

Run kubectl logs -f REPLACE_WITH_SUBMITTER_POD_NAME e.g. kubectl logs job-m44ns to follow the logs of your job