Skip to content

Hyperparameter Optimization

Hyperparameter optimization is a task of finding the set of non-trainable parameters for which the model reaches best performance. Hyperparameters include learning rate, number of layers, number of nodes in each layer, size of convolutional layers, different activation functions, and many other options.

We offer two methods to run automated hyperparameter tuning. One is using Katib which is a KubeFlow component. It requires no inclusion of third party packages and code. Only the reporting of the evaluation metrics needs to be changed. The second is using Ray Tune which is component from the Ray framework to run hyperparameter optimization programmatically. It also integrates with Ray Train framework for distributed training (see Model Training).

Method How do you interact with the service?
Applying Katib Manifest Submission via kubectl
Using Katib SDK Programmatically, can be controlled from within a notebook, VS Code, pipeline
KubeFlow UI Via UI
Applying RayJob Manifest Submission via kubectl

Katib

Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports hyperparameter tuning, early stopping and neural architecture search (NAS). Katib supports a lot of various AutoML algorithms, such as Bayesian optimization, Tree of Parzen Estimators, Random Search, Covariance Matrix Adaptation Evolution Strategy, Hyperband, Efficient Neural Architecture Search, Differentiable Architecture Search and many more. Additional algorithm support is coming soon.

Experiment

An experiment is a single tuning run, also called an optimization run.
Configuration settings are specified define the experiment. Main configurations are:

  • Objective - metrics that needs to be optimized, such as accuracy or loss. It can be specified if the metrics needs to be maximized or minimized.
  • Search space - the set of all possible hyperparameter values that the hyperparameter tuning job should consider for optimization, and the constraints for each hyperparameter.
  • Search algorithm - The algorithm to use when searching for the optimal hyperparameter values.

Trial

A trial is one iteration of the hyperparameter tuning process.
Each experiment runs several trials. The experiment runs the trials until it reaches either the objective or the configured maximum number of trials.

Neural Network Architecture Search with Katib

Neural Architecture Search (NAS) automates the process of architecture design of neural networks. NAS approaches optimize the topology of the networks, including how to connect nodes and which operators to choose.

User-defined optimization metrics can thereby include accuracy, model size or inference time to arrive at an optimal architecture for specific applications. Due to the extremely large search space, traditional evolution or reinforcement learning-based AutoML algorithms tend to be computationally expensive.

Configuring and Running NAS Examples

The list below describes the NAS-specific parameters in the YAML file for an Experiment.

  • nasConfig: The configuration for NAS. You can specify the configurations of the neural network design that you want to optimize, including the number of layers in the network, the types of operations, and more.

    • graphConfig: The graph config that defines structure for a directed acyclic graph of the neural network. You can specify the number of layers, input_sizes for the input layer and output_sizes for the output layer.

    • operations: The range of operations that you want to tune for your ML model. For each neural network layer the NAS algorithm selects one of the operations to build a neural network. Each operation contains sets of parameters similar to HP tuning Experiment.

Katib Metric Collection

Running hyperparameter optimization with Katib requires you to adjust the reporting of the evaluation metrics. You can choose to either print the metrics to stdout, log them to a file or report them via the Katib SDK. Per default it will scrape it from stdout of the trial process and will look for a specific regex pattern like:

epoch 1: accuracy=0.9 loss=0.5
See Katib documentation for more information. It is also possible to override the regex or to read metrics from a file.

Katib Manifest

  1. Prepare your code to accept hyperparameters. This can be done via environment variables or via arguments.
    import argparse
    
    def train(config):
        number_layers = config["number_layers"]
        model = Model(number_layers=number_layers)
    
    if __name__ == "__main__":
        parser = argparse.ArgumentParser()
        parser.add_argument('--number-layers', type=int)
        args = parser.parse_args()
        train({"number_layers": args.number_layers})
    
  2. Build and push your image. Replace the placeholder with your username/project and image name.
    docker login registry.cern.ch
    docker build -t registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE .
    docker push registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE
    
  3. Adjust the manifest to your needs. Parameters can be passed to your code via ${trialParameters.REPLACE_WITH_HYPERPARAMETER_NAME} to either environment variables or as arguments.
    apiVersion: kubeflow.org/v1beta1
    kind: Experiment
    metadata:
      name: hpo
    spec:
      parallelTrialCount: 3
      maxTrialCount: 10
      maxFailedTrialCount: 1
    objective:
      type: minimize
      goal: 0.001
      objectiveMetricName: loss
    algorithm:
      algorithmName: random # Bayesian Optimization, Hyperband, ...
    parameters:
      - name: REPLACE_WITH_HYPERPARAMETER_NAME
        parameterType: int
        feasibleSpace:
          min: "REPLACE_WITH_HYPERPARAMETER_MIN_VALUE"
          max: "REPLACE_WITH_HYPERPARAMETER_MAX_VALUE"
    trialTemplate:
      primaryContainerName: training
      trialParameters:
        - name: REPLACE_WITH_HYPERPARAMETER_NAME
          description: lorem ipsum
          reference: REPLACE_WITH_HYPERPARAMETER_NAME
      trialSpec:
        apiVersion: batch/v1
        kind: Job
        spec:
          template:
            metadata:
              labels:
                mount-eos: "true"
                inject-oauth2-token-pipeline: "true"
              annotations:
                sidecar.istio.io/inject: "false"
            spec:
              containers:
                - name: training
                  image: REPLACE_WITH_YOUR_IMAGE 
                  command:
                    - "python train.py --batch-size=${trialParameters.REPLACE_WITH_HYPERPARAMETER_NAME}"
              restartPolicy: Never
    
  4. Run kubectl apply -f experiment.yaml to start the optimization
  5. Run kubectl get experiments
    NAME   TYPE      STATUS   AGE
    hpo    Created   True     9s
    
  6. Results can be monitored and analysed in the UI

Katib UI

Katib experiments can be also started from from the KubeFlow dashboard. This requires:

  1. Navigate to https://ml.cern.ch/_/katib/
  2. Click NEW EXPERIMENT
  3. At the bottom of the page click Edit and submit YAML
  4. Copy the exmaple manifest from the above example manifest to the text editor
  5. Click CREATE
  6. The progress can be monitored in the UI here

Katib SDK

Katib experiments can be managed through the Python SDK. You can find a full notebook here. This can be useful if you want to quickly iterate over your code but can also be used from within KubeFlow Pipelines.

def objective(parameters):
    loss = train(parameters["batch_size"])
    # Katib parses metrics in this format: <metric-name>=<metric-value>.
    print(f"loss={loss}")

katib_client = katib.KatibClient(namespace=REPLACE_WITH_YOUR_USERNAME)
parameters = {
    "batch_size": katib.search.int(min=4, max=64),
}
name = "tune-experiment"
katib_client.tune(
    name=name,
    objective=objective,
    parameters=parameters,
    objective_metric_name="loss",
    max_trial_count=12,
    resources_per_trial={"cpu": "2"},
)

# Wait for the trials to finish
katib_client.wait_for_experiment_condition(name=name)
print(katib_client.get_optimal_hyperparameters(name))

Inside a Notebook

  1. Adjust the code to your need
  2. Create a notebook via
  3. Paste the code in a cell
  4. Run the code

Inside a pipeline

  1. Adjust the code to your need
  2. Create the pipeline definition
    from kfp.dsl import ContainerSpec
    from kfp import dsl, compiler
    
    @dsl.container_component
    def launch_hyperparameter_tuning():
        # PASTE the code from above:
        # katib_client = katib.KatibClient(namespace=ns)...
        pass
    
    @dsl.pipeline
    def pipeline() -> str:
        return launch_hyperparameter_tuning().output
    
    compiler.Compiler().compile(pipeline, package_path='pipeline.yaml')
    
  3. Compile the pipeline
    python pipeline.yaml
    
  4. Create and run the pipeline according to the pipeline documentation

Ray

Ray Tune can be used to configure and run automated hyperparameter tuning within a RayCluster. You can either create a RayCluster beforehand and interact dynamically with it via the Ray SDK or submit a RayJob which will first create the cluster according to the spec, wait for its readiness and submit the job to it.

import os
import ray
from ray import train, tune
from ray.air.integrations.mlflow import MLflowLoggerCallback, setup_mlflow

def train_wrapper(config):
    batch_size = config["batch_size"]
    model = Model(batch_size=batch_size) 
    metrics = model.train()
    return metrics

def tune():
    tuner = tune.Tuner(
        train_wrapper,
        tune_config=tune.TuneConfig(num_samples=2, metric="mean_loss_per_epoch", mode="min"),
        run_config=train.RunConfig(
            verbose=3,
            name="HPO Experiment"
        ),
        param_space={
            "batch_size": tune.randint(10, 100),
        },
    )
    results = tuner.fit()
    return results

if __name__ == "__main__":
    tune()

Ray Images

Ray provides base images for many Python versions. Remember to use the registry proxy cache to prevent rate limiting. For example, use registry.cern.ch/docker.io/rayproject/ray:2.43.0-py39-gpu to get Ray 2.43 and Python 3.9.

RayJob Manifest

A RayJob automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the Ray job finishes.

  1. Prepare your image
    FROM ray:2.43.0-py312-cpu
    USER root
    RUN pip install torch
    RUN pip install torchvision
    USER ray
    
  2. Build and push your image. Replace the placeholder with your username/project and image name.
    docker login registry.cern.ch
    docker build -t registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE .
    docker push registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE
    
  3. Adjust the manifest to your needs
    apiVersion: ray.io/v1
    kind: RayJob
    metadata:
      name: hpo
    spec:
      entrypoint: REPLACE_WITH_ENTRYPOINT # e.g. python3 main.py --hpo
      rayClusterSpec:
        rayVersion: 'REPLACE_WITH_RAY_VERSION' # e.g. '2.43.0' should match the Ray version in the image of the containers
        headGroupSpec:
          rayStartParams:
            dashboard-host: '0.0.0.0'
          template:
            metadata:
              annotations:
                sidecar.istio.io/inject: "false"
              labels:
                mount-eos: "true"
                inject-oauth2-token: "true"
            spec:
              containers:
              - name: ray-head
                image: REPLACE_WITH_YOUR_IMAGE
                ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                resources:
                  limits:
                    cpu: "1"
                  requests:
                    cpu: "200m"
        workerGroupSpecs:
            - replicas: REPLACE_WITH_NUMBER_OF_WORKERS # e.g 1
              minReplicas: 1
              maxReplicas: 2
              groupName: small-group
              rayStartParams: {}
              template:
                metadata:
                  annotations:
                    sidecar.istio.io/inject: "false"
                  labels:
                    mount-eos: "true"
                    inject-oauth2-token: "true"
                spec:
                  containers:
                  - name: worker
                    image: REPLACE_WITH_YOUR_IMAGE
                    lifecycle:
                      preStop:
                        exec:
                          command: [ "/bin/sh","-c","ray stop" ]
                    resources:
                      limits:
                        cpu: "1"
                        nvidia.com/gpu: REPLACE_WITH_NUMBER_OF_GPUS_PER_WORKER # e.g. 1
                      requests:
                        cpu: "200m"    
        submitterPodTemplate:
          metadata:
            annotations:
              sidecar.istio.io/inject: "false"
          spec:
            restartPolicy: Never
            containers:
            - name: rayjob-submitter
              image: rayproject/ray:2.9.0
    
  4. Build and push your image. Replace the placeholder with your username/project and image name.
    docker login registry.cern.ch
    docker build -t registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE .
    docker push registry.cern.ch/REPLACE_WITH_YOUR_USERNAME/REPLACE_WITH_YOUR_IMAGE
    
  5. Run kubectl apply -f job.yaml in a terminal either in a jupyter notebook or a VS Code Server
  6. Run kubectl get rayjobs to see the status of your job
    NAME                  JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
    job                                Initializing        2025-03-03T11:35:32Z                          11s
    
  7. Run kubectl get pods to see the worker pods. After a while, the submitter pods will be created
    NAME                                                      READY   STATUS         RESTARTS   AGE
    job-raycluster-7gx7p-head-hrsjk                           2/2     Running        0          104s
    job-raycluster-7gx7p-worker-small-group-dgzhp             2/2     Running        0          104s
    job-m44ns                                                 1/1     Running        0          4m17s
    
  8. Run kubectl logs -f REPLACE_WITH_SUBMITTER_POD_NAME e.g. kubectl logs job-m44ns to follow the logs of your job