Hyperparameter Optimization
Hyperparameter optimization is a task of finding the set of non-trainable parameters for which the model reaches best performance. Hyperparameters include learning rate, number of layers, number of nodes in each layer, size of convolutional layers, different activation functions, and many other options.
We offer two methods to run automated hyperparameter tuning. One is using Katib which is a KubeFlow component. It requires no inclusion of third party packages and code. Only the reporting of the evaluation metrics needs to be changed. The second is using Ray Tune which is component from the Ray framework to run hyperparameter optimization programmatically. It also integrates with Ray Train framework for distributed training (see Model Training).
Method | How do you interact with the service? |
---|---|
Applying Katib Manifest | Submission via kubectl |
Using Katib SDK | Programmatically, can be controlled from within a notebook, VS Code, pipeline |
KubeFlow UI | Via UI |
Applying RayJob Manifest | Submission via kubectl |
Katib
Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports hyperparameter tuning, early stopping and neural architecture search (NAS). Katib supports a lot of various AutoML algorithms, such as Bayesian optimization, Tree of Parzen Estimators, Random Search, Covariance Matrix Adaptation Evolution Strategy, Hyperband, Efficient Neural Architecture Search, Differentiable Architecture Search and many more. Additional algorithm support is coming soon.
Experiment
An experiment is a single tuning run, also called an optimization run.
Configuration settings are specified define the experiment. Main configurations are:
- Objective - metrics that needs to be optimized, such as accuracy or loss. It can be specified if the metrics needs to be maximized or minimized.
- Search space - the set of all possible hyperparameter values that the hyperparameter tuning job should consider for optimization, and the constraints for each hyperparameter.
- Search algorithm - The algorithm to use when searching for the optimal hyperparameter values.
Trial
A trial is one iteration of the hyperparameter tuning process.
Each experiment runs several trials. The experiment runs the trials until it reaches either the objective or the configured maximum number of trials.
Neural Network Architecture Search with Katib
Neural Architecture Search (NAS) automates the process of architecture design of neural networks. NAS approaches optimize the topology of the networks, including how to connect nodes and which operators to choose.
User-defined optimization metrics can thereby include accuracy, model size or inference time to arrive at an optimal architecture for specific applications. Due to the extremely large search space, traditional evolution or reinforcement learning-based AutoML algorithms tend to be computationally expensive.
Configuring and Running NAS Examples
The list below describes the NAS-specific parameters in the YAML file for an Experiment.
-
nasConfig: The configuration for NAS. You can specify the configurations of the neural network design that you want to optimize, including the number of layers in the network, the types of operations, and more.
-
graphConfig: The graph config that defines structure for a directed acyclic graph of the neural network. You can specify the number of layers, input_sizes for the input layer and output_sizes for the output layer.
-
operations: The range of operations that you want to tune for your ML model. For each neural network layer the NAS algorithm selects one of the operations to build a neural network. Each operation contains sets of parameters similar to HP tuning Experiment.
-
Katib Metric Collection
Running hyperparameter optimization with Katib
requires you to adjust the reporting of the evaluation metrics. You can choose to either print the metrics to stdout, log them to a file or report them via the Katib SDK.
Per default it will scrape it from stdout of the trial process and will look for a specific regex pattern like:
Katib Manifest
- Prepare your code to accept hyperparameters. This can be done via environment variables or via arguments.
import argparse def train(config): number_layers = config["number_layers"] model = Model(number_layers=number_layers) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('--number-layers', type=int) args = parser.parse_args() train({"number_layers": args.number_layers})
- Build and push your image. Replace the placeholder with your username/project and image name.
- Adjust the manifest to your needs. Parameters can be passed to your code via
${trialParameters.REPLACE_WITH_HYPERPARAMETER_NAME}
to either environment variables or as arguments.apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: name: hpo spec: parallelTrialCount: 3 maxTrialCount: 10 maxFailedTrialCount: 1 objective: type: minimize goal: 0.001 objectiveMetricName: loss algorithm: algorithmName: random # Bayesian Optimization, Hyperband, ... parameters: - name: REPLACE_WITH_HYPERPARAMETER_NAME parameterType: int feasibleSpace: min: "REPLACE_WITH_HYPERPARAMETER_MIN_VALUE" max: "REPLACE_WITH_HYPERPARAMETER_MAX_VALUE" trialTemplate: primaryContainerName: training trialParameters: - name: REPLACE_WITH_HYPERPARAMETER_NAME description: lorem ipsum reference: REPLACE_WITH_HYPERPARAMETER_NAME trialSpec: apiVersion: batch/v1 kind: Job spec: template: metadata: labels: mount-eos: "true" inject-oauth2-token-pipeline: "true" annotations: sidecar.istio.io/inject: "false" spec: containers: - name: training image: REPLACE_WITH_YOUR_IMAGE command: - "python train.py --batch-size=${trialParameters.REPLACE_WITH_HYPERPARAMETER_NAME}" restartPolicy: Never
- Run
kubectl apply -f experiment.yaml
to start the optimization - Run
kubectl get experiments
- Results can be monitored and analysed in the UI
Katib UI
Katib experiments can be also started from from the KubeFlow dashboard. This requires:
- Navigate to https://ml.cern.ch/_/katib/
- Click NEW EXPERIMENT
- At the bottom of the page click Edit and submit YAML
- Copy the exmaple manifest from the above example manifest to the text editor
- Click CREATE
- The progress can be monitored in the UI here
Katib SDK
Katib experiments can be managed through the Python SDK. You can find a full notebook here. This can be useful if you want to quickly iterate over your code but can also be used from within KubeFlow Pipelines.
def objective(parameters):
loss = train(parameters["batch_size"])
# Katib parses metrics in this format: <metric-name>=<metric-value>.
print(f"loss={loss}")
katib_client = katib.KatibClient(namespace=REPLACE_WITH_YOUR_USERNAME)
parameters = {
"batch_size": katib.search.int(min=4, max=64),
}
name = "tune-experiment"
katib_client.tune(
name=name,
objective=objective,
parameters=parameters,
objective_metric_name="loss",
max_trial_count=12,
resources_per_trial={"cpu": "2"},
)
# Wait for the trials to finish
katib_client.wait_for_experiment_condition(name=name)
print(katib_client.get_optimal_hyperparameters(name))
Inside a Notebook
- Adjust the code to your need
- Create a notebook via
- Paste the code in a cell
- Run the code
Inside a pipeline
- Adjust the code to your need
- Create the pipeline definition
from kfp.dsl import ContainerSpec from kfp import dsl, compiler @dsl.container_component def launch_hyperparameter_tuning(): # PASTE the code from above: # katib_client = katib.KatibClient(namespace=ns)... pass @dsl.pipeline def pipeline() -> str: return launch_hyperparameter_tuning().output compiler.Compiler().compile(pipeline, package_path='pipeline.yaml')
- Compile the pipeline
- Create and run the pipeline according to the pipeline documentation
Ray
Ray Tune can be used to configure and run automated hyperparameter tuning within a RayCluster
. You can either create a RayCluster beforehand and interact dynamically with it via the Ray SDK or submit a RayJob
which will first create the cluster according to the spec, wait for its readiness and submit the job to it.
import os
import ray
from ray import train, tune
from ray.air.integrations.mlflow import MLflowLoggerCallback, setup_mlflow
def train_wrapper(config):
batch_size = config["batch_size"]
model = Model(batch_size=batch_size)
metrics = model.train()
return metrics
def tune():
tuner = tune.Tuner(
train_wrapper,
tune_config=tune.TuneConfig(num_samples=2, metric="mean_loss_per_epoch", mode="min"),
run_config=train.RunConfig(
verbose=3,
name="HPO Experiment"
),
param_space={
"batch_size": tune.randint(10, 100),
},
)
results = tuner.fit()
return results
if __name__ == "__main__":
tune()
Ray Images
Ray provides base images for many Python versions. Remember to use the registry proxy cache to prevent rate limiting. For example, use registry.cern.ch/docker.io/rayproject/ray:2.43.0-py39-gpu
to get Ray 2.43
and Python 3.9
.
RayJob Manifest
A RayJob automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the Ray job finishes.
- Prepare your image
- Build and push your image. Replace the placeholder with your username/project and image name.
- Adjust the manifest to your needs
apiVersion: ray.io/v1 kind: RayJob metadata: name: hpo spec: entrypoint: REPLACE_WITH_ENTRYPOINT # e.g. python3 main.py --hpo rayClusterSpec: rayVersion: 'REPLACE_WITH_RAY_VERSION' # e.g. '2.43.0' should match the Ray version in the image of the containers headGroupSpec: rayStartParams: dashboard-host: '0.0.0.0' template: metadata: annotations: sidecar.istio.io/inject: "false" labels: mount-eos: "true" inject-oauth2-token: "true" spec: containers: - name: ray-head image: REPLACE_WITH_YOUR_IMAGE ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client resources: limits: cpu: "1" requests: cpu: "200m" workerGroupSpecs: - replicas: REPLACE_WITH_NUMBER_OF_WORKERS # e.g 1 minReplicas: 1 maxReplicas: 2 groupName: small-group rayStartParams: {} template: metadata: annotations: sidecar.istio.io/inject: "false" labels: mount-eos: "true" inject-oauth2-token: "true" spec: containers: - name: worker image: REPLACE_WITH_YOUR_IMAGE lifecycle: preStop: exec: command: [ "/bin/sh","-c","ray stop" ] resources: limits: cpu: "1" nvidia.com/gpu: REPLACE_WITH_NUMBER_OF_GPUS_PER_WORKER # e.g. 1 requests: cpu: "200m" submitterPodTemplate: metadata: annotations: sidecar.istio.io/inject: "false" spec: restartPolicy: Never containers: - name: rayjob-submitter image: rayproject/ray:2.9.0
- Build and push your image. Replace the placeholder with your username/project and image name.
- Run
kubectl apply -f job.yaml
in a terminal either in a jupyter notebook or a VS Code Server - Run
kubectl get rayjobs
to see the status of your job - Run
kubectl get pods
to see the worker pods. After a while, the submitter pods will be created - Run
kubectl logs -f REPLACE_WITH_SUBMITTER_POD_NAME
e.g.kubectl logs job-m44ns
to follow the logs of your job