Best Practices

Keep the code and data on persistent storage system

Containers are ephemeral, and the notebook servers can disapear for various reasons.
It's best to keep the code and data on a persistent storage that can be easily accessed.

Code - Github or GitLab
Data - EOS or S3 storage

Develop machine learning training scalable

When developing an ML script using Tensorflow or Pytorch, baar in mind potential distributed training.
If a model has many parameters and takes a lot of time to train on a single GPU, this is a use case for distributed training.
The best practice would be to implement all model training to support disribution, then prototype with 1 GPU.
With Kubeflow, the model training can then be horizontally scaled to multiple GPUs.

Containerised workloads

For best intergation with Kubeflow components (pipelines, distributed training, hyper parameter optimizations) it is useful to build Docker images around your ML workloads.

Additional benefits include:

Reproducibility - easy to reproduce results
Mobility - can run on any machine
Fast deployment - run prebuilt enviroments within seconds

How to build Docker images?

Setup automated builds with Gitlab CI. Building Docker images triggered with every push to the repository. Images are then stored on CERN registries.

Examples:

https://gitlab.cern.ch/ci-tools/docker-image-builder
https://clouddocs.web.cern.ch/containers/registry/gitlab.html

Hyperparameter jobs

Make sure the script runs on a single GPU in a notebook server
Make sure it uses a GPU + it completes
Make sure the script runs on a single GPU as a Katib job with 1 trial
- Then carefully expand to 2, 4, 10, 20 trials, and monitor closely
- Once sure it works, run a complete search
Be aware that some combinations of HP might crash the script
- Prepare the script for these edge cases (exception handling etc)
Store metrics in a preferred format in two places
- In the container storage to be accessed by the UI
- On EOS or S3 for persistent storage