Best Practices
Keep the code and data on persistent storage system
Containers are ephemeral, and the notebook servers can disapear for various reasons.
It's best to keep the code and data on a persistent storage that can be easily accessed.
- Code - Github or GitLab
- Data - EOS or S3 storage
Develop machine learning training scalable
When developing an ML script using Tensorflow or Pytorch, baar in mind potential distributed training.
If a model has many parameters and takes a lot of time to train on a single GPU, this is a use case for distributed training.
The best practice would be to implement all model training to support disribution, then prototype with 1 GPU.
With Kubeflow, the model training can then be horizontally scaled to multiple GPUs.
Containerised workloads
For best intergation with Kubeflow components (pipelines, distributed training, hyper parameter optimizations) it is useful to build Docker images around your ML workloads.
Additional benefits include:
- Reproducibility - easy to reproduce results
- Mobility - can run on any machine
- Fast deployment - run prebuilt enviroments within seconds
How to build Docker images?
Setup automated builds with Gitlab CI. Building Docker images triggered with every push to the repository. Images are then stored on CERN registries.
Examples:
- https://gitlab.cern.ch/ci-tools/docker-image-builder
- https://clouddocs.web.cern.ch/containers/registry/gitlab.html
Hyperparameter jobs
- Make sure the script runs on a single GPU in a notebook server
- Make sure it uses a GPU + it completes
- Make sure the script runs on a single GPU as a Katib job with 1 trial
- Then carefully expand to 2, 4, 10, 20 trials, and monitor closely
- Once sure it works, run a complete search
- Be aware that some combinations of HP might crash the script
- Prepare the script for these edge cases (exception handling etc)
- Store metrics in a preferred format in two places
- In the container storage to be accessed by the UI
- On EOS or S3 for persistent storage