Storage
Several backends can be used for storing and accessing data. They can be used to access training and test data and for storing metadata like parameter, metrics and artifacts like models, plots, logs, etc. The provided backends include EOS, object storage and shared filesystem storage. See below for information about access methods and their pros and cons.
EOS
EOS provides a service for storing large amounts of physics data and user files.
By default, EOS is mounted and automatically authenticated as the user home in notebooks for personal profiles. Team profiles can make use of EOS projects for shared storage across users. EOS can also be accessed from within pipelines and training jobs via the /eos
directory.
In order to access EOS from pipelines and training jobs, the label inject-oauth2-token-pipeline: "true"
has to be added to the respective manifest. See the example below for a TensorFlow training job:
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: tfjob
labels:
inject-oauth2-token-pipeline: "true"
spec:
...
Container User
The user in your workload container has to be root in order to automatically authenticate to EOS. If you do not specify the user, e.g. by USER user
in your Dockerfile, root
will be used per default.
Pro
EOS provides reliable storage and should be used when long term storage is needed, e.g. for user code or large datasets.
Object Storage
Object storage is a data storage architecture that manages and stores data as objects, rather than as files in a hierarchical file system or as blocks in a block storage system. Each object typically consists of the data itself, metadata, and a unique identifier. Each object belongs to a bucket which forms a basic container for data. Buckets can be used to organize data and control access but can not be nested. Implementations are usually compatible with the Amazon S3 protocol. Python code to upload a file to the existing bucket is provided below:
import boto3
client = boto3.client('s3', endpoint_url='https://s3.cern.ch')
client.upload_file(LOCAL_FILEPATH, BUCKET_NAME, KEY)
Cluster Internal
KubeFlow provides an internal object storage system based on Minio. Minio is a Kubernetes native object storage used in cloud computing. Its Python API reference can be found here.
It is suggested to use the cluster internal object storage as a starting point, before moving to more persistant solutions, such as CERNs managed object storage service described below.
Pro
Easy to setup and available for personal profiles.
Con
Due to the ephemeral nature of containerized systems, it is important to note that data stored on Minio buckets can be lost. In addition, the in-cluster minio buckets are accessible to all users.
Managed Service by CERN
s3.cern.ch offers a managed object storage system. Created buckets on s3.cern.ch
can be accessed from anywhere, if credentials are in place. A guide on how to get started with s3.cern.ch
and how generate the credentials is provided here.
Credentials
Never hard code credentials in your code because you might expose them to the public. For S3
, you can put the credentials in the file under ~/.aws/credentials
. The S3
client will use them automatically.
Example content:
Pro
Isolated storage with access control.
Con
CERN makes no additional backups and there is no provision within the s3.cern.ch
service for disaster recovery. Users are therefore responsible for maintaining independent backups of their objects where they judge it important. CERNs object storage requires an explicit quota request and is only available for shared projects in OpenStack.
Shared Filesystem Storage
Shared filesystem storage is provided by CephFS and can be provisioned by creating a Kubernetes Persistent Volume
in the KubeFlow dashboard.
The volume size needs to be defined in Gb and a propriate storage type needs to be specified. We recommend to use the storage class manila-meyrin-cephfs
. Personal profile have a quota of 30 Gb for volumes.
Pro
Kubernetes volumes can be integrated in Kubernetes native workloads by referencing the volume in the manifest. This can be helpful when the data needs to be shared with other workloads, e.g. running the KubeFlow inference service. See the model storage documentation for detailed information.
Con
No additional backups are done for volumes and there is no provision for disaster recovery at the moment.