Pipelines

Machine Learning Pipelines define workflows in the form of a directed graph where each step is run independently and data is passed through the steps. Pipelines provide more flexibility to define dependencies between components compared to classic scripts or notebooks.
For example, one could create and prepare the dataset, run a hyperparameter tuning and a distributed training in one workflow. If you need access to data, see the storage documentation on how to access different storage backends.

Pipeline

Recurrent Execution

Sometimes, it's needed to have a recurring machine learning job. For example model training with new data every day, or periodic monitoring of services availability, or monitoring usage of a specific system. To do so, a recurrent pipeline can be submitted, specifying various pipeline triggering option, such period between two pipeline runs, start and end date/time or cron options.

Pipeline Parameters

Every pipeline can have input values that can change every time the pipeline is triggered. For example, input data paths, or different options for hyperparameters can be supplied as pipeline parameters, rather than hard coded.
By default, pipelines are run without any parameters.

Pipeline Creation

A pipeline is defined as Python code which can be executed in two ways.
First, it can be compiled to a YAML file, which can then be submitted to the platform for execution (see the example below). Second, it can be submitted via a Python SDK (see the example below) which can be run inside an interactive session like a Jupyter notebook or even inside another pipeline.
This example pipeline shows how to run a simple classification pipeline on the Iris dataset. The pipeline consists of three steps: creation and normalization of the dataset and the training of the model. See KubeFlow Documentation for more information. A full example including data loading, training and serving a model can be found here.

Using the Pipeline SDK

You can use the KubeFlow SDK to define and manage pipeline runs. With this, you can start and stop pipeline runs from within a notebook, a VS Code or from another pipeline.

from kfp.client import Client

client = Client()
# Read pipeline definition from file
run = client.create_run_from_pipeline_package(
    'pipeline.yaml',
    arguments={
        'min_max_scaler': True,
        'neighbors': [3,5,9],
        'standard_scaler': False,
    },
)

@dsl.component
def prepare_dataset():
    pass

@dsl.component
def train_model():
    pass

@dsl.pipeline(
    name="End to End Pipeline",
    description="An end to end mnist example including hyperparameter tuning, train and inference"
)
def pipeline():
    prepare_dataset()
    train_model()

run_id = client.create_run_from_pipeline_func(mnist_pipeline, namespace=namespace, arguments={}).run_id
print("Run ID: ", run_id)

Using the KubeFlow Dashboard

Compile your pipeline by adding the snipped below and running python pipeline.py

if __name__ == '__main__':
    import kfp.compiler as compiler
    compiler.Compiler().compile(pipeline, __file__ + '.yaml')

Navigate to https://ml.cern.ch/_/pipeline/
Click Upload Pipeline
Add Pipeline name
Upload the compiled yaml file
Click Create
Click Create Run
- Select Run name
- Select Experiment in which to run the pipeline. An experiment holds multiple runs of the same pipeline, and makes it easier to track the runs. If needed, create an Experiment first.
- Select if the pipeline is run once, or is recurrent
  - If recurrent, select pipeline trigger options (period, start and end date/time)
- Select Pipeline Parameters
- Click Start
- Click on the name of the running pipeline (Run name)
- Track the progress of the pipeline