Instructions for creating a reusable AML pipeline using shrike.pipeline
To enjoy this doc, you need to:
- have already setup your python environment with the AML SDK following these instructions and cloned the accelerator repository as described in the "Set up" section here;
- have access to an AML workspace
Motivation
The Azure ML pipeline helper class AMLPipelineHelper
in the shrike
package was developed with the goal of helping data scientists to more easily create reusable pipelines. These instructions explain how to use the Azure ML pipeline helper class.
1. Review an existing Azure ML pipeline created using the Azure ML pipeline helper class
The accelerator repository already has examples of pipelines created using the pipeline helper class. We will now have an overview of the structure of the two most important directories (components
and pipelines
, under aml-ds/recipes/compliant-experimentation
) and go over the key files defining these pipelines.
1.1 "components" directory
This is where the components are defined, one folder per component. Each folder contains the following files:
component_spec.yaml
: this is where the component's inputs, outputs and parameters are defined. This is the Azure ML equivalent to the component manifest in Æther.component_env.yaml
: this is where the component dependencies are listed (not required for HDI components).run.py
: this is the python file actually run in Azure ML; in most cases, it is just importing a python file from elsewhere in the repo.
Further reading on components is available here.
1.2 "pipelines" directory
This is where the graphs, a.k.a. pipelines, are defined. Here is what you will find in its subdirectories:
- The
config
directory contains the config files which contain the parameter values, organized in four sub-folders:experiments
which contains the overall graph configs, thenaml
andcompute
which contain auxiliary config files referred to in the graph configs. Themodules
folder hosts the file where the components are defined (by their key, name, default version, and location of the component specification file). Once you have created new components, you will need to add them to that file. - The
subgraphs
directory contain python files that define graphs that are not meant to be used on their own but as part of larger graphs. There is a demo subgraph available there, which consists of 2probe
components chained after each other. - The
experiments
directory contain the python files whichactually define the graphs.
Now let's take a closer look at the definition of a graph in python. We will stick with the demo graph for eyes-off and open the demograph_eyesoff.py
file in the experiments
folder. The key parts are listed below.
- The required_subgraphs()
function (line 37, also shown below) defines the subgraphs that are used in the graph.
# line 37
def required_subgraphs(cls):
"""Declare dependencies on other subgraphs to allow AMLPipelineHelper to build them for you.
This method should return a dictionary:
- Keys will be used in self.subgraph_load(key) to build each required subgraph.
- Values are classes inheriting from AMLPipelineHelper
Returns:
dict[str->AMLPipelineHelper]: dictionary of subgraphs used for building this one.
"""
return {"DemoSubgraph": DemoSubgraph}
-
The
build()
function, well, builds the graph.- First, the required subgraph is loaded in line 62:
probe_subgraph = self.subgraph_load("DemoSubgraph")
- Then we define a pipeline function for the graph starting line 70. This is where all the components and subgraphs are given their parameters and inputs. Note how the parameter values are read from the config files. To see how the outputs of some components can be used as inputs of the following components see here in the subgraph python file.
- subgraph_load can take an additional custom_config [DictConfig] argument. All params in this arguments will be added to the pipeline config (overwrite) . This is particularly useful when one wants to manipulate different instances of the subgraph with conditionals and other variables that need to be evaluated at build time.
- For the time being, we have to manually apply run settings to every component. In the future, this will not be necessary anymore. For the current example, it is also done in the subgraph python file, by calling the
apply_recommended_runsettings()
function.
- First, the required subgraph is loaded in line 62:
-
The
pipeline_instance()
function creates a runnable instance of the pipeline.- The input dataset is defined in lines 104-107, by calling the
dataset_load()
function with the name and version values provided in the config file. - The pipeline function is then called with the input data as argument.
- The input dataset is defined in lines 104-107, by calling the
Next, let's open the demograph_eyesoff.yaml
config file under the pipelines/config/experiments
directory, and note how the other config files are referenced, and how the parameters are organized in sections. We also explain config files in more details in this page: Configure your pipeline.
Finally, below is the command to run this existing pipeline (a very basic demo pipeline):
python pipelines/experiments/demograph_eyesoff.py --config-dir pipelines/config --config-name experiments/demograph_eyesoff run.submit=True
2. Create your own simple Azure ML pipeline using the pipeline helper class and an already existing component
In this section, We will create a pipeline graph consisting of a single component called probe
, which is readily available in the accelerator repository. We will pass the parameters through a config file.
Procedure:
- [1] For creating your own pipeline, we invite you to start from an already existing pipeline definition such as
demograph_eyeson.py
and build from there. Just copydemograph_eyeson.py
, rename it asdemograph_workshop.py
, update the contents accordingly, and put it under the same directory (i.e.,pipelines/experiments
). The important parts to modify for this file are those listed in the section on key files above:build()
, andpipeline_instance()
(since we won't be using a subgraph, we don't need to worry about therequired_subgraphs
part). - [2] To prepare the YAML config file, start from an existing example, such as
demograph_eyeson.yaml
. Just copydemograph_eyeson.yaml
, rename it asdemograph_eyeson_workshop.yaml
, update the contents accordingly, and put it under the same directory (i.e.,pipelines/config/experiments
). The important parts are defining the component parameter values, and declaring that we want to use the local version of the component (argumentuse_local
) forprobe
. > Note: you will also need to update two auxiliary config files (eyesoff.yaml
/eyeson.yaml
file under directorypipelines/config/aml
andeyesoff.yaml
/eyeson.yaml
under directorypipelines/config/compute
), referenced by this main config filedemograph_eyeson.yaml
, to point to the Azure ML workspace and compute targets to which you have access.
And now you should be able to run your pipeline using the following command:
python pipelines/experiments/demograph_eyesoff.py --config-dir pipelines/config --config-name experiments/demograph_eyesoff run.submit=True
If you are using an eyes-on workspace, you will also need to update the base image info in component_spec.yaml
since only eyes-off workspaces can connect to the polymer prod ACR which hosts the base image.
When a parameter is not specified in the config file, you need to use + when overriding directly from command line. Otherwise there'll be errors. For example, if run.submit
is not in the config file, you need to use
python pipelines/experiments/demograph_eyesoff.py --config-dir pipelines/config --config-name experiments/demograph_eyesoff +run.submit=True
.
Please refer to Hydra override syntax for more info.