Example 2 - Consume a dataset

Problem statement

Submit a single-component pipeline which consumes a dataset, and count the number of records. Here, we also use the compliant logger to log various properties of the dataset consumed by a component (such as number of records or average of a numerical field, for instance).

Motivation

The goal of this problem is to get you familiar with how to read datasets as component inputs, and how to safely log nonsensitive messages using shrike.compliant_logging. The latter is a very helpful functionality, especially in eyes-off environments.

Configure the example

Create a dataset of type FileDataset (as opposed to TabularDataset). This can be done following these instructions. You can create a dataset from a local csv on your machine, this iris.csv file for instance (just download it first, from the link above, then upload it to your workspace as you create the dataset).

Then, update the the value of the input_data parameter, i.e. the name of your dataset in the experiment configuration file ./examples/pipelines/config/experiments/demo_count_rows_and_log.yaml.

Submit the example

To submit your experiment with the parameter value defined in the config file, just run the command shown at the top of the experiment python file.

python pipelines/experiments/demo_count_rows_and_log.py
--config-dir pipelines/config
--config-name experiments/demo_count_rows_and_log
run.submit=True

Check the logs

Once your experiment has executed successfully, click on the component, then on "Outputs + logs". In the driver log (usually called "70_driver_log.txt"), you should see your log lines prefixed with INFO:contoso.count_rows_and_log_script:.

Links to successful execution

A successful run of the experiment can be found here. (This is mostly for internal use, as you likely will not have access to that workspace.)