Data
Concepts#
AzureML provides two basic assets for working with data:
- Datastore
- Dataset
Datastore#
Provides an interface for numerous Azure Machine Learning storage accounts.
Each Azure ML workspace comes with a default datastore:
which can also be accessed directly from the Azure Portal (under the same resource group as your Azure ML Workspace).
Datastores are attached to workspaces and are used to store connection information to Azure storage services so you can refer to them by name and don't need to remember the connection information and secret used to connect to the storage services.
Use this class to perform management operations, including register, list, get, and remove datastores.
Dataset#
A dataset is a reference to data - either in a datastore or behind a public URL.
Datasets provide enhaced capabilities including data lineage (with the notion of versioned datasets).
Get Datastore#
Default datastore#
Each workspace comes with a default datastore.
Register datastore#
Connect to, or create, a datastore backed by one of the multiple data-storage options that Azure provides. For example:
- Azure Blob Container
- Azure Data Lake (Gen1 or Gen2)
- Azure File Share
- Azure MySQL
- Azure PostgreSQL
- Azure SQL
- Azure Databricks File System
See the SDK for a comprehensive list of datastore types and authentication options: Datastores (SDK).
Register a new datastore#
To register a store via an account key:
To register a store via a SAS token:
Connect to datastore#
The workspace object ws has access to its datastores via
Any datastore that is registered to workspace can thus be accessed by name.
Link datastore to Azure Storage Explorer#
The workspace object ws is a very powerful handle when it comes to managing assets the
workspace has access to. For example, we can use the workspace to connect to a datastore
in Azure Storage Explorer.
For a datastore that was created using an account key we can use:
For a datastore that was created using a SAS token we can use:
The account_name and account_key can then be used directly in Azure Storage Explorer to connect to the Datastore.
Blob Datastore#
Move data to and from your AzureBlobDatastore object datastore.
Upload to Blob Datastore#
The AzureBlobDatastore provides APIs for data upload:
Alternatively, if you are working with multiple files in different locations you can use
Download from Blob Datastore#
Download the data from the blob container to the local file system.
Via Storage Explorer#
Azure Storage Explorer is free tool to easily manage your Azure cloud storage resources from Windows, macOS, or Linux. Download it from here.
Azure Storage Explorer gives you a (graphical) file exporer, so you can literally drag-and-drop files into and out of your datastores.
See "Link datastore to Azure Storage Explorer" above for more details.
Read from Datastore#
Reference data in a Datastore in your code, for example to use in a remote setting.
DataReference#
First, connect to your basic assets: Workspace, ComputeTarget and Datastore.
Create a DataReference, either as mount:
or as download:
info
To mount a datastore the workspace need to have read and write access to the underlying storage. For readonly datastore as_download is the only option.
Consume DataReference in ScriptRunConfig#
Add this DataReference to a ScriptRunConfig as follows.
The command-line argument str(data_ref) returns the environment variable $AZUREML_DATAREFERENCE_example_data.
Finally, data_ref.to_config() instructs the run to mount the data to the compute target and to assign the
above environment variable appropriately.
Without specifying argument#
Specify a path_on_compute to reference your data without the need for command-line arguments.
Create Dataset#
From local data#
You could create and register a dataset directly from a folder on your local machine. Note that src_dir must point to a folder, not file.
⚠️ Method upload_directory: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
From a datastore#
The code snippet below shows how to create a Dataset given a relative path on datastore. Note that the path could either point to a folder (e.g. local/test/) or a single file (e.g. local/test/data.tsv).
From outputs using OutputFileDatasetConfig#
Upload to datastore#
To upload a local directory ./data/:
This will upload the entire directory ./data from local to the default datastore associated
to your workspace ws.
Create dataset from files in datastore#
To create a dataset from a directory on a datastore at <path/on/datastore>:
Use Dataset#
ScriptRunConfig#
To reference data from a dataset in a ScriptRunConfig you can either mount or download the dataset using:
dataset.as_mount(path_on_compute): mount dataset to a remote rundataset.as_download(path_on_compute): download the dataset to a remote run
Path on compute Both as_mount and as_download accept an (optional) parameter path_on_compute.
This defines the path on the compute target where the data is made available.
- If
None, the data will be downloaded into a temporary directory. - If
path_on_computestarts with a/it will be treated as an absolute path. (If you have specified an absolute path, please make sure that the job has permission to write to that directory.) - Otherwise it will be treated as relative to the working directory
Reference this data in a remote run, for example in mount-mode:
and consumed in train.py:
For more details: ScriptRunConfig