Data
#
ConceptsAzureML provides two basic assets for working with data:
- Datastore
- Dataset
#
DatastoreProvides an interface for numerous Azure Machine Learning storage accounts.
Each Azure ML workspace comes with a default datastore:
which can also be accessed directly from the Azure Portal (under the same resource group as your Azure ML Workspace).
Datastores are attached to workspaces and are used to store connection information to Azure storage services so you can refer to them by name and don't need to remember the connection information and secret used to connect to the storage services.
Use this class to perform management operations, including register, list, get, and remove datastores.
#
DatasetA dataset is a reference to data - either in a datastore or behind a public URL.
Datasets provide enhaced capabilities including data lineage (with the notion of versioned datasets).
#
Get Datastore#
Default datastoreEach workspace comes with a default datastore.
#
Register datastoreConnect to, or create, a datastore backed by one of the multiple data-storage options that Azure provides. For example:
- Azure Blob Container
- Azure Data Lake (Gen1 or Gen2)
- Azure File Share
- Azure MySQL
- Azure PostgreSQL
- Azure SQL
- Azure Databricks File System
See the SDK for a comprehensive list of datastore types and authentication options: Datastores (SDK).
#
Register a new datastoreTo register a store via an account key:
To register a store via a SAS token:
#
Connect to datastoreThe workspace object ws
has access to its datastores via
Any datastore that is registered to workspace can thus be accessed by name.
#
Link datastore to Azure Storage ExplorerThe workspace object ws
is a very powerful handle when it comes to managing assets the
workspace has access to. For example, we can use the workspace to connect to a datastore
in Azure Storage Explorer.
For a datastore that was created using an account key we can use:
For a datastore that was created using a SAS token we can use:
The account_name and account_key can then be used directly in Azure Storage Explorer to connect to the Datastore.
#
Blob DatastoreMove data to and from your AzureBlobDatastore object datastore
.
#
Upload to Blob DatastoreThe AzureBlobDatastore provides APIs for data upload:
Alternatively, if you are working with multiple files in different locations you can use
#
Download from Blob DatastoreDownload the data from the blob container to the local file system.
#
Via Storage ExplorerAzure Storage Explorer is free tool to easily manage your Azure cloud storage resources from Windows, macOS, or Linux. Download it from here.
Azure Storage Explorer gives you a (graphical) file exporer, so you can literally drag-and-drop files into and out of your datastores.
See "Link datastore to Azure Storage Explorer" above for more details.
#
Read from DatastoreReference data in a Datastore
in your code, for example to use in a remote setting.
#
DataReferenceFirst, connect to your basic assets: Workspace
, ComputeTarget
and Datastore
.
Create a DataReference
, either as mount:
or as download:
info
To mount a datastore the workspace need to have read and write access to the underlying storage. For readonly datastore as_download
is the only option.
#
Consume DataReference in ScriptRunConfigAdd this DataReference to a ScriptRunConfig as follows.
The command-line argument str(data_ref)
returns the environment variable $AZUREML_DATAREFERENCE_example_data
.
Finally, data_ref.to_config()
instructs the run to mount the data to the compute target and to assign the
above environment variable appropriately.
#
Without specifying argumentSpecify a path_on_compute
to reference your data without the need for command-line arguments.
#
Create Dataset#
From local dataYou could create and register a dataset directly from a folder on your local machine. Note that src_dir
must point to a folder, not file.
⚠️ Method upload_directory
: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
#
From a datastoreThe code snippet below shows how to create a Dataset
given a relative path on datastore
. Note that the path could either point to a folder (e.g. local/test/
) or a single file (e.g. local/test/data.tsv
).
OutputFileDatasetConfig
#
From outputs using #
Upload to datastoreTo upload a local directory ./data/
:
This will upload the entire directory ./data
from local to the default datastore associated
to your workspace ws
.
#
Create dataset from files in datastoreTo create a dataset from a directory on a datastore at <path/on/datastore>
:
#
Use Dataset#
ScriptRunConfigTo reference data from a dataset in a ScriptRunConfig you can either mount or download the dataset using:
dataset.as_mount(path_on_compute)
: mount dataset to a remote rundataset.as_download(path_on_compute)
: download the dataset to a remote run
Path on compute Both as_mount
and as_download
accept an (optional) parameter path_on_compute
.
This defines the path on the compute target where the data is made available.
- If
None
, the data will be downloaded into a temporary directory. - If
path_on_compute
starts with a/
it will be treated as an absolute path. (If you have specified an absolute path, please make sure that the job has permission to write to that directory.) - Otherwise it will be treated as relative to the working directory
Reference this data in a remote run, for example in mount-mode:
and consumed in train.py
:
For more details: ScriptRunConfig