Skip to main content

Data

Concepts#

AzureML provides two basic assets for working with data:

  • Datastore
  • Dataset

Datastore#

Provides an interface for numerous Azure Machine Learning storage accounts.

Each Azure ML workspace comes with a default datastore:

from azureml.core import Workspace
ws = Workspace.from_config()
datastore = ws.get_default_datastore()

which can also be accessed directly from the Azure Portal (under the same resource group as your Azure ML Workspace).

Datastores are attached to workspaces and are used to store connection information to Azure storage services so you can refer to them by name and don't need to remember the connection information and secret used to connect to the storage services.

Use this class to perform management operations, including register, list, get, and remove datastores.

Dataset#

A dataset is a reference to data - either in a datastore or behind a public URL.

Datasets provide enhaced capabilities including data lineage (with the notion of versioned datasets).

Get Datastore#

Default datastore#

Each workspace comes with a default datastore.

datastore = ws.get_default_datastore()

Register datastore#

Connect to, or create, a datastore backed by one of the multiple data-storage options that Azure provides. For example:

  • Azure Blob Container
  • Azure Data Lake (Gen1 or Gen2)
  • Azure File Share
  • Azure MySQL
  • Azure PostgreSQL
  • Azure SQL
  • Azure Databricks File System

See the SDK for a comprehensive list of datastore types and authentication options: Datastores (SDK).

Register a new datastore#

  • To register a store via an account key:

    datastores = Datastore.register_azure_blob_container(
    workspace=ws,
    datastore_name='<datastore-name>',
    container_name='<container-name>',
    account_name='<account-name>',
    account_key='<account-key>',
    )
  • To register a store via a SAS token:

    datastores = Datastore.register_azure_blob_container(
    workspace=ws,
    datastore_name='<datastore-name>',
    container_name='<container-name>',
    account_name='<account-name>',
    sas_token='<sas-token>',
    )

Connect to datastore#

The workspace object ws has access to its datastores via

ws.datastores: Dict[str, Datastore]

Any datastore that is registered to workspace can thus be accessed by name.

datastore = ws.datastores['<name-of-registered-datastore>']

Link datastore to Azure Storage Explorer#

The workspace object ws is a very powerful handle when it comes to managing assets the workspace has access to. For example, we can use the workspace to connect to a datastore in Azure Storage Explorer.

from azureml.core import Workspace
ws = Workspace.from_config()
datastore = ws.datastores['<name-of-datastore>']
  • For a datastore that was created using an account key we can use:

    account_name, account_key = datastore.account_name, datastore.account_key
  • For a datastore that was created using a SAS token we can use:

    sas_token = datastore.sas_token

The account_name and account_key can then be used directly in Azure Storage Explorer to connect to the Datastore.

Blob Datastore#

Move data to and from your AzureBlobDatastore object datastore.

Upload to Blob Datastore#

The AzureBlobDatastore provides APIs for data upload:

datastore.upload(
src_dir='./data',
target_path='<path/on/datastore>',
overwrite=True,
)

Alternatively, if you are working with multiple files in different locations you can use

datastore.upload_files(
files, # List[str] of absolute paths of files to upload
target_path='<path/on/datastore>',
overwrite=False,
)

Download from Blob Datastore#

Download the data from the blob container to the local file system.

datastore.download(
target_path, # str: local directory to download to
prefix='<path/on/datastore>',
overwrite=False,
)

Via Storage Explorer#

Azure Storage Explorer is free tool to easily manage your Azure cloud storage resources from Windows, macOS, or Linux. Download it from here.

Azure Storage Explorer gives you a (graphical) file exporer, so you can literally drag-and-drop files into and out of your datastores.

See "Link datastore to Azure Storage Explorer" above for more details.

Read from Datastore#

Reference data in a Datastore in your code, for example to use in a remote setting.

DataReference#

First, connect to your basic assets: Workspace, ComputeTarget and Datastore.

from azureml.core import Workspace
ws: Workspace = Workspace.from_config()
compute_target: ComputeTarget = ws.compute_targets['<compute-target-name>']
ds: Datastore = ws.get_default_datastore()

Create a DataReference, either as mount:

data_ref = ds.path('<path/on/datastore>').as_mount()

or as download:

data_ref = ds.path('<path/on/datastore>').as_download()
info

To mount a datastore the workspace need to have read and write access to the underlying storage. For readonly datastore as_download is the only option.

Consume DataReference in ScriptRunConfig#

Add this DataReference to a ScriptRunConfig as follows.

config = ScriptRunConfig(
source_directory='.',
script='script.py',
arguments=[str(data_ref)], # returns environment variable $AZUREML_DATAREFERENCE_example_data
compute_target=compute_target,
)
config.run_config.data_references[data_ref.data_reference_name] = data_ref.to_config()

The command-line argument str(data_ref) returns the environment variable $AZUREML_DATAREFERENCE_example_data. Finally, data_ref.to_config() instructs the run to mount the data to the compute target and to assign the above environment variable appropriately.

Without specifying argument#

Specify a path_on_compute to reference your data without the need for command-line arguments.

data_ref = ds.path('<path/on/datastore>').as_mount()
data_ref.path_on_compute = '/tmp/data'
config = ScriptRunConfig(
source_directory='.',
script='script.py',
compute_target=compute_target,
)
config.run_config.data_references[data_ref.data_reference_name] = data_ref.to_config()

Create Dataset#

From local data#

You could create and register a dataset directly from a folder on your local machine. Note that src_dir must point to a folder, not file.

⚠️ Method upload_directory: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

from azureml.core import Dataset
# upload the data to datastore and create a FileDataset from it
folder_data = Dataset.File.upload_directory(src_dir="path/to/folder", target=(datastore, "self-defined/path/on/datastore"))
dataset = folder_data.register(workspace=ws, name="<dataset_name>")

From a datastore#

The code snippet below shows how to create a Dataset given a relative path on datastore. Note that the path could either point to a folder (e.g. local/test/) or a single file (e.g. local/test/data.tsv).

from azureml.core import Dataset
# create input dataset
data = Dataset.File.from_files(path=(datastore, "path/on/datastore"))
dataset = data.register(workspace=ws, name="<dataset_name>")

From outputs using OutputFileDatasetConfig#

from azureml.core import ScriptRunConfig
from azureml.data import OutputFileDatasetConfig
output_data = OutputFileDatasetConfig(
destination=(datastore, "path/on/datastore"),
name="<output_name>",
)
config = ScriptRunConfig(
source_directory=".",
script="run.py",
arguments=["--output_dir", output_data.as_mount()],
)
# register your OutputFileDatasetConfig as a dataset
output_data_dataset = output_data.register_on_complete(name="<dataset_name>", description = "<dataset_description>")

Upload to datastore#

To upload a local directory ./data/:

datastore = ws.get_default_datastore()
datastore.upload(src_dir='./data', target_path='<path/on/datastore>', overwrite=True)

This will upload the entire directory ./data from local to the default datastore associated to your workspace ws.

Create dataset from files in datastore#

To create a dataset from a directory on a datastore at <path/on/datastore>:

datastore = ws.get_default_datastore()
dataset = Dataset.File.from_files(path=(datastore, '<path/on/datastore>'))

Use Dataset#

ScriptRunConfig#

To reference data from a dataset in a ScriptRunConfig you can either mount or download the dataset using:

  • dataset.as_mount(path_on_compute) : mount dataset to a remote run
  • dataset.as_download(path_on_compute) : download the dataset to a remote run

Path on compute Both as_mount and as_download accept an (optional) parameter path_on_compute. This defines the path on the compute target where the data is made available.

  • If None, the data will be downloaded into a temporary directory.
  • If path_on_compute starts with a / it will be treated as an absolute path. (If you have specified an absolute path, please make sure that the job has permission to write to that directory.)
  • Otherwise it will be treated as relative to the working directory

Reference this data in a remote run, for example in mount-mode:

run.py
arguments=[dataset.as_mount()]
config = ScriptRunConfig(source_directory='.', script='train.py', arguments=arguments)
experiment.submit(config)

and consumed in train.py:

train.py
import sys
data_dir = sys.argv[1]
print("===== DATA =====")
print("DATA PATH: " + data_dir)
print("LIST FILES IN DATA DIR...")
print(os.listdir(data_dir))
print("================")

For more details: ScriptRunConfig