User Guide ========== .. _getting-started: Getting Started --------------- Prerequisites ~~~~~~~~~~~~~ * Python 3.9 or later installed * Have an `Azure subscription`_ and an `Azure storage account`_ Installation ~~~~~~~~~~~~ Install the Azure Storage Connector for PyTorch (``azstoragetorch``) with `pip`_: .. code-block:: shell pip install azstoragetorch Configuration ~~~~~~~~~~~~~ ``azstoragetorch`` should work without any explicit credential configuration. ``azstoragetorch`` interfaces default to :py:class:`~azure.identity.DefaultAzureCredential` for credentials. ``DefaultAzureCredential`` automatically retrieves `Microsoft Entra ID tokens`_ based on your current environment. For more information on ``DefaultAzureCredential``, see its `documentation `_. To override credentials, ``azstoragetorch`` interfaces accept a ``credential`` keyword argument override and accept `SAS`_ tokens in query strings of provided Azure Storage URLs. See the :doc:`API Reference ` for more details. .. _checkpoint-guide: Saving and Loading PyTorch Models (Checkpointing) ------------------------------------------------- PyTorch `supports saving and loading trained models `_ (i.e., checkpointing). The core PyTorch interfaces for saving and loading models are :py:func:`torch.save` and :py:func:`torch.load` respectively. Both of these functions accept a file-like object to be written to or read from. ``azstoragetorch`` offers the :py:class:`azstoragetorch.io.BlobIO` file-like object class to save and load models directly to and from Azure Blob Storage when using :py:func:`torch.save` and :py:func:`torch.load`. Saving a Model ~~~~~~~~~~~~~~ To save a model to Azure Blob Storage, pass a :py:class:`azstoragetorch.io.BlobIO` directly to :py:func:`torch.save`. When creating the :py:class:`~azstoragetorch.io.BlobIO`, specify the URL to the blob you'd like to save the model to and use write mode (i.e., ``wb``) .. literalinclude:: ../../samples/save_model.py :lines: 9- Loading a Model ~~~~~~~~~~~~~~~ To load a model from Azure Blob Storage, pass a :py:class:`azstoragetorch.io.BlobIO` directly to :py:func:`torch.load`. When creating the :py:class:`~azstoragetorch.io.BlobIO`, specify the URL to the blob storing the model weights and use read mode (i.e., ``rb``) .. literalinclude:: ../../samples/load_model.py :lines: 9- .. _datasets-guide: PyTorch Datasets ---------------- PyTorch offers the `Dataset and DataLoader primitives `_ for loading data samples. ``azstoragetorch`` provides implementations for both types of PyTorch datasets, `map-style and iterable-style datasets `_, to load data samples from Azure Blob Storage: * :py:class:`azstoragetorch.datasets.BlobDataset` - `Map-style dataset `_. Use this class for random access to data samples. The class eagerly lists samples in dataset on instantiation. * :py:class:`azstoragetorch.datasets.IterableBlobDataset` - `Iterable-style dataset `_. Use this class when working with large datasets that may not fit in memory. The class lazily lists samples as dataset is iterated over. Data samples returned from both datasets map directly one-to-one to blobs in Azure Blob Storage. Both classes can be directly provided to a PyTorch :py:class:`~torch.utils.data.DataLoader` (read more :ref:`here `). When instantiating these dataset classes, use one of their class methods: * ``from_container_url()`` - Instantiate dataset by listing blobs from an Azure Storage container. * ``from_blob_urls()`` - Instantiate dataset from provided blob URLs Instantiation directly using ``__init__()`` is **not** supported. Read sections below on how to use these class methods to create datasets. Create Dataset from Azure Storage Container ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To create an ``azstoragetorch`` dataset by listing blobs in a single Azure Storage container, use the dataset class's corresponding ``from_container_url()`` method: * :py:meth:`azstoragetorch.datasets.BlobDataset.from_container_url()` for map-style dataset * :py:meth:`azstoragetorch.datasets.IterableBlobDataset.from_container_url()` for iterable-style dataset The methods accept the URL to the Azure Storage container to list blobs from. Listing is performed using the `List Blobs API `_. For example .. tab-set:: .. tab-item:: ``BlobDataset`` .. literalinclude:: ../../samples/map_dataset/dataset_from_container_url.py :lines: 9- .. tab-item:: ``IterableBlobDataset`` .. literalinclude:: ../../samples/iterable_dataset/dataset_from_container_url.py :lines: 9- The above examples lists all blobs in the container. To only include blobs whose name starts with a specific prefix, provide the ``prefix`` keyword argument .. tab-set:: .. tab-item:: ``BlobDataset`` .. literalinclude:: ../../samples/map_dataset/dataset_using_prefix.py :lines: 9- .. tab-item:: ``IterableBlobDataset`` .. literalinclude:: ../../samples/iterable_dataset/dataset_using_prefix.py :lines: 9- Create Dataset from List of Blobs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To create an ``azstoragetorch`` dataset from a pre-defined list of blobs, use the dataset class's corresponding ``from_blob_urls()`` method: * :py:meth:`azstoragetorch.datasets.BlobDataset.from_blob_urls()` for map-style dataset * :py:meth:`azstoragetorch.datasets.IterableBlobDataset.from_blob_urls()` for iterable-style dataset The method accepts a list of blob URLs to create the dataset from. For example .. tab-set:: .. tab-item:: ``BlobDataset`` .. literalinclude:: ../../samples/map_dataset/dataset_from_blob_list.py :lines: 9- .. tab-item:: ``IterableBlobDataset`` .. literalinclude:: ../../samples/iterable_dataset/dataset_from_blob_list.py :lines: 9- Transforming Dataset Output ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The default output format of dataset samples are dictionaries representing a blob in the dataset. Each dictionary has the keys: * ``url``: The full endpoint URL of the blob. * ``data``: The content of the blob as :py:class:`bytes`. For example, when accessing a dataset sample:: print(map_dataset[0]) It will have the following return format:: { "url": "https://.blob.core.windows.net//", "data": b"" } To override the output format, provide a ``transform`` callable to either ``from_blob_urls`` or ``from_container_url`` when creating the dataset. The ``transform`` callable accepts a single positional argument of type :py:class:`azstoragetorch.datasets.Blob` representing a blob in the dataset. This :py:class:`~azstoragetorch.datasets.Blob` object can be used to retrieve properties and content of the blob as part of the ``transform`` callable. Emulating the `PyTorch transform tutorial `_, the example below shows how to transform a :py:class:`~azstoragetorch.datasets.Blob` object to a :py:class:`torch.Tensor` of a :py:mod:`PIL.Image` .. literalinclude:: ../../samples/map_dataset/transforming_dataset_output.py :lines: 9- The output should include the blob name and :py:class:`~torch.Tensor` of the image:: ("", tensor([...])) .. _datasets-guide-with-dataloader: Using Dataset with PyTorch DataLoader ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once instantiated, ``azstoragetorch`` datasets can be provided directly to a PyTorch :py:class:`~torch.utils.data.DataLoader` for loading samples .. literalinclude:: ../../samples/map_dataset/dataset_with_pytorch_dataloader.py :lines: 9- Iterable-style Datasets with Multiple Workers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When using a :py:class:`~azstoragetorch.datasets.IterableBlobDataset` and :py:class:`~torch.utils.data.DataLoader` with multiple workers (i.e., ``num_workers > 1``), the :py:class:`~azstoragetorch.datasets.IterableBlobDataset` automatically shards data samples returned across workers to avoid a :py:class:`~torch.utils.data.DataLoader` from returning duplicate samples from its workers .. literalinclude:: ../../samples/iterable_dataset/multiple_workers.py :lines: 9- .. _Azure subscription: https://azure.microsoft.com/free/ .. _Azure storage account: https://learn.microsoft.com/azure/storage/common/storage-account-overview .. _pip: https://pypi.org/project/pip/ .. _Microsoft Entra ID tokens: https://learn.microsoft.com/azure/storage/blobs/authorize-access-azure-active-directory .. _DefaultAzureCredential guide: https://learn.microsoft.com/azure/developer/python/sdk/authentication/credential-chains?tabs=dac#defaultazurecredential-overview .. _SAS: https://learn.microsoft.com/azure/storage/common/storage-sas-overview .. _PyTorch checkpoint tutorial: https://pytorch.org/tutorials/beginner/saving_loading_models.html .. _PyTorch dataset tutorial: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#datasets-dataloaders .. _PyTorch dataset types: https://pytorch.org/docs/stable/data.html#dataset-types .. _PyTorch dataset map-style: https://pytorch.org/docs/stable/data.html#map-style-datasets .. _PyTorch dataset iterable-style: https://pytorch.org/docs/stable/data.html#iterable-style-datasets .. _List Blobs API: https://learn.microsoft.com/rest/api/storageservices/list-blobs?tabs=microsoft-entra-id .. _PyTorch transform tutorial: https://pytorch.org/tutorials/beginner/basics/transforms_tutorial.html