API Reference

BlobIO

class azstoragetorch.io.BlobIO(blob_url: str, mode: Literal['rb', 'wb'], *, credential: AzureSasCredential | TokenCredential | None | Literal[False] = None, **_internal_only_kwargs)

Bases: IOBase

File-like object for reading and writing blobs in Azure Blob Storage.

Use this class directly for PyTorch checkpointing by passing it directly to torch.save() or torch.load():

import torch
import torchvision.models
from azstoragetorch.io import BlobIO

CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"

# Sample model to save and load. Replace with your own model.
model = torchvision.models.resnet18(weights="DEFAULT")

# Save trained model to Azure Blob Storage. This saves the model weights
# to a blob named "model_weights.pth" in the container specified by CONTAINER_URL.
with BlobIO(f"{CONTAINER_URL}/model_weights.pth", "wb") as f:
    torch.save(model.state_dict(), f)

# Load trained model from Azure Blob Storage.  This loads the model weights
# from the blob named "model_weights.pth" in the container specified by CONTAINER_URL.
with BlobIO(f"{CONTAINER_URL}/model_weights.pth", "rb") as f:
    model.load_state_dict(torch.load(f))
Parameters:
  • blob_url (str) -- The full endpoint URL to the blob. The URL respects SAS tokens, snapshots, and version IDs in its query string.

  • mode (Literal['rb', 'wb']) --

    The mode in which to open the blob. Supported modes are:

    • rb - Opens blob for reading

    • wb - Opens blob for writing

  • credential (AzureSasCredential | TokenCredential | None | Literal[False]) -- The credential to use for authentication. If not specified, azure.identity.DefaultAzureCredential will be used. When set to False, anonymous requests will be made. If the blob_url contains a SAS token, this parameter is ignored.

close() None

Close the file-like object.

In write mode, this will flush() and commit the blob.

Raises:

FatalBlobIOWriteError -- if a fatal error occurs when writing to blob. If raised, no data written, nor uploaded, using this BlobIO instance will be committed to the blob. It is recommended to create a new BlobIO instance and retry all writes when attempting retries.

property closed: bool

Whether the file-like object is closed.

Is True if the file-like is closed, False otherwise.

flush() None

Flush all written data to the blob.

When called, any unstaged data will be uploaded and method will block until all uploads complete. In read mode, this method has no effect.

Raises:

FatalBlobIOWriteError -- if a fatal error occurs when writing to blob. If raised, no data written, nor uploaded, using this BlobIO instance will be committed to the blob. It is recommended to create a new BlobIO instance and retry all writes when attempting retries.

read(size: int | None = -1, /) bytes

Read bytes from the blob.

Parameters:

size (int | None) -- The maximum number of bytes to read. If not specified, all bytes will be read.

Returns:

The bytes read from the blob.

Return type:

bytes

readable() bool

Return whether file-like object is readable.

Returns:

True if opened in read mode, False otherwise.

Return type:

bool

readline(size: int | None = -1, /) bytes

Read and return a line from the file-like object.

The line terminator is always b'\n'.

Parameters:

size (int | None) -- The maximum number of bytes to read. If not specified, all bytes will be read up to the next line terminator.

Returns:

The bytes read from the blob.

Return type:

bytes

seek(offset: int, whence: int = 0, /) int

Change the file-like position to a given byte offset.

Parameters:
  • offset (int) -- The offset to seek to

  • whence (int) --

    The reference point for the offset. Accepted values are:

    • os.SEEK_SET - The start of the file-like object (the default)

    • os.SEEK_CUR - The current position in the file-like object

    • os.SEEK_END - The end of the file-like object

Returns:

The new absolute position in the file-like object.

Return type:

int

seekable() bool

Return whether file-like object supports random access.

Returns:

True if can seek, False otherwise. Seeking is only supported in read mode.

Return type:

bool

tell() int

Return the current position in the file-like object.

Returns:

The current position in the file-like object.

Return type:

int

writable() bool

Return whether file-like object is writeable.

Returns:

True if opened in write mode, False otherwise.

Return type:

bool

write(b: bytes | bytearray | memoryview, /) int

Writes a bytes-like object to the blob

Data written may not be immediately uploaded. Instead, data may be uploaded via threads after write() has returned or may be uploaded as part of subsequent calls to BlobIO methods. This means if write() has returned without an error, it does not mean the data was successfully uploaded to the blob. Calls to flush() or close() will upload all pending data, block until all data is uploaded, and propogate any errors.

Parameters:

b (bytes | bytearray | memoryview) -- The bytes-like object to write to the blob.

Returns:

The number of bytes written

Raises:

FatalBlobIOWriteError -- if a fatal error occurs when writing to blob. If raised, no data written, nor uploaded, using this BlobIO instance will be committed to the blob. It is recommended to create a new BlobIO instance and retry all writes when attempting retries.

Return type:

int

Datasets

class azstoragetorch.datasets.BlobDataset(blobs: Iterable[Blob], transform: Callable[[Blob], _TransformOutputType_co] | None = None)

Bases: Dataset[_TransformOutputType_co]

Map-style dataset for blobs in Azure Blob Storage.

Data samples returned from dataset map directly one-to-one to blobs in Azure Blob Storage. Use from_blob_urls() or from_container_url() to create an instance of this dataset. For example:

from azstoragetorch.datasets import BlobDataset

dataset = BlobDataset.from_container_url(
    "https://<storage-account-name>.blob.core.windows.net/<container-name>"
)
print(dataset[0])  # Print first blob in the dataset

Instantiating dataset class directly using __init__() is not supported.

Usage with PyTorch DataLoader

The dataset can be provided directly to a PyTorch DataLoader:

import torch.utils.data

loader = torch.utils.data.DataLoader(dataset)

Dataset output

The default output format of the dataset is a dictionary with the keys:

  • url: The full endpoint URL of the blob.

  • data: The content of the blob as bytes.

For example:

{
    "url": "https://<account-name>.blob.core.windows.net/<container-name>/<blob-name>",
    "data": b"<blob-content>"
}

To override the output format, provide a transform callable to either from_blob_urls() or from_container_url() when creating the dataset.

classmethod from_blob_urls(blob_urls: str | Iterable[str], *, credential: AzureSasCredential | TokenCredential | None | Literal[False] = None, transform: Callable[[Blob], _TransformOutputType_co] | None = None) Self

Instantiate dataset from provided blob URLs.

Sample usage:

container_url = "https://<storage-account-name>.blob.core.windows.net/<container-name>"
dataset = BlobDataset.from_blob_urls([
    f"{container_url}/<blob-name-1>",
    f"{container_url}/<blob-name-2>",
    f"{container_url}/<blob-name-3>",
])
Parameters:
  • blob_urls (str | Iterable[str]) -- The full endpoint URLs to the blobs to be used for dataset. Can be a single URL or an iterable of URLs. URLs respect SAS tokens, snapshots, and version IDs in their query strings.

  • credential (AzureSasCredential | TokenCredential | None | Literal[False]) -- The credential to use for authentication. If not specified, azure.identity.DefaultAzureCredential will be used. When set to False, anonymous requests will be made. If a URL contains a SAS token, this parameter is ignored for that URL.

  • transform (Callable[[Blob], _TransformOutputType_co] | None) -- A callable that accepts a Blob object representing a blob in the dataset and returns a transformed output to be used as output from the dataset. See Blob class for more information on writing a transform callable to override the default dataset output format.

Returns:

Dataset formed from the provided blob URLs.

Return type:

Self

classmethod from_container_url(container_url: str, *, prefix: str | None = None, credential: AzureSasCredential | TokenCredential | None | Literal[False] = None, transform: Callable[[Blob], _TransformOutputType_co] | None = None) Self

Instantiate dataset by listing blobs from provided container URL.

Sample usage:

dataset = BlobDataset.from_container_url(
    "https://<storage-account-name>.blob.core.windows.net/<container-name>",
)
Parameters:
  • container_url (str) -- The full endpoint URL to the container to be used for dataset. The URL respects SAS tokens in its query string.

  • prefix (str | None) -- The prefix to filter blobs by. Only blobs whose names begin with prefix will be included in the dataset. If not specified, all blobs in the container will be included in the dataset.

  • credential (AzureSasCredential | TokenCredential | None | Literal[False]) -- The credential to use for authentication. If not specified, azure.identity.DefaultAzureCredential will be used. When set to False, anonymous requests will be made. If a URL contains a SAS token, this parameter is ignored for that URL.

  • transform (Callable[[Blob], _TransformOutputType_co] | None) -- A callable that accepts a Blob object representing a blob in the dataset and returns a transformed output to be used as output from the dataset. See Blob class for more information on writing a transform callable to override the default dataset output format.

Returns:

Dataset formed from the blobs in the provided container URL.

Return type:

Self

__getitem__(index: int) _TransformOutputType_co

Retrieve the blob at the specified index in the dataset.

Parameters:

index (int) -- The index of the blob to retrieve.

Returns:

The blob, with transform applied, at the specified index.

Return type:

_TransformOutputType_co

__len__() int

Return the number of blobs in the dataset.

Returns:

The number of blobs in the dataset.

Return type:

int

class azstoragetorch.datasets.IterableBlobDataset(blobs: Iterable[Blob], transform: Callable[[Blob], _TransformOutputType_co] | None = None)

Bases: IterableDataset[_TransformOutputType_co]

Iterable-style dataset for blobs in Azure Blob Storage.

Data samples returned from dataset map directly one-to-one to blobs in Azure Blob Storage. Use from_blob_urls() or from_container_url() to create an instance of this dataset. For example:

from azstoragetorch.datasets import IterableBlobDataset

dataset = IterableBlobDataset.from_container_url(
    "https://<storage-account-name>.blob.core.windows.net/<container-name>"
)
print(next(iter(dataset)))  # Print first blob in the dataset

Instantiating dataset class directly using __init__() is not supported.

Usage with PyTorch DataLoader

The dataset can be provided directly to a PyTorch DataLoader:

import torch.utils.data

loader = torch.utils.data.DataLoader(dataset)

When setting num_workers for the DataLoader, the dataset automatically shards data samples returned across workers to avoid the DataLoader returning duplicate data samples from its workers.

Dataset output

The default output format of the dataset is a dictionary with the keys:

  • url: The full endpoint URL of the blob.

  • data: The content of the blob as bytes.

For example:

{
    "url": "https://<account-name>.blob.core.windows.net/<container-name>/<blob-name>",
    "data": b"<blob-content>"
}

To override the output format, provide a transform callable to either from_blob_urls() or from_container_url() when creating the dataset.

classmethod from_blob_urls(blob_urls: str | Iterable[str], *, credential: AzureSasCredential | TokenCredential | None | Literal[False] = None, transform: Callable[[Blob], _TransformOutputType_co] | None = None) Self

Instantiate dataset from provided blob URLs.

Sample usage:

container_url = "https://<storage-account-name>.blob.core.windows.net/<container-name>"
dataset = IterableBlobDataset.from_blob_urls([
    f"{container_url}/<blob-name-1>",
    f"{container_url}/<blob-name-2>",
    f"{container_url}/<blob-name-3>",
])
Parameters:
  • blob_urls (str | Iterable[str]) -- The full endpoint URLs to the blobs to be used for dataset. Can be a single URL or an iterable of URLs. URLs respect SAS tokens, snapshots, and version IDs in their query strings.

  • credential (AzureSasCredential | TokenCredential | None | Literal[False]) -- The credential to use for authentication. If not specified, azure.identity.DefaultAzureCredential will be used. When set to False, anonymous requests will be made. If a URL contains a SAS token, this parameter is ignored for that URL.

  • transform (Callable[[Blob], _TransformOutputType_co] | None) -- A callable that accepts a Blob object representing a blob in the dataset and returns a transformed output to be used as output from the dataset. See Blob class for more information on writing a transform callable to override the default dataset output format.

Returns:

Dataset formed from the provided blob URLs.

Return type:

Self

classmethod from_container_url(container_url: str, *, prefix: str | None = None, credential: AzureSasCredential | TokenCredential | None | Literal[False] = None, transform: Callable[[Blob], _TransformOutputType_co] | None = None) Self

Instantiate dataset by listing blobs from provided container URL.

Sample usage:

dataset = IterableBlobDataset.from_container_url(
    "https://<storage-account-name>.blob.core.windows.net/<container-name>",
)
Parameters:
  • container_url (str) -- The full endpoint URL to the container to be used for dataset. The URL respects SAS tokens in its query string.

  • prefix (str | None) -- The prefix to filter blobs by. Only blobs whose names begin with prefix will be included in the dataset. If not specified, all blobs in the container will be included in the dataset.

  • credential (AzureSasCredential | TokenCredential | None | Literal[False]) -- The credential to use for authentication. If not specified, azure.identity.DefaultAzureCredential will be used. When set to False, anonymous requests will be made. If a URL contains a SAS token, this parameter is ignored for that URL.

  • transform (Callable[[Blob], _TransformOutputType_co] | None) -- A callable that accepts a Blob object representing a blob in the dataset and returns a transformed output to be used as output from the dataset. See Blob class for more information on writing a transform callable to override the default dataset output format.

Returns:

Dataset formed from the blobs in the provided container URL.

Return type:

Self

__iter__() Iterator[_TransformOutputType_co]

Iterate over the blobs in the dataset.

Returns:

An iterator over the blobs, with transform applied, in the dataset. The transform is applied lazily to each blob as it is yielded.

Return type:

Iterator[_TransformOutputType_co]

class azstoragetorch.datasets.Blob(blob_client: AzStorageTorchBlobClient)

Object representing a single blob in a dataset.

Datasets instantiate Blob objects and pass them directly to a dataset's transform callable. Within the transform callable, use properties and methods to access a blob's properties and content. For example:

from azstoragetorch.datasets import Blob, BlobDataset

def to_bytes(blob: Blob) -> bytes:
    with blob.reader() as f:
        return f.read()

dataset = BlobDataset.from_blob_urls(
    "https://<storage-account-name>.blob.core.windows.net/<container-name>/<blob-name>",
    transform=to_bytes
)
print(type(dataset[0]))  # Type should be: <class 'bytes'>

Instantiating class directly using __init__() is not supported.

property url: str

The full endpoint URL of the blob.

The query string is not included in the returned URL.

property blob_name: str

The name of the blob.

property container_name: str

The name of the blob's container.

reader() BlobIO

Open file-like object for reading the blob's content.

Returns:

A file-like object for reading the blob's content.

Return type:

BlobIO

Exceptions

exception azstoragetorch.exceptions.AZStorageTorchError

Bases: Exception

Base class for exceptions raised by azstoragetorch.

exception azstoragetorch.exceptions.ClientRequestIdMismatchError(request_client_id: str, response_client_id: str, service_request_id: str)

Bases: AZStorageTorchError

Raised when a client request ID in a response does not match the ID in it's originating request.

If receiving this error as part of using both an azstoragetorch dataset and a PyTorch DataLoader, it may be because the dataset is being accessed in both the main process and a DataLoader's worker process. This can cause unintentional sharing of resources. To fix this error, consider not accessing the dataset's samples in the main process or not using workers with the DataLoader.

exception azstoragetorch.exceptions.FatalBlobIOWriteError(underlying_exception: BaseException)

Bases: AZStorageTorchError

Raised when a fatal error occurs during BlobIO write operations.

When this exception is raised, it indicates no more writing can be performed on the BlobIO object and no blocks staged as part of this BlobIO will be committed. It is recommended to create a new BlobIO object and retry all writes when attempting retries.