API Reference¶
BlobIO¶
- class azstoragetorch.io.BlobIO(blob_url: str, mode: Literal['rb', 'wb'], *, credential: AzureSasCredential | TokenCredential | None | Literal[False] = None, **_internal_only_kwargs)¶
Bases:
IOBase
File-like object for reading and writing blobs in Azure Blob Storage.
Use this class directly for PyTorch checkpointing by passing it directly to
torch.save()
ortorch.load()
:import torch import torchvision.models from azstoragetorch.io import BlobIO CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>" # Sample model to save and load. Replace with your own model. model = torchvision.models.resnet18(weights="DEFAULT") # Save trained model to Azure Blob Storage. This saves the model weights # to a blob named "model_weights.pth" in the container specified by CONTAINER_URL. with BlobIO(f"{CONTAINER_URL}/model_weights.pth", "wb") as f: torch.save(model.state_dict(), f) # Load trained model from Azure Blob Storage. This loads the model weights # from the blob named "model_weights.pth" in the container specified by CONTAINER_URL. with BlobIO(f"{CONTAINER_URL}/model_weights.pth", "rb") as f: model.load_state_dict(torch.load(f))
- Parameters:
blob_url (str) -- The full endpoint URL to the blob. The URL respects SAS tokens, snapshots, and version IDs in its query string.
mode (Literal['rb', 'wb']) --
The mode in which to open the blob. Supported modes are:
rb
- Opens blob for readingwb
- Opens blob for writing
credential (AzureSasCredential | TokenCredential | None | Literal[False]) -- The credential to use for authentication. If not specified,
azure.identity.DefaultAzureCredential
will be used. When set toFalse
, anonymous requests will be made. If theblob_url
contains a SAS token, this parameter is ignored.
- close() None ¶
Close the file-like object.
In write mode, this will
flush()
and commit the blob.- Raises:
FatalBlobIOWriteError -- if a fatal error occurs when writing to blob. If raised, no data written, nor uploaded, using this
BlobIO
instance will be committed to the blob. It is recommended to create a newBlobIO
instance and retry all writes when attempting retries.
- property closed: bool¶
Whether the file-like object is closed.
Is
True
if the file-like is closed,False
otherwise.
- flush() None ¶
Flush all written data to the blob.
When called, any unstaged data will be uploaded and method will block until all uploads complete. In read mode, this method has no effect.
- Raises:
FatalBlobIOWriteError -- if a fatal error occurs when writing to blob. If raised, no data written, nor uploaded, using this
BlobIO
instance will be committed to the blob. It is recommended to create a newBlobIO
instance and retry all writes when attempting retries.
- readable() bool ¶
Return whether file-like object is readable.
- Returns:
True
if opened in read mode,False
otherwise.- Return type:
- readline(size: int | None = -1, /) bytes ¶
Read and return a line from the file-like object.
The line terminator is always
b'\n'
.
- seek(offset: int, whence: int = 0, /) int ¶
Change the file-like position to a given byte offset.
- Parameters:
offset (int) -- The offset to seek to
whence (int) --
The reference point for the offset. Accepted values are:
os.SEEK_SET
- The start of the file-like object (the default)os.SEEK_CUR
- The current position in the file-like objectos.SEEK_END
- The end of the file-like object
- Returns:
The new absolute position in the file-like object.
- Return type:
- seekable() bool ¶
Return whether file-like object supports random access.
- Returns:
True
if can seek,False
otherwise. Seeking is only supported in read mode.- Return type:
- tell() int ¶
Return the current position in the file-like object.
- Returns:
The current position in the file-like object.
- Return type:
- writable() bool ¶
Return whether file-like object is writeable.
- Returns:
True
if opened in write mode,False
otherwise.- Return type:
- write(b: bytes | bytearray | memoryview, /) int ¶
Writes a bytes-like object to the blob
Data written may not be immediately uploaded. Instead, data may be uploaded via threads after
write()
has returned or may be uploaded as part of subsequent calls toBlobIO
methods. This means ifwrite()
has returned without an error, it does not mean the data was successfully uploaded to the blob. Calls toflush()
orclose()
will upload all pending data, block until all data is uploaded, and propogate any errors.- Parameters:
b (bytes | bytearray | memoryview) -- The bytes-like object to write to the blob.
- Returns:
The number of bytes written
- Raises:
FatalBlobIOWriteError -- if a fatal error occurs when writing to blob. If raised, no data written, nor uploaded, using this
BlobIO
instance will be committed to the blob. It is recommended to create a newBlobIO
instance and retry all writes when attempting retries.- Return type:
Datasets¶
- class azstoragetorch.datasets.BlobDataset(blobs: Iterable[Blob], transform: Callable[[Blob], _TransformOutputType_co] | None = None)¶
Bases:
Dataset
[_TransformOutputType_co
]Map-style dataset for blobs in Azure Blob Storage.
Data samples returned from dataset map directly one-to-one to blobs in Azure Blob Storage. Use
from_blob_urls()
orfrom_container_url()
to create an instance of this dataset. For example:from azstoragetorch.datasets import BlobDataset dataset = BlobDataset.from_container_url( "https://<storage-account-name>.blob.core.windows.net/<container-name>" ) print(dataset[0]) # Print first blob in the dataset
Instantiating dataset class directly using
__init__()
is not supported.Usage with PyTorch DataLoader
The dataset can be provided directly to a PyTorch
DataLoader
:import torch.utils.data loader = torch.utils.data.DataLoader(dataset)
Dataset output
The default output format of the dataset is a dictionary with the keys:
url
: The full endpoint URL of the blob.data
: The content of the blob asbytes
.
For example:
{ "url": "https://<account-name>.blob.core.windows.net/<container-name>/<blob-name>", "data": b"<blob-content>" }
To override the output format, provide a
transform
callable to eitherfrom_blob_urls()
orfrom_container_url()
when creating the dataset.- classmethod from_blob_urls(blob_urls: str | Iterable[str], *, credential: AzureSasCredential | TokenCredential | None | Literal[False] = None, transform: Callable[[Blob], _TransformOutputType_co] | None = None) Self ¶
Instantiate dataset from provided blob URLs.
Sample usage:
container_url = "https://<storage-account-name>.blob.core.windows.net/<container-name>" dataset = BlobDataset.from_blob_urls([ f"{container_url}/<blob-name-1>", f"{container_url}/<blob-name-2>", f"{container_url}/<blob-name-3>", ])
- Parameters:
blob_urls (str | Iterable[str]) -- The full endpoint URLs to the blobs to be used for dataset. Can be a single URL or an iterable of URLs. URLs respect SAS tokens, snapshots, and version IDs in their query strings.
credential (AzureSasCredential | TokenCredential | None | Literal[False]) -- The credential to use for authentication. If not specified,
azure.identity.DefaultAzureCredential
will be used. When set toFalse
, anonymous requests will be made. If a URL contains a SAS token, this parameter is ignored for that URL.transform (Callable[[Blob], _TransformOutputType_co] | None) -- A callable that accepts a
Blob
object representing a blob in the dataset and returns a transformed output to be used as output from the dataset. SeeBlob
class for more information on writing atransform
callable to override the default dataset output format.
- Returns:
Dataset formed from the provided blob URLs.
- Return type:
- classmethod from_container_url(container_url: str, *, prefix: str | None = None, credential: AzureSasCredential | TokenCredential | None | Literal[False] = None, transform: Callable[[Blob], _TransformOutputType_co] | None = None) Self ¶
Instantiate dataset by listing blobs from provided container URL.
Sample usage:
dataset = BlobDataset.from_container_url( "https://<storage-account-name>.blob.core.windows.net/<container-name>", )
- Parameters:
container_url (str) -- The full endpoint URL to the container to be used for dataset. The URL respects SAS tokens in its query string.
prefix (str | None) -- The prefix to filter blobs by. Only blobs whose names begin with
prefix
will be included in the dataset. If not specified, all blobs in the container will be included in the dataset.credential (AzureSasCredential | TokenCredential | None | Literal[False]) -- The credential to use for authentication. If not specified,
azure.identity.DefaultAzureCredential
will be used. When set toFalse
, anonymous requests will be made. If a URL contains a SAS token, this parameter is ignored for that URL.transform (Callable[[Blob], _TransformOutputType_co] | None) -- A callable that accepts a
Blob
object representing a blob in the dataset and returns a transformed output to be used as output from the dataset. SeeBlob
class for more information on writing atransform
callable to override the default dataset output format.
- Returns:
Dataset formed from the blobs in the provided container URL.
- Return type:
- class azstoragetorch.datasets.IterableBlobDataset(blobs: Iterable[Blob], transform: Callable[[Blob], _TransformOutputType_co] | None = None)¶
Bases:
IterableDataset
[_TransformOutputType_co
]Iterable-style dataset for blobs in Azure Blob Storage.
Data samples returned from dataset map directly one-to-one to blobs in Azure Blob Storage. Use
from_blob_urls()
orfrom_container_url()
to create an instance of this dataset. For example:from azstoragetorch.datasets import IterableBlobDataset dataset = IterableBlobDataset.from_container_url( "https://<storage-account-name>.blob.core.windows.net/<container-name>" ) print(next(iter(dataset))) # Print first blob in the dataset
Instantiating dataset class directly using
__init__()
is not supported.Usage with PyTorch DataLoader
The dataset can be provided directly to a PyTorch
DataLoader
:import torch.utils.data loader = torch.utils.data.DataLoader(dataset)
When setting
num_workers
for theDataLoader
, the dataset automatically shards data samples returned across workers to avoid theDataLoader
returning duplicate data samples from its workers.Dataset output
The default output format of the dataset is a dictionary with the keys:
url
: The full endpoint URL of the blob.data
: The content of the blob asbytes
.
For example:
{ "url": "https://<account-name>.blob.core.windows.net/<container-name>/<blob-name>", "data": b"<blob-content>" }
To override the output format, provide a
transform
callable to eitherfrom_blob_urls()
orfrom_container_url()
when creating the dataset.- classmethod from_blob_urls(blob_urls: str | Iterable[str], *, credential: AzureSasCredential | TokenCredential | None | Literal[False] = None, transform: Callable[[Blob], _TransformOutputType_co] | None = None) Self ¶
Instantiate dataset from provided blob URLs.
Sample usage:
container_url = "https://<storage-account-name>.blob.core.windows.net/<container-name>" dataset = IterableBlobDataset.from_blob_urls([ f"{container_url}/<blob-name-1>", f"{container_url}/<blob-name-2>", f"{container_url}/<blob-name-3>", ])
- Parameters:
blob_urls (str | Iterable[str]) -- The full endpoint URLs to the blobs to be used for dataset. Can be a single URL or an iterable of URLs. URLs respect SAS tokens, snapshots, and version IDs in their query strings.
credential (AzureSasCredential | TokenCredential | None | Literal[False]) -- The credential to use for authentication. If not specified,
azure.identity.DefaultAzureCredential
will be used. When set toFalse
, anonymous requests will be made. If a URL contains a SAS token, this parameter is ignored for that URL.transform (Callable[[Blob], _TransformOutputType_co] | None) -- A callable that accepts a
Blob
object representing a blob in the dataset and returns a transformed output to be used as output from the dataset. SeeBlob
class for more information on writing atransform
callable to override the default dataset output format.
- Returns:
Dataset formed from the provided blob URLs.
- Return type:
- classmethod from_container_url(container_url: str, *, prefix: str | None = None, credential: AzureSasCredential | TokenCredential | None | Literal[False] = None, transform: Callable[[Blob], _TransformOutputType_co] | None = None) Self ¶
Instantiate dataset by listing blobs from provided container URL.
Sample usage:
dataset = IterableBlobDataset.from_container_url( "https://<storage-account-name>.blob.core.windows.net/<container-name>", )
- Parameters:
container_url (str) -- The full endpoint URL to the container to be used for dataset. The URL respects SAS tokens in its query string.
prefix (str | None) -- The prefix to filter blobs by. Only blobs whose names begin with
prefix
will be included in the dataset. If not specified, all blobs in the container will be included in the dataset.credential (AzureSasCredential | TokenCredential | None | Literal[False]) -- The credential to use for authentication. If not specified,
azure.identity.DefaultAzureCredential
will be used. When set toFalse
, anonymous requests will be made. If a URL contains a SAS token, this parameter is ignored for that URL.transform (Callable[[Blob], _TransformOutputType_co] | None) -- A callable that accepts a
Blob
object representing a blob in the dataset and returns a transformed output to be used as output from the dataset. SeeBlob
class for more information on writing atransform
callable to override the default dataset output format.
- Returns:
Dataset formed from the blobs in the provided container URL.
- Return type:
- class azstoragetorch.datasets.Blob(blob_client: AzStorageTorchBlobClient)¶
Object representing a single blob in a dataset.
Datasets instantiate
Blob
objects and pass them directly to a dataset'stransform
callable. Within thetransform
callable, use properties and methods to access a blob's properties and content. For example:from azstoragetorch.datasets import Blob, BlobDataset def to_bytes(blob: Blob) -> bytes: with blob.reader() as f: return f.read() dataset = BlobDataset.from_blob_urls( "https://<storage-account-name>.blob.core.windows.net/<container-name>/<blob-name>", transform=to_bytes ) print(type(dataset[0])) # Type should be: <class 'bytes'>
Instantiating class directly using
__init__()
is not supported.
Exceptions¶
- exception azstoragetorch.exceptions.AZStorageTorchError¶
Bases:
Exception
Base class for exceptions raised by
azstoragetorch
.
- exception azstoragetorch.exceptions.ClientRequestIdMismatchError(request_client_id: str, response_client_id: str, service_request_id: str)¶
Bases:
AZStorageTorchError
Raised when a client request ID in a response does not match the ID in it's originating request.
If receiving this error as part of using both an azstoragetorch dataset and a PyTorch DataLoader, it may be because the dataset is being accessed in both the main process and a DataLoader's worker process. This can cause unintentional sharing of resources. To fix this error, consider not accessing the dataset's samples in the main process or not using workers with the DataLoader.
- exception azstoragetorch.exceptions.FatalBlobIOWriteError(underlying_exception: BaseException)¶
Bases:
AZStorageTorchError
Raised when a fatal error occurs during
BlobIO
write operations.When this exception is raised, it indicates no more writing can be performed on the
BlobIO
object and no blocks staged as part of thisBlobIO
will be committed. It is recommended to create a newBlobIO
object and retry all writes when attempting retries.