Azure IoT C SDK
|
In this document you will find information about the reliability aspects of the Azure IoT C SDK design, including details of:
This is a brief note to clarify how authentication is done in the IoTHub Device/Module clients.
Authentication of a client in the SDK can be done using either
This section does not describe the details of Device Provisioning Service (DPS), please use the link above for details.
When using SAS tokens, authentication can be done by:
As mentioned in the articles above, SAS tokens have an expiration time.
The Azure IoT SDK connection then generates and sends new SAS tokens periodically to the IoT Hub to keep the connection authenticated.
The internal behaviour is different depending on the transport protocol used:
Transport | Behaviour |
---|---|
MQTT | SAS tokens are valid for 1 hour, and a new one is sent every 48 minutes. Every time a new SAS token needs to be sent, the client will disconnect from the Azure IoT Hub and reconnect. |
AMQP | SAS tokens are valid for 1 hour, and a new one is sent every 48 minutes. Client is not disconnected when a new SAS token is sent. |
HTTP | There is no persistent connection; a new SAS token (valid for 1 hour) is created and sent with each request to the Azure IoT Hub. |
Both the SAS token lifetime and refresh rate are configurable on AMQP transport (see AMQP transport section in SDK options documentation).
The design of the Azure IoT C SDK is composed of layers, each of them assigned specific responsibilities:
Layer | C module | Purpose |
---|---|---|
Azure IoT C SDK | iothub_client | Multi-threaded layer over iothub_ll_client (automatically performs invocation of the IoTHubClient_LL_DoWork function and internal callback handling) |
Azure IoT C Low Level SDK | iothub_client_ll | Main surface of the Azure IoT device client API (single-threaded) |
Protocol Transport | iothubtransport* | Provides an interface between the specific protocol API (e.g., uamqp, umqtt) and the upper client SDK. It is responsible for part of the business logic, the message queuing and timeout control, options handling. |
Protocol API | uamqp, umqtt or native HTTP API | Implements the specific application protocol (either AMQP, MQTT or HTTP, respectivelly) |
TLS | tlsio_* | Provides a wrapper over the specific TLS API (Schannel, openssl, wolfssl, mbedtls), using the xio interface that the device client SDK uses |
Socket | socketio_* | Provides a wrapper over the specific socket API (win32, berkeley), using the xio interface that the device client SDK uses |
When an Azure IoT device client instance is created, this is the typical* sequence within the SDK:
Each of these layers provide status and error events to the above through function returns and callbacks. Connection issues are detected in three different ways accross the SDK:
Which can be, for example:
Transport protocol detecting timeouts waiting for:
Through failures reported by the socket APIs;
Through graceful disconnection notifications from the Azure IoT Hub.
Which can happen when a hub is preparing for a system update, for example (upon reconnection the device client automatically gets routed to the next available Hub).
Some aspects of the detection of connection issues are specific to the transport protocol used, as shown in the table:
Protocol | Connection Issue Detection |
---|---|
AMQP | Besides regular detection through callbacks from uAMQP, the AMQP protocol transport will mark a connection to the Azure IoT hub as faulty if 5 (five) or more consecutive failures occur on any of these: A) Attempting to subscribe for Commands, Device Methods or Twin Desired Properties, B) sending Telemetry messages (either by timeouts or active failures returned by uAMQP api), C) responding to Device Method invokations, D) refreshing CBS authentication tokens. |
MQTT | Besides regular detection through callbacks from uMQTT, the MQTT protocol transport will attempt to publish messages up to two times (waiting 60 seconds between attempts) before raising a failure. |
HTTP | HTTP connections to the Azure IoT Hub are not persistent. Each outgoing message to the hub results in a new connection and is closed as soon as the I/O is completed. Incoming messages from the Hub are received by the device client through polling mechanisms, where the the HTTP connection follows the same lifecycle above. If connection failures occur, the protocol transport simply keeps retrying the operation until it succeeds. |
Once a connection issue is detected, the transport protocol will initiate its connection retry logic. The process is as follows:
a. Pending outgoing messages are properly handled;
b. Connection components are destroyed;
c. Connection Status Callback is invoked (if subscribed);
More details are explained in the "Connection Retry Policies" sub-section below.
If the Retry Policy requires to wait before an attempt can be made, the protocol transport delays the re-connection, starting again from step 3 afterwards.
The Azure IoT Device Client C SDK implements asynchronous operations through the *_DoWork() model it uses.
All API functions that result in I/O (like IoTHubClient_LL_SendEventAsync or IoTHubClient_LL_SetMessageCallback) are actually queued or stored, taking effect only when IoTHubClient_LL_DoWork is invoked* AND the connection with the Azure IoT Hub is established.
For example, while the device client is in re-connection mode,
For clarity it is worth mentioning that while the Azure IoT Device Client is re-connecting, it has no means to receive any messages from the Azure IoT Hub. During that time any attempts to send Commands or invoke Device Methods to the given device client will result in failure returned by the Azure IoT Hub to the source of those requests.
Besides checking for error returns and responding to callbacks from lower its layers, the Azure IoT Device Client C SDK also implements extra logic to detect failures by tracking timeouts. They apply to different functionalities within the SDK, each with a specific course of action in case
Some of the time-outs can be fine-tuned by the user (through IoTHubClient_LL_SetOption). Some are dependent on the application protocol selected (AMQP, MQTT or HTTP).
Details about the specific timeout control values that can be set by the user can be found on the "Current Configuration Options" sub-section bellow.
Currently the Azure IoT Device Client C SDK implements a queue for pending outgoing Telemetry messages that is owned by the iothub_client_ll layer.
Any new Telemetry messages passed to the SDK by the user (through ) are copied and immediatelly stored in that queue.
That same queue is shared with the protocol transport, which treats it as a "waiting to send" list. The protocol transport also has its own queue for Telemetry messages, but this one is only for messages that are already being sent (i.e., have been processed and passed down to the protocol API module).
The protocol transport at some point (namely, right when IoTHubClient_LL_DoWork is invoked) removes messages in the "waiting to send" list (in the order they were added), converts them to the format understood by the protocol API layer, then adds the pair to its "in progress" list.
Once the lower layer calls back with the send completion notification the protocol transport removes the specific message from its "in progress" list and bubbles it (along with the send result) to the user (through the callback provided).
There are two timeout controls in this system. An original one in the iothub_client_ll layer - which controls the "waiting to send" queue - and a modern one in the protocol transport layer - that applies to the "in progress" list. However, since IoTHubClient_LL_DoWork causes the Telemetry messages to be immediately* processed, sent and moved to the "in progress" list, the first timeout control is virtually non-applicable.
Both can be fine-tuned by users through IoTHubClient_LL_SetOption, and because of that removing the original control could cause a break for existing customers. For that reason it has been kept as is, but it will be re-designed when we move to the next major version of the product.
For now, new customers should set both options if using the iothub_client_ll module, or just the protocol transport one if using the iothub_client module.
Some times the connection issues, although transient, can last longer than expected, or network availability can bounce on and off for a while after it first returns.
Attempting to reconnect to the Azure IoT Hub immediatelly in a loop is not the most efficient way for the device client SDK to operate, since some of the initial attempts might fail because of the reasons above.
As shown above on "The Connection Retry Logic" (step 3), the protocol transport checks if the current retry policy allows it to attempt to re-connect to the Azure IoT Hub.
The Retry Policy feature exposes a way to control how immediatelly and frequently the Azure IoT Device Client C SDK will attempt to re-connect to the Azure IoT Hub in case a connection issue occurs.
The logic is as follows:
In such case the protocol transport will attempt reconnecting.
If if succeeds, it will instruct the Retry Control to reset its counters, and the next wait times it calculates will start fresh from the initial base time.
If the re-connection attempt fails the protocol transport will continue checking with the Retry Control if it can try again. Depending on the Retry Policy currently set this next wait time calculated by the Retry Control can be longer (and similarly along with the next wait times calculated), aiming at easing out the frequency with which the device client attempts to reconnect. That gives a chance for the network (at the Operating System level) to get back to its normal availability.
The transport protocol will continue to check with the Retry Control if it is time to attempt re-connecting (until it is finally allowed or denied).
The Retry Policies are composed by a algorithm to calculate the wait times in between reconnection attempts, as well as a maximum time for the total ammount of consecutive tries (see argument "retryTimeoutLimitInSeconds" on the functions below).
If this timeout is reached, the Retry Control will instruct the transport protocol to not re-connect anymore. In this scenario a Connection Status notification is raised and the user application is made aware of that. Some customers use this aspect of the feature to perform manual intervention in case re-connections have been failing consecutivelly for long periods (e.g., 24 hours or more), possibly indicating more serious network issues.
One important detail is that the protocol transport never ceases checking with the Retry Control if it can attempt to re-connect again, even when it is told not to try anymore. In that case, if during execution time the user changes the current Retry Policy (see below for details) the Retry Control gets reset, giving the opportunity for the device client to start attemping to re-connect again.
Currently the default Retry Policy in the Azure IoT Device Client C SDK is IOTHUB_CLIENT_RETRY_EXPONENTIAL_BACKOFF_WITH_JITTER (with no timeout), but it can be set by using the following SDK function:
Or if using the iothub_client_ll module:
If retryTimeoutLimitInSeconds is set as 0 (zero) the timeout for retry policies is disabled.
The current retry policies available for the argument retryPolicy are:
Policy | Description | Example |
---|---|---|
IOTHUB_CLIENT_RETRY_NONE | No re-connections are ever attempted. | Usually this option is used along with Connection Status callbacks by users that want to implement their own connection retry logic (at the application layer). |
IOTHUB_CLIENT_RETRY_NONE | No re-connections are ever attempted.Usually this option is used along with Connection Status callbacks by users that want to implement their own connection retry logic (at the application layer). | Device client detects a connection issue, but it never attempts to reconnect. |
IOTHUB_CLIENT_RETRY_IMMEDIATE | Re-connections shall be tried immediatelly, with no wait time in between attempts | Device client detects a connection issue.The re-connection attempts happen immediatelly in a loop with no wait time until one succeeds |
IOTHUB_CLIENT_RETRY_INTERVAL | First attempt should be done immediatelly.Until the re-connection succeeds, each subsequent attempt is subject to a fixed-interval wait time (5 seconds by default). | Device client detects a connection issue.The first re-connection attempt happens immediatelly, then again every 5 seconds until it succeeds |
IOTHUB_CLIENT_RETRY_LINEAR_BACKOFF | First attempt should be done immediatelly.Until the re-connection succeeds, each subsequent attempt is subject to a wait time that grows linearly.Default behavior: starts from 5 seconds and grows by increments of 5 seconds each time. | Device client detects a connection issue.The first re-connection attempt happens immediatelly, then again every 5 seconds until it succeeds |
IOTHUB_CLIENT_RETRY_EXPONENTIAL_BACKOFF | First attempt should be done immediatelly.Until the re-connection succeeds, each subsequent attempt is subject to a wait time that grows exponentially.Default behavior: starts from 1 second and doubles each time. | Device client detects a connection issue.The first re-connection attempt happens immediatelly, then again in 1 second, then again 2 seconds, 4 seconds, 8 seconds, 16, 32, 64, ... until it succeeds. |
IOTHUB_CLIENT_RETRY_EXPONENTIAL_BACKOFF_WITH_JITTER | First attempt should be done immediatelly.Until the re-connection succeeds, each subsequent attempt is subject to a wait time that grows exponentially but with a random jitter deduction.Default behavior: starts from 1 second and doubles each time minus a random jitter of zero to one-hundred percent. | Device client detects a connection issue.The first re-connection attempt happens immediatelly, then again in 1 second, then again 1 second (-100% jitter), 2 seconds (0% jitter), 3 seconds (-50% jitter), 6 (0% jitter), 10 (-67% jitter), 19 (-10% jitter), ... until it succeeds. |
IOTHUB_CLIENT_RETRY_RANDOM | First attempt should be done immediatelly.Until the re-connection succeeds, each subsequent attempt is subject to a random wait time.Default behavior: the random wait time range is from 0 to 5 seconds. | Device client detects a connection issue.The first re-connection attempt happens immediatelly, then again in 5 seconds (random multiplier of 100%), then again 2 seconds ( (random multiplier of 40%), 4 seconds (random multiplier of 80%), 0 seconds (random multiplier of 0%), 3 (60%), ... until it succeeds. |
The Azure IoT Device Client C SDK provides a callback option to notify the upper application layer if it is connected to the Azure IoT Hub or not, followed by a standardized reason.
To access it the user can invoke one of the functions below, passing a callback function.
On iothub_client module:
On iothub_client_ll module:
This callback will be invoked in these specific situations:
These include network availability and IoT hub connectivity issues, authentication failures.
The type IOTHUB\_CLIENT\_CONNECTION\_STATUS\_CALLBACK
is defined as:
The user application must provide a function that matches the function pointer definition above to provide it as an argument to any of the *_SetConnectionStatusCallback
functions above.
For reference, the connection status and reason enumerations (IOTHUB_CLIENT_CONNECTION_STATUS
and IOTHUB_CLIENT_CONNECTION_STATUS_REASON
, respectively) are defined in iothub_client_core_common.h.
Indicates whether or not the device client is connected to the Azure IoT Hub.
Value | Description |
---|---|
IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED | Effectively means the device client is not ready to communicate with the Azure IoT Hub. The device client could be in any state from completely disconnected to not yet authenticated, including when brief disconnections occur for SAS token refreshes. See the list of IOTHUB_CLIENT_CONNECTION_STATUS_REASON status below for further details. |
IOTHUB_CLIENT_CONNECTION_AUTHENTICATED | The device client is ready to communicate with the Azure IoT Hub, being both connected and authenticated. |
This enumeration provides a more specific reason for the current connection status of the device client. Its values depend on the transport protocol chosen by the user application for the Azure IoT C SDK client (AMQP, MQTT or HTTP) and on error granularity provided by the Azure IoT Hub.
An IOTHUB_CLIENT_CONNECTION_OK
is applicable to IOTHUB_CLIENT_CONNECTION_AUTHENTICATED
only.
All the other values of IOTHUB_CLIENT_CONNECTION_STATUS_REASON
are applicable to IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
only.
Please see a description of the values according to each transport protocol supported by the Azure IoT C SDK:
Value | MQTT | AMQP | HTTP |
---|---|---|---|
IOTHUB_CLIENT_CONNECTION_OK | The Azure IoT C SDK client is connected and ready to communicate with the Azure IoT Hub. | Same | Same |
IOTHUB_CLIENT_CONNECTION_COMMUNICATION_ERROR | If a telemetry message times out receiving a PUBACK from the Azure IoT Hub, or if there is an error sending a PUBACK or PUBREC to Azure IoT Hub or if an I/O error occurs when using WebSockets. | If the AMQP transport encounters an authentication timeout, unexpected link DETACH from Azure IoT Hub, or link ATTACH timeouts. | Not applicable. |
IOTHUB_CLIENT_CONNECTION_NO_NETWORK | If an MQTT CONNECT packet fails to be sent to the Azure IoT Hub for any reason. | If the AMQP transport detects a network connection issue, which includes socket errors, failures on AMQP ATTACH to CBS link (for authentication). | Not applicable. |
IOTHUB_CLIENT_CONNECTION_BAD_CREDENTIAL | Not applicable.. The MQTT transport does map some MQTT CONNECT return code values to this status code, but these MQTT CONNECT return codes are not supported by the Azure IoT Hub. See note below. | If a SAS-based authentication request fails. See Azure IoT Hub documentation on device authentication for more details. | Not applicable. |
IOTHUB_CLIENT_CONNECTION_DEVICE_DISABLED | Raised by the MQTT transport if an MQTT CONNECT to the Azure IoT Hub is rejected. See note below. | Not applicable. | Not applicable. |
IOTHUB_CLIENT_CONNECTION_RETRY_EXPIRED | The MQTT transport has made its maximum number of attempts to reconnect to the Azure IoT Hub and it will no longer try. | The AMQP transport has made its maximum number of attempts to reconnect to the Azure IoT Hub and it will no longer try. | Not applicable. Each new HTTP request sent to the Azure IoT Hub is done over a new HTTP connection. |
IOTHUB_CLIENT_CONNECTION_EXPIRED_SAS_TOKEN | The SAS token used in the current MQTT connection is expired and the client must reconnect with a new SAS token. This is an implicit dependency on MQTT v3.1.1, which is not capable of refreshing authentication information in the same connection. | Not applicable. The AMQP protocol is capable of refreshing authentication within the same connection. | Not applicable. A new SAS token is generated for each HTTP request sent to the Azure IoT Hub. |
IOTHUB_CLIENT_CONNECTION_NO_PING_RESPONSE | The MQTT transport timed out waiting for a ping response from the Azure IoT Hub. | Not applicable. | Not applicable. |
IOTHUB_CLIENT_CONNECTION_QUOTA_EXCEEDED | Not applicable. | The Azure IoT Hub rejected a telemetry message because the maximum daily quota of telemetry messages has been reached. | Not applicable. |
The Azure IoT Hub does not support all the MQTT CONNECT return code values defined in the MQTT v3.1.1 specification, always returning Not Authorized
(MQTT CONNECT Return Code 5) on MQTT CONNECT failure.
Most of the options exposed by the public API of the Azure IoT Device Client C SDK are listed on the header file `iothub_client_options.h.
They can be set by one of the _SetOption functions depending on the module used:
On iothub_client:
On iothub_client_ll:
Here is a list of the specific options that apply to connection and messaging reliability:
Option | Value Type | Applicable To | Description |
---|---|---|---|
OPTION_MESSAGE_TIMEOUT | const tickcounter_ms_t* | Timeout for iothub client messages waiting to be sent to the IoTHub | See description above for details.The default value is zero (disabled). |
"event_send_timeout_secs" | size_t* | AMQP and AMQP over WebSockets transports | Maximum amount of time, in seconds, the AMQP protocol transport will wait for a Telemetry message to complete sending.If reached, the callback function passed to IoTHubDeviceClient_LL_SendEventAsync or IoTHubDeviceClient_SendEventAsync is invoked with result IOTHUB_CLIENT_CONFIRMATION_MESSAGE_TIMEOUT.The default value 5 minutes. |
OPTION_SERVICE_SIDE_KEEP_ALIVE_FREQ_SECS | size_t* | AMQP and AMQP over WebSockets transports | See code comments. |
OPTION_REMOTE_IDLE_TIMEOUT_RATIO | double* | AMQP and AMQP over WebSockets transports | See code comments. |
OPTION_KEEP_ALIVE | int* | MQTT and MQTT over WebSockets protocol transports | Frequency in seconds that the transport protocol will be sending MQTT pings to the Azure IoT Hub.The lower this number, more responsive the device client (when using MQTT) will be to connection issues. However, slightly more data traffic it will generate.The default value is 4 minutes. |
OPTION_CONNECTION_TIMEOUT | int* | MQTT and MQTT over WebSockets protocol transports | While connecting, it is the maximum number of seconds the device client (when using MQTT) will wait for the connection to complete (CONNACK).The default value is 30 seconds. |
Although not currently configurable, they are important for further understanding the directives that impact the connection, re-connection and messaging within the Azure IoT Device Client C SDK.
Control | Description |
---|---|
Number of cumulative failures the AMQP protocol transport will wait for to mark a connection as faulty | Currently that number is five. |
Number of cumulative send failures the MQTT protocol transport will wait for to mark a connection as faulty | Currently that number is two. |
Maximum time the AMQP transport protocols will wait for the AMQP negotiation to complete (including authentication) when a device client is connecting to the Azure IoT Hub connection | The default value is 60 seconds.If the whole AMQP negotiation does not complete within that time, the connection is deemed faulty and re-connection kicks in.That can be triggered by super-slow connections. Relaxing this timeout hasn't showed practical improvements overall. |