005. Observability

Epic: #421
Authors: Patrick Schuler, Bastian Burger, Eugene Fedorenko
Status: Accepted

Status

Accepted

Context

The goal of observability for the LoRaWAN IoT Edge starter kit is to:

Monitor if the LoRaWAN Starter Kit solution works according to the user expectations regarding the following factors:
- Coverage. The data is coming from the majority of observed IoT assets
- Freshness. The data coming from the assets is fresh and relevant
- Throughput. The data is delivered from the assets without significant delays.
- Correctness. The ratio of errors and lost messages from the assets is small
Provide monitoring instruments to detect possible failure/violation in each factor
Provide instruments to identify and diagnose failures to get to the problem quickly

The decisions in the following will apply to our LoRaWAN Network Server (LNS) implementation.

Decisions

We will support Azure Monitor as a first-class monitoring solution for our starter kit. A user can opt-in to use Application Insights with the starter kit, in which case we will support a rich set of observability features. If the user decides to not use Application Insights, we will still support essential monitoring capabilities. This means that we will:

Track LNS logs in Application Insights (when opted in). We will adhere to the IoT Edge recommended format for the structure of the log console output. Export of logs to anything else than Application Insights requires a custom solution by the user and is not supported by the starter kit.
Always expose metrics using prometheus-net.
- Additionally, we track LNS metrics using the ASP.NET Core Application Insights SDK (when opted in)
Track traces using the Application Insights SDK (when opted in)
Support alerts when using Application Insights and/or Log Analytics (with Prometheus format and metrics collector module)
For now we will not support complete distributed tracing in the LoRaWAN starter kit, other than what Application Insights tracing will give us out of the box. We will evaluate this with #695.

A more thorough description of each bullet point follows below.

Logs

Using ILogger as the core method to log information from all parts of the application makes sure we have an abstracted logging framework we can use and can add/remove sinks as required.

The different log sinks are implemented as ILoggerProvider. We will have three to start with:

Console
IoT Hub
TCP

The standard logger for Application Insights is added on an opt-in basis. We will adhere to the recommended logging format for the LNS console logger to comply with the IoT Edge log format and to simplify logs scraping. We will not support a full logs delivery solution, such as ELMS, since it will introduce too many components and too much complexity to the starter kit. This means that we will not support cloud delivery of edgeAgent and edgeHub logs other than what is documented in Retrieve IoT Edge logs - Azure IoT Edge | Microsoft Docs.

If a user of the starter kit wants to scrape logs from modules other than LNS, or use a service other than the Application Insights SDK, the user will have to implement a custom solution.

Traces

We use built-in tracing from Azure Application Insights (on an opt-in basis). This works well for function calls and correlation to other services, such as Key Vault. We will not include message flow end to end tracing for now, but will reevaluate with #695.

Metrics

The core modules edgeHub and edgeAgent support emitting metrics through a Prometheus endpoint, using the strategy described in Access built-in metrics - Azure IoT Edge | Microsoft Docs. To collect these metrics and integrate everything with Azure Monitor, we use the metric collector (preview) as suggested in Collect and transport metrics - Azure IoT Edge | Microsoft Docs to export metrics to a Log Analytics storage.

We will always expose LNS custom metrics in Prometheus format using prometheus-net/prometheus-net, such that they can be consumed by any scraper that supports the Prometheus format. This will give us the following features:

Unified metrics format accross all modules in the Edge device.
- The Prometheus format is industrial standard understood by various consumers.
Decouples metrics exposure from the delivery-to-cloud approach. If at one point we decide to change how we scrap the metrics or how/where we deliver them to the observer, we can do that without changing the modules.
Eliminates any dependencies on Azure Monitor services (Log Analytics / Application Insights) for essential monitoring
Potentially gives ability to work offline if metrics are sent by the collector module through the Edge Hub using device-to-cloud channel.
It's up to the customer to configure how, where and what metrics to deliver from any module on an edge device.

In addition to this, we will support Application Insights metrics on an opt-in basis. When enabled, we will deliver most metrics (custom and default from LNS, except the edgeAgent and edgeHub metrics, which can only be delivered to Log Analytics) to Application Insights. This will ensure that we get many of the features that we get with Application Insights out of the box (Live Metrics, integration with alerts and workbooks), while still keeping the flexibility of consuming the metrics in Prometheus format and all the advantages that come with it. This comes at the cost of increased implementation complexity.

Note: IoT Hub comes with curated workbooks and predefined queries for alerts based on built-in Prometheus-format metrics that are delivered by the metrics collector module to LogAnalytics.

Custom metrics/events

Name	Description	Source	Namespace	Dimensions
ReceiveWindowHits	Number of times we hit the different receive windows.	LNS	LoRaWan	Gateway Id, (estimated) Receive Window
ReceiveWindowMisses	Number of missed on downstream windows	LNS	LoRaWan	Gateway Id
DeviceCacheHit	Number of device cache hit	LNS	LoRaWan	Gateway Id
DeviceLoadRequests	Number of device load requests	LNS	LoRaWan	Gateway Id
JoinRequests	Number of join requests	LNS	LoRaWan	Gateway Id
StationConnectivityLost	Connection to LBS lost	LNS	LoRaWan	Gateway Id
ActiveStationConnections	Active connections to stations	LNS	LoRaWan	Gateway Id
UnhandledExceptions	Number of unhandled exceptions in LNS processing	LNS	LoRaWan
D2CMessagesReceived	Number of messages received from device	LNS	LoRaWan	Gateway Id
D2CMessageDeliveryLatency	Time from when we dispatched the message sent from the concentrator until we are done processing it	LNS	LoRaWan	Gateway Id
D2CMessageSize	Message size in bytes received from device	LNS	LoRaWan	Gateway Id
C2DMessageTooLong	Number of C2D messages that were too long to be sent downstream	LNS	LoRaWan	Gateway Id

Alerts

We support the following alerts when the user opts in to use Application Insights.

Name	Description	Source	Condition
HighUpstreamMessageLatency	High device message processing time (throughput)	D2CMessageDeliveryLatency	Dynamic
HighErrorCount	High error count (correctness)	Unhandled Exceptions	Dynamic
HighReceiveWindowMisses	High device message processing time (throughput)	ReceiveWindowMisses	Dynamic
HighDownstreamMessagesLostRatio	High device messages lost ratio (correctness, throughput)	Abandoned messages (IoT Hub metric)	Dynamic

Alternatives considered

As a generic alternative to the Application Insights SDK we considered the OpenTelemetry .NET SDK. This would allow us to abstract emitting telemetry for different backend systems. However, the status of the project - open-telemetry/opentelemetry-dotnet: The OpenTelemetry .NET Client (github.com) - is not ready to be added to the Starter Kit. Especially Prometheus exporter (alpha) and metrics in general (experimental) do not help us improving our solution at the moment.

Last update: 2022-01-07
Created: 2021-11-10