Observability

Logging
Runtime diagnostics output
Available metrics
IoT Edge Metrics collector
OpenTelemetry
Prometheus

Logging

Logging can be configured using the --ll command line argument. It allows to set a desired log level. More fine grained logging can be configured using an appsettings.json file as per .net documentation.

It is also possible to enable more detailed logging of the OPC UA stack using the --sl option.

To get the logs the iotedge command line utility can be used. Run the following from the local command line:

iotedge logs -f <container name>

These logs contain the current state of operation and can be used to narrow down an issue.

Additionally, logs from edge modules can be fetched via direct methods describes how to fetch and upload logs from Edge modules.

Runtime diagnostics output

The OPC Publisher emits metrics through its Prometheus endpoint (/metrics). To learn more about how to create a local metrics dashboard for OPC Publisher V2.7, refer to the tutorial here.

To measure the performance of OPC Publisher, the di parameter can be used to print metrics to the log in the interval specified (in seconds). Using --pd=Events the diagnostics information can instead be emitted to a topic or to IoT Hub. The routing or topic can be configured using the --dtt command line option.

The following table describes the instruments that are collected per writer group and subsequently emitted:

Log line item name	Diagnostic info property name	Description
# OPC Publisher Version (Runtime)	n/a	The full version and runtime used by the publisher
# Time	timestamp	The timestamp of the diagnostics information
# Ingestion duration	ingestionDuration	How long the data flow inside the publisher has been executing after it was created (either from file or API)
# Opc endpoint connected?	opcEndpointConnected	Whether the pipeline is currently connected to the OPC UA server endpoint or in a reconnect attempt.
# Connection retries	connectionRetries	How many times connections to the OPC UA server broke and needed to be reconnected as it pertains to the data flow.
# Monitored Opc nodes succeeded count	monitoredOpcNodesSucceededCount	How many of the configured monitored items have been established successfully inside the data flow’s OPC UA subscription and should be producing data.
# Monitored Opc nodes failed count	monitoredOpcNodesFailedCount	How many of the configured monitored items inside the data flow failed to be created in the subscription (the logs will provide more information).
# Subscriptions count	subscriptionsCount (*)	How many subscriptions were created that contain above monitored items.
# Queued/Minimum request count	publishRequestsRatio (*)	The ratio of currently queued requests to the server as a percentage of the subscription count measured here vs. the overall number of subscriptions in the underlying session (e.g., if the session is shared).
	minPublishRequestsRatio (*)	The ratio of minimum number of publish requests that should always be queued to the server.
# Good/Bad Publish request count	goodPublishRequestsRatio (*)	The ratio of currently queued publish requests that are in progress in the server and awaiting a response.
	badPublishRequestsRatio (*)	The ratio of defunct publish requests which have not been resulting in a publish response from the server.
# Ingress value changes	ingressValueChanges	The number of value changes inside the OPC UA subscription notifications processed by the data flow.
# Ingress Events	ingressEvents	The number of events that were part of these OPC UA subscription notifications that were so far processed by the data flow.
# Received Data Change Notifications	ingressDataChanges	The number of OPC UA subscription notification messages with data value changes that have been received by publisher inside this data flow
# Received Event Notifications	ingressEventNotifications	The number of OPC UA subscription notification messages with events that have been received by publisher so far inside this data flow
# Received Keep Alive Notifications	ingressEventNotifications	The number of received OPC UA subscription notification messages that were keep alive messages
# Generated Cyclic read Notifications	ingressCyclicReads	The number of cyclic read notifications generated from sampling nodes on the client side. Each notification contains the changed value.
# Generated Heartbeats Notifications	ingressHeartbeats	The number of notifications that contain heartbeats. Each notification contains the heartbeat value.
# Notification batch buffer size	ingressBatchBlockBufferSize	The number of messages awaiting encoding and sending tot he telemetry message destination inside the data flow pipeline.
# Encoder input / output size	encodingBlockInputSize	The number of messages awaiting encoding into the output format.
	encodingBlockOutputSize	The number of messages already encoded and waiting to be sent to the telemetry message destination.
# Encoder Notif. processed/dropped	encoderNotificationsProcessed	The total number of subscription notifications processed by the encoder stage of the data flow pipeline since the pipeline started.
	encoderNotificationsDropped	The total number of subscription notifications that were dropped because they could not be encoded, e.g., due to their size being to large to fit into the message.
# Encoder Network Messages produced	encoderIoTMessagesProcessed	The total number of encoded messages produced by the encoder since the start of the pipeline.
# Encoder avg Notifications/Message	encoderAvgNotificationsMessage	The average number of subscription notifications that were pressed into a message.
# Encoder avg Message body size	encoderAvgIoTMessageBodySize	The average size of the message body produced over the course of the pipeline run.
# Encoder avg Chunk (4 Kb) usage	encoderAvgIoTChunkUsage	The average use of IoT Hub chunks (4k).
# Estimated Chunks (4 KB) per day	estimatedIoTChunksPerDay	An estimate of how many chunks are used per day by publisher which enables correct sizing of the IoT Hub to avoid data loss due to throttling.
# Egress Messages queued/dropped	outgressInputBufferCount	The aggregated number of messages waiting in the input buffer of the configured telemetry message destination sinks.
	outgressInputBufferDropped	The aggregated number of messages that were dropped in any of the configured telemetry message destination sinks.
# Egress Messages successfully sent	outgressIoTMessageCount	The aggregated number of messages that were sent by all configured telemetry message destination sinks.
	sentMessagesPerSec	Publisher throughput meaning the number of messages sent to the telemetry message destination (e.g., IoT Hub / Edge Hub) per second

(*) Not exposed through the API

Available metrics

By convention, all instrument names start with “iiot”, e.g., iiot_<component>_<metric>. Metrics are collected through the .net Observability infrastructure. For backwards compatibility and to support IoT Edge metrics collector, metrics from OPC Publisher are exposed in Prometheus format on path /metrics on the default HTTP server port (see -p command line argument. When binding the HTTPS port to 9072 on the host machine, the URL becomes https://localhost:9072/metrics.

One can combine information from multiple metrics to understand and paint a bigger picture of the state of the publisher. The following table describes the individual default metrics which are in Prometheus format.

Instrument Name	Dimensions	Description
iiot_edge_module_start	Edge modules start timestamp with version. Used to calculate uptime, device restarts.	gauge
iiot_edge_publisher_monitored_items	Total monitored items per subscription	gauge
iiot_edge_device_reconnected	OPC UA server reconnection count	gauge
iiot_edge_device_disconnected	OPC UA server disconnection count	gauge
iiot_edge_reconnected	Edge device reconnection count	gauge
iiot_edge_disconnected	Edge device disconnection count	gauge
iiot_edge_publisher_messages_duration	Time taken to send messages from publisher. Used to calculate P50, P90, P95, P99, P99.9	histogram
iiot_edge_publisher_value_changes	OPC UA server ValuesChanges delivered for processing	counter
iiot_edge_publisher_value_changes_per_second	OPC UA server ValuesChanges delivered for processing per second	gauge
iiot_edge_publisher_heartbeats	OPC Publisher heartbeats delivered for processing and included in the value changes.	counter
iiot_edge_publisher_cyclicreads	OPC Publisher cyclic reads delivered for processing and included in the value changes.	counter
iiot_edge_publisher_data_changes	OPC UA server DataChanges delivered for processing	counter
iiot_edge_publisher_data_changes_per_second	OPC UA server DataChanges delivered for processing	gauge
iiot_edge_publisher_iothub_queue_size	Queued messages to IoTHub	gauge
iiot_edge_publisher_iothub_queue_dropped_count	IoTHub dropped messages count	gauge
iiot_edge_publisher_messages	Messages sent to IoTHub	counter
iiot_edge_publisher_messages_per_second	Messages sent to IoTHub per second	gauge
iiot_edge_publisher_connection_retries	Connection retries to OPC UA server	gauge
iiot_edge_publisher_encoded_notifications	Encoded OPC UA server notifications count	counter
iiot_edge_publisher_dropped_notifications	Dropped OPC UA server notifications count	counter
iiot_edge_publisher_processed_messages	Processed IoT messages count	counter
iiot_edge_publisher_notifications_per_message_average	OPC UA sever notifications per IoT message average	gauge
iiot_edge_publisher_encoded_message_size_average	Encoded IoT message body size average	gauge
iiot_edge_publisher_chunk_size_average	IoT Hub chunk size average	gauge
iiot_edge_publisher_estimated_message_chunks_per_day	Estimated IoT Hub chunks charged per day	gauge
iiot_edge_publisher_is_connection_ok	Is the endpoint connection ok?	gauge
iiot_edge_publisher_good_nodes	How many nodes are receiving data for this endpoint?	gauge
iiot_edge_publisher_bad_nodes	How many nodes are misconfigured for this endpoint?	gauge

You can use the IoT Edge Metrics Collector module to collect the metrics.

Edge modules would be instrumented with Prometheus metrics. Each module would expose the metrics on a pre-defined port. The metricscollector module would use the configuration settings to scrape metrics in a defined interval. It would then push the scraped metrics to Log Analytics Workspace. Using Azure Data Explorer (ADX), we can then query our metrics from workspace. We are creating and deploying an Azure Workbook which would provide insights into the edge modules. This would act as our primary monitoring system for edge modules.

IoT Edge Metrics collector

The metrics collector module pulls metrics exposed by OPC Publisher and pushes them to the Log Analytics workspace.

metrics

The default configuration to enable scraping metrics is:

{
    "schemaVersion": "1.0",
    "scrapeFrequencySecs": 120,
    "metricsFormat": "Json",
    "syncTarget": "AzureLogAnalytics",
    "endpoints": {
        "opcpublisher": "http://opcpublisher/metrics"
    }
}

This assumes the http server in OPC Publisher has not been disabled. Disabling the internal HTTP server also disables the Prometheus endpoint.

OpenTelemetry

It is possible to export all metrics, traces and logs to an OpenTelemetry (OTEL) collector. To specify the OTLP GRPC endpoint to export to, use the --oc command line argument.

An example setup using the OpenTelemetry (OTEL) Contrib Collector and Grafana stack can be found in the /deploy/docker folder.

Prometheus

OPC Publisher module also exposes all metrics to a Prometheus scraper (at the standard /metrics path). For troubleshooting and debugging, a local Grafana dashboard can be configured on Azure IoT Edge.

It is recommended to use the OpenTelemetry collector and expose metrics from the collector to Prometheus.

This section shows you how to setup and view OPC Publisher and EdgeHub metrics as bird’s eye view and how to quickly drill down into the issues in case of failures.

In this tutorial two pre-configured docker images (for Prometheus and Grafana) must be created and deployed as Azure IoT Edge Modules. The Prometheus Module scrapes the metrics from OPC Publisher and EdgeHub, while Grafana is used for visualization of metrics.

Setup

Create and configure Azure Container Registry to have an admin account enabled. Registry name and password are needed for pushing the docker images of Prometheus and Grafana.
Create and push images to this ACR:
- Download the zip (metricsdashboard.zip) located in this folder and extract it.
- Navigate to prometheus folder as shown:
- Edit prometheus.yml if needed. It defaults to scraping EdgeHub metrics at 9600 and OPC Publisher metrics at 9702. If only OPC Publisher metrics are needed, then remove the scraping config of EdgeHub.
- Run buildimage.bat and enter the registryname, password and tagname when prompted. It will push the edgeprometheus image to container registry.
- Navigate back to the grafana folder and run buildimage.bat located in this folder. Enter the same information to push the edgegrafana image to ACR.
Now go to the portal and verify that the images are present, as shown in the image below.
IoT Edge should now be configured to expose metrics. Navigate to the IoT Edge device to be used and perform the following steps:
- Select Set Modules to update the configuration
- Enter the registry information in the Container Registry Credentials so that it can authenticate with the Azure Container Registry.
- Optional: Enable metrics of edgeHub (Note: As of release 1.0.10, metrics are automatically exposed by default on http port 9600 of the EdgeHub)
  - Select the Runtime Settings option and for the EdgeHub and Agent, make sure that the stable version is selected with version 1.0.9.4+
  - Adjust the EdgeHub Create Options to include the Exposed Ports directive as shown above.
  - Add the following environment variables to the EdgeHub. (Not needed for version 1.0.10+)
    
    ExperimentalFeatures__Enabled : true
    
    ExperimentalFeaturesEnable__Metrics : true
  - Save the dialog to complete the runtime adjustments.
- Expose port on publisher module:
  - Navigate to the publisher and expose port 9702 to allow the metrics to be scraped, by adding the exposed ports option in the Container Create Options in the Azure portal.
  - Select Update to complete the changes made to the OPC publisher module.
Now add new modules based on the Prometheus and Grafana images created previously.
- Prometheus:
  - Add a new “IoT Edge Module” from the “+ Add” drop down list.
  - Add a module named prometheus with the image uploaded previously to the Azure Container Registry.
  - In the Container Create Options expose and bind the Prometheus port 9090.
    { "Hostname": "prometheus", "ExposedPorts": { "9090/tcp": {} }, "HostConfig": { "PortBindings": { "9090/tcp": [ { "HostPort": 9090 } ] } } }
  - Click Update to complete the changes to the Prometheus module.
  - Select Review and Create and complete the deployment by choosing Create.
- Grafana:
  - Add a new “IoT Edge Module” from the “+ Add” drop down list.
  - Add a module named grafana with the image uploaded previously to the Azure Container Registry.
  - Update the environment variables to control access to the Grafana dashboard.
    - GF_SECURITY_ADMIN_USER - the user name you want to use
    - GF_SECURITY_ADMIN_PASSWORD - a password of your choice
    - GF_SERVER_ROOT_URL - http://localhost:3000
  - In the Container Create Options expose and bind the Grafana port 3000.
    { "Hostname": "grafana", "ExposedPorts": { "3000/tcp": {} }, "HostConfig": { "PortBindings": { "3000/tcp": [ { "HostPort": 3000 } ] } } }
  - Click Update to complete the changes to the Grafana module.
  - Select Review and Create and complete the deployment by choosing Create.
- Verify all the modules are running successfully:

View Prometheus dashboard

Navigate to the Prometheus dashboard through a web browser on the IoT Edge’s host on port 9090.
- http://{edge host IP or name}:9090/graph
- Note: When using a VM, make sure to add an inbound rule for port 9090
Check the target state by navigating to /targets
Metrics can be viewed as shown:

View Grafana dashboard

Navigate to the Grafana dashboard through a web browser against the edge’s host on port 3000.
- http://{edge host IP or name}:3000
- When prompted for a user name and password enter the values entered in the environment variables
- Note: When using a VM, make sure to add an inbound rule for port 3000
Prometheus has already been configured as a data source and can now be directly accessed. Prometheus is scraping EdgeHub and OPC Publisher metrics.
Select the dashboards option to view the available dashboards and select “Publisher” to view the pre-configured dashboard as shown below.
One could easily add/remove more panels to this dashboard as per the needs.