Diagnosing problems
Common mistakes
ASO controller pod in Status CreateContainerConfigError
$ kubectl get pods -n azureserviceoperator-system
NAME READY STATUS RESTARTS AGE
azureserviceoperator-controller-manager-69cbccd645-4s5wz 1/2 CreateContainerConfigError 0 7m49s
Very likely that you forgot to create the aso-controller-settings
secret. This secret must be in
the same namespace as the pod.
You can confirm this with kubectl describe pod -n azureserviceoperator-system --selector control-plane=controller-manager
.
Look for the “Error: secret “aso-controller-settings” not found” event in the describe
output:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Warning Failed 36s (x7 over 99s) kubelet Error: secret "aso-controller-settings" not found
...
Helm installation via Argo missing ClusterRole and other resources
See reference issue #4184.
When using Argo to install the ASO Helm chart, make sure to specify the raw
GitHub URL to the chart, rather than
the URL to the ASO git repo. The chart .tgz
file contains what you need to install ASO, not everything is in the
v2/charts/azure-service-operator
folder. In particular, some files are autogenerated as part of the build and aren’t checked in, attempting to install
from the checked in templates will miss these autogenerated files.
Correct:
apiVersion: argoproj.io/v1alpha1
kind: Application
spec:
source:
repoURL: https://raw.githubusercontent.com/Azure/azure-service-operator/main/v2/charts
targetRevision: v2.8.0
chart: azure-service-operator
Incorrect:
apiVersion: argoproj.io/v1alpha1
kind: Application
spec:
source:
repoURL: https://github.com/Azure/azure-service-operator.git
targetRevision: v2.8.0
path: v2/charts/azure-service-operator
Problems with resources
Resource with no Ready condition set
When resources are first created a ready condition should appear quickly, indicating that the resource is being reconciled:
$ kubectl get resourcegroups.resources.azure.com
NAME READY SEVERITY REASON MESSAGE
aso-sample-rg False Info Reconciling The resource is in the process of being reconciled by the operator
if this isn’t happening then check the controller logs.
Resource stuck deleting
This presents slightly differently for different resources, some examples are:
For example, you might see something like this:
deleting resource "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/dev-rg/providers/Microsoft.KeyVault/vaults/kvname/providers/Microsoft.Authorization/roleAssignments/kv-role-assignement3": DELETE https://management.azure.com/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/dev-rg/providers/Microsoft.KeyVault/vaults/kvname/providers/Microsoft.Authorization/roleAssignments/kv-role-assignement3
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
ERROR CODE: InvalidRoleAssignmentId
--------------------------------------------------------------------------------
{
"error": {
"code": "InvalidRoleAssignmentId",
message": "The role assignment ID 'kv-role-assignement3' is not valid. The role assignment ID must be a GUID."
}
}
--------------------------------------------------------------------------------
This can happen because the resource was created with an invalid name, and when ASO is trying to delete it, it cannot delete the resource because the name is invalid.
Usually, ASO will prevent this situation from happening by blocking the original apply that attempts to create the resource, but from time to time that protection may be imperfect.
If you see this problem, the resource wasn’t ever created successfully in Azure and so it is safe to instruct ASO to
skip deletion of the Azure resource. This can be done by adding the serviceoperator.azure.com/reconcile-policy: skip
annotation to the resource in your cluster.
Resource reports webhook error when applied
The error may look like this:
"Error from server (InternalError): error when creating "/tmp/asd": Internal error occurred: failed calling webhook"
This may be caused by ASO pod restarts, which you can check via kubectl get pods -n azureserviceoperator-system
. If
you’re seeing the ASO pod restart periodically check its logs to see if something is causing it to exit. A common cause
of this is installing too many CRDs on a free tier AKS cluster overloading the API Server. See
CRD management for more details.
Getting ASO controller pod logs
The last stop when investigating most issues is to look at the ASO pod logs. We expect that most resource issues can be resolved using the resources .status.conditions without resorting to pod logs. Setup issues on the other hand may requiring digging into the ASO controller pod logs.
Assuming that ASO is installed into the default namespace, you can look at the logs for the controller
with the following command:
kubectl logs -n azureserviceoperator-system --selector control-plane=controller-manager --container manager
For example, here’s the log from an ASO controller that was launched with some required CRDs missing:
E0302 21:54:54.260693 1 deleg.go:144] setup “msg”=“problem running manager” “error”=“failed to wait for registry caches to sync: no matches for kind "Registry" in version "containerregistry.azure.com/v1alpha1api20210901storage"”
Log levels
To configure the log level for ASO, edit the azureserviceoperator-controller-manager
deployment and
set the --v=2
parameter to a different log level.
See levels.go for a list of levels that we log at. Other components used by ASO (such as controller-runtime) may log at different levels, so there may be value in raising the log level above the highest level we use. Be aware that this might produce a lot of logs.
Logging aggregator
You may want to consider using a cluster wide logging aggregator such as fluentd (there are many other options). Without one it may be difficult to diagnose past failures.
Fetching controller runtime and ASO metrics
Follow the metrics documentation for more information on how to fetch, configure and understand ASO and controller runtime metrics.