Troubleshooting

This page covers common issues, debugging tools, and how to inspect logs in GPT-RAG.

Showing Response Time Statistics in the Chat UI

The GPT-RAG UI includes a built-in option to display response time after each agent answer. To enable it, set the SHOW_STATISTICS application setting to true in your Container App (or App Configuration). Once enabled, each response in the chat will show timing information, helping you identify slow responses and compare performance across different queries or configurations.

Enabling Debug Logging

To increase log verbosity for any GPT-RAG component (orchestrator, ingestion, or UI), set the LOG_LEVEL environment variable to DEBUG in the corresponding Container App. For example, in the Azure Portal go to your Container App → Environment variables → set LOG_LEVEL = DEBUG. After restarting the container, the application will emit detailed logs including internal function calls, SDK diagnostics, and step-by-step execution traces. Remember to revert to INFO or WARNING after troubleshooting to avoid excessive log volume and cost.

Viewing Logs in Application Insights

All GPT-RAG components send telemetry to Application Insights. Open your Application Insights resource in the Azure Portal and go to Logs to run KQL queries.

To find recent errors across all components:

traces
| where timestamp > ago(1h)
| where severityLevel >= 3
| project timestamp, message, cloud_RoleName, severityLevel
| order by timestamp desc
| take 50

To see errors specifically in the orchestrator:

traces
| where timestamp > ago(24h)
| where cloud_RoleName contains "orchestrator"
| where severityLevel >= 3
| project timestamp, message, operation_Id
| order by timestamp desc

To trace a single request end-to-end using its operation ID (you can get this from a previous query or from the UI response headers):

traces
| where operation_Id == "YOUR_OPERATION_ID"
| order by timestamp asc
| project timestamp, message, severityLevel, cloud_RoleName

To check for rate-limit (429) or throttling issues in ingestion:

traces
| where timestamp > ago(24h)
| where cloud_RoleName contains "ingest"
| where message contains "429" or message contains "throttl" or message contains "rate limit"
| project timestamp, message
| order by timestamp desc

To view exceptions with stack traces:

exceptions
| where timestamp > ago(24h)
| project timestamp, problemId, outerMessage, details, cloud_RoleName
| order by timestamp desc
| take 20

Deploy fails after switching azd environments (stale APP_CONFIG_ENDPOINT)

If azd deploy <component> fails right after starting with an Azure CLI error saying the App Configuration resource does not exist or cannot be found, and the message references an https://<name>.azconfig.io endpoint that does not match your current environment, the most likely cause is a stale APP_CONFIG_ENDPOINT environment variable left over from a previous deployment.

The component deploy scripts (scripts/deploy.ps1 and scripts/deploy.sh) prefer the value of APP_CONFIG_ENDPOINT from your shell over the value stored in the active azd environment. When the previous App Configuration was deleted or recreated (for example, after tearing down an azd env and provisioning a new one), the stale value silently wins and the deploy targets a resource that no longer exists.

Clear the variable from your shell and let azd env provide the correct value:

PowerShell:

Remove-Item env:APP_CONFIG_ENDPOINT -ErrorAction SilentlyContinue
azd env get-values | Out-Null  # optional, confirms the active env
azd deploy <component>

Bash:

unset APP_CONFIG_ENDPOINT
azd env get-values >/dev/null  # optional
azd deploy <component>

To avoid this in the future:

Open a fresh terminal when switching between azd environments.
If you must set APP_CONFIG_ENDPOINT manually (for example, on a jumpbox or in CI), confirm it matches azd env get-value APP_CONFIG_ENDPOINT before deploying.

The component deploy scripts also print a yellow warning when the shell APP_CONFIG_ENDPOINT and the active azd env disagree, starting with GPT-RAG v2.9.1 (orchestrator v2.8.3, ingestion v2.4.4, ui v2.3.11). See #491 for context.

Known Issues and Fixes

Below is a list of commonly reported issues that have been resolved. If you encounter one of these, make sure you are running the version that includes the fix.

OOM container restarts during parallel ingestion — Ingestion containers could run out of memory when processing multiple large files concurrently. Fixed by adding memory guards, temp-file downloads for large PDFs, and lowering default concurrency. See #438.

Re-indexing caused by embedding retry issues — Transient embedding failures could cause documents to be unnecessarily re-indexed. Fixed in ingestion v2.2.5. See #437.

All documents re-indexed when permissionFilterOption is enabled — Enabling document-level security caused a full re-index instead of incremental updates. Fixed in ingestion v2.2.5. See #436.