Overview
The GPT-RAG Data Ingestion service automates the processing of diverse document types, such as PDFs, images, spreadsheets, transcripts, and SharePoint files, preparing them for indexing in Azure AI Search. It uses intelligent chunking strategies tailored to each format, generates text and image embeddings, and enables rich, multimodal retrieval experiences for agent-based RAG applications.
Key Features
- Multi-Format Processing: Handles PDFs, images, spreadsheets, transcripts, and SharePoint content
- Intelligent Chunking: Format-specific chunking strategies for optimal retrieval
- Multimodal Embeddings: Generates both text and image embeddings
- Automated Workflows: Scans sources, processes content, and indexes documents automatically
- Scheduled Execution: CRON-based scheduler for continuous data ingestion
- Multiple Data Sources: Supports Blob Storage, SharePoint, and NL2SQL metadata
Data sources
How to deploy the data ingestion service
Prerequisites
Provision the infrastructure first by following the Deployment Guide. This ensures all required Azure resources (e.g., Container App, Storage, AI Search) are in place before deploying the data ingestion service.
Required Tools:
- Azure CLI
- Azure Developer CLI (optional, if using azd)
- Git
- Python 3.12
- Docker CLI
- VS Code (recommended)
Required Permissions (for customization):
| Resource | Role | Description |
|---|---|---|
| App Configuration Store | App Configuration Data Owner | Full control over configuration settings |
| Container Registry | AcrPush | Push and pull container images |
| AI Search Service | Search Index Data Contributor | Read and write index data |
| Storage Account | Storage Blob Data Contributor | Read and write blob data |
| Cosmos DB | Cosmos DB Built-in Data Contributor | Read and write documents in Cosmos DB |
Required Permissions (for deployment):
| Resource | Role | Description |
|---|---|---|
| App Configuration | App Configuration Data Reader | Read config |
| Container Registry | AcrPush | Push images |
| Container App | Azure Container Apps Contributor | Manage Container Apps |
Deployment steps
Make sure you're logged in to Azure before anything else:
az login
Clone this repository.
If you used azd provision
Just run:
azd env refresh
azd deploy
Make sure you use the same subscription, resource group, environment name, and location from
azd provision.
If you did not use azd provision
You need to set the App Configuration endpoint and run the deploy script.
Bash (Linux/macOS):
export APP_CONFIG_ENDPOINT="https://<your-app-config-name>.azconfig.io"
./scripts/deploy.sh
PowerShell (Windows):
$env:APP_CONFIG_ENDPOINT = "https://<your-app-config-name>.azconfig.io"
.\scripts\deploy.ps1
Observability
Monitor ingestion job execution and performance using Application Insights. The following query retrieves detailed metrics for completed ingestion runs, including indexing and purging operations.
Application Insights Query
Navigate to your Application Insights resource in the Azure Portal, go to Logs, and run the following query:
let Logs = union isfuzzy=true traces, AppTraces;
Logs
| where message contains "RUN-COMPLETE"
| extend payload = parse_json(extract('\\{.*', 0, message))
| where tostring(payload.event) == "RUN-COMPLETE"
| extend indexerType = extract('\\[([^\\]]+)\\]', 1, message)
| project timestamp,
indexerType,
runId = tostring(payload.runId),
status = tostring(payload.status),
collectionsSeen = toint(payload.collectionsSeen),
// Indexer columns (work on items)
itemsDiscovered = toint(payload.itemsDiscovered),
itemsIndexed = toint(payload.itemsIndexed),
itemsFailed = toint(payload.itemsFailed),
// Purger columns (work on chunks)
chunksChecked = toint(payload.chunksChecked),
chunksDeleted = toint(payload.chunksDeleted),
chunksFailedDelete = toint(payload.chunksFailedDelete),
// Common
durationSeconds = todouble(payload.durationSeconds)
| order by timestamp desc
Query Fields
This query returns the following metrics for each ingestion run:
| Column | Description |
|---|---|
timestamp |
When the job completed |
indexerType |
Type of indexer (e.g., Blob, SharePoint, NL2SQL) |
runId |
Unique identifier for the run |
status |
Job completion status |
collectionsSeen |
Number of collections processed |
itemsDiscovered |
Total items found during scan |
itemsIndexed |
Items successfully indexed |
itemsFailed |
Items that failed to index |
chunksChecked |
Chunks verified during purge |
chunksScanned |
Total chunks scanned |
chunksDeleted |
Chunks removed from index |
chunksFailedDelete |
Chunks that failed deletion |
searchPages |
Number of search result pages processed |
durationSeconds |
Total execution time in seconds |