SharePoint Data Source
The SharePoint connector keeps Azure AI Search synchronized with both structured list data and rich documents stored in SharePoint Online. It is designed for production-scale ingestion jobs where resiliency, incremental freshness, and search-ready chunking matter.
How it Works
The SharePoint connector ingests both generic lists (structured metadata) and document libraries (files like PDFs, Word, PowerPoint) into Azure AI Search. It uses smart freshness detection by comparing SharePoint's last modified timestamp with the indexed version, skipping unchanged items to save processing time. The connector processes all collections in parallel but controls worker concurrency and OpenAI API calls to avoid rate limits.
Generic lists are indexed by reading item fields from Microsoft Graph API. You can control which fields get embedded using the optional includeFields configuration. Lookup columns are automatically resolved and cached per list, except for hidden system lists like AppPrincipals, UserInfo, or taxonomy stores. List item attachments are currently not downloaded.
Document libraries download files (default: pdf, docx, pptx) and chunk them using Azure Document Intelligence. Each chunk gets a zero-based ID (c00000, c00001, etc.) which enables accurate freshness checks—if SharePoint's lastModifiedDateTime hasn't changed since the last index run, the file is skipped without reprocessing.
Permissions are handled by calling get_item_permission_object_ids using Graph beta /permissions endpoint to capture explicit Entra user/group IDs for each item. Only GUID-backed identities (users, groups, app registrations, devices) are stored.
For detailed setup instructions, including app registration, permissions, and data source configuration, see the SharePoint Connector Setup Guide.
Ingestion Flow
┌────────────────────────────────────────────────────────────────────────────────┐
│ SHAREPOINT INGESTION FLOW │
│ │
│ ┌────────────────────┐ ┌─────────────────────────────────────────┐ │
│ │ Microsoft Graph │ │ Cosmos DB │ │
│ │ API Client │ │ • Site Configurations │ │
│ │ • Site Discovery │<──────────│ • List/Library Specs │ │
│ │ • List/Library │ │ • Field Mappings │ │
│ │ • Permissions │ │ • Category Metadata │ │
│ └─────────┬──────────┘ └─────────────────────────────────────────┘ │
│ │ │
│ │ Pull Items + Metadata │
│ v │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ SharePoint Collections │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────────────────┐ │ │
│ │ │ Generic Lists │ │ Document Libraries │ │ │
│ │ │ • Custom Fields │ │ • Office Files (docx/pptx) │ │ │
│ │ │ • Lookup Fields │ │ • PDFs │ │ │
│ │ │ • List Items │ │ • Binary Content │ │ │
│ │ └─────────┬────────┘ └───────────┬──────────────────┘ │ │
│ │ │ │ │ │
│ └────────────┼───────────────────────────────────┼────────────────────────┘ │
│ │ │ │
└───────────────┼───────────────────────────────────┼────────────────────────────┘
│ │
v v
┌────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING PIPELINE │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ sharepoint_indexer.py │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Freshness │ │ Security │ │ Lookup │ │ Content │ │ │
│ │ │ Check │─>│ Permissions │─>│ Field │─>│ Extraction │ │ │
│ │ │ (Last Mod) │ │ Resolution │ │ Resolution │ │ + Chunking │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ └──────┬───────┘ │ │
│ │ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │ │
│ │ │ Azure │ │ Document │<──────────────────────────┘ │ │
│ │ │ OpenAI │<─│ Chunker │ (For attachments) │ │
│ │ │ Embeddings │ │ (PDFs/Docs) │ │ │
│ │ └──────┬───────┘ └──────────────┘ │ │
│ │ │ │ │
│ └─────────┼────────────────────────────────────────────────────────────────┘ │
│ │ │
└────────────┼───────────────────────────────────────────────────────────────────┘
│
v
┌────────────────────────────────────────────────────────────────────────────────┐
│ OUTPUT & STORAGE │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Azure AI Search │ │ Azure Blob Storage │ │ Telemetry │ │
│ │ • Indexed Chunks │ │ • Run Summaries │ │ • App Insights │ │
│ │ • Vector Embeddings │ │ • Item Logs │ │ • Structured Logs │ │
│ │ • Security IDs │ │ • Processing State │ │ • Performance │ │
│ │ • Metadata │ │ │ │ • Error Tracking │ │
│ │ source=sharepoint │ │ │ │ │ │
│ └──────────────────────┘ └──────────────────────┘ └──────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────────────┘
Processing Pipeline
The indexer uses three tiers of parallelism to balance speed and service limits:
1. Collection Discovery (All Lists in Parallel)
┌───────────────────────────────────────────────────────────────┐
│ Cosmos datasources (type: sharepoint_site) │
└───────────┬────────────────────────┬───────────────────┬──────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ List A │ │ List B │ ... │ List N │ ← All lists start simultaneously
└─────┬────┘ └─────┬────┘ └─────┬────┘
│ │ │
└────────────────────────┴────────────────────┘
│
▼ (fetch items via Graph API)
2. Item Processing (Controlled: ≤ 4 Workers)
┌──────────────────────────────────────────────────────────────────┐
│ Global Worker Pool (INDEXER_MAX_CONCURRENCY = 4) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Worker 1 │ │Worker 2 │ │Worker 3 │ │ ...4 │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
└───────┼───────────┼───────────┼───────────┼──────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ Per-Item: freshness → process → upload │
│ • Body (generic lists): list fields → text → embedding │
│ • Files (document libraries): download → chunk → embed │
└─────────────────────────────────────────────────────────┘
3. Embedding Generation (Throttled: ≤ 2 Concurrent)
┌────────────────────────────────────────────────────────┐
│ AOAI Embedding Gate (AOAI_MAX_CONCURRENCY = 2) │
│ ┌──────────┐ ┌──────────┐ │
│ │ Slot 1 │ │ Slot 2 │ ← Only 2 embeddings run │
│ └──────────┘ └──────────┘ at the same time │
└────────────────────────────────────────────────────────┘
▲ ▲
│ │
Workers queue here when embedding is needed
(workers skip this if freshness = unchanged)
Parallelism At a Glance
| Layer | Control | Default | Notes |
|---|---|---|---|
| Collection enumeration | asyncio.gather |
Unlimited | Each list runs independently. |
| Item workers | Semaphore | INDEXER_MAX_CONCURRENCY = 4 |
Covers body + file work. |
| Embeddings | Semaphore | AOAI_MAX_CONCURRENCY = 2 |
Applies to both bodies and files. |
| Item timeout | asyncio.wait_for |
600 s | Cancels sluggish items. |
| Collection timeout | asyncio.wait_for |
7200 s | Cancels stuck lists. |
Freshness & Deduplication
- Body documents (generic lists): Fetches chunk
c00000from the index. If SharePoint'sModifiedtimestamp isn't newer, the item is skipped (skippedNoChange) without reprocessing. - Document library files: Each file gets a parent key with the file name; chunk
0stores the file's last modified time. Unchanged files incrementdocumentLibraryStats.skippedNotNewer. - 1-second tolerance: An item is reindexed only if SharePoint's timestamp is >1 second newer than the index. This prevents unnecessary work when clocks differ slightly between SharePoint and Azure AI Search.
Settings Cheat Sheet
| Category | Setting | Default | Purpose |
|---|---|---|---|
| Concurrency | INDEXER_MAX_CONCURRENCY |
4 | Item workers across all lists. |
AOAI_MAX_CONCURRENCY |
2 | Embedding throttle. | |
INDEXER_BATCH_SIZE |
500 | Upload/delete batch size in AI Search. | |
| Timeouts | INDEXER_ITEM_TIMEOUT_SECONDS |
600 | Per-item budget. Cancels stuck workers. |
LIST_GATHER_TIMEOUT_SECONDS |
7200 | Per-list budget. Aborts entire list if exceeded. | |
HTTP_TOTAL_TIMEOUT_SECONDS |
120 | Graph API calls timeout. | |
BLOB_OP_TIMEOUT_SECONDS |
20 | Blob storage writes timeout. | |
| Retries | AOAI_BACKOFF_MAX_SECONDS |
60 | Max wait between AOAI retries (exponential backoff + jitter). |
AOAI_MAX_RATE_LIMIT_ATTEMPTS |
8 | Rate limit (429) retries for embeddings. Respects Retry-After headers. |
|
AOAI_MAX_TRANSIENT_ATTEMPTS |
8 | Network/timeout retries for embeddings. Fatal errors bubble immediately. | |
GRAPH_RETRY_ATTEMPTS |
6 | Microsoft Graph GET retries for throttling/transient failures. Max 30s backoff. | |
SEARCH_RETRY_ATTEMPTS |
8 | Azure AI Search upload/delete retries (1s → 30s backoff). Honors Retry-After. |
|
| Documents | SHAREPOINT_FILES_FORMAT |
pdf,docx,pptx |
Allowed file extensions for document libraries. |
| Logging | JOBS_LOG_CONTAINER |
jobs |
Blob container for logs. |
DISABLE_STORAGE_LOGS |
unset | Set to true/1 to skip blob logging. |
Tuning notes: Increase
AOAI_MAX_CONCURRENCYonly if you confirmed higher TPM quotas. If Graph throttles (429), reduceINDEXER_MAX_CONCURRENCY. Document Intelligence chunker performs best-effort retries internally; failed items (e.g., 503 errors) can be retried on next run.
Observability
The indexer writes logs to two destinations: Application Insights (always active) and Azure Blob Storage (optional).
Application Insights
All indexer activity flows to Application Insights automatically. Below are the four most requested queries:
1. Latest indexer runs
let Logs = union isfuzzy=true traces, AppTraces;
Logs
| where message contains "RUN-COMPLETE" and message contains "sharepoint-indexer"
| extend payload = parse_json(extract('\\{.*', 0, message))
| where tostring(payload.event) == "RUN-COMPLETE"
| extend indexerType = extract('\\[([^\\]]+)\\]', 1, message)
| where indexerType endswith "-indexer" // Filtra apenas indexers
| project timestamp,
indexerType,
runId = tostring(payload.runId),
status = tostring(payload.status),
itemsDiscovered = toint(payload.itemsDiscovered),
itemsIndexed = toint(payload.itemsIndexed),
itemsFailed = toint(payload.itemsFailed),
durationSeconds = todouble(payload.durationSeconds)
| order by timestamp desc
2. All items indexed in a specific run
let TargetRunId = '20251121T212623Z';
let Logs = union isfuzzy=true traces, AppTraces;
Logs
| where message contains "ITEM-COMPLETE"
| extend payload = parse_json(extract('\\{.*', 0, message))
| where tostring(payload.event) == "ITEM-COMPLETE" and tostring(payload.runId) == TargetRunId
| project timestamp,
collection = tostring(payload.collection),
itemId = tostring(payload.itemId),
parentId = tostring(payload.parentId),
status = tostring(payload.status),
attachmentChunks = toint(payload.attachmentChunks),
totalChunks = toint(payload.totalChunks),
webUrl = tostring(payload.webUrl)
| order by timestamp desc
3. Indexing history for a specific item with details
let TargetParent = '/m365x03100047.sharepoint.com/SalesAndMarketing/1be0da74-2b71-45e0-a9d3-1ffafa7d0ba7/15';
let Logs = union isfuzzy=true traces, AppTraces;
Logs
| where message contains "ITEM-COMPLETE"
| extend payload = parse_json(extract('\\{.*', 0, message))
| where tostring(payload.event) == "ITEM-COMPLETE" and tostring(payload.parentId) == TargetParent
| project timestamp,
runId = tostring(payload.runId),
collection = tostring(payload.collection),
status = tostring(payload.status),
attachmentChunks = toint(payload.attachmentChunks),
totalChunks = toint(payload.totalChunks),
webUrl = tostring(payload.webUrl)
| order by timestamp desc
4. Recent errors (all error events)
let Logs = union isfuzzy=true traces, AppTraces;
Logs
| where severityLevel >= 3 // 3=Warning, 4=Error
| where message contains "sharepoint-indexer"
| extend payload = parse_json(extract('\\{.*', 0, message))
| where isnotempty(tostring(payload.event))
| project timestamp,
severityLevel,
event = tostring(payload.event),
runId = tostring(payload.runId),
collection = tostring(payload.collection),
itemId = tostring(payload.itemId),
parentId = tostring(payload.parentId),
error = tostring(payload.error),
message
| order by timestamp desc
Blob Storage Logs (Optional)
Blob logging is enabled by default but gracefully degrades if unavailable. To disable, set the app setting (not environment variable):
DISABLE_STORAGE_LOGS = true
Note: Azure Functions/Container Apps use Application Settings, not shell environment variables. Set this in the Azure Portal under Configuration → Application Settings.
When enabled, logs are written to the blob container specified by the JOBS_LOG_CONTAINER app setting (default: jobs):
Per-Item Logs: jobs/sharepoint-indexer/files/{sanitized_parent_id}.json
Each processed item generates a JSON log with:
-
Status:
success,skipped-no-change, orerror -
Freshness details:
incomingLastMod,existingLastMod,freshnessReason -
Document library metadata:
documentLibraryFileName,documentLibraryUrl(if applicable) -
Chunks processed: Count of chunks uploaded for this item
-
Errors: Full exception details if the item failed
Example:
{
"indexerType": "sharepoint-indexer",
"collection": "contoso.sharepoint.com/sites/engineering/Documents",
"itemId": "42",
"parent_id": "contoso_engineering_abc123_42",
"runId": "20251121T143022Z",
"status": "success",
"incomingLastMod": "2025-11-21T14:30:22Z",
"existingLastMod": "2025-11-20T10:15:00Z",
"freshnessReason": "newer-by-ms=102382000",
"chunks": 3
}
Run Summaries: jobs/sharepoint-indexer/runs/{runId}.{status}.json
Each job execution creates stage-specific snapshots:
- {runId}.started.json: Job initialization (collections count, start time)
- {runId}.finishing.json: Mid-execution snapshot with partial stats
- {runId}.finished.json: Final authoritative summary (or .failed.json/.cancelled.json)
- latest.json: Pointer to the most recent run (best-effort; may lag on immutable containers)
Example final summary:
{
"indexerType": "sharepoint-indexer",
"runId": "20251121T143022Z",
"runStartedAt": "2025-11-21T14:30:22Z",
"runFinishedAt": "2025-11-21T14:35:18Z",
"status": "finished",
"collections": 3,
"itemsDiscovered": 84,
"candidateItems": 12,
"indexedItems": 12,
"skippedNoChange": 72,
"failed": 0,
"documentLibraryStats": {
"candidates": 9,
"skippedNotNewer": 6,
"skippedExtNotAllowed": 3,
"uploadedChunks": 18
}
}
Metrics Reference
| Counter | Meaning | Source |
|---|---|---|
items_discovered |
Items enumerated from SharePoint | Run summary + App Insights |
items_candidates |
Items deemed newer than index | Run summary + App Insights |
items_indexed |
Body documents uploaded | Run summary + App Insights |
items_skipped_nochange |
Bodies skipped by freshness | Run summary + App Insights |
items_failed |
Errors/timeouts | Run summary + App Insights |
body_docs_uploaded |
Count of body documents uploaded (≤ items_indexed) | Run summary |
att_candidates |
Document-library files considered | documentLibraryStats.candidates |
att_skipped_not_newer |
Files skipped (index already has newer/equal version) | documentLibraryStats.skippedNotNewer |
att_skipped_ext_not_allowed |
Files ignored due to extension filter | documentLibraryStats.skippedExtNotAllowed |
att_uploaded_chunks |
Total chunks pushed for document libraries | documentLibraryStats.uploadedChunks |
Where to find them:
-
Blob storage:
jobs/sharepoint-indexer/runs/latest.jsonfor the most recent run. -
Application Insights: Query
traces(run-level) orcustomMetrics(time-series) for historical analysis. -
Dashboards: Combine
items_discovered,items_indexed, anddocumentLibraryStatsto visualize workload vs. actual changes each run.