SharePoint Data Source

The SharePoint connector keeps Azure AI Search synchronized with both structured list data and rich documents stored in SharePoint Online. It is designed for production-scale ingestion jobs where resiliency, incremental freshness, and search-ready chunking matter.

How it Works

The SharePoint connector ingests both generic lists (structured metadata) and document libraries (files like PDFs, Word, PowerPoint) into Azure AI Search. It uses smart freshness detection by comparing SharePoint's last modified timestamp with the indexed version, skipping unchanged items to save processing time. The connector processes all collections in parallel but controls worker concurrency and OpenAI API calls to avoid rate limits.

Generic lists are indexed by reading item fields from Microsoft Graph API. You can control which fields get embedded using the optional includeFields configuration. Lookup columns are automatically resolved and cached per list, except for hidden system lists like AppPrincipals, UserInfo, or taxonomy stores. List item attachments are currently not downloaded.

Document libraries download files (default: pdf, docx, pptx) and chunk them using Azure Document Intelligence. Each chunk gets a zero-based ID (c00000, c00001, etc.) which enables accurate freshness checks—if SharePoint's lastModifiedDateTime hasn't changed since the last index run, the file is skipped without reprocessing.

Permissions are handled by calling get_item_permission_object_ids using Graph beta /permissions endpoint to capture explicit Entra user/group IDs for each item. Only GUID-backed identities (users, groups, app registrations, devices) are stored.

For detailed setup instructions, including app registration, permissions, and data source configuration, see the SharePoint Connector Setup Guide.

Ingestion Flow

┌────────────────────────────────────────────────────────────────────────────────┐
│                         SHAREPOINT INGESTION FLOW                              │
│                                                                                │
│  ┌────────────────────┐           ┌─────────────────────────────────────────┐  │
│  │  Microsoft Graph   │           │         Cosmos DB                       │  │
│  │  API Client        │           │  • Site Configurations                  │  │
│  │  • Site Discovery  │<──────────│  • List/Library Specs                   │  │
│  │  • List/Library    │           │  • Field Mappings                       │  │
│  │  • Permissions     │           │  • Category Metadata                    │  │
│  └─────────┬──────────┘           └─────────────────────────────────────────┘  │
│            │                                                                   │
│            │ Pull Items + Metadata                                             │
│            v                                                                   │
│  ┌─────────────────────────────────────────────────────────────────────────┐   │
│  │                    SharePoint Collections                               │   │
│  │                                                                         │   │
│  │  ┌──────────────────┐              ┌──────────────────────────────┐     │   │
│  │  │  Generic Lists   │              │  Document Libraries          │     │   │
│  │  │  • Custom Fields │              │  • Office Files (docx/pptx)  │     │   │
│  │  │  • Lookup Fields │              │  • PDFs                      │     │   │
│  │  │  • List Items    │              │  • Binary Content            │     │   │
│  │  └─────────┬────────┘              └───────────┬──────────────────┘     │   │
│  │            │                                   │                        │   │
│  └────────────┼───────────────────────────────────┼────────────────────────┘   │
│               │                                   │                            │
└───────────────┼───────────────────────────────────┼────────────────────────────┘
                │                                   │
                v                                   v
┌────────────────────────────────────────────────────────────────────────────────┐
│                          PROCESSING PIPELINE                                   │
│                                                                                │
│  ┌──────────────────────────────────────────────────────────────────────────┐  │
│  │                    sharepoint_indexer.py                                 │  │
│  │                                                                          │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │  │
│  │  │  Freshness   │  │  Security    │  │  Lookup      │  │  Content     │  │  │
│  │  │  Check       │─>│  Permissions │─>│  Field       │─>│  Extraction  │  │  │
│  │  │  (Last Mod)  │  │  Resolution  │  │  Resolution  │  │  + Chunking  │  │  │
│  │  └──────────────┘  └──────────────┘  └──────────────┘  └──────┬───────┘  │  │
│  │                                                               │          │  │
│  │  ┌──────────────┐  ┌──────────────┐                           │          │  │
│  │  │  Azure       │  │  Document    │<──────────────────────────┘          │  │
│  │  │  OpenAI      │<─│  Chunker     │  (For attachments)                   │  │
│  │  │  Embeddings  │  │  (PDFs/Docs) │                                      │  │
│  │  └──────┬───────┘  └──────────────┘                                      │  │
│  │         │                                                                │  │
│  └─────────┼────────────────────────────────────────────────────────────────┘  │
│            │                                                                   │
└────────────┼───────────────────────────────────────────────────────────────────┘
             │
             v
┌────────────────────────────────────────────────────────────────────────────────┐
│                            OUTPUT & STORAGE                                    │
│                                                                                │
│  ┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────────┐  │
│  │  Azure AI Search     │  │  Azure Blob Storage  │  │    Telemetry         │  │
│  │  • Indexed Chunks    │  │  • Run Summaries     │  │  • App Insights      │  │
│  │  • Vector Embeddings │  │  • Item Logs         │  │  • Structured Logs   │  │
│  │  • Security IDs      │  │  • Processing State  │  │  • Performance       │  │
│  │  • Metadata          │  │                      │  │  • Error Tracking    │  │
│  │  source=sharepoint   │  │                      │  │                      │  │
│  └──────────────────────┘  └──────────────────────┘  └──────────────────────┘  │
│                                                                                │
└────────────────────────────────────────────────────────────────────────────────┘

Processing Pipeline

The indexer uses three tiers of parallelism to balance speed and service limits:

1. Collection Discovery (All Lists in Parallel)

┌───────────────────────────────────────────────────────────────┐
│  Cosmos datasources (type: sharepoint_site)                   │
└───────────┬────────────────────────┬───────────────────┬──────┘
            │                        │                   │
            ▼                        ▼                   ▼
      ┌──────────┐             ┌──────────┐         ┌──────────┐
      │  List A  │             │  List B  │   ...   │  List N  │   ← All lists start simultaneously
      └─────┬────┘             └─────┬────┘         └─────┬────┘
            │                        │                    │
            └────────────────────────┴────────────────────┘
                             │
                             ▼ (fetch items via Graph API)

2. Item Processing (Controlled: ≤ 4 Workers)

┌──────────────────────────────────────────────────────────────────┐
│  Global Worker Pool (INDEXER_MAX_CONCURRENCY = 4)                │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐                 │
│  │Worker 1 │ │Worker 2 │ │Worker 3 │ │  ...4   │                 │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘                 │
└───────┼───────────┼───────────┼───────────┼──────────────────────┘
        │           │           │           │
        ▼           ▼           ▼           ▼
   ┌─────────────────────────────────────────────────────────┐
   │ Per-Item: freshness → process → upload                  │
   │ • Body (generic lists): list fields → text → embedding  │
   │ • Files (document libraries): download → chunk → embed  │
   └─────────────────────────────────────────────────────────┘

3. Embedding Generation (Throttled: ≤ 2 Concurrent)

┌────────────────────────────────────────────────────────┐
│  AOAI Embedding Gate (AOAI_MAX_CONCURRENCY = 2)        │
│  ┌──────────┐ ┌──────────┐                             │
│  │  Slot 1  │ │  Slot 2  │  ← Only 2 embeddings run    │
│  └──────────┘ └──────────┘     at the same time        │
└────────────────────────────────────────────────────────┘
         ▲              ▲
         │              │
    Workers queue here when embedding is needed
    (workers skip this if freshness = unchanged)

Parallelism At a Glance

Layer	Control	Default	Notes
Collection enumeration	`asyncio.gather`	Unlimited	Each list runs independently.
Item workers	Semaphore	`INDEXER_MAX_CONCURRENCY = 4`	Covers body + file work.
Embeddings	Semaphore	`AOAI_MAX_CONCURRENCY = 2`	Applies to both bodies and files.
Item timeout	`asyncio.wait_for`	600 s	Cancels sluggish items.
Collection timeout	`asyncio.wait_for`	7200 s	Cancels stuck lists.

Freshness & Deduplication

Body documents (generic lists): Fetches chunk c00000 from the index. If SharePoint's Modified timestamp isn't newer, the item is skipped (skippedNoChange) without reprocessing.
Document library files: Each file gets a parent key with the file name; chunk 0 stores the file's last modified time. Unchanged files increment documentLibraryStats.skippedNotNewer.
1-second tolerance: An item is reindexed only if SharePoint's timestamp is >1 second newer than the index. This prevents unnecessary work when clocks differ slightly between SharePoint and Azure AI Search.

Settings Cheat Sheet

Category	Setting	Default	Purpose
Concurrency	`INDEXER_MAX_CONCURRENCY`	4	Item workers across all lists.
	`AOAI_MAX_CONCURRENCY`	2	Embedding throttle.
	`INDEXER_BATCH_SIZE`	500	Upload/delete batch size in AI Search.
Timeouts	`INDEXER_ITEM_TIMEOUT_SECONDS`	600	Per-item budget. Cancels stuck workers.
	`LIST_GATHER_TIMEOUT_SECONDS`	7200	Per-list budget. Aborts entire list if exceeded.
	`HTTP_TOTAL_TIMEOUT_SECONDS`	120	Graph API calls timeout.
	`BLOB_OP_TIMEOUT_SECONDS`	20	Blob storage writes timeout.
Retries	`AOAI_BACKOFF_MAX_SECONDS`	60	Max wait between AOAI retries (exponential backoff + jitter).
	`AOAI_MAX_RATE_LIMIT_ATTEMPTS`	8	Rate limit (429) retries for embeddings. Respects `Retry-After` headers.
	`AOAI_MAX_TRANSIENT_ATTEMPTS`	8	Network/timeout retries for embeddings. Fatal errors bubble immediately.
	`GRAPH_RETRY_ATTEMPTS`	6	Microsoft Graph GET retries for throttling/transient failures. Max 30s backoff.
	`SEARCH_RETRY_ATTEMPTS`	8	Azure AI Search upload/delete retries (1s → 30s backoff). Honors `Retry-After`.
Documents	`SHAREPOINT_FILES_FORMAT`	`pdf,docx,pptx`	Allowed file extensions for document libraries.
Logging	`JOBS_LOG_CONTAINER`	`jobs`	Blob container for logs.
	`DISABLE_STORAGE_LOGS`	unset	Set to `true/1` to skip blob logging.

Tuning notes: Increase AOAI_MAX_CONCURRENCY only if you confirmed higher TPM quotas. If Graph throttles (429), reduce INDEXER_MAX_CONCURRENCY. Document Intelligence chunker performs best-effort retries internally; failed items (e.g., 503 errors) can be retried on next run.

Observability

The indexer writes logs to two destinations: Application Insights (always active) and Azure Blob Storage (optional).

Application Insights

All indexer activity flows to Application Insights automatically. Below are the four most requested queries:

1. Latest indexer runs

let Logs = union isfuzzy=true traces, AppTraces;
Logs
| where message contains "RUN-COMPLETE" and message contains "sharepoint-indexer"
| extend payload = parse_json(extract('\\{.*', 0, message))
| where tostring(payload.event) == "RUN-COMPLETE"
| extend indexerType = extract('\\[([^\\]]+)\\]', 1, message)
| where indexerType endswith "-indexer"  // Filtra apenas indexers
| project timestamp,
          indexerType,
          runId = tostring(payload.runId),
          status = tostring(payload.status),
          itemsDiscovered = toint(payload.itemsDiscovered),
          itemsIndexed = toint(payload.itemsIndexed),
          itemsFailed = toint(payload.itemsFailed),
          durationSeconds = todouble(payload.durationSeconds)
| order by timestamp desc

2. All items indexed in a specific run

let TargetRunId = '20251121T212623Z';
let Logs = union isfuzzy=true traces, AppTraces;
Logs
| where message contains "ITEM-COMPLETE"
| extend payload = parse_json(extract('\\{.*', 0, message))
| where tostring(payload.event) == "ITEM-COMPLETE" and tostring(payload.runId) == TargetRunId
| project timestamp,
          collection = tostring(payload.collection),
          itemId = tostring(payload.itemId),
          parentId = tostring(payload.parentId),
          status = tostring(payload.status),
          attachmentChunks = toint(payload.attachmentChunks),
          totalChunks = toint(payload.totalChunks),
          webUrl = tostring(payload.webUrl)
| order by timestamp desc

3. Indexing history for a specific item with details

let TargetParent = '/m365x03100047.sharepoint.com/SalesAndMarketing/1be0da74-2b71-45e0-a9d3-1ffafa7d0ba7/15';
let Logs = union isfuzzy=true traces, AppTraces;
Logs
| where message contains "ITEM-COMPLETE"
| extend payload = parse_json(extract('\\{.*', 0, message))
| where tostring(payload.event) == "ITEM-COMPLETE" and tostring(payload.parentId) == TargetParent
| project timestamp,
          runId = tostring(payload.runId),
          collection = tostring(payload.collection),
          status = tostring(payload.status),
          attachmentChunks = toint(payload.attachmentChunks),
          totalChunks = toint(payload.totalChunks),
          webUrl = tostring(payload.webUrl)
| order by timestamp desc

4. Recent errors (all error events)

let Logs = union isfuzzy=true traces, AppTraces;
Logs
| where severityLevel >= 3  // 3=Warning, 4=Error
| where message contains "sharepoint-indexer"
| extend payload = parse_json(extract('\\{.*', 0, message))
| where isnotempty(tostring(payload.event))
| project timestamp,
          severityLevel,
          event = tostring(payload.event),
          runId = tostring(payload.runId),
          collection = tostring(payload.collection),
          itemId = tostring(payload.itemId),
          parentId = tostring(payload.parentId),
          error = tostring(payload.error),
          message
| order by timestamp desc

Blob Storage Logs (Optional)

Blob logging is enabled by default but gracefully degrades if unavailable. To disable, set the app setting (not environment variable):

DISABLE_STORAGE_LOGS = true

Note: Azure Functions/Container Apps use Application Settings, not shell environment variables. Set this in the Azure Portal under Configuration → Application Settings.

When enabled, logs are written to the blob container specified by the JOBS_LOG_CONTAINER app setting (default: jobs):

Per-Item Logs: jobs/sharepoint-indexer/files/{sanitized_parent_id}.json

Each processed item generates a JSON log with:

Status: success, skipped-no-change, or error
Freshness details: incomingLastMod, existingLastMod, freshnessReason
Document library metadata: documentLibraryFileName, documentLibraryUrl (if applicable)
Chunks processed: Count of chunks uploaded for this item
Errors: Full exception details if the item failed

Example:

{
  "indexerType": "sharepoint-indexer",
  "collection": "contoso.sharepoint.com/sites/engineering/Documents",
  "itemId": "42",
  "parent_id": "contoso_engineering_abc123_42",
  "runId": "20251121T143022Z",
  "status": "success",
  "incomingLastMod": "2025-11-21T14:30:22Z",
  "existingLastMod": "2025-11-20T10:15:00Z",
  "freshnessReason": "newer-by-ms=102382000",
  "chunks": 3
}

Run Summaries: jobs/sharepoint-indexer/runs/{runId}.{status}.json Each job execution creates stage-specific snapshots: - {runId}.started.json: Job initialization (collections count, start time) - {runId}.finishing.json: Mid-execution snapshot with partial stats - {runId}.finished.json: Final authoritative summary (or .failed.json/.cancelled.json) - latest.json: Pointer to the most recent run (best-effort; may lag on immutable containers)

Example final summary:

{
  "indexerType": "sharepoint-indexer",
  "runId": "20251121T143022Z",
  "runStartedAt": "2025-11-21T14:30:22Z",
  "runFinishedAt": "2025-11-21T14:35:18Z",
  "status": "finished",
  "collections": 3,
  "itemsDiscovered": 84,
  "candidateItems": 12,
  "indexedItems": 12,
  "skippedNoChange": 72,
  "failed": 0,
  "documentLibraryStats": {
    "candidates": 9,
    "skippedNotNewer": 6,
    "skippedExtNotAllowed": 3,
    "uploadedChunks": 18
  }
}

Metrics Reference

Counter	Meaning	Source
`items_discovered`	Items enumerated from SharePoint	Run summary + App Insights
`items_candidates`	Items deemed newer than index	Run summary + App Insights
`items_indexed`	Body documents uploaded	Run summary + App Insights
`items_skipped_nochange`	Bodies skipped by freshness	Run summary + App Insights
`items_failed`	Errors/timeouts	Run summary + App Insights
`body_docs_uploaded`	Count of body documents uploaded (≤ items_indexed)	Run summary
`att_candidates`	Document-library files considered	`documentLibraryStats.candidates`
`att_skipped_not_newer`	Files skipped (index already has newer/equal version)	`documentLibraryStats.skippedNotNewer`
`att_skipped_ext_not_allowed`	Files ignored due to extension filter	`documentLibraryStats.skippedExtNotAllowed`
`att_uploaded_chunks`	Total chunks pushed for document libraries	`documentLibraryStats.uploadedChunks`

Where to find them:

Blob storage: jobs/sharepoint-indexer/runs/latest.json for the most recent run.
Application Insights: Query traces (run-level) or customMetrics (time-series) for historical analysis.
Dashboards: Combine items_discovered, items_indexed, and documentLibraryStats to visualize workload vs. actual changes each run.