Lab 3.1 AI asist evaluation on streaming

🔍 Design Options for Streaming + Pre-Validation in RAG Systems

Approach	Description	Streaming Impact	Pros	Cons	Examples/Refs
Two-Pass (Generate-Then-Check)	LLM generates full answer, then separate evaluator model/algorithm checks groundedness & relevance before releasing.	User sees answer stream after verification (one-time delay).	- High confidence in final answer - Simple control flow - Easy to implement with frameworks (sequential calls)	- Added latency = generation + evaluation time - Not “true” streaming - 2x LLM call cost	- LlamaIndex faithfulness check - Medium pipeline with GPT-4 self-eval
On-the-Fly Chunk Validation	Checks each chunk or sentence of the streaming output against sources/query; can halt or edit mid-stream if issues found.	Streams in real-time (possibly halting or altering if a chunk fails).	- Minimal user waiting time - Catches problems early - No need to recompute entire answer	- Technically complex - May catch issues too late - Chunk size tuning & false alarms	- Azure real-time correction (preview)
Multi-Agent (Critique & Refine)	Two or more LLM agents in a loop: one produces answer, another verifies and provides feedback, then answer is refined.	Typically streams after agents finish (final output can stream).	- Very high answer quality - Flexible (e.g., add web search or tools)	- Highest latency & compute cost - Complex orchestration - User not involved during inner loop	- RARR pipeline - Multi-agent review
Hybrid Strategies	Lightweight self-check first; heavy check only if needed. Or use fast initial model + slower verifier.	Tunable balance between latency and rigor (e.g., early stream with final validation gate).	- Flexible trade-off - Expensive checks triggered only when needed	- Still evolving - Complex system logic (thresholds, model routing)	- GPT-4 eval on low-confidence outputs

Table: Design options for combining streaming response generation with pre-validation in RAG systems.