ADR-019: Cascade Monitor Pattern
Status
Accepted - 2025-10-01
Context
The cascade workflow needs to be triggered when upstream changes are merged into the fork_upstream branch. However, experience has shown that automatic triggering creates reliability and usability issues:
- Event Trigger Limitations:
pull_request_targetevents require workflows to exist on the target branch (fork_upstream), but this branch is a pure mirror without workflow files - Human Control: Teams want explicit control over when integration happens, not automatic triggering
- Timing Control: Humans may want to batch multiple changes or time integrations appropriately
- Visibility Requirements: Clear audit trails and progress tracking are needed throughout the cascade lifecycle
- Error Recovery: Failed or missed triggers need reliable detection and recovery mechanisms
The original approach assumed automatic triggering was preferred, but this created:
- Unreliable triggering due to workflow file availability issues
- Lack of human control over integration timing
- Poor visibility into cascade progress and state
- Complex error handling for edge cases
- No comprehensive tracking of issue lifecycle
Decision
Implement a Human-Centric Cascade Pattern with monitor-based safety net:
- Primary Path - Manual Triggering: Humans manually trigger cascade integration after reviewing and merging sync PRs
- Safety Net - Monitor Detection:
cascade-monitor.ymldetects missed triggers and automatically initiates cascades as fallback - Issue Lifecycle Tracking: Comprehensive tracking of cascade state through GitHub issues with label management
- cascade.yml: Enhanced with issue tracking and progress updates throughout the cascade process
- Human-in-the-Loop: Explicit human control points with clear instructions and visibility
Rationale
Human Control and Visibility
- Explicit Human Decisions: Humans control when integration happens after reviewing changes
- Clear Instructions: sync.yml provides explicit steps for manual cascade triggering
- Issue Lifecycle Tracking: Complete audit trail from sync detection to production deployment
- Progress Visibility: Real-time updates on cascade state through issue comments
Reliable Safety Net
- Monitor as Backup: Detects when humans forget to trigger cascades
- Automatic Recovery: Safety net triggers missed cascades with clear documentation
- Git-based Detection: Uses reliable branch comparison to detect pending changes
- No Event Dependencies: Not dependent on GitHub event triggering limitations
Comprehensive State Management
- Label-based Tracking: Issue labels track cascade state progression
- Comment-based Updates: Detailed progress updates in tracking issues
- Error State Handling: Automated failure detection and recovery workflows
- Completion Tracking: Issues closed when changes reach production
- Self-Healing Recovery: Automatic retry system based on human intervention signals
Improved Team Experience
- Predictable Process: Teams know exactly when and how to trigger cascades
- Better Timing Control: Can batch changes or time integrations appropriately
- Clear Error Recovery: Obvious next steps when things go wrong
- Reduced Surprises: No unexpected automatic triggers
Alternatives Considered
1. Direct Push Triggers
Pros: Simple, immediate triggering Cons: Fires on all pushes, not just sync merges; no way to filter by intent Decision: Rejected due to unwanted triggers2. Combined PR and Push Triggers
# In cascade.yml (original approach)
on:
push:
branches: [fork_upstream, fork_integration]
pull_request:
types: [closed]
branches: [fork_upstream, fork_integration]
Pros: Handles various trigger scenarios Cons: Complex conditional logic; hard to debug; no error handling Decision: Rejected due to complexity and reliability issues
3. External Webhook System
Pros: Maximum flexibility, external control Cons: Additional infrastructure; more complex setup; maintenance overhead Decision: Rejected due to complexity for minimal benefit
4. Scheduled Polling
Pros: Guaranteed to catch changes eventually Cons: Up to 5-minute delay; inefficient; doesn't scale well Decision: Rejected as primary approach (kept as backup in monitor)
Implementation Details
Sync Workflow Instructions
# In sync.yml notification
**Next Steps:**
1. 🔍 **Review the sync PR** for any breaking changes or conflicts
2. ✅ **Merge the PR** when satisfied with the changes
3. 🚀 **Manually trigger 'Cascade Integration' workflow** to integrate changes
4. 📊 **Monitor cascade progress** in Actions tab
Monitor Safety Net Structure
name: Cascade Monitor
on:
schedule:
- cron: '0 */6 * * *' # Safety net detection every 6 hours
workflow_dispatch: # Manual health checking
jobs:
detect-missed-cascade:
steps:
- name: Check for missed cascade triggers
run: |
# Check if fork_upstream has commits that fork_integration doesn't
UPSTREAM_COMMITS=$(git rev-list --count origin/fork_integration..origin/fork_upstream)
if [ "$UPSTREAM_COMMITS" -gt 0 ]; then
# Find tracking issue using improved label search
ISSUE_NUMBER=$(gh issue list \
--label "upstream-sync" \
--state open \
--limit 1 \
--json number \
--jq '.[0].number // empty')
if [ -n "$ISSUE_NUMBER" ]; then
# Comment on issue and auto-trigger cascade as safety net
gh workflow run "Cascade Integration" \
--repo ${{ github.repository }} \
-f issue_number="$ISSUE_NUMBER"
fi
fi
Issue Lifecycle Tracking
# Cascade workflow uses provided issue number for tracking
# Issue number passed as workflow input
ISSUE_NUMBER="${{ github.event.inputs.issue_number }}"
# When cascade starts
gh issue edit "$ISSUE_NUMBER" \
--remove-label "human-required" \
--add-label "cascade-active"
gh issue comment "$ISSUE_NUMBER" --body "🚀 **Cascade Integration Started** - $(date -u +%Y-%m-%dT%H:%M:%SZ)
Integration workflow has been triggered and is now processing upstream changes."
# When conflicts detected
gh issue edit "$ISSUE_NUMBER" \
--remove-label "cascade-active" \
--add-label "cascade-blocked"
# When production PR created
gh issue edit "$ISSUE_NUMBER" \
--remove-label "cascade-active" \
--add-label "production-ready"
gh issue comment "$ISSUE_NUMBER" --body "🎯 **Production PR Created** - $(date -u +%Y-%m-%dT%H:%M:%SZ)
Integration completed successfully! Production PR has been created and is ready for final review."
Health Monitoring Integration
The monitor also includes periodic health checks:
- Stale Conflict Detection: Find conflicts older than 48 hours
- Pipeline Health Reports: Overall cascade pipeline status
- Escalation Management: Automatic escalation of long-running issues
Consequences
Positive
- Human Control: Teams have explicit control over integration timing
- Reliability: No dependency on GitHub event triggering edge cases
- Visibility: Complete audit trail through issue lifecycle tracking
- Error Recovery: Clear path to resolution when things go wrong
- Predictability: Consistent, documented process for all team members
- Safety Net: Automatic detection and recovery of missed triggers
- Flexibility: Can batch changes or time integrations appropriately
Negative
- Manual Step Required: Humans must remember to trigger cascades
- Potential Delays: Up to 6 hours delay if manual trigger is forgotten
- Additional Complexity: Issue lifecycle tracking adds workflow complexity
- Learning Curve: Team needs to understand manual trigger process
- Monitor Dependency: Safety net relies on monitor workflow functioning
Neutral
- File Count: Adds one additional workflow file
- Maintenance: Two simpler workflows vs. one complex workflow
- Testing: Need to test both trigger detection and cascade execution
Integration Points
With Sync Workflow
- Sync workflow creates PRs with
upstream-synclabel - Sync workflow creates tracking issues with explicit manual trigger instructions
- Humans review, merge PR, and manually trigger cascade
With Cascade Workflow
- Cascade runs on
workflow_dispatch(manual or monitor-triggered) - Cascade updates tracking issue labels and comments throughout process
- Cascade handles conflicts, integration, and production PR creation
- Error states tracked through issue labels and comments
With Label Management (ADR-008)
- Uses predefined labels:
upstream-sync,cascade-trigger-failed,human-required - Leverages existing label-based notification system
- Maintains consistency with other workflow patterns
Monitoring and Alerting
Success Metrics
- Manual Trigger Adoption: % of sync merges followed by manual cascade triggers
- Safety Net Effectiveness: % of missed triggers caught by monitor
- Issue Lifecycle Completeness: % of cascades with complete issue tracking
- Human Response Time: Time between sync completion and manual trigger
- Error Recovery: Time to resolve cascade conflicts and issues
Failure Modes
- Forgotten Manual Trigger: Monitor safety net detects and auto-triggers
- Monitor Workflow Failure: Manual cascade trigger still available
- Issue Tracking Failure: Cascade proceeds but with reduced visibility
- Cascade Integration Conflicts: Clear conflict resolution workflow with SLA
Health Checks
- Daily Pipeline Status: Monitor generates health reports
- Stale Issue Detection: Automatically escalates old problems
- Cascade Pipeline Monitoring: Overall system health visibility
Future Enhancements
Planned Improvements
- Batch Triggering: Group multiple rapid changes into single cascade
- Priority Queuing: Handle urgent vs. routine upstream changes differently
- Smart Scheduling: Avoid triggers during maintenance windows
- Cross-Repository Coordination: Coordinate cascades across multiple forks
Extensibility Points
- Custom Trigger Logic: Easy to add new trigger conditions
- External Integrations: Webhook support for external systems
- Advanced Error Handling: Sophisticated retry and recovery strategies
- Metrics Collection: Detailed analytics on trigger patterns
Automated Failure Recovery Pattern
The monitor implements a sophisticated failure recovery system that enables self-healing cascade workflows:
Failure State Management
# Normal cascade state progression
upstream-sync → cascade-active → validated
# Failure state progression
upstream-sync → cascade-active → cascade-failed + human-required
Recovery Detection Logic
detect-recovery-ready:
name: "🔄 Automatic Recovery - Detect Ready Retries"
steps:
- name: Check for recovery-ready issues
run: |
# Find issues with cascade-failed but NOT human-required
RECOVERY_ISSUES=$(gh issue list \
--label "cascade-failed" \
--state open \
--jq '.[] | select(.labels | contains(["cascade-failed"]) and (contains(["human-required"]) | not))')
# For each recovery-ready issue
echo "$RECOVERY_ISSUES" | jq -r '.number' | while read ISSUE_NUMBER; do
# Update labels: cascade-failed → cascade-active
gh issue edit "$ISSUE_NUMBER" \
--remove-label "cascade-failed" \
--add-label "cascade-active"
# Trigger cascade retry
gh workflow run "Cascade Integration" \
--repo ${{ github.repository }} \
-f issue_number="$ISSUE_NUMBER"
done
Human Recovery Workflow
- Failure Occurs: Cascade fails, tracking issue gets
cascade-failed + human-required - Failure Issue Created: Technical details in separate high-priority issue
- Human Investigation: Developer reviews failure issue and makes fixes
- Signal Resolution: Human removes
human-requiredlabel from tracking issue - Automatic Retry: Monitor detects label removal and retries cascade
- Success/Failure: Either completes successfully or creates new failure issue
Benefits of Label-Based Recovery
- Self-Healing: No manual workflow triggering required
- Clear Handoff: Labels signal automation ↔ human transitions
- Audit Trail: Complete failure/recovery history in tracking issues
- Robust Error Handling: Multiple failure attempts tracked separately
- Predictable Process: Developers know exactly how to signal resolution
Related Decisions
- ADR-001: Three-Branch Fork Management Strategy - Defines the cascade target branches
- ADR-005: Automated Conflict Management Strategy - Conflict handling within cascades
- ADR-008: Centralized Label Management Strategy - Label-based state management
- ADR-009: Asymmetric Cascade Review Strategy - Review requirements for cascades
Success Criteria
- 90%+ of sync merges followed by manual cascade triggers within 2 hours
- 100% of missed manual triggers detected by safety net within 6 hours
- Complete issue lifecycle tracking for 95%+ of cascades
- Conflict resolution SLA: 48 hours with automatic escalation
- Zero unexpected cascade triggers (only manual or safety net)
- Clear audit trail for all cascade decisions through issue tracking
- Team adoption: 100% of team members comfortable with manual trigger process