Incorrect Very High Risk Evaluations
Date: 2025-07-24
Duration: 48 minutes
Severity: High
Impact: Multiple B2B clients affected
Executive Summary
A code deployment introduced a race condition in our risk evaluation query logic, causing legitimate sessions to be incorrectly flagged as "very_high" risk. The incident affected multiple clients for 48 minutes before detection and resolution.
Incident Details
What Happened
Sessions across multiple Sardine clients received elevated risk scores ("very_high") that did not align with their actual risk profile as determined by our checkpoint-specific rules and ML models. Root cause: a race condition where risk queries executed before the risk engine completed session evaluation, triggering fail-safe logic inappropriately.
Timeline (UTC)
Time |
Event |
Actor |
20:02 |
Canary (phased) deployment initiated with affecting change |
Engineering |
20:17 |
Automated promotion to full deployment |
CI/CD Pipeline |
20:19 |
First client escalation - elevated risk levels reported |
Client |
20:29 |
Issue escalated to engineering |
Support Team |
20:38 |
Root cause confirmed, rollback initiated |
Engineering |
20:50 |
Service fully restored |
Engineering |
Key Metrics:
- Time to Detection: 17 minutes (customer-reported)
- Time to Resolution: 31 minutes from escalation
- Total Impact Duration: 48 minutes
Five Whys Analysis
- Why were sessions incorrectly marked as very_high risk?
- Risk query executed before risk engine evaluation completed, triggering fail-safe "default to block" logic
- Why did the risk query execute before risk engine completion?
- Race condition introduced in deployment caused timing dependency failure between risk evaluation and query execution
- Why wasn't this race condition detected in testing?
- QA environment lacked sufficient session volume to reproduce the timing conditions that trigger the race condition
- Why didn't our monitoring detect this anomaly?
- Missing alerting thresholds for sudden increases in very_high risk session percentages per client
- Why didn't phased rollout catch this issue?
- Only subset of clients affected due to configuration differences; clients with custom risk logic bypassed the affected code path
Impact Analysis
What Went Wrong
- Detection Failure: No proactive monitoring detected the anomaly - relied on customer escalation
- Testing Gap: QA missed race condition scenarios due to insufficient test data volume
- Monitoring Gap: No alerting on risk level distribution changes
- Phased Launch Limitation: Configuration variance masked issue during gradual rollout
What Went Well
- Rapid Response: Support team escalated within 10 minutes of customer report
- Quick Resolution: Engineering confirmed and initiated rollback within 19 minutes
- Effective Rollback: Clean restoration with no data corruption
Action Items
Immediate (Week 1)
- [ ] Owner: Engineering Lead - Implement alerting for anomalous (zscore) increase in very_high risk sessions per client (Target: 3 days)
- [ ] Owner: QA Lead - Add race condition test scenarios with production-like session volumes (Target: 5 days)
- [ ] Per client incident report and queries for assessing impacted sessions
Short-term (Month 1)
- [ ] Owner: Platform Team - Review and refactor fail-safe logic to handle timing dependencies (Target: 2 weeks)
- [ ] Owner: Engineering - Enhance unit and E2E test coverage for risk evaluation edge cases (Target: 3 weeks)
- [ ] Owner: QA Team - Implement exploratory testing protocols for risk evaluation scenarios (Target: 4 weeks)
Medium-Term
In progress - ownership Engineering
Prevention Measures
- Enhanced Monitoring: Real-time risk distribution tracking with client-level granularity
- Improved Testing: Production-volume simulation in staging environment
- Robust Fail-safes: Timeout handling and graceful degradation for race conditions
- Phased Deployment: Configuration-aware rollout strategy to ensure comprehensive coverage
Lessons Learned
- Customer-reported incidents indicate monitoring blind spots requiring immediate attention
- Race conditions in distributed systems require specific testing strategies beyond happy/sad path scenarios
- Fail-safe mechanisms must account for timing dependencies in microservices architectures
Document Owner: Engineering Team
Review Date: 2025-08-24
Distribution: Executive Team, Engineering, Support, QA