Risk evaluations might be coming up as Very High risk with blocklisted items when this is not actually true

Incident Report for Sardine AI

Postmortem

Incorrect Very High Risk Evaluations

Date: 2025-07-24

Duration: 48 minutes

Severity: High

Impact: Multiple B2B clients affected

Executive Summary

A code deployment introduced a race condition in our risk evaluation query logic, causing legitimate sessions to be incorrectly flagged as "very_high" risk. The incident affected multiple clients for 48 minutes before detection and resolution.

Incident Details

What Happened

Sessions across multiple Sardine clients received elevated risk scores ("very_high") that did not align with their actual risk profile as determined by our checkpoint-specific rules and ML models. Root cause: a race condition where risk queries executed before the risk engine completed session evaluation, triggering fail-safe logic inappropriately.

Timeline (UTC)

Time	Event	Actor
20:02	Canary (phased) deployment initiated with affecting change	Engineering
20:17	Automated promotion to full deployment	CI/CD Pipeline
20:19	First client escalation - elevated risk levels reported	Client
20:29	Issue escalated to engineering	Support Team
20:38	Root cause confirmed, rollback initiated	Engineering
20:50	Service fully restored	Engineering

Key Metrics:

Time to Detection: 17 minutes (customer-reported)
Time to Resolution: 31 minutes from escalation
Total Impact Duration: 48 minutes

Five Whys Analysis

Why were sessions incorrectly marked as very_high risk?

Risk query executed before risk engine evaluation completed, triggering fail-safe "default to block" logic

Why did the risk query execute before risk engine completion?

Race condition introduced in deployment caused timing dependency failure between risk evaluation and query execution

Why wasn't this race condition detected in testing?

QA environment lacked sufficient session volume to reproduce the timing conditions that trigger the race condition

Why didn't our monitoring detect this anomaly?

Missing alerting thresholds for sudden increases in very_high risk session percentages per client

Why didn't phased rollout catch this issue?

Only subset of clients affected due to configuration differences; clients with custom risk logic bypassed the affected code path

Impact Analysis

What Went Wrong

Detection Failure: No proactive monitoring detected the anomaly - relied on customer escalation
Testing Gap: QA missed race condition scenarios due to insufficient test data volume
Monitoring Gap: No alerting on risk level distribution changes
Phased Launch Limitation: Configuration variance masked issue during gradual rollout

What Went Well

Rapid Response: Support team escalated within 10 minutes of customer report
Quick Resolution: Engineering confirmed and initiated rollback within 19 minutes
Effective Rollback: Clean restoration with no data corruption

Action Items

Immediate (Week 1)

[ ] Owner: Engineering Lead - Implement alerting for anomalous (zscore) increase in very_high risk sessions per client (Target: 3 days)
[ ] Owner: QA Lead - Add race condition test scenarios with production-like session volumes (Target: 5 days)
[ ] Per client incident report and queries for assessing impacted sessions

Short-term (Month 1)

[ ] Owner: Platform Team - Review and refactor fail-safe logic to handle timing dependencies (Target: 2 weeks)
[ ] Owner: Engineering - Enhance unit and E2E test coverage for risk evaluation edge cases (Target: 3 weeks)
[ ] Owner: QA Team - Implement exploratory testing protocols for risk evaluation scenarios (Target: 4 weeks)

Medium-Term

In progress - ownership Engineering

Prevention Measures

Enhanced Monitoring: Real-time risk distribution tracking with client-level granularity
Improved Testing: Production-volume simulation in staging environment
Robust Fail-safes: Timeout handling and graceful degradation for race conditions
Phased Deployment: Configuration-aware rollout strategy to ensure comprehensive coverage

Lessons Learned

Customer-reported incidents indicate monitoring blind spots requiring immediate attention
Race conditions in distributed systems require specific testing strategies beyond happy/sad path scenarios
Fail-safe mechanisms must account for timing dependencies in microservices architectures

Document Owner: Engineering Team

Review Date: 2025-08-24

Distribution: Executive Team, Engineering, Support, QA

Posted Jul 25, 2025 - 17:45 UTC

Resolved

This incident has been resolved.

Posted Jul 24, 2025 - 20:53 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 24, 2025 - 20:50 UTC

Update

Team already found root cause and it's working on a fix.

Posted Jul 24, 2025 - 20:43 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jul 24, 2025 - 20:41 UTC

This incident affected: Customer APIs.