Risk evaluations might be coming up as Very High risk with blocklisted items when this is not actually true

Incident Report for Sardine AI

Postmortem

Incorrect Very High Risk Evaluations

Date: 2025-07-24

Duration: 48 minutes

Severity: High

Impact: Multiple B2B clients affected

Executive Summary

A code deployment introduced a race condition in our risk evaluation query logic, causing legitimate sessions to be incorrectly flagged as "very_high" risk. The incident affected multiple clients for 48 minutes before detection and resolution.

Incident Details

What Happened

Sessions across multiple Sardine clients received elevated risk scores ("very_high") that did not align with their actual risk profile as determined by our checkpoint-specific rules and ML models. Root cause: a race condition where risk queries executed before the risk engine completed session evaluation, triggering fail-safe logic inappropriately.

Timeline (UTC)

Time Event Actor
20:02 Canary (phased) deployment initiated with affecting change Engineering
20:17 Automated promotion to full deployment CI/CD Pipeline
20:19 First client escalation - elevated risk levels reported Client
20:29 Issue escalated to engineering Support Team
20:38 Root cause confirmed, rollback initiated Engineering
20:50 Service fully restored Engineering

Key Metrics:

  • Time to Detection: 17 minutes (customer-reported)
  • Time to Resolution: 31 minutes from escalation
  • Total Impact Duration: 48 minutes

Five Whys Analysis

  1. Why were sessions incorrectly marked as very_high risk?
  • Risk query executed before risk engine evaluation completed, triggering fail-safe "default to block" logic
  1. Why did the risk query execute before risk engine completion?
  • Race condition introduced in deployment caused timing dependency failure between risk evaluation and query execution
  1. Why wasn't this race condition detected in testing?
  • QA environment lacked sufficient session volume to reproduce the timing conditions that trigger the race condition
  1. Why didn't our monitoring detect this anomaly?
  • Missing alerting thresholds for sudden increases in very_high risk session percentages per client
  1. Why didn't phased rollout catch this issue?
  • Only subset of clients affected due to configuration differences; clients with custom risk logic bypassed the affected code path

Impact Analysis

What Went Wrong

  1. Detection Failure: No proactive monitoring detected the anomaly - relied on customer escalation
  2. Testing Gap: QA missed race condition scenarios due to insufficient test data volume
  3. Monitoring Gap: No alerting on risk level distribution changes
  4. Phased Launch Limitation: Configuration variance masked issue during gradual rollout

What Went Well

  1. Rapid Response: Support team escalated within 10 minutes of customer report
  2. Quick Resolution: Engineering confirmed and initiated rollback within 19 minutes
  3. Effective Rollback: Clean restoration with no data corruption

Action Items

Immediate (Week 1)

  • [ ] Owner: Engineering Lead - Implement alerting for anomalous (zscore) increase in very_high risk sessions per client (Target: 3 days)
  • [ ] Owner: QA Lead - Add race condition test scenarios with production-like session volumes (Target: 5 days)
  • [ ] Per client incident report and queries for assessing impacted sessions

Short-term (Month 1)

  • [ ] Owner: Platform Team - Review and refactor fail-safe logic to handle timing dependencies (Target: 2 weeks)
  • [ ] Owner: Engineering - Enhance unit and E2E test coverage for risk evaluation edge cases (Target: 3 weeks)
  • [ ] Owner: QA Team - Implement exploratory testing protocols for risk evaluation scenarios (Target: 4 weeks)

Medium-Term

In progress - ownership Engineering

Prevention Measures

  1. Enhanced Monitoring: Real-time risk distribution tracking with client-level granularity
  2. Improved Testing: Production-volume simulation in staging environment
  3. Robust Fail-safes: Timeout handling and graceful degradation for race conditions
  4. Phased Deployment: Configuration-aware rollout strategy to ensure comprehensive coverage

Lessons Learned

  • Customer-reported incidents indicate monitoring blind spots requiring immediate attention
  • Race conditions in distributed systems require specific testing strategies beyond happy/sad path scenarios
  • Fail-safe mechanisms must account for timing dependencies in microservices architectures

Document Owner: Engineering Team

Review Date: 2025-08-24

Distribution: Executive Team, Engineering, Support, QA

Posted Jul 25, 2025 - 17:45 UTC

Resolved

This incident has been resolved.
Posted Jul 24, 2025 - 20:53 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jul 24, 2025 - 20:50 UTC

Update

Team already found root cause and it's working on a fix.
Posted Jul 24, 2025 - 20:43 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Jul 24, 2025 - 20:41 UTC
This incident affected: Customer APIs.