Latency Spike

Incident Report for Sardine AI

Postmortem

Incident Overview

  • Duration: 2024-06-24 1:10AM to 11:29PM UTC
  • Services Affected: Customers API
  • Impact: elevated latency

Root Cause Analysis

  • Primary Issue - shared client processing resources were being consumed in high number by a single account for some realtime feature computations
  • Second Factor - an increase in volume from this client over the last few days led to even more resources being consumed
  • Third Factor - a code release to have those aggregations in parallel threads exacerbated the issue even more
  • Technical Impact - extended API latency for all Sardine clients

Action Items

  • Review of feature processing logic to optimize resource consumption
  • Update internal latency alerts to ensure earlier notification, investigation, and resolution
  • Review client incident notification process to ensure correct communication of latency incidents
Posted Jun 28, 2024 - 15:06 UTC

Resolved

Incident Overview

- Duration: 2024-06-24 1:10AM to 11:29PM UTC
- Services Affected: Customers API
- Impact: elevated latency

Please see Post Mortem for details.
Posted Jun 24, 2024 - 04:00 UTC