Increased latency

Incident Report for Sardine AI

Postmortem

Overview

On Mar 3rd, Sardine experienced an increase in latency due to increased database usage. /customers endpoint experienced spikes in latency around the start of every hour for 15 minutes between 8 and 11 Pacific time. And/issuing/risks endpoint experienced intermittent higher latency during 08:00 - 10:54 Pacific time.

*/customers* latency spikes:

  • 8:00 - 8:15
  • 9:00 - 9:15
  • 9:55 - 10:15
  • 11:00 - 11:15

*/issuing/risk* endpoint latency spikes:

  • 08:02 - 08:44
  • 09:42 - 10:12
  • 10:30 - 10:54

What happened

Sardine encountered a surge in traffic originating from a client. The intermittent and unpredictable nature of these spikes presented challenges in real-time detection and impact assessment, subsequently hindering our ability to implement timely mitigation strategies.

Impact

*/customers and/issuing/risksAPI experienced intermittent higher latency during 08:11- 10:06 Pacific time, Clients using the advanced aggregation feature were more impacted.*

Timeline (all Pacific time)

  • 08:02: Incident starts - latencies for both endpoints start going up.
  • 8:10: Oncall engineer paged due to the increased latency.
  • 8:23: Alert was auto resolved. Oncall engineer started digging into the root cause, latency was not yet back to normal but in a more manageable situation.
  • 9:00 - 9:35: We had an Oncall handoff meeting, this latency issue was mentioned but no root cause was detected yet but the latency seemed under control.
  • 9:42: A customer notices the latency issue and communicates to the Sardine team.
  • 10:00: p95 latency becomes a sustained issue, new Oncall engineer starts investigating.
  • 10:20: Oncall discovers the queries creating the bottlenecks, engineers start checking if there’s a bypass or a quick enhancement possible to remove the bottleneck or if scaling the DB is our only option.
  • 10:40 AM: The DB is scaled up, in a few minutes the incident ends.

What went wrong

  • Slow to detect request spikes and queries that were causing latency issues.
  • Oncall handoff happened during the incident, issue wasn’t properly handed off

    • Old oncall engineer thought the issue was a one-off latency increase due to spikes
    • New oncall engineer wans't tagged on old threads about this topic.

Action items

  • Enforce more strict query timeouts for issuing API
  • Ongoing query optimizations on the advanced aggregations feature that will mitigate risks of this happening again
  • Enhance internal process around alerts and escalation
  • Process update to oncall handoff and incident handling - if incident happens around oncall handoff both new and old oncall with use handoff meeting as working session
Posted Mar 07, 2025 - 17:23 UTC

Resolved

We were experiencing elevated latency for customers and issuing APIs during 8:00am PT and 10:00am PT due to unusual traffic volume, the issue is resolved.
Posted Mar 03, 2025 - 06:00 UTC