Increased latency

Incident Report for Sardine AI

Postmortem

Overview

On Mar 3rd, Sardine experienced an increase in latency due to increased database usage. /customers endpoint experienced spikes in latency around the start of every hour for 15 minutes between 8 and 11 Pacific time. And/issuing/risks endpoint experienced intermittent higher latency during 08:00 - 10:54 Pacific time.

*/customers* latency spikes:

8:00 - 8:15
9:00 - 9:15
9:55 - 10:15
11:00 - 11:15

*/issuing/risk* endpoint latency spikes:

08:02 - 08:44
09:42 - 10:12
10:30 - 10:54

What happened

Sardine encountered a surge in traffic originating from a client. The intermittent and unpredictable nature of these spikes presented challenges in real-time detection and impact assessment, subsequently hindering our ability to implement timely mitigation strategies.

Impact

*/customers and/issuing/risksAPI experienced intermittent higher latency during 08:11- 10:06 Pacific time, Clients using the advanced aggregation feature were more impacted.*

Timeline (all Pacific time)

08:02: Incident starts - latencies for both endpoints start going up.
8:10: Oncall engineer paged due to the increased latency.
8:23: Alert was auto resolved. Oncall engineer started digging into the root cause, latency was not yet back to normal but in a more manageable situation.
9:00 - 9:35: We had an Oncall handoff meeting, this latency issue was mentioned but no root cause was detected yet but the latency seemed under control.
9:42: A customer notices the latency issue and communicates to the Sardine team.
10:00: p95 latency becomes a sustained issue, new Oncall engineer starts investigating.
10:20: Oncall discovers the queries creating the bottlenecks, engineers start checking if there’s a bypass or a quick enhancement possible to remove the bottleneck or if scaling the DB is our only option.
10:40 AM: The DB is scaled up, in a few minutes the incident ends.

What went wrong

Slow to detect request spikes and queries that were causing latency issues.
Oncall handoff happened during the incident, issue wasn’t properly handed off
- Old oncall engineer thought the issue was a one-off latency increase due to spikes
- New oncall engineer wans't tagged on old threads about this topic.

Action items

Enforce more strict query timeouts for issuing API
Ongoing query optimizations on the advanced aggregations feature that will mitigate risks of this happening again
Enhance internal process around alerts and escalation
Process update to oncall handoff and incident handling - if incident happens around oncall handoff both new and old oncall with use handoff meeting as working session

Posted Mar 07, 2025 - 17:23 UTC

Resolved

We were experiencing elevated latency for customers and issuing APIs during 8:00am PT and 10:00am PT due to unusual traffic volume, the issue is resolved.

Posted Mar 03, 2025 - 06:00 UTC