Overview
On Mar 3rd, Sardine experienced an increase in latency due to increased database usage. /customers
endpoint experienced spikes in latency around the start of every hour for 15 minutes between 8 and 11 Pacific time. And/issuing/risks
endpoint experienced intermittent higher latency during 08:00 - 10:54 Pacific time.
*/customers*
latency spikes:
- 8:00 - 8:15
- 9:00 - 9:15
- 9:55 - 10:15
- 11:00 - 11:15
*/issuing/risk*
endpoint latency spikes:
- 08:02 - 08:44
- 09:42 - 10:12
- 10:30 - 10:54
What happened
Sardine encountered a surge in traffic originating from a client. The intermittent and unpredictable nature of these spikes presented challenges in real-time detection and impact assessment, subsequently hindering our ability to implement timely mitigation strategies.
Impact
*/customers
and/issuing/risks
API experienced intermittent higher latency during 08:11- 10:06 Pacific time, Clients using the advanced aggregation feature were more impacted.*
Timeline (all Pacific time)
- 08:02: Incident starts - latencies for both endpoints start going up.
- 8:10: Oncall engineer paged due to the increased latency.
- 8:23: Alert was auto resolved. Oncall engineer started digging into the root cause, latency was not yet back to normal but in a more manageable situation.
- 9:00 - 9:35: We had an Oncall handoff meeting, this latency issue was mentioned but no root cause was detected yet but the latency seemed under control.
- 9:42: A customer notices the latency issue and communicates to the Sardine team.
- 10:00: p95 latency becomes a sustained issue, new Oncall engineer starts investigating.
- 10:20: Oncall discovers the queries creating the bottlenecks, engineers start checking if there’s a bypass or a quick enhancement possible to remove the bottleneck or if scaling the DB is our only option.
- 10:40 AM: The DB is scaled up, in a few minutes the incident ends.
What went wrong
Action items
- Enforce more strict query timeouts for issuing API
- Ongoing query optimizations on the advanced aggregations feature that will mitigate risks of this happening again
- Enhance internal process around alerts and escalation
- Process update to oncall handoff and incident handling - if incident happens around oncall handoff both new and old oncall with use handoff meeting as working session