Overview
On Jan 31, Sardine experienced an increase in latency due to increased database usage. /customers
, /issuing/risks
, and /devices
API experienced intermittent higher latency during 9:03 - 13:16 Pacific time.
What happened
Sardine experienced an increase in traffic due to overall organic traffic growth. Additionally, some of our clients sent us batch-based traffic that was bursty in the nature. Some of our database didn’t scale well to handle the bursty traffic.
Impact
/customers
, /issuing/risks
, and /devices
API experienced intermittent higher latency during 8:03 - 13:16 Pacific time.
Timeline (all Pacific time)
- 8:03 AM: Incident starts
- 8:04 AM: Oncall engineer paged due to the increased latency
- 8:21 AM: Alert was auto resolved. Oncall engineer concluded it was one-off latency spike
- Around 9:00 AM: Client reached out to Sardine due to latency concern
- 9:00AM - 13:00: Multiple alerts were triggered. Oncall engineer were investigating
- 13:33: root cause identified
- 13:40: scaling and config change were performed (incident ends)
What went wrong
- While internal alert notified oncall engineer, it wasn’t escalated soon enough
- Inefficient internal logic caused performance bottleneck
- Rate limit didn’t provide the sufficient control
Action items
- Enhance internal process around alerts and escalation
- Fix inefficient old logic to avoid future similar issue