Elevated API latency

Incident Report for Sardine AI

Postmortem

Overview

On Jan 31, Sardine experienced an increase in latency due to increased database usage. /customers, /issuing/risks, and /devices API experienced intermittent higher latency during 9:03 - 13:16 Pacific time.

What happened

Sardine experienced an increase in traffic due to overall organic traffic growth. Additionally, some of our clients sent us batch-based traffic that was bursty in the nature. Some of our database didn’t scale well to handle the bursty traffic.

Impact

/customers, /issuing/risks, and /devices API experienced intermittent higher latency during 8:03 - 13:16 Pacific time.

Timeline (all Pacific time)

  • 8:03 AM: Incident starts
  • 8:04 AM: Oncall engineer paged due to the increased latency
  • 8:21 AM: Alert was auto resolved. Oncall engineer concluded it was one-off latency spike
  • Around 9:00 AM: Client reached out to Sardine due to latency concern
  • 9:00AM - 13:00: Multiple alerts were triggered. Oncall engineer were investigating
  • 13:33: root cause identified
  • 13:40: scaling and config change were performed (incident ends)

What went wrong

  • While internal alert notified oncall engineer, it wasn’t escalated soon enough
  • Inefficient internal logic caused performance bottleneck
  • Rate limit didn’t provide the sufficient control

Action items

  • Enhance internal process around alerts and escalation
  • Fix inefficient old logic to avoid future similar issue
Posted Jan 31, 2025 - 22:44 UTC

Resolved

This incident has been resolved.

We'll be adding a PostMortem with the full description and action items here as soon as we have it.
Posted Jan 31, 2025 - 22:10 UTC

Investigating

We are currently experiencing elevated latency and are actively working to resolve the issue
Posted Jan 31, 2025 - 21:48 UTC
This incident affected: Device APIs and Customer APIs.