Partial downtime

Incident Report for Sardine AI

Postmortem

- ALL TIMES ARE PST

Overview

On Apr 19 and Apr 23, Sardine experienced an increase in latency due to a huge increase in traffic. /customers, /issuing/risks, /feedbacks and /devices APIs were affected in following times (pacific time):

08:34-08:46, 08:51-08:53 Apr 19

08:13-08:16, 08:37-08:42, 08:54-08:58, 09:11-09:16 Apr 23

What happened

Sardine experienced a huge increase in traffic. While we have rate limit and auto scaling in place, our system was overloaded and caused performance degradation.

Impact

/customers, /issuing/risks, feedbacks and /devices API experienced increase in latency.

Timeline

Date Status
**April 19 2025
0834hrs - 0846hrs** Risk and Device Apis were experiencing increase in latency.
We started manually scaling Nginx horizontally and vertically. (While autoscaler was there, we did the manually to make it faster)
**April 19 2025
0846hrs - 0851hrs** All APIs were back up
**April 19 2025
0851hrs - 0853hrs** Risk and Device Apis were experiencing increase in latency.
April 19 2025 0853hrs onwards All APIs were back up. No issues moving forth.
————— —————
April 23 2025
0813hrs - 0816hrs Risk and Device Apis were experiencing increase in latency.
Sardine engineers analyzed the traffic and started enabling some rate limit rules.
April 23 2025
0837hrs - 0842hrs Risk and Device Apis were experiencing increase in latency.
Sardine engineers analyzed the traffic and hardened the rate limit rules
Sardine engineers also started to scale up our nginx servers vertically.
April 23 2025
0854hrs - 0858hrs Risk and Device Apis were experiencing increase in latency.
Sardine engineers analyzed the traffic and hardened the rate limit rules.
Also a ban rate limit rule was set (which took into effect at 0911hrs)
April 23 2025
0911hrs - 0916hrs Risk and Device Apis were severely rate limited due to a misconfiguration
April 23 2025
0916hrs Onwards Misconfiguration was lifted and systems went back online

What we’re doing to prevent future issues

  • We have enhanced our rate limiting system and updated automated mitigation setup so in the future, similar traffic will be automatically blocked
  • We have also created a new web application framework configuration to ensure such spikes of traffics are properly dealt with.
  • We are also creating dedicated instances to better handle such spikes.
Posted Apr 24, 2025 - 13:41 UTC

Resolved

Partial downtime on API and dashboard from 15:40 UTC to 15:56 UTC
Posted Apr 19, 2025 - 03:30 UTC