Degradation of API services
Incident Report for Sardine AI
Postmortem

What happened

One of our DB instances had a downtime for about 3 minutes. This affected our API capabilities to respond properly to requests.

What went wrong

A code change introduced a feature that wasn’t cached hence heavily increasing requests to the DB.

What we are doing about this

  • Upgrade database to reduce the downtime (in case it should ever happen again, from 140 seconds to 60 at most).
  • Configure proper maintenance schedule so we have better control of automated maintenance by cloud provider
  • Improve caching so even if database is down we can still serve the traffic
Posted Jul 26, 2024 - 16:11 UTC

Resolved
Downtime for customers, devices, and issuing/risks API during 4:31:04-4:34:00 UTC for majority of traffic. For some traffic we were still able to provide response based on cached rulesets and configuration.
Posted Jul 26, 2024 - 04:30 UTC