checkout outage
Incident Report for Sardine AI

What happened? has experienced a major outage from around 7:12 PM PST to 8:02 PM PST Nov 17 2022. All API endpoints were down during this time period.

Why did it happen? What went wrong?

We applied a DB migration to production and dropped a table that we no longer intended to use. There was no logic relying on the table, however our ORM library (Sequel) queries the table schema information when the app boots.

Notice that this happens on boot - this complicated matters further, since migrations were run w/o a server restart (e.g. without a deploy), so the migration didn’t immediately cause downtime: the server encountered a segfault, restarted, failed to boot due to the missing table, and landed in a crash loop.

Note this issue was not caught in our dev or sandbox environment, because this only happens when you run migration against old version of code (that lived in production).

Since this table was not referenced in any live traffic, our servers were healthy after running migration code. However, after a few hours later, our pods were restarted to due to unrelated segmentation fault (it seems bug with one of our dependency), and stuck in the crash loop. In production we had two pods at the time of incidents, and when both pods got unhealthy status, outage starts.

Sardine has uptime monitoring system but it was misconfigured and didn’t monitor this backend servers.

Timeline (all pacific time)

  • Nov 14 12:00PM pull request to drop unused table was merged in internal github repo
  • Nov 17 1:12 PM: above migration was applied on prod
  • 3:57 PM: one of the pod got segmentation fault and restarted. stuck in crash loop. load balancer started stop sending traffic to this pod
  • 7:12 PM: second pod got segmentation fault and stuck in crash loop. Full outage starts
  • 7:42 PM: one of our customer reported issue
  • 7:46 PM: issue was confirmed and escalated to engineering team
  • 8:00 PM: prod migration was applied to rollback the change
  • 8:02 PM: sever pods became healthy

What we’re doing to prevent this from happening again

  • establish more strong guideline and process around database schema migration
  • enhance alerting

    • Uptime monitoring for backend
    • Alerts around boot failure
    • Alerts around sudden log spike
Posted Nov 18, 2022 - 21:57 UTC

Resolved has experienced outage from around 9:12 PM CST to 10:02 PM CST. All API endpoint were down during this time period. We'll follow up with postmortem
Posted Nov 18, 2022 - 04:10 UTC