crypto.sardine.ai has experienced a major outage from around 7:12 PM PST to 8:02 PM PST Nov 17 2022. All API endpoints were down during this time period.
We applied a DB migration to production and dropped a table that we no longer intended to use. There was no logic relying on the table, however our ORM library (Sequel) queries the table schema information when the app boots.
Notice that this happens on boot - this complicated matters further, since migrations were run w/o a server restart (e.g. without a deploy), so the migration didn’t immediately cause downtime: the server encountered a segfault, restarted, failed to boot due to the missing table, and landed in a crash loop.
Note this issue was not caught in our dev or sandbox environment, because this only happens when you run migration against old version of code (that lived in production).
Since this table was not referenced in any live traffic, our servers were healthy after running migration code. However, after a few hours later, our pods were restarted to due to unrelated segmentation fault (it seems bug with one of our dependency), and stuck in the crash loop. In production we had two pods at the time of incidents, and when both pods got unhealthy status, outage starts.
Sardine has uptime monitoring system but it was misconfigured and didn’t monitor this backend servers.
enhance alerting