On Oct 29, Sardine experienced an outage on our rules-engine service that impacted our /v1/customersendpoint causing some requests to fail during this period, with a big spike when the error happened and a fast decrease on the error rate.
Sardine deploys the backend services every Wednesday, during one of our regular deployments we noticed an error while the rules-engine service was being deployed in canary mode (25% of traffic gets routed to the new version instances).
After noticing the error our team immediately started a rollback and checked for root causes, finding it a few minutes later and only continuing the deployment on the next day, this time, a successful one.
The error was a database migration that caused the application old versions to lose the reference to schema they used, causing the errors and triggering 5xx responses in the customers api.
/v1/customers API endpoint responding with 500 http response for around 30 minutes, with the first minutes concentrating 90% of the errors