Customers API issue

Incident Report for Sardine AI

Postmortem

Rule engine outage on deploy

Overview

On Oct 29, Sardine experienced an outage on our rules-engine service that impacted our /v1/customersendpoint causing some requests to fail during this period, with a big spike when the error happened and a fast decrease on the error rate.

What happened

Sardine deploys the backend services every Wednesday, during one of our regular deployments we noticed an error while the rules-engine service was being deployed in canary mode (25% of traffic gets routed to the new version instances).

After noticing the error our team immediately started a rollback and checked for root causes, finding it a few minutes later and only continuing the deployment on the next day, this time, a successful one.

The error was a database migration that caused the application old versions to lose the reference to schema they used, causing the errors and triggering 5xx responses in the customers api.

Impact

/v1/customers API endpoint responding with 500 http response for around 30 minutes, with the first minutes concentrating 90% of the errors

Timeline (UTC)

  • 18:44 deployment ticket gets approved
  • 18:55 Release engineering starts deployment
  • 19:05 Release engineer notices something is wrong, rules-engine service has an elevated number of errors after deployment
  • 19:07 Release engineer starts rollback action
  • 19:10 incident is initiated internally and externally
  • 19:44 All services have finished rolling back to the previous version
  • 19:54 Incident is considered as solved.

Action items

  • Fix root cause issue where migrations running in a canary deployment made old versions receive an error in prepared statement cache
  • Enhance internal toolings to deployment to have faster deployment and rollbacks
Posted Oct 31, 2025 - 18:48 UTC

Resolved

We have had elevated 5xx errors for customers API from 19:04 UTC to 19:43 UTC
incident is resolved now and we'll come back with post mortem as soon as possible.
Posted Oct 29, 2025 - 19:49 UTC

Investigating

Team is investigating we'll update as soon as possible.

Except errors from our API.
Posted Oct 29, 2025 - 19:28 UTC
This incident affected: Customer APIs.