We deployed code that had a bug in our authentication layer, resulting in null pointer exceptions for all API calls. This code change is related to non-production environments so issue only happened in production. VMs still passed the health check (which is unauthenticated API call) thus were treated as healthy, missing some calls that were impactful to the service.
2:15 PM: deployment initiated
2:23 PM: deployment complete
2:24 PM: first error observed (two services were down)
2:31 PM: oncall got paged
2:49 PM: rollback initiated
2:57 PM: rollback completed, error is resolved
All risk APIs (customers, devices, feedbacks, banks/transactions, issuing/risks, identity-documents) were down from 2:24PM to 2:57PM.
Make rollback process more robust