Service Outage

Incident Report for Sardine AI

Postmortem

Risk service down (Jan 5 2023)

Overview

We deployed code that had a bug in our authentication layer, resulting in null pointer exceptions for all API calls. This code change is related to non-production environments so issue only happened in production. VMs still passed the health check (which is unauthenticated API call) thus were treated as healthy, missing some calls that were impactful to the service.

Timeline (pacific)

2:15 PM: deployment initiated

2:23 PM: deployment complete

2:24 PM: first error observed (two services were down)

2:31 PM: oncall got paged

2:49 PM: rollback initiated

2:57 PM: rollback completed, error is resolved

Impact

All risk APIs (customers, devices, feedbacks, banks/transactions, issuing/risks, identity-documents) were down from 2:24PM to 2:57PM.

What went wrong

Due to nature of change, this issue only happens in production
Instance template from previous deploy was deleted, required more time to rollback
We had setup for canary deploy but we performed full deployment

What went well

Monitoring system detected issue as soon as services got deployed
oncall escalated to Devops and Team Lead ASAP

Action items

Enforcing canary deploy to be as a standard practice
Make rollback process more robust
- Keep instance templates from the previous deploy.
- Create automation to create instance template just in case
- Having clearer visibility on the past deployment versions to speed up time in finding them for possible rollbacks.

Posted Jan 05, 2023 - 23:33 UTC

Resolved

From 2:24-2:56PST our Risk API was unavailable. This incident has been resolved.

Posted Jan 05, 2023 - 23:00 UTC

Investigating

We are aware of a current service outage and are currently investigating

Posted Jan 05, 2023 - 22:58 UTC

This incident affected: Device APIs and Customer APIs.