Service Outage
Incident Report for Sardine AI
Postmortem

Risk service down (Jan 5 2023)

Overview

We deployed code that had a bug in our authentication layer, resulting in null pointer exceptions for all API calls. This code change is related to non-production environments so issue only happened in production. VMs still passed the health check (which is unauthenticated API call) thus were treated as healthy, missing some calls that were impactful to the service.

Timeline (pacific)

2:15 PM: deployment initiated

2:23 PM: deployment complete

2:24 PM: first error observed (two services were down)

2:31 PM: oncall got paged

2:49 PM: rollback initiated

2:57 PM: rollback completed, error is resolved

Impact

All risk APIs (customers, devices, feedbacks, banks/transactions, issuing/risks, identity-documents) were down from 2:24PM to 2:57PM.

What went wrong

  • Due to nature of change, this issue only happens in production
  • Instance template from previous deploy was deleted, required more time to rollback
  • We had setup for canary deploy but we performed full deployment

What went well

  • Monitoring system detected issue as soon as services got deployed
  • oncall escalated to Devops and Team Lead ASAP

Action items

  • Enforcing canary deploy to be as a standard practice
  • Make rollback process more robust

    • Keep instance templates from the previous deploy.
    • Create automation to create instance template just in case
    • Having clearer visibility on the past deployment versions to speed up time in finding them for possible rollbacks.
Posted Jan 05, 2023 - 23:33 UTC

Resolved
From 2:24-2:56PST our Risk API was unavailable. This incident has been resolved.
Posted Jan 05, 2023 - 23:00 UTC
Investigating
We are aware of a current service outage and are currently investigating
Posted Jan 05, 2023 - 22:58 UTC
This incident affected: Device APIs and Customer APIs.