Dashboard instability while loading certain entities might ocurr

Incident Report for Sardine AI

Postmortem

Impact:

During incident window

  • Customer Intelligence Search latency was degraded for queries spanning >30 days of data.
  • Session Details and Customer Details pages load were slow
  • Connections Graph and Timeline features were also impacted

Executive Summary

As part of infrastructure optimization, our development team performed multiple operations to our search databases to optimize index structure and data storage. This resulted in inefficient provision of our warm data cluster, and resulted in degraded performance.

The team ultimately resolved the incident by updating data cluster configuration. Due to the volume of data, simple rollback was not possible, resulting in the long incident.

Incident Details

What Happened

Our development team performed multiple operations to our search databases to optimize index structure and data storage. Due to bug in migration script, we migrated more data than initially anticipated. The destination cluster didn’t have sufficient storage and computing resources assigned.

Latency started rising slowly as more data was migrated. This was initially dismissed as expected as we’re moving older data to separate clusters that are indeed slower but should remain within acceptable bounds. Two days later, on May 12, as the warm indices filled up as the migration completed, users began reporting that dashboard search was very slow.

We then attempted upsizing the cluster but it was not able to upsize due to high traffic and large amount of data. Incident was resolved by our team manually reverted some of the operation.

Timeline

Time (PT, May 12) Event
May 10, 23:38 Automated operation around data migration was initiated, team was monitoring and didn’t report any issue
May 11, 00:00 Latency starts climbing. Alerts were triggered but assumed as expected.
May 12, 6:02 AM Support reports dashboard slowness; on-call begins investigation
9:04 AM Incident formally created
10:56 AM First code fix deployed for customer details + session details
11:18 AM Deploy complete, pages still slow
12:25 PM Removed search dependency on Customer Profile + Session Details. Page Loads improved, Network Graph + Customer search still slow.
1:09 PM Root cause identified: indices incorrectly in warm tier; direct hot-tier migration initiated (~10h estimated)
3:05 PM Warm tier upsized aggressively migration still not converging
7:00–7:08 PM search cluster repeatedly auto-cancels in-flight shard recovery; direct migration abandoned
7:19 PM Switched to another approach of spinnig up new cluster
7:41 PM April indicies restored from snapshot; last-30d queries drop to ~15ms
9:03 PM February + March indicies restores complete
10:14 PM Replicas added to hot copies; search queue drops to 0. Incident resolved.

Action Items

Immediate:

  • Manually rollback problematic resource allocation
  • Ensure all node pools have enough resources

Medium Term Process Improvements:

  • Runbook and Migration process for search database upgrade operation

    • Better review process for Infra changes
    • Runbook for monitoring upgrade and immediate rollback
  • Observability in order to know if latency is expected

Posted May 15, 2026 - 12:52 UTC

Resolved

This incident has been resolved.
Posted May 13, 2026 - 12:16 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted May 12, 2026 - 23:16 UTC
This incident affected: Dashboard.