Impact:
During incident window
As part of infrastructure optimization, our development team performed multiple operations to our search databases to optimize index structure and data storage. This resulted in inefficient provision of our warm data cluster, and resulted in degraded performance.
The team ultimately resolved the incident by updating data cluster configuration. Due to the volume of data, simple rollback was not possible, resulting in the long incident.
Our development team performed multiple operations to our search databases to optimize index structure and data storage. Due to bug in migration script, we migrated more data than initially anticipated. The destination cluster didn’t have sufficient storage and computing resources assigned.
Latency started rising slowly as more data was migrated. This was initially dismissed as expected as we’re moving older data to separate clusters that are indeed slower but should remain within acceptable bounds. Two days later, on May 12, as the warm indices filled up as the migration completed, users began reporting that dashboard search was very slow.
We then attempted upsizing the cluster but it was not able to upsize due to high traffic and large amount of data. Incident was resolved by our team manually reverted some of the operation.
| Time (PT, May 12) | Event |
|---|---|
| May 10, 23:38 | Automated operation around data migration was initiated, team was monitoring and didn’t report any issue |
| May 11, 00:00 | Latency starts climbing. Alerts were triggered but assumed as expected. |
| May 12, 6:02 AM | Support reports dashboard slowness; on-call begins investigation |
| 9:04 AM | Incident formally created |
| 10:56 AM | First code fix deployed for customer details + session details |
| 11:18 AM | Deploy complete, pages still slow |
| 12:25 PM | Removed search dependency on Customer Profile + Session Details. Page Loads improved, Network Graph + Customer search still slow. |
| 1:09 PM | Root cause identified: indices incorrectly in warm tier; direct hot-tier migration initiated (~10h estimated) |
| 3:05 PM | Warm tier upsized aggressively migration still not converging |
| 7:00–7:08 PM | search cluster repeatedly auto-cancels in-flight shard recovery; direct migration abandoned |
| 7:19 PM | Switched to another approach of spinnig up new cluster |
| 7:41 PM | April indicies restored from snapshot; last-30d queries drop to ~15ms |
| 9:03 PM | February + March indicies restores complete |
| 10:14 PM | Replicas added to hot copies; search queue drops to 0. Incident resolved. |
Immediate:
Medium Term Process Improvements:
Runbook and Migration process for search database upgrade operation
Observability in order to know if latency is expected