Scenario
- Aurora spends 75% of waits on cross region binlog replication, and unable to satify the TPS requirement even after upgrading to highest hardware spec. Such bottleneck is limiting business growth
Timeline
- Nov - PoC with load testing
- Dec - Migration plan discussion and got buy-ins
- Jan - Develop verificaiton tools and processes
- Set up clusters and drills on operations
- Feb - Run the verificaiton schemes on prod
- 30+ case DR drills from stg to prod
Major decision dilemmas/Known unknowns and how I solved them
- Tidb vs Dynamo
- Both sides run PoC and run a 4 hour debate session to present and argue for the solution
- Neither side can show the other side is not feasible. So tie is broken by the LCA in the reporting line.
- Hosting solution: EKS + operator or EC2 + Ansible
- Both sides has data to support its claim, and neither side can prove the other side is not feasible
- Again, the tie is broken by the LCA in the reporting line, who in retro, made the correct decision
- Migration approach: one-shot vs incremental
- The intuition says we should migrate data incrementally. But I lobbied for one shot approach after my research * I researched 5 cases with clients in similar industries. Talked to 2 of them directly to understand their motivation of not choosing incremental
- Got confirmation from the solution architects of pingcap on not choosing incremental
- Such proposal caused stress in higher-ups. As a remedy, all migration runbooks includes near-real time verification and rollback plan
How I did verification
- query replay - to verify server side behavior is behaving as expected. Note since it is Aurora, we can’t use a proxy service to traffic capure and relay to a side car without having down time for prod
- Binlog + EMR job to check the data consistency cross domain
- to make sure binlog replication is behaving as expected ,
- Standard sync-diff verification applies too
- Traffic replay - to verify client side library (jdbc, connection pool) etc are maintaining same behavior as mysql
- Detailed runbook with DRI and reviwer at each step
- I ask as the coordinator to measure process and overall correctness
- Each step is timed and publicly annouced by me and the action taker
- Drill is done 4 times before the actual migration. The last time is the complete drill with all stakeholders’ attendance
- Last 2 drills completed without any error
Setbacks/Unknown unknowns run into
- Two weeks before going live. Query replay verification failed on prod
- 36 hours before going live. Run out of snowflake id on prod
- Keep seeing concurrent read-write error even though the code shows no conrrency during writes