Retro: tidb migration for a mission critical system

Aurora spends 75% of waits on cross region binlog replication, and unable to satify the TPS requirement even after upgrading to highest hardware spec. Such bottleneck is limiting business growth

Nov - PoC with load testing
Dec - Migration plan discussion and got buy-ins
Jan - Develop verificaiton tools and processes
- Set up clusters and drills on operations
Feb - Run the verificaiton schemes on prod
- 30+ case DR drills from stg to prod

Tidb vs Dynamo
- Both sides run PoC and run a 4 hour debate session to present and argue for the solution
- Neither side can show the other side is not feasible. So tie is broken by the LCA in the reporting line.
Hosting solution: EKS + operator or EC2 + Ansible
- Both sides has data to support its claim, and neither side can prove the other side is not feasible
- Again, the tie is broken by the LCA in the reporting line, who in retro, made the correct decision
Migration approach: one-shot vs incremental
- The intuition says we should migrate data incrementally. But I lobbied for one shot approach after my research * I researched 5 cases with clients in similar industries. Talked to 2 of them directly to understand their motivation of not choosing incremental
  - Got confirmation from the solution architects of pingcap on not choosing incremental
- Such proposal caused stress in higher-ups. As a remedy, all migration runbooks includes near-real time verification and rollback plan

query replay - to verify server side behavior is behaving as expected. Note since it is Aurora, we can’t use a proxy service to traffic capure and relay to a side car without having down time for prod
Binlog + EMR job to check the data consistency cross domain
- to make sure binlog replication is behaving as expected ,
- Standard sync-diff verification applies too
Traffic replay - to verify client side library (jdbc, connection pool) etc are maintaining same behavior as mysql

Detailed runbook with DRI and reviwer at each step
I ask as the coordinator to measure process and overall correctness
Each step is timed and publicly annouced by me and the action taker
Drill is done 4 times before the actual migration. The last time is the complete drill with all stakeholders’ attendance
Last 2 drills completed without any error

Two weeks before going live. Query replay verification failed on prod
36 hours before going live. Run out of snowflake id on prod
Keep seeing concurrent read-write error even though the code shows no conrrency during writes