Scenario

  • Aurora spends 75% of waits on cross region binlog replication, and unable to satify the TPS requirement even after upgrading to highest hardware spec. Such bottleneck is limiting business growth

Timeline

  • Nov - PoC with load testing
  • Dec - Migration plan discussion and got buy-ins
  • Jan - Develop verificaiton tools and processes
    • Set up clusters and drills on operations
  • Feb - Run the verificaiton schemes on prod
    • 30+ case DR drills from stg to prod

Major decision dilemmas/Known unknowns and how I solved them

  • Tidb vs Dynamo
    • Both sides run PoC and run a 4 hour debate session to present and argue for the solution
    • Neither side can show the other side is not feasible. So tie is broken by the LCA in the reporting line.
  • Hosting solution: EKS + operator or EC2 + Ansible
    • Both sides has data to support its claim, and neither side can prove the other side is not feasible
    • Again, the tie is broken by the LCA in the reporting line, who in retro, made the correct decision
  • Migration approach: one-shot vs incremental
    • The intuition says we should migrate data incrementally. But I lobbied for one shot approach after my research * I researched 5 cases with clients in similar industries. Talked to 2 of them directly to understand their motivation of not choosing incremental
      • Got confirmation from the solution architects of pingcap on not choosing incremental
    • Such proposal caused stress in higher-ups. As a remedy, all migration runbooks includes near-real time verification and rollback plan

How I did verification

  • query replay - to verify server side behavior is behaving as expected. Note since it is Aurora, we can’t use a proxy service to traffic capure and relay to a side car without having down time for prod
  • Binlog + EMR job to check the data consistency cross domain
    • to make sure binlog replication is behaving as expected ,
    • Standard sync-diff verification applies too
  • Traffic replay - to verify client side library (jdbc, connection pool) etc are maintaining same behavior as mysql

How we performed migraiton drill

  • Detailed runbook with DRI and reviwer at each step
  • I ask as the coordinator to measure process and overall correctness
  • Each step is timed and publicly annouced by me and the action taker
  • Drill is done 4 times before the actual migration. The last time is the complete drill with all stakeholders’ attendance
  • Last 2 drills completed without any error

Setbacks/Unknown unknowns run into

  • Two weeks before going live. Query replay verification failed on prod
  • 36 hours before going live. Run out of snowflake id on prod
  • Keep seeing concurrent read-write error even though the code shows no conrrency during writes