Scenario and goals

  • Major refactoring of centralization of multiple payment history tables into one To act as OLTP and small OLAP bedrock for front line services
  • 2k writes per sec. 1k reads with p99 < 2 sec
  • Gain confidence for the critial path migration later on

Timeline

  • June - PoC with load testing
  • July - Dev starts
  • Aug - Load testing with problems. DR drills
  • Sep - Turn on incremental traffic switch
  • Oct - 100% traffic switch

Major decision dilemmas/Known unknowns and how I solved them

  • DB selection
    • PoCed vitess, tidb, cockroach db
    • Vitess has hugh operational problem even during PoC
    • Cockroach is feasible, but we are mysql based so favored tdib
  • Hosting solution: EKS + operator or EC2 + Ansible
    • Ansible proved too much effort
    • Because we can do incremental traffic switch, it is OK we take risks with k8s operator, since it is the direction of the future
  • Data transfer approach: direct between DBs or done by kafka
    • Kafka approach means we don’t need coordination logic
    • But our kafka pipeline implmentation uses the single row actions, which reduced speed by a large margin
    • Personal preference is between DBs, but the implmentation team prefers the kafka approach. In the end I approved the team’s decision
      • I have no way to prove the kafka approach won’t perform as well.
      • I didn’t force a PoC, because that would be implementing coordination logic regardless, which we all know it is not trivial
      • I trusted the team, which I didn’t work too much with in prior
      • In retro, I should have forced a PoC on the team’s preferred approach even against the timeline.

How did we migrate

  • Based on big query, we know in effect data becomes immutable after 3 months. So we migration data up to T - 3 months first, and take a backup of that. This means our kafka pipeline needs to run at most 3 months data (in effect, a little over one month) every time we need to re-run
  • A recon job that calcuates checksum between old and new architecture based on domain logics
  • A new service powered by the tidb running in parallel with the old one. Openresty in from of the both to do traffic swtich

Setbacks/Unknown unknowns run into

  • By late July, find that kafka pipeline performance is much lower than expected - only 2k writes per sec
  • By late August, find that we had to scope down the project - one low traffic /big effort component is dropped
  • Schema changed twice AFTER staging migration is completed. Each time it addes an extra week of delay
  • First week in production, the cluster disappeared from all monitors for 10 minutes

(TODO: add how we solve such problems)