Retro: tidb migration for txn histroy re-architecturing
Scenario and goals
- Major refactoring of centralization of multiple payment history tables into one To act as OLTP and small OLAP bedrock for front line services
- 2k writes per sec. 1k reads with p99 < 2 sec
- Gain confidence for the critial path migration later on
Timeline
- June - PoC with load testing
- July - Dev starts
- Aug - Load testing with problems. DR drills
- Sep - Turn on incremental traffic switch
- Oct - 100% traffic switch
Major decision dilemmas/Known unknowns and how I solved them
- DB selection
- PoCed vitess, tidb, cockroach db
- Vitess has hugh operational problem even during PoC
- Cockroach is feasible, but we are mysql based so favored tdib
- Hosting solution: EKS + operator or EC2 + Ansible
- Ansible proved too much effort
- Because we can do incremental traffic switch, it is OK we take risks with k8s operator, since it is the direction of the future
- Data transfer approach: direct between DBs or done by kafka
- Kafka approach means we don’t need coordination logic
- But our kafka pipeline implmentation uses the single row actions, which reduced speed by a large margin
- Personal preference is between DBs, but the implmentation team prefers the kafka approach. In the end I approved the team’s decision
- I have no way to prove the kafka approach won’t perform as well.
- I didn’t force a PoC, because that would be implementing coordination logic regardless, which we all know it is not trivial
- I trusted the team, which I didn’t work too much with in prior
- In retro, I should have forced a PoC on the team’s preferred approach even against the timeline.
How did we migrate
- Based on big query, we know in effect data becomes immutable after 3 months. So we migration data up to T - 3 months first, and take a backup of that. This means our kafka pipeline needs to run at most 3 months data (in effect, a little over one month) every time we need to re-run
- A recon job that calcuates checksum between old and new architecture based on domain logics
- A new service powered by the tidb running in parallel with the old one. Openresty in from of the both to do traffic swtich
Setbacks/Unknown unknowns run into
- By late July, find that kafka pipeline performance is much lower than expected - only 2k writes per sec
- By late August, find that we had to scope down the project - one low traffic /big effort component is dropped
- Schema changed twice AFTER staging migration is completed. Each time it addes an extra week of delay
- First week in production, the cluster disappeared from all monitors for 10 minutes
(TODO: add how we solve such problems)