Retro: tidb migration for txn histroy re-architecturing

Major refactoring of centralization of multiple payment history tables into one To act as OLTP and small OLAP bedrock for front line services
2k writes per sec. 1k reads with p99 < 2 sec
Gain confidence for the critial path migration later on

DB selection
- PoCed vitess, tidb, cockroach db
- Vitess has hugh operational problem even during PoC
- Cockroach is feasible, but we are mysql based so favored tdib
Hosting solution: EKS + operator or EC2 + Ansible
- Ansible proved too much effort
- Because we can do incremental traffic switch, it is OK we take risks with k8s operator, since it is the direction of the future
Data transfer approach: direct between DBs or done by kafka
- Kafka approach means we don’t need coordination logic
- But our kafka pipeline implmentation uses the single row actions, which reduced speed by a large margin
- Personal preference is between DBs, but the implmentation team prefers the kafka approach. In the end I approved the team’s decision
  - I have no way to prove the kafka approach won’t perform as well.
  - I didn’t force a PoC, because that would be implementing coordination logic regardless, which we all know it is not trivial
  - I trusted the team, which I didn’t work too much with in prior
  - In retro, I should have forced a PoC on the team’s preferred approach even against the timeline.

Based on big query, we know in effect data becomes immutable after 3 months. So we migration data up to T - 3 months first, and take a backup of that. This means our kafka pipeline needs to run at most 3 months data (in effect, a little over one month) every time we need to re-run
A recon job that calcuates checksum between old and new architecture based on domain logics
A new service powered by the tidb running in parallel with the old one. Openresty in from of the both to do traffic swtich

By late July, find that kafka pipeline performance is much lower than expected - only 2k writes per sec
By late August, find that we had to scope down the project - one low traffic /big effort component is dropped
Schema changed twice AFTER staging migration is completed. Each time it addes an extra week of delay
First week in production, the cluster disappeared from all monitors for 10 minutes

(TODO: add how we solve such problems)