When will tidb binlog replication lose data
How tidb writes the binlog
- The pump we deployed is actually the pump server(PS) process. Tidb process has a pump client (PC) that writes to PS during a txn
- PC uses PS’s interface
WriteBinlog(context.Context, *WriteBinlogReq) (*WriteBinlogResp, error)
- PC implmentation:
func (c *PumpsClient) WriteBinlog(binlog *pb.Binlog) error
- PC uses PS’s interface
- PC maintains a list of available PS and a list of unavailable PS
- Subscribe to PD and listen for its updates on PS status
- Reguarly heartbeat to PS on the unavailable list, and try to add it to back to the available list
- Keep an
ErrNum
for each PS. Increment it on failed binlog writes, and reset it on success. If a PS’sErrNum
is too high (default 10), add it to the unavailable list
- If PC failed to write the prewrite binlog:
- If err is
ResourceExhausted
error from grpc, stop retrying - Otherwise, try every single available pump once
- If still no luck, try every single unavailable pump once
- If still no luck, declare that we failed to write to binlog
- If err is
- If PC failed to write the commit binlog, PC will not retry, because PS can recover it from Tikv after 10 mins (default max txn timeout)
- Note this is a common source of replication lag
- When pump gracefully shutdown, e.g., via pdctl, it will endter the readonly mode first, wait until all drainers have read the binlog, and then shutdown
When do we lose binlog
Assuming we turn on ignore-error
- binlog stored on PS is lost forever
- All PSs are down
- PC encounters grpc’s
ResourceExhausted
error. Note this case very likely we don’t lose any binlog, because most likely the main txn commit will fail too
Default settings
// RetryInterval is the interval of retrying to write binlog.
RetryInterval = 100 * time.Millisecond
// DefaultBinlogWriteTimeout is the default max time binlog can use to write to pump.
DefaultBinlogWriteTimeout = 15 * time.Second