When will tidb binlog replication lose data

How tidb writes the binlog

The pump we deployed is actually the pump server(PS) process. Tidb process has a pump client (PC) that writes to PS during a txn
- PC uses PS’s interface WriteBinlog(context.Context, *WriteBinlogReq) (*WriteBinlogResp, error)
- PC implmentation: func (c *PumpsClient) WriteBinlog(binlog *pb.Binlog) error
PC maintains a list of available PS and a list of unavailable PS
- Subscribe to PD and listen for its updates on PS status
- Reguarly heartbeat to PS on the unavailable list, and try to add it to back to the available list
- Keep an ErrNum for each PS. Increment it on failed binlog writes, and reset it on success. If a PS’s ErrNum is too high (default 10), add it to the unavailable list
If PC failed to write the prewrite binlog:
- If err is ResourceExhausted error from grpc, stop retrying
- Otherwise, try every single available pump once
- If still no luck, try every single unavailable pump once
- If still no luck, declare that we failed to write to binlog
If PC failed to write the commit binlog, PC will not retry, because PS can recover it from Tikv after 10 mins (default max txn timeout)
- Note this is a common source of replication lag
When pump gracefully shutdown, e.g., via pdctl, it will endter the readonly mode first, wait until all drainers have read the binlog, and then shutdown

When do we lose binlog

Assuming we turn on ignore-error

binlog stored on PS is lost forever
All PSs are down
PC encounters grpc’s ResourceExhausted error. Note this case very likely we don’t lose any binlog, because most likely the main txn commit will fail too

Default settings

	// RetryInterval is the interval of retrying to write binlog.
	RetryInterval = 100 * time.Millisecond

// DefaultBinlogWriteTimeout is the default max time binlog can use to write to pump.
	DefaultBinlogWriteTimeout = 15 * time.Second