Source code

How tidb writes the binlog

  • The pump we deployed is actually the pump server(PS) process. Tidb process has a pump client (PC) that writes to PS during a txn
    • PC uses PS’s interface WriteBinlog(context.Context, *WriteBinlogReq) (*WriteBinlogResp, error)
    • PC implmentation: func (c *PumpsClient) WriteBinlog(binlog *pb.Binlog) error
  • PC maintains a list of available PS and a list of unavailable PS
    • Subscribe to PD and listen for its updates on PS status
    • Reguarly heartbeat to PS on the unavailable list, and try to add it to back to the available list
    • Keep an ErrNum for each PS. Increment it on failed binlog writes, and reset it on success. If a PS’s ErrNum is too high (default 10), add it to the unavailable list
  • If PC failed to write the prewrite binlog:
    • If err is ResourceExhausted error from grpc, stop retrying
    • Otherwise, try every single available pump once
    • If still no luck, try every single unavailable pump once
    • If still no luck, declare that we failed to write to binlog
  • If PC failed to write the commit binlog, PC will not retry, because PS can recover it from Tikv after 10 mins (default max txn timeout)
    • Note this is a common source of replication lag
  • When pump gracefully shutdown, e.g., via pdctl, it will endter the readonly mode first, wait until all drainers have read the binlog, and then shutdown

When do we lose binlog

Assuming we turn on ignore-error

  • binlog stored on PS is lost forever
  • All PSs are down
  • PC encounters grpc’s ResourceExhausted error. Note this case very likely we don’t lose any binlog, because most likely the main txn commit will fail too

Default settings

	// RetryInterval is the interval of retrying to write binlog.
	RetryInterval = 100 * time.Millisecond

// DefaultBinlogWriteTimeout is the default max time binlog can use to write to pump.
	DefaultBinlogWriteTimeout = 15 * time.Second