Distributed Systems Testing

Profile everything, even on production - to catch once in a life time bug
Tests may make your code less beautiful - May need to add members to structs just for testsings, but we still need to design for tests
Fault injeciton: to test network failure case, try automate it without human intervention - Otherwise it is inefficient
Disk fails > 8% after 3 years
Importatnt to monitor NTP, detect jumping back => normally bad!!
Reading data from disk without checksum => no protection against potential data corruption
Fault injection: disk error, netowrk card, cpu, clock, file system, network & protocol => need to simulate everything so that you can inject error
Common tools: libfiu, openstack fault injection factory, Jepsen (mostly famous),
FoundationDB limitation: fake multi-process does not work well with languages where single thread is in effect multi-threaded,e.g., channel)
TiKV uses namazu. Planning to introduce OpenTracing (in Go) to fill the similar role as Google Dapper
Dont test failure case by triggering failure automatically, use your simulation layer