Distributed Systems Testing - from TiDB talk
-
Profile everything, even on production - to catch once in a life time bug
-
Tests may make your code less beautiful - May need to add members to structs just for testsings, but we still need to design for tests
-
Fault injeciton: to test network failure case, try automate it without human intervention - Otherwise it is inefficient
-
Disk fails > 8% after 3 years
-
Importatnt to monitor NTP, detect jumping back => normally bad!!
-
Reading data from disk without checksum => no protection against potential data corruption
-
Fault injection: disk error, netowrk card, cpu, clock, file system, network & protocol => need to simulate everything so that you can inject error
-
Common tools: libfiu, openstack fault injection factory, Jepsen (mostly famous),
-
FoundationDB limitation: fake multi-process does not work well with languages where single thread is in effect multi-threaded,e.g., channel)
-
TiKV uses namazu. Planning to introduce OpenTracing (in Go) to fill the similar role as Google Dapper
-
Dont test failure case by triggering failure automatically, use your simulation layer