Original in Chinese

  1. Profile everything, even on production - to catch once in a life time bug

  2. Tests may make your code less beautiful - May need to add members to structs just for testsings, but we still need to design for tests

  3. Fault injeciton: to test network failure case, try automate it without human intervention - Otherwise it is inefficient

  4. Disk fails > 8% after 3 years

  5. Importatnt to monitor NTP, detect jumping back => normally bad!!

  6. Reading data from disk without checksum => no protection against potential data corruption

  7. Fault injection: disk error, netowrk card, cpu, clock, file system, network & protocol => need to simulate everything so that you can inject error

  8. Common tools: libfiu, openstack fault injection factory, Jepsen (mostly famous),

  9. FoundationDB limitation: fake multi-process does not work well with languages where single thread is in effect multi-threaded,e.g., channel)

  10. TiKV uses namazu. Planning to introduce OpenTracing (in Go) to fill the similar role as Google Dapper

  11. Dont test failure case by triggering failure automatically, use your simulation layer