the morning paper

  • ELB default health check timeout is 10 sec, which is overly generious for the same region case. Normally 2 second timeout with 3 attempts to declare failure is enough
  • Otherwise, by default, the instance will be taken offline at least 30 seconds after the failure happens, with traffic still directed to the faulty instance.
  • Problem with time-based availability metrics: MTTF / (MTTF + MTTR)
    • In distributed systems, common to have part of it failed somewhere, how do you define “down”?
    • How do you differentitate the impact of down during off hour and peak hour
  • Problem with count-based availability metrics: % of successful requests
    • High volume user has higher impact on this metric
    • Less traffic will come when user preceives the system is down, which makes it look better than it actually is
    • Not showing how long a system is down
  • Both metrics above do not capture the down time pattern and different durations of outages
  • The longer the time window, the “better” availability metric appears