the morning paper
- ELB default health check timeout is 10 sec, which is overly generious for the same region case. Normally 2 second timeout with 3 attempts to declare failure is enough
- Otherwise, by default, the instance will be taken offline at least 30 seconds after the failure happens, with traffic still directed to the faulty instance.
- Problem with time-based availability metrics: MTTF / (MTTF + MTTR)
- In distributed systems, common to have part of it failed somewhere, how do you define “down”?
- How do you differentitate the impact of down during off hour and peak hour
- Problem with count-based availability metrics: % of successful requests
- High volume user has higher impact on this metric
- Less traffic will come when user preceives the system is down, which makes it look better than it actually is
- Not showing how long a system is down
- Both metrics above do not capture the down time pattern and different durations of outages
- The longer the time window, the “better” availability metric appears