Health check and availability

ELB default health check timeout is 10 sec, which is overly generious for the same region case. Normally 2 second timeout with 3 attempts to declare failure is enough
Otherwise, by default, the instance will be taken offline at least 30 seconds after the failure happens, with traffic still directed to the faulty instance.
Problem with time-based availability metrics: MTTF / (MTTF + MTTR)
- In distributed systems, common to have part of it failed somewhere, how do you define “down”?
- How do you differentitate the impact of down during off hour and peak hour
Problem with count-based availability metrics: % of successful requests
- High volume user has higher impact on this metric
- Less traffic will come when user preceives the system is down, which makes it look better than it actually is
- Not showing how long a system is down
Both metrics above do not capture the down time pattern and different durations of outages
The longer the time window, the “better” availability metric appears