On measuring performance
Getting performance data
- A better qustion to ask during performance testing is, while maintaining my SLO, what is the max throughput I can achieve?
- Develop test harness early, so that you can enough time to refine experienments and metrics to collect - this is a time consuming process!
- One script to automate the running of the script, and another one to automate the generation of the report from the collected data
- Add additional tests to ensure config change is indeed deployed
- Defend against fast but incorrect implmentation. Include correctness check as non-timed part of the experiment,e.g., check data structure invariant after the timed portion
- Micro benchmark is for certain aspects of the system. They are never good representation of the overall system health. Use maro-benchmark that represents the real-life workload to show the overall picture
- What you measured is often response time instead of service time. Due to queueing and bufffering, customers may experience latency while you do not see it on your monitor.
- Introduce a base line across the benchmarks and compare with the baseline instead of comparing with each other.
- Baseline is often the optimal or state of the art solution
- Do the same run twice in a row to verify unintended caching effect, and do it twice AFTER some other runs so to see the cold + hot cache case
- Use neighbors of the regular/power strides and random points, so that you are very clear if you ran into corner cases or not
- Run untimed warm-ups
Interpreting performance data
- Many published performance nubmers measure what is optimized or the underlying system optimizes what is the measures
- Seperate calibration workload from evaluation workload(?). Otherwise, no way to show the predictive power of the model, which is the most important part of the model
- Need to show clearly the trade off areas and regression performance numbers,i.e., which numbers are better, which stays same, and which becomes worse?
- If subset of test is used, explain why some can be omitted. Note the omitted case may display a different trend at all
- Throughput and load analysis should come together
- Percetile and average metrics are very misleading, although they are good ways to tell stories. Rely on histrogram instead if possible.
- Double spikes distribution often the sign of multple behaviorial groups
- The “shoulder” of the gamma distribution often a sign of a small but separate behaviorial groups
- Due to number of requests to serve a single user, even good 99.9% percentile means pretty likely your user will hit a bad latency. Max latency carries a lot of weight in showing problems.
- Performance degradation easily got muffled in avg metrics. Basically averages are a good way to fool client to make them feel safe.
- Raw average must accompany with standard deviation. May also need Student’s t-test for significance
- Don’t use mean over a set of different benchmarks, even worse if they are normalized. Use geometric mean instead
- Use geometric means for multiplication problems, and harmonic mean for speeds, e.g., when speed s1 for t1 and speed s2 for t2.
- Give complete results instead of ratios
- Compare the exact data points instead of graphs over the same interval
On percentile
The most important thing about a metric is its distribution, and P99 does not say anything about the remaining 1% and previous 98%. Therefore, we can construct many counter-intutitive cases, and normally can not apply transitivity and composition directly
Suppose P(99,X) = 100ms, X consists of two sequential steps A and B, P(99,A) = 60ms
- Can P(99,B) > 100 ms? No
- If slowest 1% of A is matched with slowest 1% of B, then we can conclude P(99,B) = 40 ms
- Can P(99,B) in (40, 100] ms? Yes
- suppose P(98,A) = 1ms, P(100,A) = 60ms, P(100,B) = P(98,B) = 99ms, P(1, B) = 1ms
- Then we can constuct the distribution as P1(X) = 61ms, P(99,X) = 100ms, P(100,X) = 159ms
- Can P(99,B) in (0, 40) ms?
- Yes, suppose P(99,B) = 1ms, P(100,B) = 40ms, P(99,A) = P(1, A) 60ms, P(100,A) = 99ms
- Then we can construct P(99,X) = P(100,X) = 100ms, P(98, X) = 61ms
X consists of two sequential steps A and B, P(99,A) = 60ms, P(99,B) = 40ms
- Can P(99,X) > 100ms?
- Can P(99,X) be less than 95ms?
- Yes to both cases, just use the two constructions we have in the last sections
X sends requests to M, which will batch requests into a single DB request
- Can P(99,X) > P(99,M)? Yes, due to batched waits
- Can P(99,M) > P(99,X)? No, the only thing we can confirm is P99 of composition >= P99 of parts
In Prometheus
- Prometheus only records count and sum of buckets.
- If max bucket is too small, qunatile will record only the max bucket. The end effect is a straight line
- If bucket range is too big, too many data will fall into the same bucket, with the assumption that they are evenly distributed within the same bucket