On measuring performance

Getting performance data

A better qustion to ask during performance testing is, while maintaining my SLO, what is the max throughput I can achieve?
Develop test harness early, so that you can enough time to refine experienments and metrics to collect - this is a time consuming process!
One script to automate the running of the script, and another one to automate the generation of the report from the collected data
Add additional tests to ensure config change is indeed deployed
Defend against fast but incorrect implmentation. Include correctness check as non-timed part of the experiment,e.g., check data structure invariant after the timed portion
Micro benchmark is for certain aspects of the system. They are never good representation of the overall system health. Use maro-benchmark that represents the real-life workload to show the overall picture
What you measured is often response time instead of service time. Due to queueing and bufffering, customers may experience latency while you do not see it on your monitor.
Introduce a base line across the benchmarks and compare with the baseline instead of comparing with each other.
Baseline is often the optimal or state of the art solution
Do the same run twice in a row to verify unintended caching effect, and do it twice AFTER some other runs so to see the cold + hot cache case
Use neighbors of the regular/power strides and random points, so that you are very clear if you ran into corner cases or not
Run untimed warm-ups

Interpreting performance data

Many published performance nubmers measure what is optimized or the underlying system optimizes what is the measures
- Seperate calibration workload from evaluation workload(?). Otherwise, no way to show the predictive power of the model, which is the most important part of the model
Need to show clearly the trade off areas and regression performance numbers,i.e., which numbers are better, which stays same, and which becomes worse?
If subset of test is used, explain why some can be omitted. Note the omitted case may display a different trend at all
Throughput and load analysis should come together
Percetile and average metrics are very misleading, although they are good ways to tell stories. Rely on histrogram instead if possible.
- Double spikes distribution often the sign of multple behaviorial groups
- The “shoulder” of the gamma distribution often a sign of a small but separate behaviorial groups
Due to number of requests to serve a single user, even good 99.9% percentile means pretty likely your user will hit a bad latency. Max latency carries a lot of weight in showing problems.
Performance degradation easily got muffled in avg metrics. Basically averages are a good way to fool client to make them feel safe.
- Raw average must accompany with standard deviation. May also need Student’s t-test for significance
- Don’t use mean over a set of different benchmarks, even worse if they are normalized. Use geometric mean instead
- Use geometric means for multiplication problems, and harmonic mean for speeds, e.g., when speed s1 for t1 and speed s2 for t2.
Give complete results instead of ratios
Compare the exact data points instead of graphs over the same interval

On percentile

The most important thing about a metric is its distribution, and P99 does not say anything about the remaining 1% and previous 98%. Therefore, we can construct many counter-intutitive cases, and normally can not apply transitivity and composition directly

Suppose P(99,X) = 100ms, X consists of two sequential steps A and B, P(99,A) = 60ms

Can P(99,B) > 100 ms? No
If slowest 1% of A is matched with slowest 1% of B, then we can conclude P(99,B) = 40 ms
Can P(99,B) in (40, 100] ms? Yes
- suppose P(98,A) = 1ms, P(100,A) = 60ms, P(100,B) = P(98,B) = 99ms, P(1, B) = 1ms
- Then we can constuct the distribution as P1(X) = 61ms, P(99,X) = 100ms, P(100,X) = 159ms
Can P(99,B) in (0, 40) ms?
- Yes, suppose P(99,B) = 1ms, P(100,B) = 40ms, P(99,A) = P(1, A) 60ms, P(100,A) = 99ms
- Then we can construct P(99,X) = P(100,X) = 100ms, P(98, X) = 61ms

X consists of two sequential steps A and B, P(99,A) = 60ms, P(99,B) = 40ms

Can P(99,X) > 100ms?
Can P(99,X) be less than 95ms?
Yes to both cases, just use the two constructions we have in the last sections

X sends requests to M, which will batch requests into a single DB request

Can P(99,X) > P(99,M)? Yes, due to batched waits
Can P(99,M) > P(99,X)? No, the only thing we can confirm is P99 of composition >= P99 of parts

In Prometheus

Prometheus only records count and sum of buckets.
If max bucket is too small, qunatile will record only the max bucket. The end effect is a straight line
If bucket range is too big, too many data will fall into the same bucket, with the assumption that they are evenly distributed within the same bucket