Getting performance data

  • A better qustion to ask during performance testing is, while maintaining my SLO, what is the max throughput I can achieve?
  • Develop test harness early, so that you can enough time to refine experienments and metrics to collect - this is a time consuming process!
  • One script to automate the running of the script, and another one to automate the generation of the report from the collected data
  • Add additional tests to ensure config change is indeed deployed
  • Defend against fast but incorrect implmentation. Include correctness check as non-timed part of the experiment,e.g., check data structure invariant after the timed portion
  • Micro benchmark is for certain aspects of the system. They are never good representation of the overall system health. Use maro-benchmark that represents the real-life workload to show the overall picture
  • What you measured is often response time instead of service time. Due to queueing and bufffering, customers may experience latency while you do not see it on your monitor.
  • Introduce a base line across the benchmarks and compare with the baseline instead of comparing with each other.
  • Baseline is often the optimal or state of the art solution
  • Do the same run twice in a row to verify unintended caching effect, and do it twice AFTER some other runs so to see the cold + hot cache case
  • Use neighbors of the regular/power strides and random points, so that you are very clear if you ran into corner cases or not
  • Run untimed warm-ups

Interpreting performance data

  • Many published performance nubmers measure what is optimized or the underlying system optimizes what is the measures
    • Seperate calibration workload from evaluation workload(?). Otherwise, no way to show the predictive power of the model, which is the most important part of the model
  • Need to show clearly the trade off areas and regression performance numbers,i.e., which numbers are better, which stays same, and which becomes worse?
  • If subset of test is used, explain why some can be omitted. Note the omitted case may display a different trend at all
  • Throughput and load analysis should come together
  • Percetile and average metrics are very misleading, although they are good ways to tell stories. Rely on histrogram instead if possible.
    • Double spikes distribution often the sign of multple behaviorial groups
    • The “shoulder” of the gamma distribution often a sign of a small but separate behaviorial groups
  • Due to number of requests to serve a single user, even good 99.9% percentile means pretty likely your user will hit a bad latency. Max latency carries a lot of weight in showing problems.
  • Performance degradation easily got muffled in avg metrics. Basically averages are a good way to fool client to make them feel safe.
    • Raw average must accompany with standard deviation. May also need Student’s t-test for significance
    • Don’t use mean over a set of different benchmarks, even worse if they are normalized. Use geometric mean instead
    • Use geometric means for multiplication problems, and harmonic mean for speeds, e.g., when speed s1 for t1 and speed s2 for t2.
  • Give complete results instead of ratios
  • Compare the exact data points instead of graphs over the same interval

On percentile

The most important thing about a metric is its distribution, and P99 does not say anything about the remaining 1% and previous 98%. Therefore, we can construct many counter-intutitive cases, and normally can not apply transitivity and composition directly

Suppose P(99,X) = 100ms, X consists of two sequential steps A and B, P(99,A) = 60ms

  • Can P(99,B) > 100 ms? No
  • If slowest 1% of A is matched with slowest 1% of B, then we can conclude P(99,B) = 40 ms
  • Can P(99,B) in (40, 100] ms? Yes
    • suppose P(98,A) = 1ms, P(100,A) = 60ms, P(100,B) = P(98,B) = 99ms, P(1, B) = 1ms
    • Then we can constuct the distribution as P1(X) = 61ms, P(99,X) = 100ms, P(100,X) = 159ms
  • Can P(99,B) in (0, 40) ms?
    • Yes, suppose P(99,B) = 1ms, P(100,B) = 40ms, P(99,A) = P(1, A) 60ms, P(100,A) = 99ms
    • Then we can construct P(99,X) = P(100,X) = 100ms, P(98, X) = 61ms

X consists of two sequential steps A and B, P(99,A) = 60ms, P(99,B) = 40ms

  • Can P(99,X) > 100ms?
  • Can P(99,X) be less than 95ms?
  • Yes to both cases, just use the two constructions we have in the last sections

X sends requests to M, which will batch requests into a single DB request

  • Can P(99,X) > P(99,M)? Yes, due to batched waits
  • Can P(99,M) > P(99,X)? No, the only thing we can confirm is P99 of composition >= P99 of parts

In Prometheus

  • Prometheus only records count and sum of buckets.
  • If max bucket is too small, qunatile will record only the max bucket. The end effect is a straight line
  • If bucket range is too big, too many data will fall into the same bucket, with the assumption that they are evenly distributed within the same bucket

References