Design a dapper-like system for tracing and monitoring

Requirements: drill down a certain request and see its sequence of actions

  • Embed in the header
    • Trace id: root id
    • Parent session id: who called this request
    • Current session id
    • Seq number to mark layer and de-dup
  • Do sampling to control no more than 1k writes per sec
  • Can just store them in (traceid, session) -> table
  • Upon viewing, just load all sessions in a trace on the fly and compute the tree (< 100 sessions per trace)

Requirements: calculate avg latency by node id and time

  • Timeseries db, similar to the design of prometheus
  • For each metric, we maintain (timestamp, machineid, sequence) -> labels, value
  • Suppose 100 metrics sent per sec, 3 months data is 100 * 86400 * 100 = 1B data points, so if we want to plot the data over 3 months we need to scan at most 1 bill rows.
    • It can be parallelized by each point we see on the graph