Suppose we run an e-commerce web site

Requirement: top 10 in the last 24 hours for each store

  • Need to keep track of
    • For each store, maintains a top 10 list, probably in cache, since it is way more reads than writes
    • txn history with index at (store_id, created_at), so need a seperate txn history service shared by store_id
  • Each store has at most 10k txns per day, so we can afford to load them in memory and compute it at every refresh interval, e.g., 10 mins
    • In fact, 7 days means 1 mil txns, so is good too
  • The calculation of stores can be triggered on-demand, i.e., during the visit, so we don’t have to pull store catalog service, i.e., update once every 1-5 minutes
    • Either synchronously via API call or asyncly via MQ

Requirement: complete item sales rank in the last week by category

  • Category is much more coarse grained than store, espeically basic ones. Expect total < 1 mil.
  • This also means the whole data can fit into the cache
  • Category is a nested-tree like structure, we assume no more than 5 levels (20 ^ 5) = 3.5 mil. Note that on amazon it is more like a DAG with 2 paths. So we assume each item will update at most 10 categories
  • Because it is windowed, we have to keep re-calculating the whole batch to defend against the retired sales case even for the top K.
    • Mini batch can speed up the calculation at the cost of additional complexity, but is needed if we care about the update frequency
    • The brute force calculation is OK if we don’t have strong SLA on the freshness of the data (< 1 hours). It will scan and populate 10 mil per day * 7 days * 10 categories per item = about 1B intermediate rows

Suppose we want to see updates < 1 min, for popular categories, e.g., top 100 hot categories

Populating the raw data

  • Keep track of
    • Last pulled checkpoint
    • (category_id, time, item_id) -> count
  • We periodically pull the txn services to get the txns since the last pulled checkpoint to populate (category_id, time, item_id) -> count
    • Can also triggered by notifications from the main txn service, but need to watch out for frequent update problem, so still need to checkthe last pulled check point
    • Can purge pulled data after 1 week
    • During the pull need to clear all data after the checkpoint first, and do a complete calculation

Calculating the final result

  • Keep track of
    • (category_id, item_id) -> count, last_calced_at
    • (category_id) -> (item id list, last updated_at)
  • Can store the result into cache since we have < 1 mil categories
  • At each refresh interval, scan all items for
    • the count after last_calced_at
    • the count before the now - 7 days
    • Last calculated window count
    • a - b + c to get the current window count
    • update the current window count with new updated_at
  • scan all items in the categories, can compute it in memory
    • we assume it has < 10 mil items per cateogry

Requirement: show a item’s sale number in the last day/week/month/lifetime