System design: OLAP
Suppose we run an e-commerce web site
Requirement: top 10 in the last 24 hours for each store
- Need to keep track of
- For each store, maintains a top 10 list, probably in cache, since it is way more reads than writes
- txn history with index at (store_id, created_at), so need a seperate txn history service shared by store_id
- Each store has at most 10k txns per day, so we can afford to load them in memory and compute it at every refresh interval, e.g., 10 mins
- In fact, 7 days means 1 mil txns, so is good too
- The calculation of stores can be triggered on-demand, i.e., during the visit, so we don’t have to pull store catalog service, i.e., update once every 1-5 minutes
- Either synchronously via API call or asyncly via MQ
Requirement: complete item sales rank in the last week by category
- Category is much more coarse grained than store, espeically basic ones. Expect total < 1 mil.
- This also means the whole data can fit into the cache
- Category is a nested-tree like structure, we assume no more than 5 levels (20 ^ 5) = 3.5 mil. Note that on amazon it is more like a DAG with 2 paths. So we assume each item will update at most 10 categories
- Because it is windowed, we have to keep re-calculating the whole batch to defend against the retired sales case even for the top K.
- Mini batch can speed up the calculation at the cost of additional complexity, but is needed if we care about the update frequency
- The brute force calculation is OK if we don’t have strong SLA on the freshness of the data (< 1 hours). It will scan and populate 10 mil per day * 7 days * 10 categories per item = about 1B intermediate rows
Suppose we want to see updates < 1 min, for popular categories, e.g., top 100 hot categories
Populating the raw data
- Keep track of
- Last pulled checkpoint
- (category_id, time, item_id) -> count
- We periodically pull the txn services to get the txns since the last pulled checkpoint to populate (category_id, time, item_id) -> count
- Can also triggered by notifications from the main txn service, but need to watch out for frequent update problem, so still need to checkthe last pulled check point
- Can purge pulled data after 1 week
- During the pull need to clear all data after the checkpoint first, and do a complete calculation
Calculating the final result
- Keep track of
- (category_id, item_id) -> count, last_calced_at
- (category_id) -> (item id list, last updated_at)
- Can store the result into cache since we have < 1 mil categories
- At each refresh interval, scan all items for
- the count after last_calced_at
- the count before the now - 7 days
- Last calculated window count
- a - b + c to get the current window count
- update the current window count with new updated_at
- scan all items in the categories, can compute it in memory
- we assume it has < 10 mil items per cateogry