Motivations

  • More writes than reads, but still need to keep an index for retrival, e.g., histroy tables, logs, and other time-series data
  • The original paper focuses on the hardware cost LSM brings, and claims it is much better than B-tree. The root cause is the lower disk arm cost

Overview

  • Mutliple level of trees, for the simplicity, we assume 2 levels: c0 and c1
    • c0 is a small, in memory, AVL tree. Implmentation wise, it is the MemTable backed by WAL, which stores keys sorted
    • c1 is an on disk B tree, but its hot node’s pages are most likely in memory
    • In practice,the LSM tree typically consists of multiple levels, each representing a sorted run of data. Levels closer to the top have newer and more frequently accessed data, while lower levels contain older and less frequently accessed data.
  • On insertion, the new index entry goes to c0, and will be merged to c1 when c0 becomes too big
  • Retrival will look at c0 and after that c1
  • The design gives trade single key read for key range scan and write performance

In memory c0

  • Recovery is done by reconstrution from the record logs, since c0 hosts only the most recent inserts. However, we still need to define the replay starting point carefully
  • Deletion is handled as a special deletion node
  • When the MemTable is too huge, convert current MemTable to an Immutable MemTable (IMT),and create a new MemTable.

On disk c1

  • single page nodes on each level below the root is place together in multi-page disk blocks. Each node 100% full
  • Note direct write to c1 except during the merges. So that we don’t have to read-modify-write an entire block due to the random write
  • On disk data is sorted runs of data with no overlapping key ranges
  • Implementation wise it is the SSTable. Again, keys are store sorted

Read

  • start from lowever level SSTable and keep going deeper until we first find one.

Update

  • just like insert. This means we accept that multiple versions of same key exists at different SSTables

Delete

  • insert a k-del mark. The value will be truly deleted when SSTables merge

How to merge

  • Idea similar to merge sort. Assuming we have a pointer ,for each tree, to the next to leaf to merge
  • During the merge the indices are still available except for short locking period
  • Starting from the smallest value of c0, and load the next multi-page block from c1 into memory
  • Generate the new multi-page block similar to merge sort, and write them back to the right most positon of c1. Note that parent node references will be updated too, and old c1 blocks will not be recycled immediately for recovery purposes
  • the merged leaves from c0 will be removed from memory
  • The process repeats after the right most c0 leave is written to c1
  • Merge process itself will checkpoint and flush to disks periodically