RocksDB concepts
- Column family: Each k-v belongs to only 1 CF. CFs share the same WAL (to support atomic writes), but different memtable and table files. Note classic CF mapping would be (row_key, a sorted list of (column, value)), but in RocksDB it is just (cf, key, value)
- Write buffer stores the memtable. Block cache is where RocksDB caches data in memory for reads
- Memtable flushes are by default scheduled on HIGH thread pool, while compactions on LOW thread pool. Stalling memtable flush can stall writes, increasing p99 latency. Set the thread pool with
max_background_jobs
Compaction
- Size-tiered compaction strategy: small SSTable to medium SSTable. trade read and space for write amplification. Compaction merges all sorted runs in one level to create a new sorted run in the next level.
- A common approach for tiered is to merge sorted runs of similar size, without having the notion of levels
- Level-based compaction strategy: Each level has at most 1 sorted run. Some data will go to the next level if current level is approaching limit. Trade read/write amplification for space amplification. In paper is all-to-all merge, but in RocksDB it is some-to-some
- Leveled compaction in RocksDB is also tiered+leveled. There can be N sorted runs at the memtable level courtesy of the max_write_buffer_number option– only one is active for writes, the rest are read-only waiting to be flushed. A memtable flush is similar to tiered compaction – the memtable output creates a new sorted run in L0 and doesn’t read/rewrite existing sorted runs in L0.
- Sub compaction speed up a compaction job by partitioning it among multiple threads.
Read & write amplification
- RA counts the number of disk reads to perform a query. It is defined separately for point query and range queries
- Flash-based storage can be written to only a finite number of times, so write amplification will decrease the flash lifetime
- LSM with level-based compaction has better write amplification than B-tree
How TiKV uses RocksDB
- All data in a TiKV node shares two RocksDB instances. One is for data, and the other is for Raft log.
- The default RocksDB instance stores KV data in the default, write and lock CFs
- raft logs are stored as region_id/log_id -> value
- data is stored as key + bit inverted ts -> value. TS is inverted, so that highest bit will be the first item
- Tikv uses prefix bloom filter, i.e., it predicts on the prefix instead of the whole key
- Use TableProperties in Split Check, MVCC GC, and compact region range
- When adding a replica to a new server, generates a SST snapshot and sends to the new server directly
- Disk space will be released only when tombstones have been compacted