Prometheus file structure

Every two hours’ data is in the same block. Each of block is a directory with chunks, one metadata file, and one index file
On average each sample takes no more than 2 bytes
Almost every section starts with a length field, and ends with a CRC32 checksum

Index file

Symbol table: a sorted list of strings, which are used in label pairs
Series (more below): a list of series. Each series hold the label set of the series and its chunks (more below) with blocks
List of postings: Each posting is a list of series references that contain a given label pair
Posting offset table: list of entries, each is a label pair to offset to its series list in the postings list section. This section is sorted
Table of content: entry point to the entire index and points to various sections in the file.

Series section in the index file

Each series holds number of lables, followed by the tuple of symbol table references
The next section is the number of indexed chunks, each of chunk index contains the min time and max time of the time, and their position in the chunk file. Note that all ts except the first one is stored as deltas

chuck

max 512 MB, each record of chuck is len + encoding + data + CRC

WAL

128MB by default. A segment is written to in pages of 32KB. Mostly borrowed the idea from rocksdb

type
len
CRC
data

WAL accepts

Series records encode the labels that identifies a series and its unique ID
Sample records encode samples as a list of triples (series_id, timestamp, value)
Tombstone records encode tombstones as a list of triples (series_id, min_time, max_time)

Example

Suppose i have one serive deployed on two machines, each will output [{service: sid}, {node: nid}] label sets

Inside the index file

Symbol table: [n1, n2, node, service, sid]
Series:
- [2, 0, 3, 4]
- min time in c1, delta for max in c1, offset in c1
- [2, 1, 3, 4]
- min time in c2, delta for max in c2, offset in c2
Postings:
- [0], [0, 1], [1]
Posting offset table:
- ((node, n1), 0), ((node, n2), 2), ((service, s1), 1)

Note in the design, we have direct index on only single lable pair, or the raw data. This means most queries, we will have to merge to find series entries, and then scatter-merge