|
| 1 | +# WAL Disk Format |
| 2 | + |
| 3 | +The write ahead log operates in segments that that are numbered and sequential, |
| 4 | +e.g. `000000`, `000001`, `000002`, etc., and are limited to 128MB by default. |
| 5 | +A segment is written to in pages of 32KB. Only the last page of the most recent segment |
| 6 | +may be partial. A WAL record is an opaque byte slice that gets split up into sub-records |
| 7 | +should it exceed the remaining space of the current page. Records are never split across |
| 8 | +segment boundaries. |
| 9 | +The encoding of pages is largely borrowed from [LevelDB's/RocksDB's wirte ahead log.][1] |
| 10 | + |
| 11 | +Notable deviations are that the record fragment is encoded as: |
| 12 | + |
| 13 | +┌───────────┬──────────┬────────────┬──────────────┐ |
| 14 | +│ type <1b> │ len <2b> │ CRC32 <4b> │ data <bytes> │ |
| 15 | +└───────────┴──────────┴────────────┴──────────────┘ |
| 16 | + |
| 17 | +## Record encoding |
| 18 | + |
| 19 | +The records written to the write ahead log are encoded as follows: |
| 20 | + |
| 21 | +### Series records |
| 22 | + |
| 23 | +Series records encode the labels that identifier a series and its unique ID. |
| 24 | + |
| 25 | +┌────────────────────────────────────────────┐ |
| 26 | +│ type = 1 <1b> │ |
| 27 | +├────────────────────────────────────────────┤ |
| 28 | +│ ┌─────────┬──────────────────────────────┐ │ |
| 29 | +│ │ id <8b> │ n = len(labels) <uvarint> │ │ |
| 30 | +│ ├─────────┴────────────┬─────────────────┤ │ |
| 31 | +│ │ len(str_1) <uvarint> │ str_1 <bytes> │ │ |
| 32 | +│ ├──────────────────────┴─────────────────┤ │ |
| 33 | +│ │ ... │ │ |
| 34 | +│ ├───────────────────────┬────────────────┤ │ |
| 35 | +│ │ len(str_2n) <uvarint> │ str_2n <bytes> │ │ |
| 36 | +│ └───────────────────────┴────────────────┘ │ |
| 37 | +│ . . . │ |
| 38 | +└────────────────────────────────────────────┘ |
| 39 | + |
| 40 | +### Sample records |
| 41 | + |
| 42 | +Sample records encode samples as a list of triples `(series_id, timestamp, value)`. |
| 43 | +Series reference and timestamp are encoded as deltas w.r.t the first sample. |
| 44 | + |
| 45 | +┌──────────────────────────────────────────────────────────────────┐ |
| 46 | +│ type = 2 <1b> │ |
| 47 | +├──────────────────────────────────────────────────────────────────┤ |
| 48 | +│ ┌────────────────────┬───────────────────────────┬─────────────┐ │ |
| 49 | +│ │ id <8b> │ timestamp <8b> │ value <8b> │ │ |
| 50 | +│ └────────────────────┴───────────────────────────┴─────────────┘ │ |
| 51 | +│ ┌────────────────────┬───────────────────────────┬─────────────┐ │ |
| 52 | +│ │ id_delta <uvarint> │ timestamp_delta <uvarint> │ value <8b> │ │ |
| 53 | +│ └────────────────────┴───────────────────────────┴─────────────┘ │ |
| 54 | +│ . . . │ |
| 55 | +└──────────────────────────────────────────────────────────────────┘ |
| 56 | + |
| 57 | +### Tombstone records |
| 58 | + |
| 59 | +Tombstone records encode tombstones as a list of triples `(series_id, min_time, max_time)` |
| 60 | +and specify an interval for which samples of a series got deleted. |
| 61 | + |
| 62 | + |
| 63 | +┌─────────────────────────────────────────────────────┐ |
| 64 | +│ type = 3 <1b> │ |
| 65 | +├─────────────────────────────────────────────────────┤ |
| 66 | +│ ┌─────────┬───────────────────┬───────────────────┐ │ |
| 67 | +│ │ id <8b> │ min_time <varint> │ max_time <varint> │ │ |
| 68 | +│ └─────────┴───────────────────┴───────────────────┘ │ |
| 69 | +│ . . . │ |
| 70 | +└─────────────────────────────────────────────────────┘ |
| 71 | + |
| 72 | +[1][https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log-File-Format] |
0 commit comments