Currently the library was tested on Win10 + msvc v142 x64/x86 toolsets
$ git clone https://github.com/yuyoyuppe/ckvs/
$ cd ckvs
$ git submodule update --init --recursive
Note: <arch>
in the following lists could be x86
or x64
.
- Get premake5 and add it to your
PATH
- Open msvc developer prompt
$ vcpkg install boost:<arch>-windows
$ premake5 vs2017
// no vs2019 support yet :<$ cd build
$ msbuild ckvs.sln /p:Configuration=Release /p:Platform=<arch> /p:PlatformToolset=v142
$ cd bin_<arch>
- Get premake5 and add it to your
PATH
- install boost(aptitude/pacman)
$ premake5 gmake2
$ cd build
$ make
$ cd bin_<arch>
Now you can run property_tests
. Please note that you can change tmp storage paths on different hdd-types.
- Concurrent hierarchical key-value supporting
uint32, uint64, float, double, string, blob
types capable of handling TBs of data blob
s andstring
s could be arbitrary length- CRUD operations on key-value pairs(currently insert/remove/find)
- Key-expiration support
- x86/x64 support
- C++17
B+-tree serves as an index, since it gurantees a very small number of random lookups on HDD. Therefore, even when the amount of keys is huge, e.g. for B+-tree of order 1024 with 100 million of int64 keys, only <1%(~3500) of nodes are non-leafs, meaning that we can always keep them in RAM with a tiny footprint of ~11MB.
There's been many research papers on concurrent B*-like trees, and an excellent overview is available here. Some of the newer lock-free variants require nontrivial amount of effort to implement and could yield non-optimal results for an on disk DBs, see figures 13-14. The classic choice is to use a certain lock-coupling scheme: perform the "crab-walk" node-locking while traversing the tree, relasing the safe nodes above. For example, to insert some value X into a tree of height 2:
Root A
+-------------------------+
| 1. Write-lock A |
| 2. select appr. child |
+------+------------------+
Internal B |
+-------------------------+ |
| 3. Write-lock B +<---+
| 4. if safe, unlock A |
+-----------------------+-+
|
v Leaf C
+--+----------------------+
| 5. Write-lock C |
| 6. If safe, unlock A, B |
+-------------------------+
We can also optimistically assume that the node C won't be overflowed after insertion, and take read-locks along the way. However, if our assumption was wrong, we would need to restart traversal from the root using write-locks once again.
- Using an appropriate locking primitive is important, e.g. locking boost::upgrade_mutex is ~10x slower than
std::shared_mutex
msvc v142, Win 10, i7-6700, but the latter requires additional support. - Distributing children of two nodes without allocating additional Node of greater arity requires very careful index manipulation
- Recursive-along-the-height functions are ok, since the height is usually small enough to cause stack overflow
- Storing parent handles would possibly require updating thousands of nodes on rebalancing, so we need to trace them each time during root-to-leaf traversal
- Unionizing/templating node type instead of using virtual dispatch allows heavy inlining, zero copy-serialization and policy customizations
- To implement slotted values which could be loaded/saved without a (de)-serialization phase, we need a new concept - assign/compare proxy, which bptree could get from a locked_node_t interface. Then it performs all operations on _slots/_keys using that proxy. Those proxies will hold a slotted_storage pointer(to be able to transfer the actual data between pages) and kv-store pointer, so they can request an additional pages for overflow
To minimize IO overhead, B-tree nodes and various metadata structures are all sized as multiples of system page size and being manipulated asynchronously through an event-loop. While the memory mapped IO seems like an obvious first choice for the task, it lacks the explicit control over flushing(e.g. PrefetchVirtualMemory
is only a suggestion, lacks Evict
counterpart, Windows-only API etc.). Since the software knows exactly which pages should be flushed and loaded, it's reasonable to disable OS file-caching and use direct IO. Awesome llfio library is used to abstract platform details like OVERLAPPED
etc.
Strict memory budgeting and preallocating beforehand alleviates mostly all allocation problems at the cost of reduced flexibility. However, having tweakable at compile-time data structures allows generating required permutation set and hot-load it later.
To keep binary compatibility between stores on x86/x64, all arithmetic types are named explicitly, serialized structures do not contain any pointers and are carefully aligned. Endianness is also statically asserted. There's a peculiar issue of type punning leading to undefined behavior in C++ UB, which could be alleviated in various ways. Keeping the structures is_trivially_copyable/constructible
and reinterpret_cast
ing it from standard-blessed type seems to be a way to go.
- Hash-table of fixed capacity and bucket-level spinlocking
- Page descriptors are linked-list to buckets, their lifetime is managed by lockfree stack
- Hybrid LRU/random replacement policty
- Need to be careful not to choose miniscule capacity to avoid livelocking(or callback
cleanup
frompage_lock
)
Always at 0 offset in the physical file. Pinned in the page cache. Could be partially mounted(referenced) to some virtual store.
magic sig, 4 bytes
+
v
+--+--------------------------------------+
| ckvs | name to page id b+-tree root map |
+-----------------------------------------+
^
+
linearized tree of b+-tree roots, <rest of the page> bytes
handles folder structure, e.g.:
- parent_dir
|_ child_ckvs1 -> {tree_root pageID X}
|_ child_ckvs2 -> {tree root pageID Y}
|__|_grandchild_ckvs3 -> {tree root pageID Z}
...
Free Page Stack Page
When a new page is required or a cleaned one needs releasing, free page stack pages
come to an aid! They're located at fixed offsets in the file, specified by the extend_granularity
setting.
number of free page IDs in this page, sizeof(least_unsigned_t<max_pages>) bytes
+
v
+--+--------------------------------------------------+
| N | free_page_id1, free_page_id2,... free_page_idN |
+----------------------------+------------------------+
^
free page ids array, <rest of the page> bytes
Leaf/Root, 1 byte
| # Keys, If type bit mask > 10 -> not actual key/value, but a slot #id in
the slotted storage, else -> actual key/value, order * 8 bytes each
| 2 bytes | |
v v v v
+--+-------+-------+-------+---+
| Kind | nKeys | Keys | Values | <- serialized node<> struct
+------------------------------+
+ /--------------------------slotted storage-----------------------\
v | 2 bytes -v 2 bytes end_offset + 2 bytes size each |
+-----------------------------------------------+-------------+--------V---------------------+
| B+-tree node | node types | slots_end_offset | nUsed slots | slot descS | slot values |
+---------------------+--------------+------------------------+ ------------------+----------+
type bit mask, -----^ ^ binary/string slot values ---^
4 bits per key |
00: uint64 offset from the page end at which to prepend new slot value, 2 bytes
01: float
10: double
11: blob
100: string
101X: overflown string/blob -> value in the slotted page is overflow_page_id + slot_id
111X: overflown huge string/blob -> value is the Huge Type, value in the storage = huge_page_id
+ size in pages
Used when the string/binary is smaller than page_size, but can't fit in the embedded storage on the B+-tree Node Page. Created on demand, Node Page's mutators first check max #overflow_tolerance
of existing overflow pages referenced from the embedded slotted storage, and allocate new one if couldn't get a slot.
+-----------------------------------------------------------------------------------------------+
| same slotted storage structure as on B+-tree Node Page, but with a capacity of an entire page |
+-----------------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------------+
| contiguous binary data, bulk-allocated from the Free Page Stack, so max size |
| of huge binary object = sizeof(page) * (extend_granularity - 1) |
+---------------------------------------------------------------------------------------------+
- extend type mask to also contain info if the key/value in the slotted storage is also appended by timestamp
- pros: smaller size overhead when the % of expiring values are small
- cons: we're forced to lazily remove expired values, need GC to compact
- separate B+-tree: keys are timestamps, values are common keys.
- pros: fast traversal from leaf to leaf(since they store next sibling pointer)
- cons: possibly big overhead on huge keys duplication
- All times are wall clock time.
- Unless specified otherwise, in all tests below B+-tree order = 1022, and page size = 16KB, all key/values are randomly shuffled
uint64_t
- Compiled without
ASSERTS
define at solution scope - Using DDR4 2133 RAMdisk. Peak witnessed throughput = ~600MB/s, typical = ~350MB/s
- Note that having 10M+ values in a single physical store seem to choke it at page_cache_capacity = 32768(should this be the case? investigate why - cache thrashing/bad page settings/too much memcpying etc.) and much earlier with an even smaller capacity
Syntetic paged_file
vs single-threaded ofstream
random write test
Writing ~10K pages of size 16KB, async has 8 threads:
===HDD==
async completed in 1.976 sec
ofstream completed 33.468 sec
===SSD==
async completed in 1.206 sec
ofstream completed 0.784 sec
===RAM==
async completed in 1.133 sec
ofstream completed 0.541 sec
Insert 10M of uint64 key/values, splitted randomly across 8 threads, concurrently removing odd keys, then save to file(in RAM), reopen it and find all even keys in 1 thread:
$ property_tests.exe
OK -- 14.174sec
removed in 6.925 sec
removed in 7.321 sec
Running test #0 ...
queried in 0.78 sec
OK -- 2.146sec
^ including single-threaded insert before the query threads are launched
queried in 0.74 sec
OK -- 2.514sec
queried in 4.951 sec
OK -- 13.751sec
$ property_tests.exe
queried in 9.953 sec
$ property_tests.exe
queried in 658.798 sec
OK -- 1152.19sec
$ property_tests.exe
Running test #0 ...
queried in 80.643 sec
OK -- 121.129sec
todo: more? mixed load tests? 80/20 bigger cache size(relax power of 2 page_cache capacity restriction)?