layout | title | excerpt |
---|---|---|
post |
MVCC in TiKV |
This document gives an overview of MVCC implementation in TiKV. |
Transaction isolation is important for database management system. Because database should provide an illusion that the user is the only one who connects to the database, which greatly simplifies application development. But, the concurrency controlling problems like data races must be resolved since there will be a lot of connections to the database. Due to this background, the database management system (DBMS) ensures that the resulting concurrent access patterns are safe, ideally by serializablity.
Though serializablity is a great concept, it is hard to implement efficiently. A classical solution is a variant of Two-Phase Locking, aka 2PL. Using 2PL, the DBMS maintains read and write locks to ensure that conflicting transactions are executed in a well-defined order, which results in serializable execution schedules. But, locking, however, has several drawbacks: First, readers and writers block each other. Second, most transactions are read-only and therefore harmless from a transaction-ordering perspective. Using a locking-based isolation mechanism, no update transaction is allowed to change a data object that has been read by a potentially long-running read transaction and thus has to wait until the read transaction finishes. This severely limits the degree of concurrency in the system.
Multi-Version Concurrency Control(MVCC) is an elegant solution for this problem, in which each update creates a new version of the data object instead of updating data objects in-place, such that concurrenct readers can still see the old version while the update transaction proceeds concurrently. Such stradegy can prevent read-only transactions from waiting, and in fact do not have to use locking at all. This is an extremely desirable property and the reason why many DBMS implements MVCC, e.g., PostgreSQL, Oracle, Microsoft SQL Server.
Let's dive into TiKV
's MVCC implementation, located at src/storage.
Since TiKV
is a disritbuted storage system, it needs a globally unique time service, called Timestamp Oracle
(TSO), to allocate a monotonic increasing timestamp. This function is provided in PD
in TiKV, which is provided by TrueTime API
by using multiple modern clock references(GPS and atomic alocks) in Spanner. So keep in mind that every TS
represents a monotonic increasing timestamp.
To divie into the Transaction part in TiKV
, src/storage is a good begining, which implements the entries. Storage
is a struct that actually receives the get/Scan commands.
pub struct Storage {
engine: Box<Engine>,
sendch: SendCh<Msg>,
handle: Arc<Mutex<StorageHandle>>,
}
impl Storage {
pub fn start(&mut self, config: &Config) -> Result<()> {
let mut handle = self.handle.lock().unwrap();
if handle.handle.is_some() {
return Err(box_err!("scheduler is already running"));
}
let engine = self.engine.clone();
let builder = thread::Builder::new().name(thd_name!("storage-scheduler"));
let mut el = handle.event_loop.take().unwrap();
let sched_concurrency = config.sched_concurrency;
let sched_worker_pool_size = config.sched_worker_pool_size;
let sched_too_busy_threshold = config.sched_too_busy_threshold;
let ch = self.sendch.clone();
let h = try!(builder.spawn(move || {
let mut sched = Scheduler::new(engine,
ch,
sched_concurrency,
sched_worker_pool_size,
sched_too_busy_threshold);
if let Err(e) = el.run(&mut sched) {
panic!("scheduler run err:{:?}", e);
}
info!("scheduler stopped");
}));
handle.handle = Some(h);
Ok(())
}
}
This start
function helps to explain how a storage runs.
Engine is the trait which describes the actual database used in storage system, which is implemented in raftkv and Enginerocksdb.
StorageHandle
is the struct that handles commands received from sendch
powered by mio
.
Then the following functions like async_get
and async_batch_get
will send the corresponding commands to the channel, which can be got by the scheduler to execute asynchronously.
All right, the MVCC protocol calling is exactly implemented in Scheduler. The storage receives commands from clients and sends commands as messages to the scheduler. Then the scheduler will process the command or call corresponding asynchronous function. There are two types of operations, reading and writing. Reading is implemented in MvccReader, which is easy to understand. Writing part is the core of MVCC implementation.
Here comes the core of the transaction model of TiKV, which called 2-Phase Commit powered by MVCC. There are two stages in one transaction.
- Select one row as the primary row, the others as the secondary rows.
- Lock the primary row. Before locking, it will check whether there is other locks on this row or whether there are some commits located after startTS. These two situations will lead to conflicts.If any of themhappens, rollback will be called.
- Repeat the operations on secondary row.
- Write to the
CF_WRITE
with commitTS. - Delete the corresponding lock.
Rollback is called when there are conflicts during the prewrite
.
It is easy to predict that there will be more and more MVCC versions if there is no Garbage Collector to remove the invalid versions. But we cannot just simply removeall the versions before a safe point. Since there maybe only one version for a key, it will be kept. In TiKV
, if there is any Put
or Delete
before the safe point, then all the latter writes can be deleted, otherwise only Delete
, Rollback
and Lock
will be deleted.
During developing and debugging, sometimes we need to know the MVCC version information.So we develop a new tool for searching the MVCC information. TiKV
stores the Key-Values, Locks and Writes informations in CF_DEFAULT
, CF_LOCK
, CF_WRITE
.
All the values of the CF are encoded as following:
default | lock | write | |
---|---|---|---|
key | z{encoded_key}{start_ts(desc)} | z{encoded_key} | z{encoded_key}{commit_ts(desc)} |
value | {value} | {flag}{primary_key}{start_ts(varint)} | {flag}{start_ts(varint)} |
Details can be found here.
Since all the MVCC version information is stored as CF Key-Values in Rocksdb, to search for a Key's version information, we just need to encode the key with different formats then search in the corresponding CF. The CF Key-Values are modeled by MvccKv.