Introduce double write file system #323

Connor1996 · 2023-07-07T02:01:51Z

This PR introduces a hedged file system to double-write every io operation. The HedgedFileSystem manages two directories on different cloud disks. All operations of the interface are serialized by one channel for each disk and wait until either one of the channels is consumed. With that, if one of the disk's io is slow for a long time, the other can still serve the operations without any delay. And once the disk comes back to normal, it can catch up with the accumulated operations record in the channel. Then the states of the two disks can be synced again.

Close #342

Signed-off-by: Connor1996 <[email protected]>

coderplay · 2023-08-23T07:41:17Z

src/env/double_write.rs

+pub struct HedgedHandle<F: FileSystem> {
+    disk1: Sender<(FileTask, Callback<usize>)>,
+    disk2: Sender<(FileTask, Callback<usize>)>,
+    counter1: Arc<AtomicU64>,


What do counters indicate?

The sequence number for each disk's channel, we can know which one is newer by it

coderplay · 2023-08-23T07:44:57Z

src/env/double_write.rs

+        }
+    }
+
+    fn read(&self, offset: usize, buf: &mut [u8]) -> IoResult<usize> {


Are there concurrent readers and writers when operating raft logs?

no concurrent writers, but may have concurrent readers

coderplay · 2023-08-23T07:47:06Z

src/env/double_write.rs

+        }
+    }
+
+    fn write(&self, offset: usize, content: &[u8]) -> IoResult<usize> {


Is that possible that when we do a write, there is an another thread that's doing read?

From the design doc, we achieve an eventual consistency on the two disks. Meaning, at a time, the content of the two disks can be different. If the writer is writing disk 1, while the reader read from disk 2, looks like the raft engine would make a wrong action based on the wrong point of view.

Note the offset is specified. The reader always reads from the newer disk at the point of time. If the reader can read from disk2, it indicates that disk2 must have the data of that offset. It doesn't need to care about the latest write because it's not reading the latest write's offset

coderplay · 2023-08-23T07:49:03Z

src/env/double_write.rs

+// TODO: read both dir at recovery, maybe no need? cause operations are to both
+// disks TODO: consider encryption
+
+impl<F: FileSystem> HedgedFileSystem<F> {


Is this filesystem thread safe?

Yes, it's thread-safe by trait Sync. But it occurs to me that we should make two channels send atomically

coderplay · 2023-08-23T17:42:38Z

src/env/double_write.rs

+        let count1 = self.counter1.load(Ordering::Relaxed);
+        let count2 = self.counter2.load(Ordering::Relaxed);
+        match count1.cmp(&count2) {
+            std::cmp::Ordering::Equal => {
+                if let Some(fd) = self.fd1.read().unwrap().as_ref() {
+                    fd.read(offset, buf)
+                } else if let Some(fd) = self.fd2.read().unwrap().as_ref() {
+                    fd.read(offset, buf)
+                } else {
+                    panic!("Both fd1 and fd2 are None");
+                }
+            }
+            std::cmp::Ordering::Greater => {
+                self.fd1.read().unwrap().as_ref().unwrap().read(offset, buf)
+            }
+            std::cmp::Ordering::Less => {
+                self.fd2.read().unwrap().as_ref().unwrap().read(offset, buf)
+            }
+        }


Looks like it's likely to create two concurrent readers, one reads from disk 1 while another reads from disk 2. Concerns?

Yes, as long as the disk has the data, then you can read it. Any concerns?

two cases:

offset > file size

fd is null for one of the disks

Impossible for raft engine logic, How can you read data that you haven't ever written?

Fd is null means the file is purged, which further means no one would visit the file anymore.

Why are you so sure the offset is always <= file size ? Is there any contract on the layer above the FileSystem that guarantee it? The upper layer LogFileReader expose below public API, from which we can pass an arbitrary value to offset that might be larger than the filesize.

/// Polls bytes from the file. Stops only when the buffer is filled or /// reaching the "end of file". pub fn read_to(&mut self, offset: u64, mut buf: &mut [u8]) -> Result<usize> {

see the below comment regarding fd for one disk is non-null, while for another disk is null

coderplay · 2023-08-23T19:06:57Z

src/env/double_write.rs

+        match pos {
+            SeekFrom::Start(offset) => self.offset = offset as usize,
+            SeekFrom::Current(i) => self.offset = (self.offset as i64 + i) as usize,
+            SeekFrom::End(i) => self.offset = (self.inner.file_size()? as i64 + i) as usize,


The file size on the two disks can be different, so you get different result on each call of this method. Looks like it will cause problems

It would always return the file size of the newer disk. The result on each call for a being appended file must be monotonic increasing. lt has nothing different compared to the non-hedging file system.

coderplay · 2023-08-23T19:10:42Z

src/env/double_write.rs

+
+pub struct HedgedReader<F: FileSystem> {
+    inner: Arc<HedgedHandle<F>>,
+    offset: usize,


We don't add any concurrency protection here on the offset, does that mean the offset for the files on two disks are equivalent ?

I remember that from the design doc, you mentioned the content of the two disks can be different. IIUC, in the case of raft logs being purged on one disk while not yet be purged on the other disk, the inner: Arc<HedgedHandle<F>> can points to one file, and points to nothing in another run.

No need for concurrency protection, cause it's not Sync. Rust compiler would take care of it

@Connor1996 could you please reply my 2nd comment?

Yes, it's possible to point to one file, and the other points to nothing. Still doesn't matter, if it's purged which means nothing will visit it anymore.

The fd may not be meaningful for the disk which is purged, but it's meaningful for the other disk which is not yet purged. Is it possible that due to the in-atomic counters comparison (get counter1;get counter2; compare counter1 and counter2), the reader picks the wrong fd which is null to read data from?

src/env/double_write.rs

Signed-off-by: Connor1996 <[email protected]>

src/env/double_write.rs

coderplay · 2023-08-24T16:05:34Z

src/env/double_write.rs

+
+    pub fn bootstrap(&self) -> Result<()> {
+        // catch up diff
+        let files1 = self.get_files(&self.path1)?;


Is it possible that files1 is a set of {f1, f2, f3}, while file2 is {f2, f3, f4} ?

Since they can be different, how does the catch_up_diff work on this case?

fn catch_up_diff(&self, fromFiles: Files, toFiles: Files) -> Result<()> {

coderplay · 2023-08-24T16:19:06Z

src/env/double_write.rs

+    pub fn bootstrap(&self) -> Result<()> {
+        // catch up diff
+        let files1 = self.get_files(&self.path1)?;
+        let files2 = self.get_files(&self.path2)?;


In the case of enable-log-recycle is enabled, IIUC, file names are reused. Meaning the same file name might hold different raft log content over time. Is it possible that for the same file name f, the file on disk1 and the one on disk2 hold different contents?

Impossible, log-recycle doesn't reuse file name.

oh my bad, my understanding was wrong. How does log recycling work?

coderplay · 2023-08-25T01:02:55Z

src/env/double_write.rs

+        // choose latest to perform read
+        let count1 = self.counter1.load(Ordering::Relaxed);
+        let count2 = self.counter2.load(Ordering::Relaxed);
+        match count1.cmp(&count2) {


There are 3 steps here before reading a file:

get counter1

get counter2

compare counter1 and counter2

The 3 steps together are not atomic. If we get an wrong result that counter1 < counter2 due to the non-atomicity, and the offset > file size for counter1, what will happen?

It doesn't matter.
Let's say disk2 is the slower one, and its file size < offset

first get s1 = counter1, we can know that s1 >= offset >= file size,

then get s2 = counter2,

if s1 <= s2, we read on disk2, because s2 >= s1 >= offset >= file size, disk2 must have the data now

if s1 > s2, we read on disk1

coderplay · 2023-08-25T16:51:47Z

My overall impression of this PR is that we are leveraging certain undocumented invariants, e.g. sole append-only writer plus one reader instance per thread, in the upper layers to ensure the safety of concurrency in the lower-level HedgedFileSystem. Two drawbacks I can think of

There are many variants in the FileSystem layer, for example the read-write conflicts, in-atomicity of operations, etc. It's complicated to cover all the cases, and it make the code hard to be understood.
In the future, if any engineer is unaware of these hidden relationships, it could easily lead to significant disasters.

Please correct me if my understanding is wrong.

Signed-off-by: Connor1996 <[email protected]>

LykxSassinator · 2023-09-19T03:11:04Z

src/env/hedged/mod.rs

+use crossbeam::channel::unbounded;
+use crossbeam::channel::Receiver;
+use log::info;
+use std::fs;
+use std::io::{Read, Result as IoResult, Seek, SeekFrom, Write};
+use std::path::Path;
+use std::path::PathBuf;
+use std::sync::atomic::AtomicU64;
+use std::sync::atomic::Ordering;
+use std::sync::Arc;
+use std::thread;
+use std::thread::JoinHandle;
+
+use crate::env::log_fd::LogFd;
+use crate::env::DefaultFileSystem;
+use crate::env::{FileSystem, Handle, Permission, WriteExt};
+use futures::executor::block_on;
+use futures::{join, select};
+
+mod recover;
+mod runner;
+mod sender;
+mod task;
+mod util;
+
+use runner::TaskRunner;
+use sender::HedgedSender;
+use task::{
+    empty_callback, paired_future_callback, Callback, FutureHandle, SeqTask, Task, TaskRes,
+};
+use util::replace_path;
+


Refer to #339, this import format should be reformatted.

LykxSassinator

More uts should be added to HedgeFileSystem.

LykxSassinator · 2023-09-19T04:05:19Z

src/env/hedged/mod.rs

+                if t1.is_finished() || t2.is_finished() {
+                    if t1.is_finished() {
+                        t1.join().unwrap();
+                    } else {
+                        t2.join().unwrap();
+                    }
+                    break;
+                }


Suggested change

if t1.is_finished() || t2.is_finished() {

if t1.is_finished() {

t1.join().unwrap();

} else {

t2.join().unwrap();

}

break;

}

if t1.is_finished() {

t1.join().unwrap();

break;

} else if t2.is_finished() {

t2.join().unwrap();

break;

}

LykxSassinator · 2023-09-19T05:02:19Z

src/env/hedged/mod.rs

+            std::cmp::Ordering::Equal => {
+                // still need to catch up, but only diff
+                recover::catch_up_diff(&self.base, files1, files2, false)?;
+                return Ok(());
+            }
+            std::cmp::Ordering::Less => {
+                recover::catch_up_diff(&self.base, files2, files1, false)?;
+            }
+            std::cmp::Ordering::Greater => {
+                recover::catch_up_diff(&self.base, files1, files2, false)?;
+            }


Suggested change

std::cmp::Ordering::Equal => {

// still need to catch up, but only diff

recover::catch_up_diff(&self.base, files1, files2, false)?;

return Ok(());

}

std::cmp::Ordering::Less => {

recover::catch_up_diff(&self.base, files2, files1, false)?;

}

std::cmp::Ordering::Greater => {

recover::catch_up_diff(&self.base, files1, files2, false)?;

}

std::cmp::Ordering::Equal | std::cmp::Ordering::Greater => {

// still need to catch up, but only diff

recover::catch_up_diff(&self.base, files1, files2, false)?;

}

std::cmp::Ordering::Less => {

recover::catch_up_diff(&self.base, files2, files1, false)?;

}

And recommend import std::cmp::Ordering and abbreviate it with Ordering::xxx.

LykxSassinator · 2023-09-19T05:17:04Z

src/env/hedged/mod.rs

+                }
+            }
+            std::cmp::Ordering::Greater => {
+                self.handle1.try_get(&self.base)?.unwrap().read(offset, buf)


FYI, the error should also be processed ?

tabokie · 2023-09-25T08:23:14Z

src/config.rs

@@ -40,6 +40,8 @@ pub struct Config {
    /// Default: None
    pub spill_dir: Option<String>,

+    pub second_dir: Option<String>,


maybe backup_dir or mirror_dir

tabokie · 2023-09-25T08:27:50Z

src/env/hedged/recover.rs

+    }
+}
+
+pub(crate) fn catch_up_diff(


"if the file in to is not in from, delete it"

Suggested change

pub(crate) fn catch_up_diff(

pub(crate) fn synchronize_files(

tabokie · 2023-09-25T08:29:48Z

src/env/hedged/mod.rs

+        let count1 = recover::get_latest_valid_seq(&self.base, &files1)?;
+        let count2 = recover::get_latest_valid_seq(&self.base, &files2)?;
+
+        match count1.cmp(&count2) {


get_latest_valid_seq returns the number of log items from latest file? Why is it used to determine the synchronize direction?

tonyxuqqi · 2023-09-28T18:01:01Z

src/env/hedged/mod.rs

+
+impl HedgedFileSystem {
+    pub fn new(base: Arc<DefaultFileSystem>, path1: PathBuf, path2: PathBuf) -> Self {
+        let (tx1, rx1) = unbounded::<(SeqTask, Callback)>();


Should we limit the length of the channel to prevent the OOM, which is the hard limit HedgedFileSystem can use

tonyxuqqi · 2023-09-28T18:06:46Z

src/env/hedged/mod.rs

+        }
+    }
+
+    async fn wait(&self, task1: Task, task2: Task) -> IoResult<()> {


Suggested change

async fn wait(&self, task1: Task, task2: Task) -> IoResult<()> {

async fn wait_one(&self, task1: Task, task2: Task) -> IoResult<()> {

tonyxuqqi · 2023-09-28T18:07:16Z

src/env/hedged/mod.rs

+        self.sender.state()
+    }
+
+    async fn wait_handle(&self, task1: Task, task2: Task) -> IoResult<HedgedHandle> {


Suggested change

async fn wait_handle(&self, task1: Task, task2: Task) -> IoResult<HedgedHandle> {

async fn wait_one_handle(&self, task1: Task, task2: Task) -> IoResult<HedgedHandle> {

tonyxuqqi · 2023-09-28T18:09:12Z

src/env/hedged/mod.rs

+
+impl Drop for HedgedFileSystem {
+    fn drop(&mut self) {
+        block_on(self.wait(Task::Stop, Task::Stop)).unwrap();


Should it wait for both Stop finished?

tonyxuqqi · 2023-09-28T18:12:20Z

src/env/hedged/mod.rs

+                // wait 1s
+                // one disk may be blocked for a long time,
+                // to avoid block shutdown process for a long time, do not join the threads
+                // here, only need at least to ensure one thread is exited


HedgedFileSystem is dropped but its underlying thread may still live, which is not ideal.

Maybe we should abort the pending tasks on slow disk and wait both threads exist.

tonyxuqqi · 2023-09-28T18:34:49Z

src/env/hedged/util.rs

@@ -0,0 +1,9 @@
+use std::path::{Path, PathBuf};
+
+pub fn replace_path(path: &Path, from: &Path, to: &Path) -> PathBuf {


The fn can be renamed as replace_prefix? Otherwise it's not obvious that parameter from must be prefix of path.

tonyxuqqi · 2023-09-28T18:35:30Z

src/env/hedged/util.rs

@@ -0,0 +1,9 @@
+use std::path::{Path, PathBuf};
+
+pub fn replace_path(path: &Path, from: &Path, to: &Path) -> PathBuf {


The fn can be renamed as replace_prefix? Otherwise it's not obvious that parameter from must be prefix of path.

tonyxuqqi · 2023-09-28T20:59:49Z

src/env/hedged/recover.rs

+        check_files(&from_files.rewrite_files, &to_files.rewrite_files)?;
+    }
+    check_files(&from_files.reserved_files, &to_files.reserved_files)?;
+    Ok(())


We should double check if the files copy are correct by checksum.

tonyxuqqi · 2023-09-28T21:12:59Z

src/env/hedged/sender.rs

+            let check2 = inner.disk2.len() > get_pause_threshold();
+            match (check1, check2) {
+                (true, true) => {
+                    panic!("Both channels of disk1 and disk2 are full")


When both disks are slow, maybe should not panic but instead fallback to single disk approach.

tonyxuqqi · 2023-09-28T21:17:46Z

src/engine.rs

        mut cfg: Config,
        file_system: Arc<F>,
        mut listeners: Vec<Arc<dyn EventListener>>,
    ) -> Result<Engine<F, FilePipeLog<F>>> {
        cfg.sanitize()?;
+        file_system.bootstrap()?;


I'm wondering if it can fallback to single disk solution dynamically. The file system may need extra APIs to wait for all pending writes done and then engine can switch to different file_system.

Maybe in another PR

tonyxuqqi

Overall LGTM

tonyxuqqi · 2023-10-05T17:17:33Z

src/env/hedged/mod.rs

+        let files1 = recover::get_files(&self.path1)?;
+        let files2 = recover::get_files(&self.path2)?;
+
+        let count1 = recover::get_latest_valid_seq(&self.base, &files1)?;


So here we only compare the last file's log count to decide which disk is newer?
What if their file counts are different and the older has more?

ti-chi-bot · 2024-11-05T12:02:47Z

@Connor1996: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
rust-nightly	`5c7e7b2`	link	true	`/test rust-nightly`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Connor1996 added 6 commits July 3, 2023 20:40

add double write env

1fa6c25

Signed-off-by: Connor1996 <[email protected]>

add test

4d22a39

Signed-off-by: Connor1996 <[email protected]>

add failpoint

6297cff

Signed-off-by: Connor1996 <[email protected]>

rename

8e5edc5

Signed-off-by: Connor1996 <[email protected]>

add test

62a9040

Signed-off-by: Connor1996 <[email protected]>

adjust

045ecbd

Signed-off-by: Connor1996 <[email protected]>

coderplay reviewed Aug 23, 2023

View reviewed changes

src/env/double_write.rs Outdated Show resolved Hide resolved

add recover ext

394beed

Signed-off-by: Connor1996 <[email protected]>

coderplay reviewed Aug 24, 2023

View reviewed changes

src/env/double_write.rs Outdated Show resolved Hide resolved

coderplay reviewed Aug 24, 2023

View reviewed changes

src/env/double_write.rs Outdated Show resolved Hide resolved

coderplay reviewed Aug 24, 2023

View reviewed changes

coderplay reviewed Aug 25, 2023

View reviewed changes

Connor1996 added 3 commits August 31, 2023 14:09

refactor

adbd517

Signed-off-by: Connor1996 <[email protected]>

make send to two disks atomic

928dd70

Signed-off-by: Connor1996 <[email protected]>

make format

6754299

Signed-off-by: Connor1996 <[email protected]>

Connor1996 force-pushed the double-write branch from de998d7 to 6754299 Compare August 31, 2023 07:01

Connor1996 added 5 commits August 31, 2023 15:42

fix warning

b523bd4

Signed-off-by: Connor1996 <[email protected]>

refine tests

3409669

Signed-off-by: Connor1996 <[email protected]>

wait both for rewrite files

e1fc66c

Signed-off-by: Connor1996 <[email protected]>

allocate sequential number for task

f250521

Signed-off-by: Connor1996 <[email protected]>

rename and fix seqno update

6b60bf7

Signed-off-by: Connor1996 <[email protected]>

Connor1996 added 8 commits September 13, 2023 17:05

add recover case and fix cancealed rx

135ae19

Signed-off-by: Connor1996 <[email protected]>

handle background error and panic

df907f0

Signed-off-by: Connor1996 <[email protected]>

remove recover ext

ce601a9

Signed-off-by: Connor1996 <[email protected]>

separate into multiple files

76b38bb

Signed-off-by: Connor1996 <[email protected]>

add comment

2518b10

Signed-off-by: Connor1996 <[email protected]>

add comment

56c40dc

Signed-off-by: Connor1996 <[email protected]>

clean

58132d3

Signed-off-by: Connor1996 <[email protected]>

Merge remote-tracking branch 'upstream/master' into double-write

73375f4

Connor1996 force-pushed the double-write branch 2 times, most recently from 1d3ccc3 to 5b1d470 Compare September 18, 2023 10:07

clean

5c7e7b2

Signed-off-by: Connor1996 <[email protected]>

Connor1996 force-pushed the double-write branch from 5b1d470 to 5c7e7b2 Compare September 18, 2023 10:07

Connor1996 marked this pull request as ready for review September 18, 2023 10:07

Connor1996 requested review from tonyxuqqi and tabokie September 18, 2023 10:08

LykxSassinator reviewed Sep 19, 2023

View reviewed changes

tabokie reviewed Sep 25, 2023

View reviewed changes

tonyxuqqi reviewed Sep 28, 2023

View reviewed changes

tonyxuqqi reviewed Oct 5, 2023

View reviewed changes

	pub(crate) fn catch_up_diff(
	pub(crate) fn synchronize_files(

	async fn wait(&self, task1: Task, task2: Task) -> IoResult<()> {
	async fn wait_one(&self, task1: Task, task2: Task) -> IoResult<()> {

	async fn wait_handle(&self, task1: Task, task2: Task) -> IoResult<HedgedHandle> {
	async fn wait_one_handle(&self, task1: Task, task2: Task) -> IoResult<HedgedHandle> {

		@@ -0,0 +1,9 @@
		use std::path::{Path, PathBuf};

		pub fn replace_path(path: &Path, from: &Path, to: &Path) -> PathBuf {

Introduce double write file system #323

Are you sure you want to change the base?

Introduce double write file system #323

Conversation

Connor1996 commented Jul 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderplay Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderplay Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Connor1996 Aug 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderplay Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderplay Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderplay Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderplay Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Connor1996 Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderplay Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Connor1996 commented Jul 7, 2023 •

edited

Loading

coderplay Aug 23, 2023 •

edited

Loading

coderplay Aug 23, 2023 •

edited

Loading

Connor1996 Aug 24, 2023 •

edited

Loading

coderplay Aug 23, 2023 •

edited

Loading

coderplay Aug 25, 2023 •

edited

Loading

coderplay Aug 23, 2023 •

edited

Loading

coderplay Aug 23, 2023 •

edited

Loading

Connor1996 Aug 25, 2023 •

edited

Loading

coderplay Aug 25, 2023 •

edited

Loading

coderplay Aug 25, 2023 •

edited

Loading

coderplay commented Aug 25, 2023 •

edited

Loading