Distributed File Store Simulator

A learning-oriented mini-distributed file-system written in Rust, focused on chunking, hashing, compression, replication and parity recovery—loosely inspired by the design ideas of the Google File System (GFS).

Inspirational parallels to GFS:

Large immutable chunks minimise master interactions and make sequential I/O efficient.
Master + chunkserver mindset (here: coordinator + nodes) keeps metadata centralised but datapath peer-to-peer.
Replication & cheap hardware: tolerate node loss by re-replicating or rebuilding from parity.

Project Status

Under active development (pre-v1)

Goals / Feature Checklist

Planned Architecture

flowchart LR
  %% Client side
  subgraph CLIENT["Client Machine"]
    APP[Application]
    CLI[dfs-sim client / CLI]
    APP --> CLI
  end

  %% Coordinator
  COORD["Coordinator / Master
  (in-memory catalog + replication policy)"]

  %% Storage box (no title here)
  subgraph STORAGE
    direction TB
    N1[Node-1]
    N2[Node-2]
    N3[Node-3]
    MORE((...))
    %% External storage label as its own node
    L["Chunk Servers"]
    style L fill:none,stroke:none
  end

  %% Control (dashed)
  CLI -. lookup(file, chunk-idx) .-> COORD
  COORD -. locations + chunk handles .-> CLI
  N1  -. heartbeat + chunkstate .-> COORD
  N2  -. heartbeat + chunkstate .-> COORD
  N3  -. heartbeat + chunkstate .-> COORD

  %% Data (solid)
  CLI -->|stream chunk| N1
  CLI -->|replica / parity| N2
  CLI -->|replica / parity| N3

Solid arrows: data path.
Dashed arrows: metadata / control.

Anticipated Functionality

# three nodes
dfs-sim node --id n1 --listen 127.0.0.1:4001
dfs-sim node --id n2 --listen 127.0.0.1:4002
dfs-sim node --id n3 --listen 127.0.0.1:4003

# coordinator
dfs-sim coord --nodes 127.0.0.1:4001,127.0.0.1:4002,127.0.0.1:4003 \
              --replication 2 --parity xor

# store & retrieve
dfs-sim put ./large.iso --name iso/ubuntu.iso
dfs-sim get iso/ubuntu.iso --out ./restore.iso

Bring one node down (Ctrl-C n1) and repeat dfs-sim get … to watch parity recovery in logs.

How it works (design details)

Data path

PUT: CLI streams file ➜ Coordinator
Coordinator splits into k MB chunks, computes SHA-256, compresses, picks replica + parity layout, then streams each chunk to selected nodes in parallel (round-robin for now).
GET: Coordinator returns ordered chunk list; CLI concurrently pulls chunks, verifies hash, decompresses, reassembles. Missing chunks trigger parity rebuild on-the-fly.
Node failure simulation: Bring a node down—catalog notices missed heart-beats, flips state, and read path fetches from replicas or kicks off XOR reconstruction.

Chunking Engine

Each chunk is 64 MB. Large enough to amortize metadata, small enough for demo.
Once written, chunks are append-only—mirrors GFS’s design for simpler consistency guarantees.
Chunker reads file in 64 MB blocks and computes SHA-256 hash, which are stored in a custom data structure.

Replication & parity

Each chunk is streamed to R distinct nodes (--replication R).
Optional XOR parity block written to a further node set (--parity xor), allowing single-node loss recovery.
Coordinator tracks live nodes; missing replicas are lazily rebuilt in background.

Coordinator responsibilities

Namespace & placement: chooses nodes, maintains in-memory catalog (serialised to disk on exit).
Heart-beat & health: nodes ping every 3 s; missed pings mark node offline.
Recovery: scans catalog, schedules background parity::rebuild() tasks.

Compression workflow

Chunk bytes → lz4_flex::compress
Prepend 4-byte original-size header
Store compressed buffer; on read, reverse.

Concurrency story

Tokio runtime (or std::thread + channels) drives async transfers.
Chunk uploads are pipelined; hashes & compression executed in a scoped thread-pool so CPU work overlaps I/O.
Shared state behind Arc<Mutex<…>> to exercise Rust’s ownership model.

Benchmarks & metrics

Planned with hyperfine (throughput) and cargo-flamegraph (hot spots). Report will include:

PUT/GET MB/s vs. replication factor
Compression ratio vs. CPU cost
Restore latency under node loss

References

Ghemawat S., Gobioff H., Leung S-T. The Google File System. SOSP 2003.
Project outline: dfs-simulator-project (pdf).

How to Contribute

Contributions are welcome soon once the core scaffolding stabilises. Feel free to open issues for design discussion; pull-requests will be reviewed after an initial public alpha.

License

Released under the MIT License – see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
assets		assets
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed File Store Simulator

Project Status

Goals / Feature Checklist

Planned Architecture

Anticipated Functionality

How it works (design details)

Data path

Chunking Engine

Replication & parity

Coordinator responsibilities

Compression workflow

Concurrency story

Benchmarks & metrics

References

How to Contribute

License

About

Uh oh!

Packages

Uh oh!

Languages

License

AarjavPatni/dfs_simulator

Folders and files

Latest commit

History

Repository files navigation

Distributed File Store Simulator

Project Status

Goals / Feature Checklist

Planned Architecture

Anticipated Functionality

How it works (design details)

Data path

Chunking Engine

Replication & parity

Coordinator responsibilities

Compression workflow

Concurrency story

Benchmarks & metrics

References

How to Contribute

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Languages

Packages