This repository is based on https://github.com/duckdb/extension-template, check it out if you want to build and ship your own DuckDB extension.
DuckDB-MPP is a distributed query processing extension for DuckDB, designed to scale
data storage and analytical workloads across multiple nodes. Inspired by the Citus extension for
PostgreSQL, it transforms DuckDB into a powerful MPP (Massively Parallel Processing) system while retaining its
lightweight nature and SQL compatibility.
Warning
This is a work-in-progress (WIP) personal project, primarily aimed at deepening my understanding of DuckDB's implementation, not for production usage
DuckDB extensions uses VCPKG for dependency management. Enabling VCPKG is very simple: follow the installation instructions or just run the following:
git clone https://github.com/Microsoft/vcpkg.git
./vcpkg/bootstrap-vcpkg.sh
export VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmakeNow to build the extension, run:
makeThe main binaries that will be built are:
./build/release/duckdb
./build/release/test/unittest
./build/release/extension/mpp/mpp.duckdb_extensionduckdbis the binary for the duckdb shell with the extension code automatically loaded.unittestis the test runner of duckdb. Again, the extension is already linked into the binary.mpp.duckdb_extensionis the loadable binary as it would be distributed.
Automatically shards tables across multiple nodes using a share-nothing architecture, enabling horizontal scalability for large datasets.
Leverages multiple CPU cores and nodes to process queries in parallel, accelerating analytical workloads.
Employs a central coordinator node to manage query routing and result aggregation, simplifying client interactions.
Builds directly on DuckDB's query engine, maintaining full SQL compatibility and leveraging its efficient columnar storage.
Suppose you have three nodes with IP addresses:
- 192.168.10.10 (Coordinator)
- 192.168.10.11 (Worker 1)
- 192.168.10.12 (Worker 2)
On each node, start DuckDB and create a database:
# On all nodes (coordinator and workers)
duckdb /path/to/test.dbOn each node, attach the distributed database using the ATTACH command:
-- On 192.168.10.10 (Coordinator)
ATTACH 'test' AS mpp_test (type mpp, endpoint '192.168.10.10:12345');
USE mpp_test;
-- On 192.168.10.11 (Worker 1)
ATTACH 'test' AS mpp_test (type mpp, endpoint '192.168.10.11:12345');
USE mpp_test;
-- On 192.168.10.12 (Worker 2)
ATTACH 'test' AS mpp_test (type mpp, endpoint '192.168.10.12:12345');
USE mpp_test;On the coordinator node, register the worker nodes using master_add_node:
-- On coordinator node (192.168.10.10)
SELECT * FROM master_add_node('192.168.10.11', 12345);
SELECT * FROM master_add_node('192.168.10.12', 12345);-- On coordinator node (192.168.10.10)
-- Create a distributed table partitioned by 'id' with 6 shards
CREATE TABLE test(id INTEGER, name TEXT) PARTITION BY(id) WITH BUCKETS 6;
-- Insert data (automatically distributed across workers)
INSERT INTO test VALUES(1, 'alex'), (2, 'jason'), (3, 'james');
-- Query the distributed table
SELECT * FROM test;- Create table
- Create table as
- Drop table
- INSERT
- SELECT
- UPDATE
- UPDATE ... RETURNING
- DELETE
- DELETE ... RETURNING
- ALTER
- Distributed transactions
- Shard rebalancing
- Global row id
- Shard pruning
- Co-located tables
- Reference tables
- TPC-H
- TPC-DS