Skip to content
/ py-dht Public

Python module that can be used to simulate how a Kademlia based DHT could be integrated into Ethereum to help with the data-sharding schema

License

Notifications You must be signed in to change notification settings

cortze/py-dht

Repository files navigation

py-dht ·

small and straightforward representation of how a kademlia-based dht could be integrated into ethereum, particularly at das-research.

What can this dht do?

the current work on this simulation of a kademlia-based dht network offers:

  • DHTNewtork object that can: spawn dhtclients, serve as main source to initialize the routing table of the dhtclients, resolve connections between dhtclients, handle and keep track of the interactions between dhtclients. the parameters to configure the a dhtnetwork are:

    • networkid: in case we want to simulate different network at the same time
    • errorrate: to define the number connections that will fall into an error (we can understand it as the likelines o f an error in %)
    • delayrage: range between the slowest possible delay and the biggest one. a random delay will be selected every time a connection is stablished between 2 nodes (if no error is raised)

    the network offers the following functions:

    • parallel_clilist_initializer
    • init_with_random_peers initializes a network using a "blazingly fast" method, which can be optimized even more if a number of threads/processes is defined
    • add_new_node adds a new node to the local Network
    • connect_to_node returns the Connection obj between node A and B
    • bootstrap_node return the "best" nodes/dhtclis to compose the routing table for the given node
    • summary return the summary of the current status of the network (number of nodes, successful connections, failed ones, etc), will evolve over time
  • Connection interface that limits how two dhtclients interact with each other (like if it was a closed api/protocol). it offers the possibility to client a (client starting the connection) to ask client b (remote client), applying if specified the delay at the moment of stablishing the connection and per each interaction:

    • get_closest_nodes_to(hash) will return the k closest nodes to the given hash that client b has in it's routing table. note: it will also return the value if it's stored locally :)

    • store_segment(segment) will add the hash and the segment as a key-value pair in b's local storage

    • retrieve_segment(hash) will ask b to return the segment of the given hash if it has it

  • DHTClient as representation of a node in the simulated dht network. the dhtclient can be created using the following parameters:

    • nodeid: id of the node that hosts the dhtclient
    • network: referece to the network obj that where the dhtclient participates in
    • kbucketsize: k value, number of nodes per kbucket
    • a: number of concurrent node connections the client does while looking for a given key
    • b: target of nodes (number of nodes) returned when asking for a hash
    • steptostop: number of iterations without finding anyone closer to stop the lookup operation

    the client serves a list of endpoints such as:

    • bootstrap uses the network reference to find the right peers for the routing table
    • lookup_for_hash will try to look the value of the hash in the network, and the closest nodes to it
    • get_closest_nodes_to will return the closest nodes to a hash from the local routing table
    • provide_block_segment will lookup for the closest nodes in the network, and store the segment on them
    • store_segment will store locally a segment value using its hash as key
    • retrieve_segment will return the value of a hash if its locally, exception raised otherwise
  • RoutingTable and KBucket classes to store locally the local representation of the network for a given node

  • Hash and BitArray classes to represent a nodeid/blocksegment/generalobject

Dependencies

The source code runs mostly on plain Python libraries. However, to speed up the performance, the plain arrays and dicts were updated to classes from collections. Thus, I recomend to have an specific virtual environment to use the module.

To install the dependencies, do:

# python -m venv venv
# or
# python -m virtualenv venv
# source venv/bin/activate
(venv)$ pip install -r requirements.txt

Tests

The repo has a list of tests to ensure that no functionality is broken whenever a new feature is added. All the tests are triggered whenever GitHub records a Push or a PullRequest. However, there are locally runable using the ./launch_test.sh script

# the script will try to activate any `venv` located at the root of the directory
py-dht$ bash launch_tests.sh

Benchmarks

Running benchmarks is a bit trickier. First install the py-dht module as editable in the venv

py-dht$ pip install -e ./

after that, feel free to change the parameters in the benchmarks/launch_benchmarks.sh and run it like if it was a test:

# the script will try to activate any `venv` located at the root of the directory
py-dht$ cd benchmarks
py-dht/benchmarks$ bash launch_benchmarks.sh

Numbers and recomendations

From the experience of running tests and benchmarks on the repo, I can say that the optimizations on #8 and #9 were more than necesary.

Recomendations:

  • to simulate a network -> use the network.init_with_random_peers() function setting the processes parameters the initialization of the network is by far the process that takes the longer, as it has to compute the best routing tables for each of the spawned nodes. So, please benefit from the concurrency to reduce the duration (check this test as example)
  • At the moment using 20 cores of a ryzen 5900x I'm able to initialize a network of 10k dhtclients with k=20 in 28 secs

Latest numbers of the network becnhmark

# 03.08.2023 
py-dht$ python benchmarks/network.py -t opt-conc -o ./csvs -i 1 -k 20 -n 10000 --threads 20
-- benchmark: opt-conc_network_n_initialization --
rounds          : 1
failed rounds   : 0
prep time (s)   : 1.5974044799804688e-05
avg (s)         : 0.09420228004455566
median (s)      : 0.09420228004455566
p90_duration (s): 0.09420228004455566
-- benchmark: opt-conc_network_bootstrap_node --
rounds          : 1
failed rounds   : 0
prep time (s)   : 0.11602234840393066
avg (s)         : 0.6563236713409424
median (s)      : 0.6563236713409424
p90_duration (s): 0.6563236713409424
-- benchmark: opt-conc_network_fast_bootstrap_network --
rounds          : 1
failed rounds   : 0
prep time (s)   : 8.344650268554688e-06
avg (s)         : 174.6814157962799
median (s)      : 174.6814157962799
p90_duration (s): 174.6814157962799
-- benchmark: opt-conc_network_fast_threaded_bootstrap_network --
rounds          : 1
failed rounds   : 0
prep time (s)   : 0.0008330345153808594
avg (s)         : 28.239949226379395
median (s)      : 28.239949226379395
p90_duration (s): 28.239949226379395

At the moment (after applying the code optimizations in #8 and the conncurrency #9) the numbers look like this:

Number of nodes Default implementation 1st Optimization (monothread) 2nd Optimization (8 processes)
500 10 2.096 0.489
1000 39 5.39 1.02
1200 56 7.22 1.48
2000 157.36 14.49 2.83
5000 995.73 60.01 11.15
10000 (beyond 2 hours) 199.97 39.15
NOTE: all the measurements displayed in the table are expresed in seconds

img.png

Maintainer

Mikel Cortes-Goicoechea - @cortze

Contributing

feel free to dive in! change proposals, issues, and prs will be more than welcome.

Support

  • the work has been supported by codex
  • feel free to support this project through buy me a coffee; help me getting the ship of caffeine that I need 😊.

License

mit, see license file

About

Python module that can be used to simulate how a Kademlia based DHT could be integrated into Ethereum to help with the data-sharding schema

Resources

License

Stars

Watchers

Forks

Packages

No packages published