Skip to content

generator does not uniformly distribute operations to nodes #611

@nurturenature

Description

@nurturenature

Hi,

It appears that operations are not uniformly distributed to nodes by the generator.
The first specified node, e.g. "n1", gets so many more operations that I thought it was worth quantifying and confirming that this is not desired behavior.

Using a test that is a minimal no-op:

{:db       db/noop
:nemesis   nemesis/noop
:generator (repeat {:type  :invoke
                    :f     :node
                    :value nil})}

with a client that just augments the op map with {:type :ok :value node} and may sleep --op-latency ms to simulate latency:

(invoke!
  [{:keys [node op-latency] :as _this} _test op]
  ; simulate any desired latency?
  (when (< 0 op-latency)
    (u/sleep op-latency))  
  
  ; op is ok with a value of node
  (assoc op
         :type  :ok
         :value node))

and a checker that just calculates total ops by node:

(->> history
     h/client-ops
     (h/remove (fn [{:keys [type] :as _op}] (= type :invoke)))
     (reduce (fn [summary {:keys [value] :as _op}]
               (update summary value (fn [old] (+ 1 (or old 0)))))
             (sorted-map)))

as --workload ops-by-node yields results showing "n1" receiving almost 2x the operations as "n5":

{:valid? true,
 :nodes ["n1" "n2" "n3" "n4" "n5"],
 :ops-by-node {"n1" 1919,
               "n2" 1009,
               "n3" 978,
               "n4" 977,
               "n5" 977}}

Manually specifying nodes in reverse order reverses the skew to "n5" receiving almost 2x the operations as "n1":

--nodes n5,n4,n3,n2,n1

{:valid? true,
 :nodes ["n5" "n4" "n3" "n2" "n1"],
 :ops-by-node {"n1" 960,
               "n2" 960,
               "n3" 963,
               "n4" 983,
               "n5" 1895}}

Let's try adding a new workload, --workload odd-nodes-only, only odd numbered nodes receive operations, with a generator of (gen/on-threads #{0 2 4}) (fn is a set of the odd numbered nodes' corresponding threads):

{:valid? true,
 :nodes ["n1" "n2" "n3" "n4" "n5"],
 :ops-by-node {"n1" 1936,
               "n3" 1952,
               "n5" 1949}}

And another new workload, --workload on-threads-any, all nodes receive operations, with a generator of (gen/on-threads any?) (fn any? always returns true):

{:valid? true,
 :nodes ["n1" "n2" "n3" "n4" "n5"],
 :ops-by-node {"n1" 1893,
               "n2" 993,
               "n3" 964,
               "n4" 962,
               "n5" 962}}

Let's introduce some client latency, --op-latency ms, the amount of time, simulated latency, an op should take in ms:

--op-latency 15

{:valid? true,
 :nodes ["n1" "n2" "n3" "n4" "n5"],
 :ops-by-node {"n1" 1299,
               "n2" 1240,
               "n3" 1141,
               "n4" 1102,
               "n5" 1040}}

Now try a latency designed to remove a node from availability equal to all other nodes having a chance,
e.g. 1000 / rate * #-nodes:

--op-latency 50

{:valid? true,
 :nodes ["n1" "n2" "n3" "n4" "n5"],
 :ops-by-node {"n1" 1029,
               "n2" 1021,
               "n3" 1019,
               "n4" 1008,
               "n5" 991}}

Looking at:

; When we consume a thread, we bump the next thread index. This means we
; rotate evenly through threads instead of giving a single thread all the
; ops.

and the test:

; We want to distribute requests evenly across threads to prevent
; starvation.

would seem to confirm that this is not the desire behavior?

A repository, jepsen-skeleton, has been created to demonstrate this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions