-
Notifications
You must be signed in to change notification settings - Fork 733
Open
Description
Hi,
Sometimes at the end of the test the operation which should stop disruption isn't generated. Example from logs:
2024-11-05 15:17:58,885{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :start-partition-node [:isolated {"n1" #{"n2" "n5" "n6" "n4" "n3"}, "n2" #{"n1"}, "n5" #{"n1"}, "n6" #{"n1"}, "n4" #{"n1"}, "n3" #{"n1"}}]
.
.
.
.
.
2024-11-05 15:34:08,356{GMT} INFO [jepsen worker 4] jepsen.generator.interpreter: Healing cluster.
2024-11-05 15:34:08,357{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :stop-partition-ring nil
2024-11-05 15:34:08,498{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :stop-partition-ring :network-healed
2024-11-05 15:34:08,498{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :stop-partition-halves nil
2024-11-05 15:34:08,610{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :stop-partition-halves :network-healed
2024-11-05 15:34:08,610{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :stop-partition-node nil
2024-11-05 15:34:08,732{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :stop-partition-node :network-healed
2024-11-05 15:34:08,732{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :heal-node nil
2024-11-05 15:34:08,732{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :heal-node :not-started
2024-11-05 15:34:08,732{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :stop-network-disruption nil
2024-11-05 15:34:08,847{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :stop-network-disruption [:reliable {"n1" nil, "n2" nil, "n3" nil, "n4" nil, "n5" nil, "n6" nil}]
You can see that at 15:17:58,885 start-partition-node was generated but I don't see corresponding stop-partition-node until the final generator started.
This is the code I have for nemesis:
(ns memgraph.high-availability.create.nemesis
"Memgraph nemesis for HA cluster"
(:require [jepsen
[nemesis :as nemesis]
[generator :as gen]
[net :as net]
[util :as util]
[control :as c]]
[jepsen.nemesis.combined :as nemesis-combined]
[clojure.tools.logging :refer [info]]
[clojure.string :as str]
[memgraph.high-availability.utils :as hautils]
[memgraph
[support :as s]
[query :as mgquery]
[utils :as utils]]))
(defn node-start-stopper
"Takes a targeting function which, given a list of nodes, returns a single
node or collection of nodes to affect, and two functions `(start! test node)`
invoked on nemesis start, and `(stop! test node nodes-config)` invoked on nemesis stop.
Also takes a nodes-config map needed to restart nodes.
Returns a nemesis which responds to :start and :stop by running the start!
and stop! fns on each of the given nodes. During `start!` and `stop!`, binds
the `jepsen.control` session to the given node, so you can just call `(c/exec
...)`.
The targeter can take either (targeter test nodes) or, if that fails,
(targeter nodes).
Re-selects a fresh node (or nodes) for each start--if targeter returns nil,
skips the start. The return values from the start and stop fns will become
the :values of the returned :info operations from the nemesis, e.g.:
{:value {:n1 [:killed \"java\"]}}"
[targeter start! stop! nodes-config]
(let [nodes (atom nil)]
(reify nemesis/Nemesis
(setup! [this _] this)
(invoke! [_ test op]
(locking nodes
(assoc op :type :info, :value
(case (:f op)
:start (let [ns (:nodes test)
ns (try (targeter test ns)
(catch clojure.lang.ArityException _
(targeter ns)))
ns (util/coll ns)]
(if ns
(if (compare-and-set! nodes nil ns)
(c/on-many ns (start! test c/*host*))
(str "nemesis already disrupting "
(pr-str @nodes)))
:no-target))
:stop (if-let [ns @nodes]
(let [value (c/on-many ns (stop! test c/*host* nodes-config))]
(reset! nodes nil)
value)
:not-started)))))
(teardown! [_ _]))))
(defn main-instance?
[instance]
(= (:role instance) "main"))
(defn get-current-main
"Returns the name of the main instance. If there is no main instance or more than one main, throws exception."
[instances]
(let [main-instances (filter main-instance? instances)
num-mains (count main-instances)
main-instance
(cond (= num-mains 1) (first main-instances)
(= num-mains 0) nil
:else (throw (Exception. "Expected at most one main instance.")))
main-instance-name (if (nil? main-instance) nil (:name main-instance))]
main-instance-name))
(defn leader?
[instance]
(= (:role instance) "leader"))
(defn extract-coord-name
"We could use any server. We cannot just use name because in Memgraph coordinator's name is written as 'coordinator_{id}'."
[bolt-server]
(first (str/split bolt-server #":")))
(defn get-current-leader
"Returns the name of the current leader. If there is no leader or more than one leader, throws exception."
[instances]
(let [leaders (filter leader? instances)
num-leaders (count leaders)
leader
(cond (= num-leaders 1) (first leaders)
(= num-leaders 0) nil
:else (throw (Exception. "Expected at most one leader.")))
leader-name (if (nil? leader) nil (extract-coord-name (:bolt_server leader)))]
leader-name))
(defn choose-node-to-kill-on-coord
"Chooses between the current main and the current leader. If there are no clear MAIN and LEADER instance in the cluster, we choose random node. We always connect to
coordinator 'n4' (free choice). If n4 is down, we choose random instance to kill"
[ns coord]
(let [conn (utils/open-bolt coord)]
(try
(utils/with-session conn session
(let [instances (->> (mgquery/get-all-instances session) (reduce conj []))
main (get-current-main instances)
leader (get-current-leader instances)
node-to-kill (cond
(and (nil? main) (nil? leader)) (rand-nth ns)
(nil? main) leader
(nil? leader) main
:else (rand-nth [main leader]))
node-desc (cond
(hautils/data-instance? node-to-kill) "current main"
(hautils/coord-instance? node-to-kill) "current leader"
:else "random node")]
(info "Killing" node-desc ":" node-to-kill)
node-to-kill))
(catch org.neo4j.driver.exceptions.ServiceUnavailableException _e
nil))))
(defn choose-node-to-kill
[ns]
(let [coordinators ["n4" "n5" "n6"]
chooser (partial choose-node-to-kill-on-coord ns)]
(some chooser coordinators)))
(defn node-killer
"Responds to :start by killing a random node."
[nodes-config]
(node-start-stopper choose-node-to-kill
s/stop-node!
s/start-memgraph-node!
nodes-config))
(defn network-disruptor
"Responds to :start by disrupting network on chosen nodes and to :stop by restoring network configuration."
[db]
(nemesis-combined/packet-nemesis db))
(defn full-nemesis
"Can kill, restart all processess and initiate network partitions."
[db nodes-config]
(nemesis/compose
{{:kill-node :start
:heal-node :stop} (node-killer nodes-config)
{:start-partition-halves :start
:stop-partition-halves :stop} (nemesis/partition-random-halves)
{:start-partition-ring :start
:stop-partition-ring :stop} (nemesis/partition-majorities-ring)
{:start-partition-node :start
:stop-partition-node :stop} (nemesis/partition-random-node)
{:start-network-disruption :start-packet
:stop-network-disruption :stop-packet} (network-disruptor db)}))
(defn nemesis-events
"Create a random sequence of nemesis events. Disruptions last [10-60] seconds, and the system remains undisrupted for some time afterwards."
[nodes-config]
(let [events [[{:type :info :f :start-partition-halves}
(gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
{:type :info :f :stop-partition-halves}
(gen/sleep 5)]
[{:type :info :f :start-partition-ring}
(gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
{:type :info :f :stop-partition-ring}
(gen/sleep 5)]
[{:type :info :f :start-partition-node}
(gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
{:type :info :f :stop-partition-node}
(gen/sleep 5)]
[{:type :info :f :kill-node}
(gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
{:type :info :f :heal-node}
(gen/sleep 5)]
[{:type :info :f :start-network-disruption :value [(keys nodes-config) net/all-packet-behaviors]}
(gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
{:type :info :f :stop-network-disruption}
(gen/sleep 5)]]]
(mapcat identity (repeatedly #(rand-nth events)))))
(defn create
"Create a map which contains a nemesis configuration for running HA create test."
[db nodes-config]
{:nemesis (full-nemesis db nodes-config)
:generator (gen/phases
(gen/sleep 5) ; Enough time for cluster setup to finish
(nemesis-events nodes-config))
:final-generator (map utils/op [:stop-partition-ring :stop-partition-halves :stop-partition-node :heal-node :stop-network-disruption])})
Metadata
Metadata
Assignees
Labels
No labels