Add "pharos reboot" command for rebooting hosts or the entire cluster #1003

kke · 2019-01-24T09:12:55Z

Fixes #996

Adds pharos reboot for rebooting nodes.

If any of the hosts to reboot are masters, they will be rebooted first. The nodes will be drained before the reboot and uncordoned once they have finished rebooting.

jakolehm

I don't understand all the changes.. maybe you could comment changes via review?

jakolehm · 2019-01-24T10:32:17Z

lib/pharos/autoload.rb

@@ -47,6 +47,9 @@ module CoreExt

  module SSH
    autoload :Client, 'pharos/ssh/client'
+    autoload :Error, 'pharos/ssh/client'


Why do we need these?

Pharos::SSH::Client is not yet loaded when the phases are loading.

jakolehm · 2019-01-24T10:32:56Z

lib/pharos/config.rb

@@ -37,6 +37,11 @@ def self.load(raw_data)
      config.hosts.each do |host|
        host.api_endpoint = config.api&.endpoint
        host.config = config
+        # map bastion.host to an existing host


How this is related?

Otherwise the bastion.host points to a duplicate created via Configuration::Host.new instead to the actual host in config (which is completely fine if the bastion host even isn't in the config)

This means, that for example host.ssh.gateway.shutdown! when doing a host reboot/reconnect will shut down a gateway that is not the duplicated bastion host's gateway.

kke · 2019-01-24T10:59:52Z

lib/pharos/config.rb

@@ -122,7 +127,7 @@ def kube_client(kubeconfig)
        api_port = 6443
      else
        api_address = 'localhost'
-        api_port = master_host.bastion.host.ssh.gateway(master_host.api_address, 6443)
+        api_port = master_host.bastion.host.ssh.gateway.open(master_host.api_address, 6443)


ssh.gateway now returns a Net::SSH::Gateway instead of directly doing gateway.open.

kke · 2019-01-24T11:08:13Z

lib/pharos/ssh/client.rb

+      def gateway
+        @gateway ||= Net::SSH::Gateway.new(@host, @user, @opts).tap do |gw|
+          gw.instance_exec do
+            @thread.report_on_exception = false


This was added because when rebooting a node that acts as a bastion, the gateway thread for the host with bastion will die from IOError and report the exception to the terminal.

It seems to manage to reconnect with all the connection/reconnection additions in this PR and the exception appears to be harmless.

kke · 2019-01-24T11:12:13Z

lib/pharos/ssh/client.rb

      end

      private

      def require_session!
-        raise Error, "Connection not established" if @session.nil? || @session.closed?
+        connect(timeout: 3) unless connected?


Changed the SSH::Client.exec etc that use require_session! attempt to make a connection instead of failing directly.

kke · 2019-01-24T11:13:32Z

lib/pharos/ssh/client.rb

-          if bastion
-            gw_opts = {}
-            gw_opts[:keys] = [bastion.ssh_key_path] if bastion.ssh_key_path
-            gateway = Net::SSH::Gateway.new(bastion.address, bastion.user, gw_opts)


Becuse ssh.gateway returned ssh.gateway.open, the Net::SSH::Gateway.new was duplicated here and was left in a "dangling" local variable that could not be accessed for shutdown!.

kke · 2019-01-24T11:16:21Z

lib/pharos/reboot_command.rb

+        cluster_manager.apply_reboot_hosts(master_hosts, parallel: false)
+      end
+
+      puts pastel.green("==> Resharpening tools ...")


After rebooting masters, it seemed logical to regather facts, but now I realize it may never reach this point if the master reboot failed to bring up kubelet again. So, this could be skipped I think.

lib/pharos/phases/configure_client.rb

lib/pharos/phases/drain.rb

kke · 2019-01-24T11:27:34Z

lib/pharos/scripts/reboot-asap.sh

@@ -0,0 +1,30 @@
+#!/bin/bash


The alternative to this would be some kind of trickery with nohup (which I tried but failed) or always using +1. Now it often finishes in a couple of seconds instead of waiting a full minute every time.

Would it be easier to generate this in ruby code?

require 'time' current_time = Time.parse(ssh.exec!('date +%H:%M:%S`)) if current_time.sec > 55 shutdown_time = "+1" sleepy_time = 60 else new_time = Time.parse((current_time + 60).strftime('%H:%M:00')) shutdown_time = new_time.strftime('%H:%M') sleepy_time = (new_time - current_time).to_i end ssh.exec!("shutdown -r #{shutdown_time}") ssh.disconnect sleep sleepy_time

Remind me again why we need to delay reboot?

How about:

ssh.exec!("shutdown -r now &") sleep 1 until !ssh.connected?

To disconnect ssh client gracefully.

[36] pry(main)> ssh.exec('sudo shutdown -r now &') IOError: closed stream

It still exits instantly and possibly crashes some bastion thread. Maybe the bastion hosts need to be rebooted separately.

The transport PR changed a lot stuff and #1090 (which I would like to have merged before finalizing this one) will add more, this PR was made before all that and is probably not up to date at all.

Fair enough, but I feel that adding timers is not the right way to do this (they always introduce races). Maybe we need to wait until #1090 is merged.

kke · 2019-01-24T11:31:24Z

lib/pharos/reboot_command.rb

+
+      unless master_hosts.empty?
+        puts pastel.green("==> Rebooting #{master_hosts.size} master node#{'s' if master_hosts.size > 1} ...")
+        cluster_manager.apply_reboot_hosts(master_hosts, parallel: false)


Masters are booted sequentially instead of parallelly here.

jakolehm · 2019-01-24T11:29:18Z

lib/pharos/scripts/reboot-asap.sh

@@ -0,0 +1,30 @@
+#!/bin/bash


Would it be easier to generate this in ruby code?

jakolehm · 2019-01-24T11:31:10Z

lib/pharos/phases/reboot_host.rb

+      ].freeze
+
+      def call
+        reboot && reconnect && uncordon


Why && ? Methods don't actually return boolean?

lib/pharos/phase.rb

jakolehm · 2019-01-24T11:33:20Z

lib/pharos/configuration/bastion.rb

      def host
-        Host.new(address: address, user: user, ssh_key_path: ssh_key_path)
+        @host ||= Host.new(address: address, user: user, ssh_key_path: ssh_key_path)


||= with attr_writer feels a bit broken?

Well yes, if someone would override the host= method created by attr_writer, it would not be called.

The alternative that would use the writer method would be:

def host @host || self.host = Host.new end

kke · 2019-02-19T09:28:19Z

Badly out of sync. Maybe wait for #1044 (this will need to refuse booting the localhost I suppose).

kke · 2019-02-19T09:35:39Z

Oh, #1044 is in and is why this conflicted so bad :)

kke · 2019-02-19T14:00:44Z

lib/pharos/config.rb

@@ -123,7 +123,7 @@ def kube_client(kubeconfig)
        api_port = 6443
      else
        api_address = 'localhost'
-        api_port = master_host.bastion.host.transport.gateway(master_host.api_address, 6443)
+        api_port = master_host.bastion.gateway.open(master_host.api_address, 6443)


This requires #1090

jakolehm · 2019-02-26T06:17:05Z

lib/pharos/phases/configure_etcd.rb

@@ -21,6 +21,7 @@ def call
          PEER_NAME: peer_name(@host),
          ARCH: @host.cpu_arch.name
        )
+        host.checks['etcd_ca_exists'] = true


Are we using this?

jakolehm · 2019-02-26T06:17:21Z

lib/pharos/phases/configure_master.rb

@@ -33,6 +33,7 @@ def call
        end

        cluster_context['master-certs'] = pull_kube_certs unless cluster_context['master-certs']
+        host.checks['kubelet_configured'] = true


Are we using this?

def new? !checks['kubelet_configured'] end # @return [Integer] def master_sort_score if checks['api_healthy'] 0 elsif checks['kubelet_configured'] 1 else 2 end end

jakolehm · 2019-02-26T06:17:30Z

lib/pharos/phases/configure_master.rb

@@ -90,6 +91,7 @@ def push_kube_certs(certs)
          transport.file(path).write(contents)
          transport.exec!("sudo chmod 0400 #{path}")
        end
+        host.checks['ca_exists'] = true


Are we using this?

lib/pharos/phases/validate_host.rb 40: raise Pharos::InvalidHostError, "Cannot change worker host role to master" if @host.master? && [email protected]['ca_exists'] 41: raise Pharos::InvalidHostError, "Cannot change master host role to worker" if @host.worker? && @host.checks['ca_exists']

I meant are we using it in this PR? How is this relevant?

jakolehm · 2019-02-26T06:18:02Z

lib/pharos/phases/configure_master.rb

@@ -99,6 +101,7 @@ def pull_kube_certs
          path = File.join(KUBE_DIR, 'pki', file)
          certs[file] = transport.file(path).read
        end
+        host.checks['ca_exists'] = certs.key?('ca.key')


Are we using this?

lib/pharos/phases/reboot_host.rb

kke · 2019-02-26T08:32:54Z

lib/pharos/reboot_command.rb

+
+      unless local_hosts.empty?
+        puts "  " + pastel.red("!" * 76)
+        puts pastel.red("    The host will remain cordoned (workloads will not be scheduled on it) after the reboot")


All this pastel stuff need to be updated too

kke · 2019-02-26T08:59:34Z

lib/pharos/reboot_command.rb

+      unless local_hosts.empty?
+        puts "  " + pastel.red("!" * 76)
+        puts pastel.red("    The host will remain cordoned (workloads will not be scheduled on it) after the reboot")
+        puts pastel.red("    To uncordon, you must use: ") + pastel.cyan("pharos exec -c #{config_yaml.filename} -r master -f -- kubectl uncordon #{local_hosts.first}")


Hmm, it could be possible to schedule uncordoning by using something like:

echo "kubectl uncordon #{localhost.hostname}" | at now + 10 minutes

before issuing the reboot command on the localhost

jakolehm · 2019-02-28T16:16:44Z

lib/pharos/reboot_command.rb

+
+      unless worker_hosts.empty?
+        puts pastel.green("==> Rebooting #{worker_hosts.size} worker node#{'s' if worker_hosts.size > 1} ...")
+        cluster_manager.apply_reboot_hosts(worker_hosts, parallel: true)


It's not safe to reboot all workers in parallel. At minimum there needs to be an option for this.

jakolehm · 2019-02-28T16:19:13Z

lib/pharos/scripts/reboot-asap.sh

@@ -0,0 +1,30 @@
+#!/bin/bash


Fair enough, but I feel that adding timers is not the right way to do this (they always introduce races). Maybe we need to wait until #1090 is merged.

kke · 2019-04-03T11:06:56Z

Cleaned up after the transport changes, I feel there might still be a bit too much logic.

kke · 2019-06-14T11:38:48Z

Hmm I wonder if this is still 2.4.0 -worthy. Seems quite simple nowadays. Maybe add a e2e after "worker up".

kke added the enhancement New feature or request label Jan 24, 2019

kke requested a review from jakolehm January 24, 2019 09:12

kke self-assigned this Jan 24, 2019

jakolehm reviewed Jan 24, 2019

View reviewed changes

kke commented Jan 24, 2019

View reviewed changes

lib/pharos/phases/configure_client.rb Outdated Show resolved Hide resolved

kke commented Jan 24, 2019

View reviewed changes

lib/pharos/phases/drain.rb Outdated Show resolved Hide resolved

kke commented Jan 24, 2019

View reviewed changes

jakolehm reviewed Jan 24, 2019

View reviewed changes

kke commented Feb 19, 2019

View reviewed changes

jakolehm reviewed Feb 26, 2019

View reviewed changes

kke commented Feb 26, 2019

View reviewed changes

kke mentioned this pull request Feb 27, 2019

Add host.ssh_port option and refactor gateway initialization #1090

Closed

jakolehm reviewed Feb 28, 2019

View reviewed changes

kke mentioned this pull request Mar 15, 2019

Drop Net::SSH::Gateway dependency #1185

Merged

jakolehm added this to the 2.4.0 milestone Mar 19, 2019

kke force-pushed the feature/node_reboot branch from 9ef8d35 to 399988f Compare April 3, 2019 11:04

New try

399988f

Kimmo Lehto added 2 commits June 14, 2019 14:39

Merge branch 'master' into feature/node_reboot

935577b

Add to e2e [cluster-e2e]

3c522e7

Kimmo Lehto added 11 commits June 14, 2019 15:23

Move [cluster-e2e]

a1851f6

Missing space [cluster-e2e]

a1e4ff2

Darned shellcheck [cluster-e2e]

f23c835

Finetune [cluster-e2e]

996dd53

TF12 took too long to reboot? [cluster-e2e]

e4b67b4

Retry without sleep [cluster-e2e]

b29e05c

No point retrying key errors

3ba16bb

How about just one host? [cluster-e2e]

7bcd54b

Weird stuff [cluster-e2e]

432ba70

More timeout [cluster-e2e]

ff40f02

Merge branch 'master' into feature/node_reboot [cluster-e2e]

acb5ecf

jakolehm modified the milestones: 2.4.0, 2.5.0 Jun 20, 2019

Merge branch 'master' into feature/node_reboot

10072ac

jakolehm closed this Mar 22, 2020

Add "pharos reboot" command for rebooting hosts or the entire cluster #1003

Add "pharos reboot" command for rebooting hosts or the entire cluster #1003

Uh oh!

Conversation

kke commented Jan 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakolehm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kke Jan 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kke Jan 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kke Jan 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kke Jan 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kke commented Feb 19, 2019

Uh oh!

kke commented Feb 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kke commented Jan 24, 2019 •

edited

Loading

kke Jan 24, 2019 •

edited

Loading

kke Jan 24, 2019 •

edited

Loading

kke Jan 24, 2019 •

edited

Loading

kke Jan 24, 2019 •

edited

Loading