Skip to content

Add "pharos reboot" command for rebooting hosts or the entire cluster #1003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

kke
Copy link
Contributor

@kke kke commented Jan 24, 2019

Fixes #996

Adds pharos reboot for rebooting nodes.

If any of the hosts to reboot are masters, they will be rebooted first. The nodes will be drained before the reboot and uncordoned once they have finished rebooting.

@kke kke added the enhancement New feature or request label Jan 24, 2019
@kke kke requested a review from jakolehm January 24, 2019 09:12
@kke kke self-assigned this Jan 24, 2019
Copy link
Contributor

@jakolehm jakolehm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand all the changes.. maybe you could comment changes via review?

@@ -47,6 +47,9 @@ module CoreExt

module SSH
autoload :Client, 'pharos/ssh/client'
autoload :Error, 'pharos/ssh/client'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pharos::SSH::Client is not yet loaded when the phases are loading.

@@ -37,6 +37,11 @@ def self.load(raw_data)
config.hosts.each do |host|
host.api_endpoint = config.api&.endpoint
host.config = config
# map bastion.host to an existing host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this is related?

Copy link
Contributor Author

@kke kke Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise the bastion.host points to a duplicate created via Configuration::Host.new instead to the actual host in config (which is completely fine if the bastion host even isn't in the config)

This means, that for example host.ssh.gateway.shutdown! when doing a host reboot/reconnect will shut down a gateway that is not the duplicated bastion host's gateway.

@@ -122,7 +127,7 @@ def kube_client(kubeconfig)
api_port = 6443
else
api_address = 'localhost'
api_port = master_host.bastion.host.ssh.gateway(master_host.api_address, 6443)
api_port = master_host.bastion.host.ssh.gateway.open(master_host.api_address, 6443)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ssh.gateway now returns a Net::SSH::Gateway instead of directly doing gateway.open.

def gateway
@gateway ||= Net::SSH::Gateway.new(@host, @user, @opts).tap do |gw|
gw.instance_exec do
@thread.report_on_exception = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added because when rebooting a node that acts as a bastion, the gateway thread for the host with bastion will die from IOError and report the exception to the terminal.

It seems to manage to reconnect with all the connection/reconnection additions in this PR and the exception appears to be harmless.

end

private

def require_session!
raise Error, "Connection not established" if @session.nil? || @session.closed?
connect(timeout: 3) unless connected?
Copy link
Contributor Author

@kke kke Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the SSH::Client.exec etc that use require_session! attempt to make a connection instead of failing directly.

if bastion
gw_opts = {}
gw_opts[:keys] = [bastion.ssh_key_path] if bastion.ssh_key_path
gateway = Net::SSH::Gateway.new(bastion.address, bastion.user, gw_opts)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Becuse ssh.gateway returned ssh.gateway.open, the Net::SSH::Gateway.new was duplicated here and was left in a "dangling" local variable that could not be accessed for shutdown!.

cluster_manager.apply_reboot_hosts(master_hosts, parallel: false)
end

puts pastel.green("==> Resharpening tools ...")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebooting masters, it seemed logical to regather facts, but now I realize it may never reach this point if the master reboot failed to bring up kubelet again. So, this could be skipped I think.

@@ -0,0 +1,30 @@
#!/bin/bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative to this would be some kind of trickery with nohup (which I tried but failed) or always using +1. Now it often finishes in a couple of seconds instead of waiting a full minute every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easier to generate this in ruby code?

Copy link
Contributor Author

@kke kke Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require 'time'
current_time = Time.parse(ssh.exec!('date +%H:%M:%S`))
if current_time.sec > 55
  shutdown_time = "+1"
  sleepy_time = 60
else
  new_time = Time.parse((current_time + 60).strftime('%H:%M:00'))
  shutdown_time = new_time.strftime('%H:%M')
  sleepy_time = (new_time - current_time).to_i
end
ssh.exec!("shutdown -r #{shutdown_time}")
ssh.disconnect
sleep sleepy_time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me again why we need to delay reboot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

ssh.exec!("shutdown -r now &")
sleep 1 until !ssh.connected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To disconnect ssh client gracefully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[36] pry(main)> ssh.exec('sudo shutdown -r now &')
IOError: closed stream

It still exits instantly and possibly crashes some bastion thread. Maybe the bastion hosts need to be rebooted separately.

The transport PR changed a lot stuff and #1090 (which I would like to have merged before finalizing this one) will add more, this PR was made before all that and is probably not up to date at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, but I feel that adding timers is not the right way to do this (they always introduce races). Maybe we need to wait until #1090 is merged.


unless master_hosts.empty?
puts pastel.green("==> Rebooting #{master_hosts.size} master node#{'s' if master_hosts.size > 1} ...")
cluster_manager.apply_reboot_hosts(master_hosts, parallel: false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Masters are booted sequentially instead of parallelly here.

@@ -0,0 +1,30 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easier to generate this in ruby code?

].freeze

def call
reboot && reconnect && uncordon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why && ? Methods don't actually return boolean?

def host
Host.new(address: address, user: user, ssh_key_path: ssh_key_path)
@host ||= Host.new(address: address, user: user, ssh_key_path: ssh_key_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

||= with attr_writer feels a bit broken?

Copy link
Contributor Author

@kke kke Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes, if someone would override the host= method created by attr_writer, it would not be called.

The alternative that would use the writer method would be:

def host
  @host || self.host = Host.new
end

@kke
Copy link
Contributor Author

kke commented Feb 19, 2019

Badly out of sync. Maybe wait for #1044 (this will need to refuse booting the localhost I suppose).

@kke
Copy link
Contributor Author

kke commented Feb 19, 2019

Oh, #1044 is in and is why this conflicted so bad :)

@@ -123,7 +123,7 @@ def kube_client(kubeconfig)
api_port = 6443
else
api_address = 'localhost'
api_port = master_host.bastion.host.transport.gateway(master_host.api_address, 6443)
api_port = master_host.bastion.gateway.open(master_host.api_address, 6443)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires #1090

@@ -21,6 +21,7 @@ def call
PEER_NAME: peer_name(@host),
ARCH: @host.cpu_arch.name
)
host.checks['etcd_ca_exists'] = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this?

@@ -33,6 +33,7 @@ def call
end

cluster_context['master-certs'] = pull_kube_certs unless cluster_context['master-certs']
host.checks['kubelet_configured'] = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

      def new?
        !checks['kubelet_configured']
      end

      # @return [Integer]
      def master_sort_score
        if checks['api_healthy']
          0
        elsif checks['kubelet_configured']
          1
        else
          2
        end
      end

@@ -90,6 +91,7 @@ def push_kube_certs(certs)
transport.file(path).write(contents)
transport.exec!("sudo chmod 0400 #{path}")
end
host.checks['ca_exists'] = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib/pharos/phases/validate_host.rb
40:        raise Pharos::InvalidHostError, "Cannot change worker host role to master" if @host.master? && [email protected]['ca_exists']
41:        raise Pharos::InvalidHostError, "Cannot change master host role to worker" if @host.worker? && @host.checks['ca_exists']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant are we using it in this PR? How is this relevant?

@@ -99,6 +101,7 @@ def pull_kube_certs
path = File.join(KUBE_DIR, 'pki', file)
certs[file] = transport.file(path).read
end
host.checks['ca_exists'] = certs.key?('ca.key')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this?


unless local_hosts.empty?
puts " " + pastel.red("!" * 76)
puts pastel.red(" The host will remain cordoned (workloads will not be scheduled on it) after the reboot")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this pastel stuff need to be updated too

unless local_hosts.empty?
puts " " + pastel.red("!" * 76)
puts pastel.red(" The host will remain cordoned (workloads will not be scheduled on it) after the reboot")
puts pastel.red(" To uncordon, you must use: ") + pastel.cyan("pharos exec -c #{config_yaml.filename} -r master -f -- kubectl uncordon #{local_hosts.first}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it could be possible to schedule uncordoning by using something like:

echo "kubectl uncordon #{localhost.hostname}" | at now + 10 minutes

before issuing the reboot command on the localhost


unless worker_hosts.empty?
puts pastel.green("==> Rebooting #{worker_hosts.size} worker node#{'s' if worker_hosts.size > 1} ...")
cluster_manager.apply_reboot_hosts(worker_hosts, parallel: true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not safe to reboot all workers in parallel. At minimum there needs to be an option for this.

@@ -0,0 +1,30 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, but I feel that adding timers is not the right way to do this (they always introduce races). Maybe we need to wait until #1090 is merged.

@jakolehm jakolehm added this to the 2.4.0 milestone Mar 19, 2019
@kke kke force-pushed the feature/node_reboot branch from 9ef8d35 to 399988f Compare April 3, 2019 11:04
@kke
Copy link
Contributor Author

kke commented Apr 3, 2019

Cleaned up after the transport changes, I feel there might still be a bit too much logic.

@kke
Copy link
Contributor Author

kke commented Jun 14, 2019

Hmm I wonder if this is still 2.4.0 -worthy. Seems quite simple nowadays. Maybe add a e2e after "worker up".

@jakolehm jakolehm modified the milestones: 2.4.0, 2.5.0 Jun 20, 2019
@jakolehm jakolehm closed this Mar 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reboot command
2 participants