-
Notifications
You must be signed in to change notification settings - Fork 40
Add "pharos reboot" command for rebooting hosts or the entire cluster #1003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand all the changes.. maybe you could comment changes via review?
lib/pharos/autoload.rb
Outdated
@@ -47,6 +47,9 @@ module CoreExt | |||
|
|||
module SSH | |||
autoload :Client, 'pharos/ssh/client' | |||
autoload :Error, 'pharos/ssh/client' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pharos::SSH::Client
is not yet loaded when the phases are loading.
lib/pharos/config.rb
Outdated
@@ -37,6 +37,11 @@ def self.load(raw_data) | |||
config.hosts.each do |host| | |||
host.api_endpoint = config.api&.endpoint | |||
host.config = config | |||
# map bastion.host to an existing host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How this is related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise the bastion.host
points to a duplicate created via Configuration::Host.new
instead to the actual host in config (which is completely fine if the bastion host even isn't in the config)
This means, that for example host.ssh.gateway.shutdown!
when doing a host reboot/reconnect will shut down a gateway that is not the duplicated bastion host's gateway.
lib/pharos/config.rb
Outdated
@@ -122,7 +127,7 @@ def kube_client(kubeconfig) | |||
api_port = 6443 | |||
else | |||
api_address = 'localhost' | |||
api_port = master_host.bastion.host.ssh.gateway(master_host.api_address, 6443) | |||
api_port = master_host.bastion.host.ssh.gateway.open(master_host.api_address, 6443) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ssh.gateway
now returns a Net::SSH::Gateway
instead of directly doing gateway.open
.
lib/pharos/ssh/client.rb
Outdated
def gateway | ||
@gateway ||= Net::SSH::Gateway.new(@host, @user, @opts).tap do |gw| | ||
gw.instance_exec do | ||
@thread.report_on_exception = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added because when rebooting a node that acts as a bastion, the gateway thread for the host with bastion will die from IOError and report the exception to the terminal.
It seems to manage to reconnect with all the connection/reconnection additions in this PR and the exception appears to be harmless.
lib/pharos/ssh/client.rb
Outdated
end | ||
|
||
private | ||
|
||
def require_session! | ||
raise Error, "Connection not established" if @session.nil? || @session.closed? | ||
connect(timeout: 3) unless connected? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the SSH::Client.exec
etc that use require_session!
attempt to make a connection instead of failing directly.
lib/pharos/ssh/client.rb
Outdated
if bastion | ||
gw_opts = {} | ||
gw_opts[:keys] = [bastion.ssh_key_path] if bastion.ssh_key_path | ||
gateway = Net::SSH::Gateway.new(bastion.address, bastion.user, gw_opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Becuse ssh.gateway
returned ssh.gateway.open
, the Net::SSH::Gateway.new
was duplicated here and was left in a "dangling" local variable that could not be accessed for shutdown!
.
lib/pharos/reboot_command.rb
Outdated
cluster_manager.apply_reboot_hosts(master_hosts, parallel: false) | ||
end | ||
|
||
puts pastel.green("==> Resharpening tools ...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After rebooting masters, it seemed logical to regather facts, but now I realize it may never reach this point if the master reboot failed to bring up kubelet again. So, this could be skipped I think.
lib/pharos/scripts/reboot-asap.sh
Outdated
@@ -0,0 +1,30 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative to this would be some kind of trickery with nohup
(which I tried but failed) or always using +1
. Now it often finishes in a couple of seconds instead of waiting a full minute every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be easier to generate this in ruby code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require 'time'
current_time = Time.parse(ssh.exec!('date +%H:%M:%S`))
if current_time.sec > 55
shutdown_time = "+1"
sleepy_time = 60
else
new_time = Time.parse((current_time + 60).strftime('%H:%M:00'))
shutdown_time = new_time.strftime('%H:%M')
sleepy_time = (new_time - current_time).to_i
end
ssh.exec!("shutdown -r #{shutdown_time}")
ssh.disconnect
sleep sleepy_time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me again why we need to delay reboot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
ssh.exec!("shutdown -r now &")
sleep 1 until !ssh.connected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To disconnect ssh client gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[36] pry(main)> ssh.exec('sudo shutdown -r now &')
IOError: closed stream
It still exits instantly and possibly crashes some bastion thread. Maybe the bastion hosts need to be rebooted separately.
The transport PR changed a lot stuff and #1090 (which I would like to have merged before finalizing this one) will add more, this PR was made before all that and is probably not up to date at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, but I feel that adding timers is not the right way to do this (they always introduce races). Maybe we need to wait until #1090 is merged.
lib/pharos/reboot_command.rb
Outdated
|
||
unless master_hosts.empty? | ||
puts pastel.green("==> Rebooting #{master_hosts.size} master node#{'s' if master_hosts.size > 1} ...") | ||
cluster_manager.apply_reboot_hosts(master_hosts, parallel: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Masters are booted sequentially instead of parallelly here.
lib/pharos/scripts/reboot-asap.sh
Outdated
@@ -0,0 +1,30 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be easier to generate this in ruby code?
lib/pharos/phases/reboot_host.rb
Outdated
].freeze | ||
|
||
def call | ||
reboot && reconnect && uncordon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why &&
? Methods don't actually return boolean?
lib/pharos/configuration/bastion.rb
Outdated
def host | ||
Host.new(address: address, user: user, ssh_key_path: ssh_key_path) | ||
@host ||= Host.new(address: address, user: user, ssh_key_path: ssh_key_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
||=
with attr_writer
feels a bit broken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well yes, if someone would override the host=
method created by attr_writer
, it would not be called.
The alternative that would use the writer method would be:
def host
@host || self.host = Host.new
end
Badly out of sync. Maybe wait for #1044 (this will need to refuse booting the localhost I suppose). |
Oh, #1044 is in and is why this conflicted so bad :) |
lib/pharos/config.rb
Outdated
@@ -123,7 +123,7 @@ def kube_client(kubeconfig) | |||
api_port = 6443 | |||
else | |||
api_address = 'localhost' | |||
api_port = master_host.bastion.host.transport.gateway(master_host.api_address, 6443) | |||
api_port = master_host.bastion.gateway.open(master_host.api_address, 6443) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires #1090
lib/pharos/phases/configure_etcd.rb
Outdated
@@ -21,6 +21,7 @@ def call | |||
PEER_NAME: peer_name(@host), | |||
ARCH: @host.cpu_arch.name | |||
) | |||
host.checks['etcd_ca_exists'] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using this?
@@ -33,6 +33,7 @@ def call | |||
end | |||
|
|||
cluster_context['master-certs'] = pull_kube_certs unless cluster_context['master-certs'] | |||
host.checks['kubelet_configured'] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def new?
!checks['kubelet_configured']
end
# @return [Integer]
def master_sort_score
if checks['api_healthy']
0
elsif checks['kubelet_configured']
1
else
2
end
end
@@ -90,6 +91,7 @@ def push_kube_certs(certs) | |||
transport.file(path).write(contents) | |||
transport.exec!("sudo chmod 0400 #{path}") | |||
end | |||
host.checks['ca_exists'] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/pharos/phases/validate_host.rb
40: raise Pharos::InvalidHostError, "Cannot change worker host role to master" if @host.master? && [email protected]['ca_exists']
41: raise Pharos::InvalidHostError, "Cannot change master host role to worker" if @host.worker? && @host.checks['ca_exists']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant are we using it in this PR? How is this relevant?
@@ -99,6 +101,7 @@ def pull_kube_certs | |||
path = File.join(KUBE_DIR, 'pki', file) | |||
certs[file] = transport.file(path).read | |||
end | |||
host.checks['ca_exists'] = certs.key?('ca.key') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using this?
lib/pharos/reboot_command.rb
Outdated
|
||
unless local_hosts.empty? | ||
puts " " + pastel.red("!" * 76) | ||
puts pastel.red(" The host will remain cordoned (workloads will not be scheduled on it) after the reboot") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this pastel stuff need to be updated too
lib/pharos/reboot_command.rb
Outdated
unless local_hosts.empty? | ||
puts " " + pastel.red("!" * 76) | ||
puts pastel.red(" The host will remain cordoned (workloads will not be scheduled on it) after the reboot") | ||
puts pastel.red(" To uncordon, you must use: ") + pastel.cyan("pharos exec -c #{config_yaml.filename} -r master -f -- kubectl uncordon #{local_hosts.first}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it could be possible to schedule uncordoning by using something like:
echo "kubectl uncordon #{localhost.hostname}" | at now + 10 minutes
before issuing the reboot command on the localhost
lib/pharos/reboot_command.rb
Outdated
|
||
unless worker_hosts.empty? | ||
puts pastel.green("==> Rebooting #{worker_hosts.size} worker node#{'s' if worker_hosts.size > 1} ...") | ||
cluster_manager.apply_reboot_hosts(worker_hosts, parallel: true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not safe to reboot all workers in parallel. At minimum there needs to be an option for this.
lib/pharos/scripts/reboot-asap.sh
Outdated
@@ -0,0 +1,30 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, but I feel that adding timers is not the right way to do this (they always introduce races). Maybe we need to wait until #1090 is merged.
Cleaned up after the transport changes, I feel there might still be a bit too much logic. |
Hmm I wonder if this is still 2.4.0 -worthy. Seems quite simple nowadays. Maybe add a e2e after "worker up". |
Fixes #996
Adds
pharos reboot
for rebooting nodes.If any of the hosts to reboot are masters, they will be rebooted first. The nodes will be drained before the reboot and uncordoned once they have finished rebooting.