Open
Description
libmesos has a grace period during which it will try to support the API from the last version, in order to support graceful cluster upgrades.
libmesos-based mesos cluster upgrade story from mesos master version X -> Y:
- ensure that all slave+executor+scheduler processes are running on top of libmesos version X or Y
- upgrade a mesos master to version Y, bounce it, restart the other HA masters (still running X) so that leadership passes to the master on version Y
- observe cluster health metrics, being on high-alert to turn off the new master if things get wonky so that a master on the old version will take over
- after suspicion fades, upgrade and bounce the other HA masters from X to Y
- gradually move all slave+executor+scheduler processes over to version Y, before upgrade Y->Z happens
mesos-go based cluster upgrade story:
- restart everything as fast as you can, because there is no graceful transition possible, eating a guaranteed maintenance window of unavailability
- hope in your heart of hearts that nothing bad happens while nervously staring at the health metrics - meanwhile every second you are losing money and customer trust
Furthermore, many users of mesos-go have no idea that it behaves differently from what they're used to with libmesos! mesos-go needs to make this super explicit up-front, because outages are probably already happening due to this mismatch of assumptions!