-
Notifications
You must be signed in to change notification settings - Fork 77
Description
Hi Ionel & Malte, hope things are great with you two. I wanted to give you an update and ask for your expert opinion/advise on one very important issue we have identified while doing detail throughput performance testing recently. Firstly, we have completed incorporating the following scheduling functionality into Firmament scheduler:
- Node level Affinity/Anti-Affinity using the network flow graph approach, using the similar approach as for regular workloads/tasks with no affinity/anti-affinity requirements.
- Pod Level Affinity/Anti-Affinity using Pod-at-a-time with multi-round scheduling approach. We had to optimize the multi-round process somewhat in order for better throughput. Currently, we do see that Firmament throughput is approx. 2X better even though we are doing pod-at-a-time processing.
- Support for Taints/Tolerations.
Overall, throughput numbers are definitely in favor of Firmament scheduler by a great margin, as we earlier discovered as well. However, there is a one caveat in all this. The Job size (K8S replica-set, deployment or Jobs) is quite large with large no. of tasks per job in these tests. If the Job size is smaller and we have great number of Jobs, Firmament performance really degrades as you can see in the examples below. This is due to the fact that the solver has to deal with great no. of arcs in such cases.
Net-net is, based on our assessment, Firmament scheduler definitely does a great job in use cases where Job size is quite large. This is primarily true because due to equivalence classes as a mechanism for amortizing the work.
Large number of Jobs consisting of smaller number of tasks, throughput benefits are not there due to large number of arcs drawn in the graph.
Question for both of you is if there is a way to optimize all this and reduce the number of arcs in the following examples in order for Firmament to be a general purpose scheduler. Please let us know your thoughts and perspective on all this. I will also create an issue to this regards in CAMSAS as well. Thanks.
Node Anti-Affinity Scenario
• Let us assume there are 800 nodes in a cluster.
• In a scheduling run, let us say we are processing 15,200 pods.
• Let us say we use 800 replicate-sets with 19 replicas in each set.
• Let us also assume that we have set limit of 19 arcs between task EC and nodes (using machine individual ECs with each EC of capacity 1). This is essentially to load balance the incoming workloads/pods across multiple nodes (each arc increases incrementally in order to do load distribution across eligible machines).
• In a node anti-affinity use case scenario, let us assume that an incoming pod can go to any remaining 799 nodes as one single node in the cluster has conflict with the node level anti-affinity rule for incoming Pods.
• Accordingly, we end up having no. of arcs in the flow graph as = 799 * 800 * 19 = 12,144,800 arcs.
• Even though there are only 15,200 incoming pods in a scheduling run, we end up creating 12,144,800 arcs in the graph unnecessarily.
• Ideally, we should limit the no of arcs drawn between task EC and nodes (using machine ECs) to lowest cost 15,200 arcs only.
Normal Pod Scenario
• Let us assume there are 800 nodes in a cluster.
• In a scheduling run, let us say we are processing 15,200 pods.
• Let us say we use 3,040 replicate-sets with 5 replicas in each set. Each replica-set uses a unique CPU-Memory combination.
• Let us also assume that we have set limit of 19 arcs between task EC and nodes (using machine individual ECs with each EC of capacity 1). This is essentially to load balance the incoming workloads/pods across multiple nodes (each arc increases incrementally in order to do load distribution across eligible machines).
• Let us assume incoming pods can go to any of the 800 nodes.
• Accordingly, we end up having no. of arcs in the flow graph as = 3,040 * 19 * 800 = 46,208,000 arcs.
• Even though there are only 15,200 incoming pods in a scheduling run, we end up creating 46,208,000 arcs in the graph unnecessarily.
• Ideally, we should limit the no of arcs drawn between task EC and nodes (using machine ECs) to lowest cost 15,200 arcs only.