RoadMap

This is the Cloud Scheduler Roadmap. It’s designed to keep all parties interested in the development up to speed, in addition to keeping the developers on track. Please consider the road map as not written in stone, but a flexible living document.

Each release of the software will be deployed to a dedicated testing platform and evaluated during the next short development cycle. The goal is trackable and incremental improvements in both quality and features.

Branch Plan
Development will occur on the dev branch and be merged into the master branch before each release. The master will then be tagged. Should show stopper bugs appear before the next short release cycle they will be patched directly in both the dev and master branches.

Version 1.0.2

December 20
1.0.2 Git Tag
1.0.2 Tarball
Changes from 0.14

Features

Admin Server added – can use cloud_admin to help manage CloudScheduler resources
FIFO Job Scheduling support
Option to limit total starting VMs to ease cloud load
Support for OpenStack clouds
General bug fixes and improvements

Version 0.14

October 4
0.14 Git Tag
0.14 Tarball
Changes from 0.13

Features

.

Version 0.13

June 28
0.13 Git Tag
0.13 Tarball
Changes from 0.12

Features

Retire VMs before reaching maximum lifetime issue #153
Support for different users with same vmtype / vm image having resources allocated properly and jobs only running on the users’ vm that booted for their jobs
Cloud resources config information now available from cloud_status -l (used by monitoring)
Improvements to balancing behavior when using condor_off
New cloud_resources option for enabling / disabling a cloud vs commenting it out entirely – cloud will not show up in cloud_status when disabled
Proxy Refreshing fixes
VM shutdown logging improvements – each shutdown/destroy should now be preceded with a message indicting a reason for the shutdown and the VM id number
Added proper resource tracking for Nimbus clusters with multiple network pools.
Automated banning/filter of jobs for a set time so a CS will not repeatedly try to start a VM for a job that has a problem such as an expired proxy
cloud_status -a now lists the storage remaining on cloud

Version 0.12

April 4
0.12 Git Tag
0.12 Tarball
Changes from 0.11

Features

~~Retire VMs before reaching maximum lifetime issue #153~~
Lots of little bug fixes

Version 0.11.2

March 8
0.11.2 Git Tag
0.11.2 Tarball
Changes from 0.10

Features

Support for using multiple AMIs/EC2 clouds
VM IP output to aid with statistics via Munin
Bug Fixes re: cloud reconfigure / error VM handling / VM command timeouts
User share splitting amongst multiple VMs issue #169
Fixed user share allocation issue with clouds of different size / specs issue #168
Specify Target clouds for jobs to run on – issue #151

Version 0.10

January 11
0.10 Git Tag
0.10 Tarball
Changes from 0.9

Features

Integrate MyProxy credential renewal
Nimbus credential delegation integration
Better selection of VMs for retirement
https image staging

Version 0.9

November 8
0.9 Git Tag
0.9 Tarball
Changes from 0.8

Features

Improve job management
Much lower resource usage
Lots of bugfixes

Version 0.8

September 17
0.8 Git Tag
0.8 Tarball
Changes from 0.7

Features

High priority jobs – issue #111
~~Create Scheduling plugin system – issue #98~~
Allow Cloud Scheduler to boot VMs with submitting user’s proxy certificates, rather than community certs – issue #97
Graceful Shutdown of VMs
Allow multiple jobs per VM – issue #94

Version 0.7

August 6
0.7 Git Tag
0.7 Tarball
Changes from 0.6

Features

Reduce Memory consumption with thousands of jobs / Hundreds of VMs
Shutdown of Machines that fail to register with Condor – issue #116
Better support for Eucalyptus
VM Banning on specific clusters – issue #117

Version 0.6

June 30
0.6 Git Tag
0.6 Tarball

Features

Add additional resources without causing service interruptions – issue #102
~~Allow Cloud Scheduler to boot VMs with submitting user’s proxy certificates, rather than community certs – issue #97~~ moved to 0.7
Integration of the Cloud Aggregator tool to get up to date cloud information – issue #84
Investigate using EC2 spot pricing – issue #96
Make Job and Resource pool thread-safe – issue #95
~~Allow multiple jobs per VM – issue #94~~ moved to 0.7
~~Create Scheduling plugin system – issue #98~~ moved to 0.7
Memory based VM distribution/balancing – issue #92
Summary information for cloud_status -m – issue #108

Version 0.5

May 20
0.5 Git Tag
0.5 Tarball

Features

Move job polling and VM polling to their own threads for better responsiveness
- ~~Right now, polling a condor schedd with 3000 jobs takes 20-30 minutes (!) which causes n VMs to take ~30*n minutes to start.~~
Option for graceful shutdown of VMs so running jobs will not be interrupted by Cloud Scheduler shutting down VMs to redistribute resources
Feature to have a VM stay around for a set time after jobs finish. +VMKeepAlive = “minutes” – primarily for testing/debugging where jobs may need to resubmitted
cloud_status can display information about jobs
cloud_resources being pickled(persisted) to recover from crash

Version 0.4

April 9
0.4 Git Tag
0.4 Tarball

Features

Improve life cycle management.
- Ensure that there are no more running VMs than jobs requiring that VM type
- ~~Implement a grace period for destroying unneeded VMs – leave the last existing VM of a type alive for a set time. (likely low priority for Astronomers)~~
- In the event that all VM slots are full, shut down VMs to create a more fair distribution of resources
Add multicore VM request support for Nimbus
Add contextualization support (Nimbus first)
Improve performance and reliability

Tasks

Add multicore VM test set
Add job priority scheme test set

Goals

Version 0.3

Due approximately January 29
Pushed to February 5
Released Feb 8

0.3 Git Tag
0.3 Tarball

Features

Anti-starvation model for job fairness and throttling.
Worst case VM scheduling such that there is an even distribution between between users of VM slots.
Improve life cycle management.
- ~~Ensure that there are no more running VMs than jobs requiring that VM type~~.
- Clean shutdown of VMs to support cleaner status for Condor.
Job priority on per user basis

Tasks

Fair scheduling to first order for multiple users, with 100 jobs each. Note that we don’t expect perfectly fair scheduling to occur.
Second Draft of Software Design Specification.

Goals

1000 jobs, 4 minute jobs, single user. 100% job completion.
1000 jobs, 5 competing users (with initial fairness testing).

Version 0.2

Due approximately December 18th
Pushed to January 8
Released January 11

Git Tag
Tarball

In this release we make sure that we can boot VMs on all likely IaaS software. We improve our interactions with OpenNebula and Eucalyptus.

Features

Improved support for EC2-like clusters including ~~OpenNebula~~ and Eucalyptus.
- Amazon EC2 support added.
- Make sure we can support the differences in the connection details between the cluster software.
Give user ability to request multiple cores and scratch space when submitting a Condor Job.
Improve life cycle management.
- Eliminate VM instances when no jobs require them.
- Improvements will be based on the lessons learned during 0.1
- Astronomer input will be required to determine what is an improvement.
~~First attempt at basic fairness and throttling~~
blank space partition support

Tasks

~~Second Draft of Software Design Specification~~

Goal:

100 jobs at a time, 4 minutes, multiple users ( ~~not very fair~~ first come first serve). 100% job completion.
100 jobs, 4 minutes, ~~using multiple cores~~ , different memory requirements (not guaranteed to work in fair way).

Version 0.1

Due approximately November 30
Released December 2

Git Tag
Tarball

The goal of 0.1 is to have a minimally functional test environment that can be used by one or two cooperating testers (i.e. Astronomers).

Features

Basic scheduling heuristic to boot x number of VMs for a set of jobs
- For example, job mapping n jobs to 1 vm. Version 0.0 is an unsophisticated 1:1 mapping.
Improved VM life cycle management
- For example, VM life cycle managed by a default “time-to-live,” after which the VM will be destroyed if not busy (executing jobs).
- Alternatively, a simple VM life cycle scheme in which a VM is shut down when no jobs exist that require its VM type.
Basic balanced scheduling between clusters. This should be considered a rudimentary first effort.
~~Able to boot VMs on Amazon EC2~~.

Tasks

First draft of Software Design Specification (SDS)
Fully Documented install process
Expose all useful configuration options in configuration file.
Document all configuration options

Goal

100 total jobs in the system at one time. No fairness between users.
Make sure we have 100% job completion.

Version 0.0

Due November 12 Firm
Released November 12

Git Tag
Tarball

This is an enhanced version of the proof of concept deployment that was demonstrated at Banff Summit 09.

Features

Booting of user created VMs when jobs are detected in the Condor queue.
1:1 (VM:job) primitive scheduling algorithm.
Limited support for VM parameter specification (currently: memory, CPU architecture, network).
No VM life cycle management.
No fairness between users; no job priorities.
No more than 10 jobs at a time.

Future Work

Graceful shutdown of VMs for load balancing (eg. not killing jobs)
Investigate supporting OpenNebula as its query API matures (outstanding bugs must be fixed).
Pull dynamic cloud resource information from MDS (need to investigate MDS replacements).
Support a more sophisticated job priority scheme.
- Ideas: an express queue or alternatively job preemption (in SGE terms, an “override queue”). This is important for things like night to night to night processing.
Save persistence data between application runs

RoadMap

Version 1.0.2

Features

Version 0.14

Features

Version 0.13

Features

Version 0.12

Features

Version 0.11.2

Features

Version 0.10

Features

Version 0.9

Features

Version 0.8

Features

Version 0.7

Features

Version 0.6

Features

Version 0.5

Features

Version 0.4

Features

Tasks

Goals

Version 0.3

Features

Tasks

Goals

Version 0.2

Features

Tasks

Goal:

Version 0.1

Features

Tasks

Goal

Version 0.0

Features

Future Work

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally