Skip to content
mhpx edited this page Jan 3, 2012 · 140 revisions

This is the Cloud Scheduler Roadmap. It’s designed to keep all parties interested in the development up to speed, in addition to keeping the developers on track. Please consider the road map as not written in stone, but a flexible living document.

Each release of the software will be deployed to a dedicated testing platform and evaluated during the next short development cycle. The goal is trackable and incremental improvements in both quality and features.

Branch Plan
Development will occur on the dev branch and be merged into the master branch before each release. The master will then be tagged. Should show stopper bugs appear before the next short release cycle they will be patched directly in both the dev and master branches.

Version 1.0.2

December 20
1.0.2 Git Tag
1.0.2 Tarball
Changes from 0.14

Features

  • Admin Server added – can use cloud_admin to help manage CloudScheduler resources
  • FIFO Job Scheduling support
  • Option to limit total starting VMs to ease cloud load
  • Support for OpenStack clouds
  • General bug fixes and improvements

Version 0.14

October 4
0.14 Git Tag
0.14 Tarball
Changes from 0.13

Features

  • .

Version 0.13

June 28
0.13 Git Tag
0.13 Tarball
Changes from 0.12

Features

  • Retire VMs before reaching maximum lifetime issue #153
  • Support for different users with same vmtype / vm image having resources allocated properly and jobs only running on the users’ vm that booted for their jobs
  • Cloud resources config information now available from cloud_status -l (used by monitoring)
  • Improvements to balancing behavior when using condor_off
  • New cloud_resources option for enabling / disabling a cloud vs commenting it out entirely – cloud will not show up in cloud_status when disabled
  • Proxy Refreshing fixes
  • VM shutdown logging improvements – each shutdown/destroy should now be preceded with a message indicting a reason for the shutdown and the VM id number
  • Added proper resource tracking for Nimbus clusters with multiple network pools.
  • Automated banning/filter of jobs for a set time so a CS will not repeatedly try to start a VM for a job that has a problem such as an expired proxy
  • cloud_status -a now lists the storage remaining on cloud

Version 0.12

April 4
0.12 Git Tag
0.12 Tarball
Changes from 0.11

Features

  • Retire VMs before reaching maximum lifetime issue #153
  • Lots of little bug fixes

Version 0.11.2

March 8
0.11.2 Git Tag
0.11.2 Tarball
Changes from 0.10

Features

  • Support for using multiple AMIs/EC2 clouds
  • VM IP output to aid with statistics via Munin
  • Bug Fixes re: cloud reconfigure / error VM handling / VM command timeouts
  • User share splitting amongst multiple VMs issue #169
  • Fixed user share allocation issue with clouds of different size / specs issue #168
  • Specify Target clouds for jobs to run on – issue #151

Version 0.10

January 11
0.10 Git Tag
0.10 Tarball
Changes from 0.9

Features

  • Integrate MyProxy credential renewal
  • Nimbus credential delegation integration
  • Better selection of VMs for retirement
  • https image staging

Version 0.9

November 8
0.9 Git Tag
0.9 Tarball
Changes from 0.8

Features

  • Improve job management
  • Much lower resource usage
  • Lots of bugfixes

Version 0.8

September 17
0.8 Git Tag
0.8 Tarball
Changes from 0.7

Features

  • High priority jobs – issue #111
  • Create Scheduling plugin system – issue #98
  • Allow Cloud Scheduler to boot VMs with submitting user’s proxy certificates, rather than community certs – issue #97
  • Graceful Shutdown of VMs
  • Allow multiple jobs per VM – issue #94

Version 0.7

August 6
0.7 Git Tag
0.7 Tarball
Changes from 0.6

Features

  • Reduce Memory consumption with thousands of jobs / Hundreds of VMs
  • Shutdown of Machines that fail to register with Condor – issue #116
  • Better support for Eucalyptus
  • VM Banning on specific clusters – issue #117

Version 0.6

June 30
0.6 Git Tag
0.6 Tarball

Features

  • Add additional resources without causing service interruptions – issue #102
  • Allow Cloud Scheduler to boot VMs with submitting user’s proxy certificates, rather than community certs – issue #97 moved to 0.7
  • Integration of the Cloud Aggregator tool to get up to date cloud information – issue #84
  • Investigate using EC2 spot pricing – issue #96
  • Make Job and Resource pool thread-safe – issue #95
  • Allow multiple jobs per VM – issue #94 moved to 0.7
  • Create Scheduling plugin system – issue #98 moved to 0.7
  • Memory based VM distribution/balancing – issue #92
  • Summary information for cloud_status -m – issue #108

Version 0.5

May 20
0.5 Git Tag
0.5 Tarball

Features

  • Move job polling and VM polling to their own threads for better responsiveness
    • Right now, polling a condor schedd with 3000 jobs takes 20-30 minutes (!) which causes n VMs to take ~30*n minutes to start.
  • Option for graceful shutdown of VMs so running jobs will not be interrupted by Cloud Scheduler shutting down VMs to redistribute resources
  • Feature to have a VM stay around for a set time after jobs finish. +VMKeepAlive = “minutes” – primarily for testing/debugging where jobs may need to resubmitted
  • cloud_status can display information about jobs
  • cloud_resources being pickled(persisted) to recover from crash

Version 0.4

April 9
0.4 Git Tag
0.4 Tarball

Features

  • Improve life cycle management.
    • Ensure that there are no more running VMs than jobs requiring that VM type
    • Implement a grace period for destroying unneeded VMs – leave the last existing VM of a type alive for a set time. (likely low priority for Astronomers)
    • In the event that all VM slots are full, shut down VMs to create a more fair distribution of resources
  • Add multicore VM request support for Nimbus
  • Add contextualization support (Nimbus first)
  • Improve performance and reliability

Tasks

  • Add multicore VM test set
  • Add job priority scheme test set

Goals

Version 0.3

Due approximately January 29
Pushed to February 5
Released Feb 8

0.3 Git Tag
0.3 Tarball

Features

  • Anti-starvation model for job fairness and throttling.
  • Worst case VM scheduling such that there is an even distribution between between users of VM slots.
  • Improve life cycle management.
    • Ensure that there are no more running VMs than jobs requiring that VM type.
    • Clean shutdown of VMs to support cleaner status for Condor.
  • Job priority on per user basis

Tasks

  • Fair scheduling to first order for multiple users, with 100 jobs each. Note that we don’t expect perfectly fair scheduling to occur.
  • Second Draft of Software Design Specification.

Goals

  • 1000 jobs, 4 minute jobs, single user. 100% job completion.
  • 1000 jobs, 5 competing users (with initial fairness testing).

Version 0.2

Due approximately December 18th
Pushed to January 8
Released January 11

Git Tag
Tarball

In this release we make sure that we can boot VMs on all likely IaaS software. We improve our interactions with OpenNebula and Eucalyptus.

Features

  • Improved support for EC2-like clusters including OpenNebula and Eucalyptus.
    • Amazon EC2 support added.
    • Make sure we can support the differences in the connection details between the cluster software.
  • Give user ability to request multiple cores and scratch space when submitting a Condor Job.
  • Improve life cycle management.
    • Eliminate VM instances when no jobs require them.
    • Improvements will be based on the lessons learned during 0.1
    • Astronomer input will be required to determine what is an improvement.
  • First attempt at basic fairness and throttling
  • blank space partition support

Tasks

  • Second Draft of Software Design Specification

Goal:

  • 100 jobs at a time, 4 minutes, multiple users ( not very fair first come first serve). 100% job completion.
  • 100 jobs, 4 minutes, using multiple cores , different memory requirements (not guaranteed to work in fair way).

Version 0.1

Due approximately November 30
Released December 2

Git Tag
Tarball

The goal of 0.1 is to have a minimally functional test environment that can be used by one or two cooperating testers (i.e. Astronomers).

Features

  • Basic scheduling heuristic to boot x number of VMs for a set of jobs
    • For example, job mapping n jobs to 1 vm. Version 0.0 is an unsophisticated 1:1 mapping.
  • Improved VM life cycle management
    • For example, VM life cycle managed by a default “time-to-live,” after which the VM will be destroyed if not busy (executing jobs).
    • Alternatively, a simple VM life cycle scheme in which a VM is shut down when no jobs exist that require its VM type.
  • Basic balanced scheduling between clusters. This should be considered a rudimentary first effort.
  • Able to boot VMs on Amazon EC2.

Tasks

  • First draft of Software Design Specification (SDS)
  • Fully Documented install process
  • Expose all useful configuration options in configuration file.
  • Document all configuration options

Goal

  • 100 total jobs in the system at one time. No fairness between users.
  • Make sure we have 100% job completion.

Version 0.0

Due November 12 Firm
Released November 12

Git Tag
Tarball

This is an enhanced version of the proof of concept deployment that was demonstrated at Banff Summit 09.

Features

  • Booting of user created VMs when jobs are detected in the Condor queue.
  • 1:1 (VM:job) primitive scheduling algorithm.
  • Limited support for VM parameter specification (currently: memory, CPU architecture, network).
  • No VM life cycle management.
  • No fairness between users; no job priorities.
  • No more than 10 jobs at a time.

Future Work

  • Graceful shutdown of VMs for load balancing (eg. not killing jobs)
  • Investigate supporting OpenNebula as its query API matures (outstanding bugs must be fixed).
  • Pull dynamic cloud resource information from MDS (need to investigate MDS replacements).
  • Support a more sophisticated job priority scheme.
    • Ideas: an express queue or alternatively job preemption (in SGE terms, an “override queue”). This is important for things like night to night to night processing.
  • Save persistence data between application runs