All notable changes to this project will be documented in this file
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Fixed bug in parallelized Kubernetes watch processing, from @scrosby
- Make prometheus JVM metrics use compute cluster name, from @samincheva
- Parallelize Kubernetes watch processing, from @scrosby
- Do not set scalar-requests to pool specific resources for Kenzo pods, from @ahaysx
- Reverted parallel Kubernetes watch processing for future release
- Prometheus metrics
- JVM metrics, from @samincheva
- Ring metrics, from @samincheva
- Parity for remaining codahale metrics, from @samincheva
- Direct-to-Kubernetes scheduler (Kenzo)
- Use backpressure of scheduling pods to moderate launching new pods for real jobs, from @ahaysx
- Prometheus metrics parity, from @ahaysx
- Parallelize Kubernetes watch processing, from @scrosby
- Optimize getting the nodename from a node, from @scrosby
- Initial implementation for submitting jobs directly to Kubernetes Scheduler, from @ahaysx
- Better error handling in scheduler/write functions, from @ahaysx
- Launch tasks similarly for both Fenzo and K8s Scheduler pools, from @ahaysx
- Adding prometheus metrics to remaining modules, from @samincheva
- Optimizing total pod count metric and fixing mismatched metric labels, from @samincheva
- Make Fenzo config pool-specific and set up for other schedulers, from @ahaysx
- Prometheus metrics for the kubernetes, API, and tools module, from @samincheva
- Update of synthetic pods counter metric even if the current match cycle doesn't autoscale, from @samincheva
- Performance optimization for add-starting-pods, from @scrosby
- Remove metatransaction filter from match, from @scrosby
- Make job resource lookup more efficient in miss path, from @scrosby
- Relazy some list generation in rank cycle, from @scrosby
- Parallelize autoscale to run at the same time as main job launches, from @scrosby
- Add new JobSubmissionModifier and refactor JobRouter, from @laurameng
- Prometheus metrics
- Updated match cycle metric logic for 0 considerable case, from @samincheva
- Added prometheus metric for synthetic pods count, from @samincheva
- Use a factory fn for creating (future) different types of pool handlers, from @ahaysx
- Configured the /metrics endpoint to have a separate rate limit, from @samincheva
- Prometheus, from @samincheva
- Adding match cycle metrics to prometheus, from @samincheva
- Adding prometheus metric for jobs launch count, from @samincheva
- Use pools & submit pools in /jobs list endpoint, from @laurameng
- Add support for pool quotas across pools, from @scrosby
- Add support for routing jobs between pools based on constraints, from @scrosby
- Forced eval of lazy sequence in tracing span causing performance degradation, from @samincheva
- Add more opentracing spans to the match cycle, from @samincheva
- Disabled pools integration tests handle 0 quota better, from @samincheva
- Opentracing for the match cycle logic, from @samincheva
- Allow preemptions for tasks with unknown status, from @ahaysx
- Fix rebalancer integration test to handle failures better, from @ahaysx
- Moved the global launch/kill ordering lock to be per compute-cluster, from @laurameng
- Fix bug in api-only flag that would fail operations requiring a connection to the leader, from @samincheva
- Cook now determines which pool a k8s node is in via a label instead of a taint, from @scrosby
- Ascribe NodeAffinity k8s failures to node preemption, from @scrosby
- Updated instance->user cache to handle fake entities for waiting jobs, to speed up worst-case rank loop performance, by @scrosby
- Updated structured logging utility to handle failed json conversion, from @samincheva
- Updated format-map-for-structured-logging to traverse nested maps instead of flattening them, from @samincheva
- Updated scheduler, compute cluster, and kubernetes API code to use structured logging, from @samincheva
- Development documentation for Cook, from @scrosby
- Utility for emitting structured logs, from @samincheva
- Updated unit tests for GPU model types, from @scrosby
- Corrected numbers to not use string format in match cycle metrics, from @laurameng
- Splitting 'updating dynamic clusters' log into separate entries, from @samincheva
- Convert match cycle log line to structured logging, from @laurameng
- Add second init sidecar for checkpointing, from @scrosby
- Add support for a second auxiliary init container for k8s that runs in the user's image in k8s, allowing it to introspect the platform and do any custom changes.
- Add Postgres support to Vagrant environment setup, from @nsinkov
- Update sidecar dependencies and prepare to release sidecar 1.2.2, from @scrosby
- Corrected implementation of Kubernetes controller pod process log removal, from @laurameng
- Add capability for configuring default Kubernetes pod labels on a per pool basis, from @laurameng
- Reduced Cook logging to condense log volume, from @laurameng
- Remove low-value Fenzo log as part of log diet efforts
- Remove Kubernetes controller pod process logs on scans when Cook & Kubernetes agree on "running" state
- Remove taskid scan log
- Set USER env variable, in addition to COOK_JOB_USER, in Kubernetes by default, from @laurameng
- Increase logging verbosity when submission fails with an exception, from @scrosby
- Initial Liquibase support for Cook for postgres configuration, from @scrosby
- Switch from java.jdbc to next.jdbc, from @scrosby
- Switch Cook to using c3p0 for database pooling, from @scrosby
- Switch OSS test runtime from Minimesos to GKE, from @scrosby
- Fix pod label value validation regex, from @nsinkov
- Reject jobs with invalid job constraints at submission time, from @nsinkov
- Reject jobs with invalid pob labels at submission time, from @nsinkov
- Support for setting annotation to use all group IDs in Kubernetes, from @dposada
- Ability for Cook to use a Postgres database, from @scrosby
- Add missing fields to compute cluster API validation, from @nsinkov
- Metrics of gaps in Kubernetes watches, from @dposada
- Fix support for incremental default image configuration, from @nsinkov
- Support using default image with a user-specified container, from @nsinkov
- Allow incremental configurations for default job constraints, from @nsinkov
- Field for command length, from @dposada
- Logging of job instance when rebalancer preemption transaction fails, from @dposada
- Metrics for node and pod counts, from @nsinkov
:production?
to the config, from @dposada
- Changed logging from ERROR to INFO when a deleted cluster's watch fails, from @dposada
- Changed logging with finalizer deletion, from @scrosby
- Collected metrics for waiting jobs under quota, from @nsinkov
- Cleaned up the 'no acceptable compute cluster' log, from @dposada
- Added support for transforming job constraints via configuration, from @dposada
- Filtering out unsound GPU nodes, from @dposada
- Support upserting > 1 incremental config in one transaction, from @nsinkov
- Fix compile error on ex-info call, from @scrosby
- Fix docker environment to work with clojure 1.10, from @scrosby
- Switch to JDK-11 and Clojure 1.10, from @scrosby
- Cook can add a finalizer to pods, from @scrosby
- Support for incremental config for checkpointing volume mounts, from @nsinkov
- Support for rotating logs every hour, from @scrosby
- Trimmed down pod event logging, from @dposada
- Optional comments to incremental value configurations, from @nsinkov
- Trimmed down pod metadata logging, from @dposada
- Reverting JDK11 upgrade (back to JDK8), from @nsinkov
- Reverted 1.10 clojure change, from @scrosby
- Add incremental image configuration support for aux containers, from @nsinkov
- Fix metric reporting that was broken in JDK-11, from @scrosby
- Make progress updates compatible with checkpointing, from @nsinkov
- Upgraded Cook to work with JDK11 and Clojure 1.10, from @scrosby
- Bug in default image selection logging, from @nsinkov
exit-code
andinstance-exited?
topod-completed
passport events, from @dposada
- Revamped pod-submission-related passport events, from @dposada
- Clarified not-looking-for-offers log, from @dposada
- Incremental feature flags, from @nsinkov
- A flag for controlling which pools get telemetry-related environment variables, from @scrosby
- Support for defaulting environment variables by pool, from @scrosby
- Chunking to the
listPodForAllNamespaces
k8s API call, from @dposada - Fast failing of job instances on 500 responses from k8s pod submissions, from @dposada
- Support for shared memory on k8s, from @scrosby
- Resource requests to the
job-submitted
passport event, from @dposada
- Bug where
pod-launched
andpod-completed
passport events sometimes have anil
pool, from @calebhar12
- Adjust test_user_pool_rate_limit to make it more reliable, from @scrosby
- Fix Location header of redirects to include request parameters, from @scrosby
- Make /unscheduled endpoint redirect to leader, from @scrosby
- Avoid a lot of reflection costs in core Cook inner match and k8s loop, from @scrosby
- Use date in passport log file name, from @nsinkov
- Add pool-name, job-name, and user to Passport Logs, from @calebhar12
- Update passport event types with cook-scheduler source and namespace, from @calebhar12
- Logs info instead of warn for node-watch timeouts, from @dposada
- When adding a job to an existing job group, don't override the group, from @nsinkov
- Add pool source to job submission passport stamp, from @dposada
- Add instance uuid to job uuid cache, from @calebhar12
- Save the submitted job's pool, from @dposada
- Ability to turn rebalancer on or off by pool, from @scrosby
- Optimized a cache used by rebalancer, from @scrosby
(internal-only release)
(internal-only release)
- Support for job-routing plugins, from @dposada
- Features to compute clusters, from @dposada
- Constrained checkpointing to supported pools, from @nsinkov
- Fix the names for synthetic pod workload labels, from @dposada
- Environment variables for telemetry, from @dposada
- Improved performance for
VirtualMachineLeaseAdapter
,TaskRequestAdapter
, andupdate-host-reservation
, from @scrosby
- Hard delete pods that have been in the terminating state for too long, from @dposada
- Skip inactive pools when ranking, from @dposada
- Instance field with how long the job queued before that instance, from @dposada
- Made straggler kill a mea-culpa failure, from @nsinkov
- Added pod labels for application name and version, from @dposada
- Prefixed all application pod labels with the configured pod label prefix, from @dposada
- Calculate the time-until-waiting metric correctly, from @scrosby
- Do fenzo unassigns in batches outside of the k8s state locks, from @scrosby
- Make the k8s lock vector a vector not a sequence, from @scrosby
- Split metrics for synthetic pods and regular pods in k8s, from @scrosby
- Split k8s lock shards by compute cluster, from @scrosby
- Do watch event processing in parallel at watch startup, from @scrosby
- Optimize novel host constraint by 10%, from @scrosby
- Prevent an inactive pool from having a scheduling loop, from @dposada
- Gracefully handle nodes with nil consumption maps, from @dposada
- Made prolonged
ContainersNotReady
pod condition result in failure, from @dposada - Added logging of watch response status field, from @dposada
- Allowed synthetic pod anti-affinity to specify a namespace, from @dposada
- Improved logging when k8s watch response object is nil, from @dposada
- Gracefully ignore nodes with no pods during consumption calculation, from @dposada
- Allowed synthetic pods to have inter-pod anti-affinity, from @dposada
- Make cook pods ignore a tenured node taint, from @scrosby
- Fix the memory request value sent to pod via environmental variable to exclude sidecar memory, from @nsinkov
- Allowed synthetic pods to have a non-default termination grace period, from @dposada
- A knob letting Cook clobber syhthetic pods with real jobs for k8s, from @scrosby
- Look for Cook memory labels on job labels, not pod labels, from @nsinkov
- Optimization to the match cycle, from @scrosby
- Add memory limit job label, from @nsinkov
- Optimized code for generating synthetic pods to do less work and autoscale less when we're matching more often, from @scrosby
- Support for the default pool being a k8s pool, from @dposada
- Mark failure reason correctly for pod failure from preemption, from @dposada
- Support ignoring specific group ID's when computing supplemental group IDs, from @scrosby
- Log exceptions in
deep-merge-with
, from @dposada
- Take only the top X pending jobs when triggering k8s autoscaling, from @dposada
- Support for longer pod names, from @scrosby
- Configurable validation of job resources by node type, from @dposada
/usage
for all users, from @dposada
- Make
job->acceptable-compute-clusters
configurable, from @dposada
- Ability to not set memory limits, from @kathryn-zhou
- Authenticator refresh logic needed for non-GKE k8s, from @scrosby
- Checkpoint locality constraint, from @dposada
- Logging the largest job and offer by resource, from @dposada
- Adds location to compute cluster, from @dposada
- Schedules and matches jobs with disk, from @kathryn-zhou
- Makes k8s API client read timeout configurable, from @dposada
- Add resource request and limit to init-container in pod, from @scrosby
- Refactor authentication initialization, from @scrosby
- Migrate to GitHub Actions from Travis CI, from @kevo1ution
- Allow users to use int values for disk request and disk limit, from @kathryn-zhou
- Add support for ignoring a taint prefix, from @scrosby
- Increase limit for launch-task-num-threads, from @scrosby
- Make progress an absolute path in k8s, from @scrosby
- Do not schedule nodes with unschedulable node-spec, from @scrosby
- Improve error handling when calculating effective image, from @nsinkov
- Per-user queue length limits, from @dposada
- API for Disk Limits, from @kathryn-zhou
- Metadata pod env vars, from @nsinkov
- Support for modifying pod image when checkpointing, from @nsinkov
- Increases default and max :controller-lock-num-shards, from @dposada
- Make the kill-lock be a ReentrantReadWriteLock and add metrics, from @scrosby
- Make pool taint / label and context configurable, from @scrosby
- Gracefully handles unknown job resource type, from @dposada
- Fix memory leak in k8s state for deleted pods, from @scrosby
- Reduced excessive logging for checkpointing and launching tasks, from @dposada
- Added supplemental groups to the pod security context, from @dposada
- Reduced excessive logging for k8s dynamic clusters and writing tasks, from @dposada
- Cache sizes to be configurable, from @scrosby
- Tracking of how rate limiting is affecting the queue, from @scrosby
- Per-user per-pool job launch rate limiting, from @scrosby
- Configurable checkpointing kill switch, from @nsinkov
- Dynamic compute cluster log from
ERROR
toWARN
, from @scrosby
- Allowing for different rate limit for auth-bypass requests, from @dposada
- Added warning log when jobs go unmatched for too long, from @dposada
- Added the ability to flush a rate limit from the cache, from @scrosby
- Add rate limits per compute cluster, from @scrosby
- Cached job-constant fields in defrecords for gpu-host-constraint, from @kathryn-zhou
- Cache job-constant fields in defrecords for user-defined-constraint, from @kathryn-zhou
- Workload fields to job application, from @dposada
- Reduced excessive logging for k8s, from @dposada
- Added logging of offer and job resource percentiles, from @dposada
- Missing compute cluster check, from @nsinkov
- Deleting unschedulable synthetic pods, from @dposada
- Dynamic cluster configuration support, from @nsinkov
- Improved logging for launching tasks, stop launching synthetic pods, and matching offers, from @dposada
- Support for EQUALS job constraints in k8s, from @dposada
HOST_IP
environment variable for k8s, from @dposada
- De-lazied the list of constraints to avoid locking in Fenzo, from @scrosby
- Made
job->previous-hosts-to-avoid
useset
instead ofmapv -> distinct
to reduce lock contention, from @sradack
- Order of per-user and pool-global quota application, from @scrosby
- Support for prefixed job labels to become k8s pod labels, from @dposada
- Attribution labels to k8s synthetic pods, from @dposada
agent_id
as a preferred alternative toslave_id
on job instances, from @dposada/shutdown-leader
admin-only API endpoint, from @dposada
- Improved matching log, from @dposada
- Improved per-user launch-rate-limit log, from @dposada
- Added log at start and end of job ranking, from @dposada
- Made "killing cancelled task" log INFO-level, from @dposada
- Handling of preemption on k8s pod initialization, from @nsinkov
- Using 1024*1024 (mebibytes) as the k8s memory multiplier, from @dposada
- Avoiding NPE due to missing resources when totaling resources for metrics, from @dposada
- Per-pool global quotas, from @scrosby
- Accounting for GPU tasks assigned to nodes in the current matching cycle, from @kathryn-zhou
- Force processing when state scanning in k8s, from @dposada
- GPU job support in k8s, from @kathryn-zhou
- Rapid pool skipping in k8s, from @nsinkov
- Enhanced offer generation and updated GPU constraints for k8s, from @kathryn-zhou
- Fixed some errors that caused NPE and ERROR logs, from @scrosby
- Replaced chime logic with less aggressive chime logic, from @nsinkov
- Port mapping support for k8s, from @dposada
- Total number of pods and nodes quota for k8s, from @scrosby
- Fixed several O(#pods * #nodes) bugs in k8s code, from @scrosby
- Limit autoscaling to quota of what's allowed to run, from @dposada
- Improved pool scheduling by fixing chime logic, from @nsinkov
- Validation for GPU model requests, from @kathryn-zhou
- Added extra metrics and logging around match cycle, from @scrosby
- Fixed O(#pods * #nodes) bug in calculating k8s offers, from @scrosby
- Update checkpointing settings, from @nsinkov
- Automates GKE dev environment setup, from @dposada
- Don't set cpu limit on sidecar if not setting on main container, from @nsinkov
- Upgrades k8s client library to 7.0.0, from @dposada
- Log pod metadata, from @nsinkov
- Skips match when there are no considerable jobs, from @dposada
- Do not use :missing state for preempted pod, from @nsinkov
- Added check for k8s node preemption using preemption pod label, from @nsinkov
- Resolved
ClassNotFoundException
for Mesos task-launching, from @dposada
Killed by user
reason code, from @nsinkov- Fallback to k8s checkpointing disabled when max attempts exceeded, from @nsinkov
- Logging of k8s pod events, from @dposada
- Added
safe-to-evict
annotation to k8s synthetic pods, from @dposada - Made matches go to
launch-tasks
in bulk, from @dposada
- Support for specifying the default container on a per-pool basis, from @scrosby
- Add memory overhead accounting when checkpointing, from @nsinkov
- Add lock-sharding to k8s controller, from @dposada
- Add MESOS_DIRECTORY to the k8s environment, from @dposada
- Launches k8s tasks in parallel, from @dposada
- Add flag to use google service account for authentication, from @nsinkov
- Add ability to use google metadata server for authentication, from @nsinkov
- Bring all config.edn files up to date, from @scrosby
- Help-make-cluster script uses now unavailable gke k8s version, from @scrosby
- Stop writing synthetic pod info to datomic, from @dposada
- Improved k8s autoscaling metrics, from @dposada
- Allow removing cpu limit in k8s, from @dposada
- Show rate limited users in HTTP log, from @scrosby
- Change checkpointing volume from init container to scratch space, from @nsinkov
- Main container environment variables to init container in k8s, from @nsinkov
- Writable scratch space separate from the k8s sandbox, from @nsinkov
- Experimental API and schema support for checkpointing in k8s, from @nsinkov
- Fast fail for k8s pods with un-initialized containers, from @dposada
- Made k8s sidecar readiness probe optional, from @DaoWen
- Made k8s pod watch initialization process each pod only once, from @dposada
- Removed node anti-affinity for blocklist labels from k8s synthetic pods, from @dposada
- Added node anti-affinity for blocklist labels to synthetic pods, from @dposada
- Mesos sandbox mount to k8s pods (backward compatibility for jobs that assume they're running on Mesos), from @nsinkov
- Progress reporting for k8s jobs, from @DaoWen
- SSL verification between Cook and k8s, from @scrosby
- Fast fail for unschedulable k8s pods, from @dposada
- Support for k8s synthetic pod namespace to be user's namespace, from @dposada
- k8s synthetic pod anti-affinity to previous hosts, from @dposada
- Made autoscaling (for k8s) based on pending jobs instead of match failures, from @dposada
- Renamed k8s metrics to be consistent with prior metric naming, from @scrosby
- Separated k8s job pods' workdir and sandbox, from @DaoWen
- Removed expensive log from
handle-resource-offers!
, from @dposada
- Mapping for Mesos reason
REASON_TASK_KILLED_DURING_LAUNCH
, from @dposada - Experimental support for synthetic k8s pods to trigger the cluster autoscaler, from @dposada
- Metrics to k8s code, from @scrosby
- Made "Container launch failed" mea culpa, from @dposada
- NPE in sandbox calculation when compute cluster is not found, from @scrosby
- Integration test improvements, from @dposada and @scrosby
- Support to blocklist nodes in k8s that have certain labels, from @scrosby
- Bug in reading default pool from config when using k8s sidecar, from @nsinkov
- Bug in job progress aggregation, from @DaoWen
- Handling of node preemption, from @dposada and @scrosby
- Handling of k8s startup connection errors, from @scrosby
- Handling of bad request response from k8s, from @scrosby
- Handling when a running pod goes completely missing, from @scrosby
- REST endpoint for posting job progress updates, from @DaoWen
- Bug in k8s state machine for completed instances, from @scrosby
- Bug in k8s pod resource requests, from @DaoWen
- Handling of pod submission failures, from @dposada
- Race where Cook can kill a task then later launch it, from @scrosby
- Improved logging for k8s compute clusters, from @dposada
- Logs fileserver for k8s jobs, from @nsinkov
- Missing state pairs in the k8s controller, from @scrosby
- Default the user parameter in docker, from @shamsimam
- Improve k8s node and pod watches so that they retry forever, from @scrosby
- Correct misnumbered 403 error codes for Swagger, from @DaoWen
- Support for moving a portion of a user's jobs to a different pool, from @dposada
- Support in k8s compute clusters for max pods per node, from @dposada
- Made Mesos reconciler only reconcile Mesos tasks, from @scrosby
- Made declining Mesos offers work, from @shamsimam
- Removed incorrect rate-limit reason in
/unscheduled_jobs
, from @dposada
- Avoid using Cook executor when launching on k8s, from @dposada
- Made container defaults be compute-cluster specific, from @dposada
- Added mapping for the Failed pod phase on k8s, from @dposada
- Reverted a change that added unexpectedly expensive logging, from @scrosby
- Support for multiple kubernetes compute clusters, from @scrosby
- Support for mesos and kubernetes compute clusters simultaneously, from @scrosby
- Scripts for creating compute clusters on GKE, from @scrosby
- Optimized quota reading, from @shamsimam
- Integration test improvements, from @dposada
- Bug fixes for kubernetes support, from @scrosby
- Max ports to task constraints, from @pschorf
- Leader URL to
/info
, from @dposada
- Max priority to 16,000,000, from @nsinkov
- Integration test improvements, from @dposada
- Pool name to matching logs, from @dposada
COOK_INSTANCE_NUM
environment variable, from @pschorf- Metrics on instance fetch rates, from @scrosby
- Capturing the time it takes to list jobs, from @scrosby
- Support for multiple submit plugins, from @pschorf
- Maximum command line length parameter, from @pschorf
- Improved error logging, from @pschorf
- Check quota when rebalancing, from @pschorf
- Optimize the job fetching code to not round-trip to UUID, from @scrosby
- Fixed rebalancer bug, from @pschorf
- Support for default container volumes, from @pschorf
- Limit number of jobs eligible for matching for out of quota users, from @pschorf
- Compute cluster on task objects, from @scrosby
- Support running Cook Executor in docker containers, from @pschorf
- Filter jobs that would put users out of quota from /queue endpoint, from @pschorf
- Fixed bug in user metric reporting, from @pschorf
- Added support for file_url, from @pschorf
- Fix for periodic job cleanup, from @scrosby
- Instance completion plugin, from @pschorf
- Periodic cleanup of uncommitted jobs, from @scrosby
- Pool selection plugin, from @pschorf
- Added support for suitable flag for datasets, from @pschorf
- Added plugin support for job submission and launch, from @scrosby
- Added COOK_INSTANCE_UUID to task environment, from @dposada
- Allow setting cook executor retry limit to 0, from @pschorf
- Support for docker images in mesos containerizer, from @pschorf
- Global launch rate limit, from @scrosby
- Made per-user rate limit more gradual, from @scrosby
- Sped up
/unscheduled
endpoint with new query and truncating long lists, from @pschorf
- Support for job launch rate limits, from @scrosby
- Updated dependencies for integration tests to newer versions
- Support for x-cook-pool header, from @pschorf
- Bug in reporting total usage when pools are enabled, from @pschorf
- Updated some metric names to incorporate pools, from @pschorf and @dposada
- Rate limiting on job submission, from @scrosby
- Remove stale dataset cost data, from @pschorf
- Don't show uncommitted jobs in unscheduled_jobs endpoint, from @pschorf
- Support for contacting a data local service to obtain cost data for scheduling, from @pschorf
- Bug in quota-checking when running without pools, from @dposada
- Bug in the rebalancer's retrieval of DRU divisors when running with pools, from @dposada
- Integer overflows in timer tasks when the scheduler runs for a long time, from @shamsimam
- Per-pool job scheduling, from @dposada and @pschorf
- Support for self-impersonation requests from normal users, from @DaoWen
- Exit code syncer to handle a high rate of incoming exit code messages, from @shamsimam
- Removed TTL from agent attributes cache, from @dposada
- Performance improvements to job submission, from @scrosby and @pschorf
- data-local field to jobs, from @pschorf
- Performance improvements to job submission, from @scrosby and @pschorf
- Consume entire request before sending response, from @pschorf
- Container fields to /jobs, from @dposada
- reason_mea_culpa to instance responses, from @dposada
- Support for x-forwarded-proto header for CORS requests, from @pschorf
- Removed mesos master-hosts config, from @dposada
- Removed rebalancer min-utilization-threshold, from @dposada
- Better authorization failed message on job deletion, from @dposada
- Handle edge case in estimated completion constraint, from @pschorf
- Issue where task reconciliation was failing, from @pschorf
- Issue where nil instance timestamps would cause NPEs, from @dposada
- Pool support to /jobs, from @dposada
- Estimated completion constraint, from @pschorf
- Pool submap to /quota and /share, from @pschorf
- Improvements to job query times, from @scrosby
- Added pool support to /share and /quota endpoints, from @pschorf
- Returns 409 on some retry operations instead of retrying jobs which could end up in a bad state, from @pschorf
- Fixed bug with disable_mea_culpa_retries, from @pschorf
- Improved logging for some error cases, from @dposada
- Support for pool param to /usage endpoint, from @dposada
- Support for pool param on job submission, from @dposada
- Support for SSL, from @pschorf
- Support for api-only mode, from @dposada
- Issue where monitor metrics would sometimes stop on a non-zero value, from @dposada
- Fix performance regression in list API, from @scrosby
- Support for listing custom executor jobs in /jobs endpoint, from @dposada
- Kill instances for cancelled jobs on leadership election, from @pschorf
- Performance improvements to scheduling and list APIs, from @scrosby
- Fixed GPU support, from @dPeS
- Support for CORS requests, from @pschorf
- Scheduling performance improvements, from @scrosby
- Counters for job cpu/mem/runtime by failure reason, from @dposada
- Endpoint for instance statistics, from @dposada
- Support for a configurable run as user, from @shamsimam
- Support for configuring number of instances which can fail before falling back to the mesos executor, from @shamsimam
- Performance improvements to sandbox syncer, from @shamsimam
- Rebalancer now reserve hosts after preempting, from @pschorf
- Performance improvents to dru computation, @shamsimam
- Added timely sandbox directory updates for tasks that are not executed by the cook executor, from @shamsimam
- Added environment variables that contain the resources requested by the job, from @shamsimam
- Converted monitor Riemann events to codahale metrics, from @dposada
- Fixed string encoding on
/rawscheduler
POST, from @pschorf - The
start-time
timestamp on/info
no longer re-evaluates tonow
on each request, from @DaoWen
- Added user-impersonation functionality to support services running on top of Cook Scheduler, from @DaoWen
- Jobs that exceed a user's total resource quota are rejected rather than waiting indefinitely, from @DaoWen
- Added unauthenticated /info endpoint for retrieving basic setup information, from @DaoWen
- Added metrics for message rates of Mesos status changes and framework updates, from @shamsimam
- Added check for required
reason
parameter on share and quota deletions, from @DaoWen
- Fixed error in Kerberos middleware setup, from @DaoWen
- Reclassified
MESOS_EXECUTOR_TERMINATED
as a mea-culpa error, from @shamsimam - Fixed bug preventing group retry updates by non-admin users, from @DaoWen
- Fixed bug causing a 500 rather than a 404 for gets on non-existent groups, from @DaoWen
- Re-enabled Fenzo group constraints, from @pschorf
- Added /instances endpoint for retrieving job instances, from @dposada
- Added /jobs resource for retrieving jobs, from @dposada
- Added /usage endpoint for displaying user resource usage, from @DaoWen
- Added failed-only option for retry endpoint, from @DaoWen
- Fixed authorization check on group endpoint, from @DaoWen
- Disabled fenzo group constraints, from @pschorf
- Retries sandbox syncing of hosts when cache entries expire, from @shamsimam
- Allow partial results from /unscheduled_jobs, from @dposada
- Improve performance by defering calculation of group components, from @pschorf
- Support millisecond time resolution for lingering tasks, from @DaoWen
- Added COOK_JOB_UUID and COOK_JOB_GROUP_UUID to the job environment, from @shamsimam
- Added support for killing a group of jobs, from @DaoWen
- Added sysouts to get job output closer to Mesos' CommandExecutor, from @shamsimam
- Added metrics for usage of /list, from @dposada
- Added support for retrying a group of jobs, from @DaoWen
- Added support for configurable environment passed to Cook Executor, from @shamsimam
- Fixed bug with job group constraints, from @pschorf
- Fixed bug where Cook Executor jobs were opting in to the heartbeat support, from @shamsimam
- Changed (simplified) the sandbox directory syncing mechanism for jobs, from @shamsimam
- Renamed to users allowed, from @dposada
- Fixes for stderr/out file handling in Cook executor, from @shamsimam
- Fixed bug with /unscheduled_jobs endpoint, from @pschorf
- Added support for allowing job to specify which executor (cook|mesos) to use, from @shamsimam
- Added support for passing state=success/failed in /list, from @dposada
- Added support for filtering by name in /list, from @dposada
- More failure codes have been classified as mea-culpa failures, from @pschorf
- /queue endpoint redirects to the master on non-master hosts, from @pschorf
- Fixed handling of detailed parameter on group queries, from @DaoWen
- Fixed bug with launching docker container jobs, from @DaoWen
- Fixed bug with docker container port mappings, from @pschorf
- Performance improvement in rank jobs, from @wyegelwel
- Added JVM metric reporting, from @pschorf
- Added support for partial results when querying for groups, from @dposada
- Added support for user allowlisting, from @dposada
- Added support for throttling rate of publishing instance progress updates, from @shamsimam
- Added authorization check for job creation, from @dposada
- The Mesos Framework ID is now configurable, from @dposada
- Added configuration for agent-query-cache, from @shamsimam
- Added support for Cook Executor, from @shamsimam
- Replaced aggregate preemption logging with individual preemption decisions, from @wyegelwel
- /debug endpoint now returns the version number, from @dposada
- Fixed a bug which was overwriting end-time on duplicate mesos messages, from @pschorf
- Fixed a bug with querying for jobs with a non-zero number of ports, from @dposada
- Parallelize in-order processing of status messages, from @shamsimam
- Change reason string from "Mesos command executor failed" to "Command exited non-zero", from @wyegelwe
- Added configuration option for the leader to report unhealthy, from @pschorf
- Optimized list endpoint query for running and waiting jobs, from @wyegelwel and @pschorf
- Lowered log level of sandbox directory fetch error to reduce noise, from @wyegelwel
- Further optimize list endpoint query, from @pschorf and @wyegelwel
- Optimized the query in the list endpoint to avoid an expensive datomic join, from @pschorf and @wyegelwel
- Change the list endpoint time range to be inclusive on start, from @wyegelwel
- Add check to ensure job/group uuids do not exist before creation, from @pschorf
- Limit rebalancer jobs to consider to max preemptions, from @wyegelwel
- Added simulator to test scheduler performance, from @wyegelwel
- Added job constraints, from @wyegelwel
- Added instance progress to query response, from @dposada
- Fixed bug where job submit errors would return 201, from @pschorf
- Optimizations in ranking to improve schedule time, from @shamsimam
- Refactor fenzo constraints to use less memory, from @pschorf
- Added disable-mea-culpa-retries to jobclient, from @WenboZhao
- Fix bug with disable-mea-culpa-retries, from @pschorf
- Make DRU order deterministic, from @wyegelwel
- Change default cycle time for checking max-runtime exceeded to 1m, from @wyegelwel
- Remove concat usage, from @pschorf
- /unscheduled_jobs API endpoint, from @mforsyth
- Added application to job description, from @dposada
- Added disable-mea-culpa-retries flag, from @pschorf
- Added docker, from @dposada
- Added support for job groups in simulator, from @mforsyth
- Added /failure_reasons API endpoint, from @mforsyth
- Added expected-runtime to job description, from @dposada
- Added /settings API endpoint, from @dposada
- Added group host placement constraints, from @DiegoAlbertoTorres
- Require an explicit reason when changing shares or quotas (from @mforsyth). This intentionally breaks backwards compatibility.
- Optimized matching code to speed schedule time @wyegelwel
- Stream JSON responses, from @pschorf
- Speed up ranking with commit latch and caching from @wyegelwel
- Fixed a bug with calculating whether we matched the head of the queue which caused cook to only schedule 1 job at a time. (this is why 1.2.0 was yanked)
- Start of CHANGELOG. We are likely missing some items from 1.0.1, will be better from now on.
- Switch to use Fenzo for matching from @dgrnbrg and @mforsyth
- GPU support from @dgrnbrg
- Swaggerized endpoints from @mforsyth
- Groups (https://github.com/twosigma/Cook/blob/master/scheduler/docs/groups.md) from @DiegoAlbertoTorres
- Containers support from @sdegler, @leifwalsh, @wyegelwel
- Retry endpoint from @pjlegato and @wyegelwel
- Authorization on endpoints from @pjlegato and @wyegelwel
- System simulator and CI from @mforsyth
- Access logs for server from @sophaskins
- Mea culpa reasons so some failures don't count against retries from @DiegoAlbertoTorres @mforsyth
- Switch to use mesomatic over clj-mesos from @mforsyth
- Tied to mesos 1.x.x (exact version is 1.0.1)
- State change of a job from waiting to running now occurs when Cook submits the job to mesos (not when mesos confirms the job is running) from @aadamson and @DiegoAlbertoTorres
- Performance improvements to ranking and scheduling from @wyegelwel
- Split brain on mesos / zk fail over. Cook will now exit when it loses leadership with either zk or mesos. A supervisor is expected to restart it from@wyegelwel