Releases: skypilot-org/skypilot
SkyPilot v0.7.0
SkyPilot v0.7.0: 3x faster, reservation support, observability, admin policies, new AI hardware, new UX, and more!
We are excited to announce the release of SkyPilot v0.7.0! This release brings significant performance improvements and many new features:
- Upto 3x faster provisioning
- Reservation support: AWS Capacity Reservations, AWS Capacity Blocks, GCP reservations, GCP Dynamic Workload Scheduler (DWS), and more
- Observability features
- Admin policy enforcement
- Support for H100 Mega, TPU v6, TPU v5, gVNIC, azure blob storage, faster disks, and more
- New UX for
sky
CLI
and many bug fixes and enhancements!
Release Highlights
Performance
We have made 2-3x performance improvements across cloud providers through optimizations in our provisioning stack and the images we use.
Cloud | Provisioning Time | Speedup |
---|---|---|
AWS | 1 min 10s | 3x |
GCP | 1 min 15s | 3x |
Azure | 2 min 16s | 2x |
Kubernetes | 52s | 2.5x |
Reservations
SkyPilot now supports short-term and long-term reservations across clouds:
- AWS Capacity Reservations
- AWS Capacity Blocks
- GCP reservations
- GCP Dynamic Workload Scheduler (DWS)
- Bring your own VMs or Kubernetes clusters
SkyPilot's failover includes these reservations, so they can be combined with spot instances or any other resources/clouds to create a resilient and cost-effective infrastructure.
Observability on Kubernetes
SkyPilot now has two new observability features on Kubernetes:
sky status --kubernetes
shows all SkyPilot resources on the cluster. (#4040, #4079)$ sky status --cloud kubernetes Kubernetes cluster state (context: mycluster) SkyPilot clusters USER NAME LAUNCHED RESOURCES STATUS alice infer-svc-1 23 hrs ago 1x Kubernetes(cpus=1, mem=1, {'L4': 1}) UP alice sky-jobs-controller-80b50983 2 days ago 1x Kubernetes(cpus=4, mem=4) UP alice sky-serve-controller-80b50983 23 hrs ago 1x Kubernetes(cpus=4, mem=4) UP bob dev 1 day ago 1x Kubernetes(cpus=2, mem=8, {'H100': 1}) UP bob multinode-dev 1 day ago 2x Kubernetes(cpus=2, mem=2) UP bob sky-jobs-controller-2ea485ea 2 days ago 1x Kubernetes(cpus=4, mem=4) UP Managed jobs In progress tasks: 1 STARTING USER ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS alice 1 - eval 1x[CPU:1+] 2 days ago 49s 8s 0 SUCCEEDED bob 4 - pretrain 1x[H100:4] 1 day ago 1h 1m 11s 1h 14s 0 SUCCEEDED bob 3 - bigjob 1x[CPU:16] 1 day ago 1d 21h 11m 4s - 0 STARTING bob 2 - failjob 1x[CPU:1+] 1 day ago 54s 9s 0 FAILED bob 1 - shortjob 1x[CPU:1+] 2 days ago 1h 1m 19s 1h 16s 0 SUCCEEDED
sky show-gpus --cloud kubernetes
shows detailed GPU availability information on the cluster. (#3816, #4085)$ sky show-gpus --cloud kubernetes Kubernetes GPUs GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS L4 1, 2, 4 8 8 H100 1, 2, 4, 8 16 16 Kubernetes per node GPU availability NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS my-cluster-0 L4 4 4 my-cluster-1 L4 4 4 my-cluster-2 H100 8 8 my-cluster-3 H100 8 8
Admin policy enforcement
SkyPilot has a new admin policy mechanism (#3966) that admins can use to enforce policies on users’ SkyPilot usage. These policies apply custom validation and mutation logic to a user’s tasks and SkyPilot config.
Example policies:
- Add Labels for all Tasks on Kubernetes
- Always Disable Public IP for AWS Tasks
- Use Spot for all GPU Tasks
- Enforce Autostop for all Tasks
Azure Blob Storage support
In addition to S3, GCS and R2, you can now use Azure Blob Storage as a storage backend for storing and accessing data. (#3032)
New AI hardware support
- New accelerators: TPU v6 (#4115), TPU v5 (#3814), H100 Mega (#4099),
- Faster networking on GCP with gVNIC (#4095)
- Faster disks: new disk tier
ultra
(#3860) for GCP and AWS.
UX revamp
SkyPilot CLI is cleaner, simpler and even easier to parse now (#4023)
New LLM Recipes
- Llama 3.1 and Llama 3.2 recipes (#3990, #3779, #3780)
- llm.c training for GPT 2 (#3611)
- Pixtral (#3938, #3940)
- Qwen2-VL and Qwen 2.5 support (#3961, #3959)
- Yi model family support (#3958)
- Nemo GPT (#3743)
- Other examples: Airflow (#3982), AWS Neuron Accelerator (#4020), and Deepspeed with k8s support (#4124)
Deprecation Notice
- All
SKY_*
environment variables are deprecated in favor ofSKYPILOT_*
variables.- All
SKY_*
variables will be removed in v0.9.0. - See docs for list of currently supported variables.
- All
Backend
New Features
- Managed jobs can now recover from job-level failures (e.g., GPU errors, non-zero exit codes, etc.) (#3919)
- Set
max_restarts_on_errors
to specify the number of times SkyPilot should try to restart the job.
resources: job_recovery: max_restarts_on_errors: 3 # Retry 3 times before marking the job as failed
- Set
- Nvidia GPUs can now disable ECC (#3676)
- New environment variable
SKYPILOT_NUM_NODES
to fetch the number of nodes in the cluster. (#3656) - SkyPilot config can now be overridden in the task definition with
experimental.config_override
(#3689)experimental: config_override: docker: run_options: ... kubernetes: pod_config: ... provision_timeout: ... gcp: managed_instance_group: ... nvidia_gpus: disable_ecc: ...
Enhancements
- SSH keys AddKeysToAgent for ssh config file and ssh cmd #3985
- SkyPilot runtime is now installed in a separate conda environment, reducing interference with user's environment. (#3639)
docker.run_options
now allows users to pass additional options when running docker containers. (#3682)
Fixes
- Fix
sky cancel
not terminating all child processes (#3919) - Fix provisioning failures when multiple versions of SkyPilot are installed (#3866)
- Shell autocomplete installation is now more robust (#3892, #3893)
Kubernetes
New Features
- Observability improvements:
- SkyPilot now helps you set up your clusters for running SkyPilot jobs.
- If you already have a list of IPs and their SSH keys,
sky local up
can now automatically set it up as a cluster to be used for running jobs. (#3926) - If you don't have a cluster yet, we provide a simple one-click setup script to deploy VMs with Kubernetes on cloud of your choice (#3929).
- If you already have a list of IPs and their SSH keys,
- SkyPilot job output is now piped to the container logs (#3758)
- Use your existing logging tooling (
kubectl logs
, filebeat, etc.) to view SkyPilot job outputs.
- Use your existing logging tooling (
- Support for Nvidia GPU operator labels (
nvidia.com/gpu.product
) for detecting GPU types. (#3493)- You no longer need to label GPUs if you have the Nvidia GPU operator installed.
- Spot instances are now supported on GKE clusters (#3675)
- [Experimental] Multi-context support (#3913, #3968, #3897, #3772, #4013)
Performance improvements:
- New command runner: 3x faster command submission for Kubernetes pods. (#3157)
sky local up
for GPUs is now ~5x faster, provisioning in 2min 30s instead of 12min (#3664)- Our GPU images are now 3x smaller (1.5 GB), reducing the time to pull the image (#3665)
- SSH jump pod is no longer required for
port-forward
mode (#3657) - SSH setup is now parallelized to speed up multi-node provisioning (#4158)
Enhancements and fixes
SkyPilot v0.6.1
This patch release brings many improvements and fixes to SkyPilot, including major performance improvements for Kubernetes and Azure and new features for AWS and GCP.
Stay tuned for a detailed changelog coming up in v0.7.0!
SkyPilot v0.6.0
SkyPilot v0.6.0: Jobs API, SkyServe on Kubernetes, Spot + On-demand mixing, Paperspace support and more!
We are excited to release SkyPilot v0.6.0! This release includes a number of new features:
- Managed Jobs for job execution and recovery
- SkyServe and Jobs on Kubernetes
- Mix on-demand and spot instances in SkyServe
- New cloud: Paperspace
Release Highlights
Managed Jobs
- The spot controller has been enhanced to support any job on on-demand or spot instances.
- To use, run
sky jobs launch
instead ofsky spot launch
.
- To use, run
- The new job controller can automatically recover jobs from any spot preemptions or hardware failures, and also execute pipelines of jobs.
- The
sky jobs
API is identical to thesky spot
API, but also supports on-demand instances.
SkyServe and Jobs on Kubernetes
- SkyPilot can now run SkyServe and Managed Job controllers on Kubernetes
- This means you can now run your SkyServe and Managed Jobs on your Kubernetes cluster!
- Simply run
sky jobs launch
orsky serve up
, and SkyPilot will automatically deploy the controller on your Kubernetes cluster if available and run jobs on the cheapest available location.
Mix on-demand and spot instances in SkyServe
- SkyServe now supports a new intelligent policy for mixing spot and on-demand instances. Example.
- Uses on-demand instances to ensure availability and spot instances to save costs.
- Dynamically falls back to on-demand replicas when spot replicas are not available. Example.
Paperspace support
- Newest cloud to join the Sky: Paperspace!
- Paperspace offers the latest GPUs including H100 and A100-80GB for AI training and inference.
- Simply add your Paperspace API key to
~/.paperspace/config.json
and runsky check paperspace
to get started. - Big thanks to @asaiacai for contributing Paperspace support!
More LLMs and Recipes
Deprecation Notes
The following features have been deprecated and will be removed in the next minor release:
sky spot
CLI: usesky jobs
CLI instead.core.spot_xxx
APIs: refactored tojobs.xxx
.qps_lower_threshold
andauto_restart
inservice
: usetarget_qps_per_replica
instead.
Changelog
Managed Jobs
- Changes make to local catalog at ~/.sky/catalog are now reflected on the controller (#3289)
- The name of the spot job is now included in the
SKYPILOT_TASK_ID
environment variable (#3424) - Legacy spot job APIs have been refactored from
core.spot_xxx
tojobs.xxx
(#3417) - Cloud for the controller is now chosen based on the resources of the replicas (#3363)
- Bug fixes (#3302, #3397, #3459, #3468, #3480)
SkyServe
New Features
- New intelligent policy for mixing spot and on-demand instances in SkyServe (#3194)
- SkyServe now uses proxy instead of HTTP redirect responses for better performance (#3395)
- Readiness probe now supports headers: this is useful for authentication or other headers required for readiness checks (#3552)
Enhancements
- Optimizations - replicas are reused when only service section is changed (#3214)
- Rolling updates are now the default behavior for SkyServe (#3249)
- Controller cloud is now chosen from replica resources if it is not already up (#3231)
- Bug fixes and API improvements (#3257, #3299, #3303, #3411, #3411, #3546)
Kubernetes
- Kubernetes clusters can now run SkyServe and Managed Jobs (#3377, #3524, #3521)
sky show-gpus
now shows realtime availability of GPUs in the cluster (#3499)- Autoscaling Kubernetes clusters are now supported: SkyPilot can now wait for GKE node pools, Karpenter and other autoscalers to provision nodes (#3513, #3415)
- Use Kubernetes service accounts by specifying
remote_identity
in ~/.sky/config.yaml (#3377, #3527) sky local up
now also automatically installs the Nginx Ingress Controller (#3223)- Support for specifying custom pod configurations with
pod_config
(#3244)- Use this to modify the pod configuration for your environment, e.g., attaching volumes, specifying imagePullSecrets, increasing /dev/shm size limit, setting
HTTP_PROXY
and more! See examplepod_config
here.
- Use this to modify the pod configuration for your environment, e.g., attaching volumes, specifying imagePullSecrets, increasing /dev/shm size limit, setting
- Support for specifying custom metadata to all Kubernetes resources created by SkyPilot (#3333)
- Useful for tracking resources created by SkyPilot in your Kubernetes cluster.
- Support for PodIP mode for exposing ports (#3445)
Enhancements
- GPU Isolation: SkyPilot no longer uses privileged containers and pods can no longer use GPUs not allocated to them (#3443)
- Ingress creation requests are now batched to minimize nginx reloads and ingress paths are namespaced (#3263, #3373)
- All SkyPilot pods are now labelled with
skypilot-user
to identify the owner of the pod (#3576) - Special characters in environment variables are now correctly parsed (#3322)
- GPU labelling is now more robust (#3274)
- Bug fixes and quality of life improvements (#3266, #3392, #3439, #3509, #3524, #3525, #3532, #3563, #3578, #3374)
CLI & Core interfaces
New Features
resources
now supportslabels
field to set labels (instance tags on aws, labels on gcp and k8s) on cloud resources (#3464, #3505)sky check
now supports checking credentials for specific clouds, e.g.sky check aws gcp
(#3229)- You can also restrict which clouds are checked by setting
allowed_clouds
in~/.sky/config.yaml
. (#3556)
- You can also restrict which clouds are checked by setting
any_of
orordered
fields inresources
can now have clouds that are not enabled (#3567)- A new environment variable
SKYPILOT_CLUSTER_INFO
, containing cluster name, cloud, region and zone is now available in all tasks (#3424)
Enhancements
- Optimizer is up to 10x faster when multiple resources are specified (#3567)
- Autostop timer is now reset at the start of a new sky launch to avoid unexpected autostops (#3205)
- GCP GPUs now include
DEVICE_MEM
insky show-gpus
(#3375) - Better sorting for
sky show-gpus
(#3492) - Handling for usernames containing invalid characters (#3528)
- Null environment variables now raise an error (#3557)
Runtime & Backend
- SkyPilot now supports Python 3.11 (#3248)
- SkyPilot runtime is now isolated from any environment changes made by user code (#3575, #3326, #3339)
- Fix for jobs and services running longer than 12 days (#3460)
- Docker runtime fixes and enhancements, including fix for storage mounting in container (#3450, #3436, #3481, #3343)
- Bug fixes and optimizations (#3280, #3292, #3178, #3386, #3292, #3386, #3407, #3423, #3368, #3457, #3469, #3482, #3495, #3512, #3536, #3568)
Optimizations
- Lazy imports for 2x faster import times (#3394, #3463)
- Faster setup and job submission (#3523, #3484),
Cloud: GCP
Cloud: Azure
- Custom images are now supported on Azure. Simply specify
image_id
in theresources
field. (#3362) - 8x faster autostop for Azure (#3519)
- Fix GPUs not being detected in Azure (#3313)
- Provisioning fixes (#3483)
Cloud: AWS
- Fine-grained IAM roles: you can now specify IAM roles on a per-resource basis (#3488, #3514)
- SkyPilot can now be run in ECS containers by assuming
container-role
IAM roles (#3503) - SkyPilot will not delete user-specified security groups (#3402)
Cloud: Fluidstack
- H100 and A100 Nvlink support for Fluidstack (#3467)
- Opening ports is now supported for Fluidstack (#3294)
- Bug fixes (#3254, #3265)
Other Clouds
- Bug fixes for Lambda provisioning and termination (#3409, #3410)
- Multi-gpu fixes for RunPod (#3291)
- Cudo: handle missing project errors (#3438)
Thanks to all contributors!
New contributors: @MysteryManav, @JGSweets, @Harthgar, @mjkanji
Many thanks to all contributors who contributed to this release!
Contributors: @Michaelvll, @romilbhardwaj, @concretevitamin, @cblmemo, @MaoZiming, @shethhriday29, @asaiacai, @JGSweets, @mjkanji, @MysteryManav, @landscapepainter, @Harthgar, @mjibril, @dtran24, @fozziethebeat, @JungleCatSW
Full Changelog: v0.5.0...v0.6.0
SkyPilot v0.5.0
SkyPilot v0.5.0: SkyServe, New Provisioner, LLMs, Kubernetes, and More Clouds
We are excited to release SkyPilot v0.5.0, where we introduce a significant amount of new features and enhancements, including:
- SkyPilot Serving
- New provisioner
- LLM recipes for the latest open models and engines
- Kubernetes support improvement
- 4 new clouds (contributed by the cloud providers!)
and more!
Release Highlights
New Features
- Multiple candidate resources: SkyPilot now supports multiple candidate resources for a single task (using multiple accelerators,
any_of
orordered
inresources
), allowing users to significantly enlarge the resource pool and get higher availability. - New Provisioner: Provisioner gets a new implementation, which is 2x faster and more reliable for supported clouds. Support launching clusters with more than 100 nodes. Dependency requirements for clouds are also significantly reduced.
- Disk Tier: Introducing
best
disk tier for the best performance and cost, so you can choose the best disk for any cloud. (#2434) - Allow 2x spot jobs to be run concurrently
- Mount storage back after cluster restart
SkyServe
SkyServe is a serving system on top of SkyPilot that deploys and scales any HTTP services across one or more regions or clouds, with autoscaling, load balancing, and more.
- Introducing SkyServe: deploy and scale your AI models across multiple regions or clouds. (#2458)
- Autoscaler: Request rate based autoscaling policy. (#2868, #2878)
- Autoscaler: Support scaling to 0 when no requests (#2938)
- Rolling update: Support rolling update for existing services (#2935, #3057)
Other Enhancements
- Environment variable support in services field (#3078)
- Override task configurations with CLI arguments (#2979)
- Logging improvement for replicas (#2924, #2949)
- Smoke tests for SkyServe (#2911)
- Documents for SkyServe (#3022, #2794, #2864, #2894, #2922, #2989, #3182)
- UX improvements for SkyServe (#2895, #2940, #2961, #3054, #3176, #3094)
- Bug fixes and robustness improvement (#2811, #2822, #2860, #2995, #2983, #3058, #3075, #3226)
New LLM Recipes
- Gemma: Serve your Gemma on any cloud (#3207, #3220)
- SGLang: Speed up your LLM deployments with SGLang for 5x throughput on SkyServe (#3126, #3140, #3170, #3145)
- Mixtral 8x7B: Serving and scaling Mixtral 8x7B model on any regions/clouds (#2857, #2888, #3017, #3067, #2882)
- Mistral 7B: Official docs for hosting Mistral 7B from mistral.ai (#2615, #2856)
- CodeLlama: Hosting CodeLlama model with SkyServe and accessing it with API, chat or VSCode (#3050, #3143)
- LoRAX: efficient multi-lora LLM inference (#2883)
- axolotl: a latest LLM tool for finetuning AI models running on SkyPilot (#2784, #2789)
- Tabby: Self-host coding assistant Tabby on SkyPilot (#2597, #3068)
- vLLM: Serve with vLLM to expose OpenAI API for Vicuna and Mixtral (#2614, #2643, #2616, #2786, #2791, #2948,#3118)
- TGI: Scale the inference engine TGI with SkyServe (#3121)
Kubernetes
Kubernetes support received a number of New Features and Enhancements.
- Multi-node support for Kubernetes (#2609, #3019)
- Open ports support for Kubernetes (#2588, #2713, #2997, #3200)
- Support Coreweave label for GPUs in Kubernetes (Coreweave support under development) (#2650)
- Starting a kubernetes GPU cluster locally with
sky local up
(#2890) - Custom Image Support for Kubernetes Instances (#2729, #3019, #3210)
- New provisioner for kubernets for better performance and robustneess (#3019)
- Supporting Kubernetes cluster launched with k3s and Rancher (#3148)
Other Enhancements
- Support H100 80GB in Kubernetes (#2840)
- Share SSH jump pod across users to reduce resources consumption (#2826)
- Allow
KUBECONFIG
env var for config file specification (#3169) - Robustify the kubernetes cluster removement (#3043)
- Fixes GPU labeller (#2636, #2653)
- UX and Robustness improvement (#2638, #2712, #2589, #2785, #2551, #2795, #2884, #2913, #2795)
- Documents improvement (#2595, #2705, #2957, #2991, #2997, #3119)
More Clouds
SkyPilot now supports 13 cloud providers, including 4 new provider-contributed clouds: VMWare vSphere, RunPod, Fluidstack and Cudo Compute.
- RunPod: RunPod is a specialized AI cloud, with additional capacities for high-end GPUs. (#2980, #3018)
- Fluidstack: Fluidstack offers accessible GPUs for AI with low cost. (#3086, #3224)
- Cudo Compute: GPU cloud provides low cost GPUs powered with green energy. (#2975, #3224)
- VMWare vSphere: you can now bring your own vSphere cluster to SkyPilot. (docs) (#3000)
Clouds
AWS
New Features
- New provisioner for AWS: >2x faster for multi-node provisioning and more reliable for cluster launching. (#1702, #2719, #2792)
- Support for AWS Trainium accelerator (#2690)
- Support null for proxy command to filter regions (#2756)
- Support CUDA 12.1 with default image updates (#2788)
- Job scheduling on Inferentia and Trainium (#2969, #2798)
- Allow specifying security_group (#3133)
Enhancements
- Make public / private subnet selection robust (#2867)
- Avoid hanging for restarting an instance in STOPPING state (#2998)
- Remove sunset instance types (#2610)
- Add docs for custom VPC support (#2776)
Fixes
- Fix conda installation on AWS default image (#3206)
- Robustify the custom image support (#3216)
- Fix subnet selection for AWS and autodown for spot instances (#2921)
- Fix minimal permission for AWS (#2978)
- Improve opening ports for AWS (#2716)
- Autstop with new provisioner (#2719)
GCP
New Features
- Security: Custom VPC support for GCP. (#2764, #2772, #2854, #2944)
- Security: Support private IP with proxy jump on GCP. (#2819)
- New provisioner: Adopted new provisioner for GCP with >2x faster and more robust provisioning (#2681, #2719, #2943)
- Automatically use reserved instances from multiple reserved pools (#2836, #2681)
- Support L4 accelerator for GCP (#2724)
- Allow stopping spot clusters on GCP (#2877)
Enhancements
- Allow stopping VM with local SSD (#2587)
- Update default runtime version for TPU node (#2601, #2602)
- Handling transient error during launching GCP clusters (#2669)
- Update GCSFuse version to 1.3.0 for GCS storage mount (#2887)
- Set TPU VM the default option for TPU accelerators (#1758)
- Ignore missing gcp credentials for latest gcloud and avoid duplicating credentials (#3028, #3172, #3234)
Fixes
- Fix custom docker image support (#3218)
- Fix minimal roles required for GCP (#2704)
- Robustify the catalog fetching (#3141)
- Fix ports on TPU VM and cluster launched before 0.4.0 (#2641)
- Fix backward compatibility issue with GCP clusters (#2604)
- Fix
--disk-size
for Custom Machine Images (#2718) - Update catalog fetcher with more options (#2562)
- Assign GCP VMs with service account (#2972)
- Fix machine image support (#3030, #3236)
- Fix error handling for failed provisioning (#2852)
- Leave out TPU v5 in catalog as it is not supported (#2656)
- Fix GCP minimal permission (#2947, #2770, #2761)
Azure
Enhancements
- Make ports openning more robust (#2649, #2891, #3084)
- Additional arguments for Azure catalog fetcher and support H100 (#2561, #2844, #2847)
- Support CUDA 12.1 with default image updates (#2468)
- Support spot instances on Azure (#2871)
Fixes
- Fix custom docker image support (#3218)
- UX: Fix Azure disk tier explicitly shown in resources str (#3064)
- Fix status query for Azure (#3015)
SCP
- Fix SCP error raised in
sky check
(#3038)
CLI & Core interfaces
New Features
- Multi-node jobs fail fast fast for single node failure (#3081)
- Add configurations for not uploading credentials (#2904)
- Adding
sky status --endpoints
CLI (#3199) - Support more characters in cluster name (#3130)
- Show all regions and more accurate price in
sky show-gpus
(#2583, #2892, #2933, #2946, #3083, #3149, #3113) - Allow infering cloud from region or zone (#2632)
- Add
--commit
and--version
forsky
CLI (#2720, #2731, #2733)
Enhancements
- Robustify runtime initialization on remote cluster (#3132)
- Better error message for YAML parsing (#3040)
- Smarter GPU name completion (#3014)
- Speed up retry until up by not doing exponential backoff (#2821)
- Add schema validation for config (#2645)
- Allow
--disk-tier none
override (#2906) sky check
improvement (#3174, #3212, #3160)- Better logging for CLIs (#2535, #2691, #2728, #3139, #3175)
Fixes
- Fix permissi...
SkyPilot v0.4.1
This is a patch release to ship bug fixes faster to our users! This release includes many feature updates and bug fixes, including the new provisioner for AWS, fixing OOM and credential issues for long-running spot jobs, and some additional improvements.
Detailed changelog coming up in v0.5!
SkyPilot v0.4.0
SkyPilot v0.4.0: Kubernetes, native containers, ports and new clouds
We are excited to release SkyPilot v0.4.0, which brings a host of new features and improvements, including Kubernetes support, native container support, ability to open ports, and more.
Release Highlights
New Features
- Kubernetes support: SkyPilot tasks and clusters can now run on Kubernetes clusters, including on-prem and cloud hosted deployments (GKE, EKS).
- If you have a working kubeconfig, simply run
sky check
andsky launch --cloud kubernetes
to run your task on Kubernetes. - If desired, tasks can also failover to the cloud when the Kubernetes cluster does not have enough resources. The same SkyPilot YAMLs and CLI works seamlessly across Kubernetes and clouds.
- If you have a working kubeconfig, simply run
- Opening ports on clusters: Open ports on your clusters with the
ports
field. These ports are publicly accessible and can be used for hosting LLM inference endpoints, Jupyter notebooks, web servers, Tensorboard, and other services. - Native container support: If your task uses docker containers, SkyPilot's
setup
andrun
commands can now directly be executed in that container. This allows you to wrap your environment in a container and run it on any cloud with SkyPilot. - Reservation support: This release adds support for GCP reservations. SkyPilot will now prioritize using your reservations on the cloud to save costs and get higher availability.
- New Managed Spot Features
- Spot pipeline support: automatically execute a pipeline of sequential tasks.
- Spot dashboard: track all your spot jobs in your browser.
New LLM Recipes
- vLLM on any cloud - blog, example.
- Llama 2 - Train Vicuna on Llama-2 and serve chatbots - blog, fine-tuning example, self-hosted chatbot.
- LocalGPT: chat with your pdfs - example.
- Falcon-40B fine-tuning guide - example.
More Clouds
SkyPilot now supports 8 clouds, including community contributed support for two new clouds:
SkyPilot now also supports IBM COS buckets (#1966).
Core and UX Improvements
- Faster failover: 30x faster failover with our new quota optimization which checks if quotas are available before launching a cluster (Supported on GCP, AWS).
- Easily get VM IPs: The new
--ip
flag forsky status
returns the public IP address of the cluster (e.g.,sky status --ip mycluster
). Use this to access services such as LLM inference endpoints, jupyter notebooks and more. - Improved scriptability: SkyPilot YAMLs and CLI are more scriptable than ever -
file_mounts
can be dynamically defined with environment variables (docs, example), environment variables can be set through a dotenv file with the new--env-file
flag (#2296). - Core optimizations: Multi-node clusters stop 4x faster (#2199),
sky status
updates for stopped clusters are 10x faster (#2288), and the job queue is more memory efficient (#1636). - Nightly releases: We now release nightly versions of SkyPilot. To get the cutting edge of SkyPilot without installing from source, run
pip install skypilot-nightly
(#1446)
Deprecation
- SkyPilot On-prem is now deprecated and Kubernetes will be the recommended mode of running SkyPilot on on-prem clusters.
Below is a detailed list of changes.
Managed Spot
New Features
- Spot pipeline support: automatically handles a pipeline of spot jobs. (#1982)
- Spot dashboard is now available with
sky spot dashboard
: you can now see all your spot jobs in GUI (#2103, #2136) - Spot callback - users can now run custom code when spot job status changes (#2106, #2364)
- Resource configuration of the spot controller can now be customized (docs, #2040)
Enhancements
- SkyPilot now shows the spot job's resources and estimated cost before confirmation (#2524)
- Switch to eager failover recovery policy for better spot lifetime (#2234)
- Reduce the logging for launching spot controller (#2056)
Fixes
- We now show PENDING spot job in the spot queue before it starts (#2044)
- Robustness fixes (#2102, #2153, #2119, #2004, #2330, 1998)
CLI & YAML interfaces
New Features
- Users can now use environment variables to dynamically define file_mounts (docs, #2146)
sky status
can now show the head IP of the cluster with-a
or--ip
flags (#2305, #2563)sky down/stop/start
defaults to a unique cluster if it exists andsky cancel
without cluster cancels the latest task (#2325)
Enhancement
sky check
output is now friendlier with more hints for disabled clouds (#2002, #2017, #2196, #2114, #2221, #2377)sky down
progress bar now reflects clusters failed to terminate (#1595, #2005)- We now fail early if rsync is not installed locally (#2168)
- Better messages and hints for CLI (#2027, #2028, #2077, #2083, #2085)
Fixes
- Fixed the order of VMs in optimizer table when
--cpus
is provided (#2037) - Better handling when
sky launch
is interrupted (#2206, #2252)
Backend
New Features
- Users can now open ports for their clusters with the
ports
field (docs, #2210, #2477) - Docker support in
image_id
- tasks can now be run inside docker containers (docs, #1910) - Users can now clone a cluster from an existing cluster's disk with the
--clone-disk-from
flag (#2098) - Users can now launch their own ray cluster on a SkyPilot cluster (#2020)
Enhancements
- 30x faster failover for AWS and GCP when quotas are not available (#1953, #2187, #2313)
- Faster
sky launch
by caching cluster IP address (#2400) - Job queue is now more resource efficient, with significant memory consumption reduction on remote cluster (#1636)
- Cluster names no longer map directly to cloud cluster names. Instead, they are mapped to a unique cluster name on the cloud. This helps with isolation across users sharing cloud accounts. (#2403)
- More efficient and robust stopping/termination for AWS (#2121)
sky status --refresh
for STOPPED cluster is 10x faster (#2079)- Empty YAML fields are now allowed (#1890)
Fixes
- Manually started/stopped clusters are now better handled (#2130, #2203, #2389)
- Fix edge case where existing clusters were terminated when resources are not available (#2170)
- Fixes for disk_tier UX (#2156, #2215)
- Robustness fixes (#2033, #2061, #2009, #2491, #2290, #1259, #2074, #2023, #2042)
Storage
New Features
Enhancements
- Deletion is now parallelized for faster deletion (#2058)
- UX improvements for
sky storage
CLI (#2063, #2177) - GCS bucket mounting now uses gcsfuse v1.0.1 (#2470)
Fixes
- Fix transient failures when uploading to GCS from MacOS due to multiprocessing bug (#2125)
- Robustness fixes (#2049, #2117, #2165, #2259, #2326, #2250)
Dependencies
- Avoid buggy grpcio versions (#2055)
- Pydantic is pinned to
<2.0
(#2157) - PyYAML is pinned to
>3.13, != 5.4.*
to avoid issues with Cython 3 (#2256, #2514) - Ray
<= 2.6.3
is supported on local machines (#2401) pycryptodome
,oauth2client
are no longer required (#2515)
Clouds
AWS
- H100 GPUs are now supported (#2323)
- New docs for AWS cloud administrator about advanced login option (SSO and account switching) (#1888)
- Insufficient permission is now handled gracefully (#2415, #2456)
- Fixed a bug where existing AWS cluster would end up in INIT state after changing identity (#2442)
- Fix fetching AZ when describe zones permission does not exist in all regions (#2463)
GCP
- Nvidia L4 GPUs are now supported (#2212)
- Machine Images are now supported (#2280)
- GCP reservations are now...
SkyPilot v0.3.3
This patch release brings many bug fixes and features, including new mechanics for stop/down, callbacks for spot jobs and a critical dependency fix for PyYAML after the release of cython 3.
Detailed changelog coming up in v0.4!
SkyPilot v0.3.2
This is a patch release to ship bug fixes faster to our users! This release includes many feature updates and bug fixes, including the pedantic dependency issue, disk cloning, file mounts, and cloud-specific improvements.
Detailed changelog coming up in v0.4!
SkyPilot v0.3.1
This is a patch release to ship several important enhancements and bug fixes:
Enhancements
- On-demand H100 GPU from Lambda is supported!
sky launch --gpus h100
- To use it, remove any previous Lambda catalog:
rm -rf ~/.sky/catalogs/v5/lambda
- To use it, remove any previous Lambda catalog:
- Managed spot: make job cancellation during failover more robust to mitigate a rare
FAILED_SETUP
error (#1998)
Fixes
- Provisioner / Backend
- Logging
- Managed spot
- Fix
sky spot launch --retry-until-up
to make it actually retry until up (#2004)
- Fix
- Storage
- Fix a rare storage cloud check error if
sky check
has never been called (#2017)
- Fix a rare storage cloud check error if
- On-prem
- Fix detecting A5000 and A6000 GPUs (#2023)
Full Changelog: v0.3.0...v0.3.1
SkyPilot v0.3.0
SkyPilot v0.3.0: LLM Support, New Clouds, Enhanced Production-Readiness
We are excited to release SkyPilot v0.3, the most significant release thus far in the project's history.
v0.3 focuses on:
- LLM support (Vicuna, LLaMA)
- New clouds (Lambda Cloud; IBM; Cloudflare R2)
- Enhanced production readiness
See the release blog post for a deep-dive into highlights.
Release notes below are as compared to v0.2 (full changelog).
Release Highlights
- LLM support
- Vicuna LLM chatbot trained using SkyPilot for $300 on spot instances!
- Full finetuning & serving YAMLs released here to build off of!
- Serve your own LLaMA LLM chatbot on any cloud: full example, blog, repo
- Significantly expanded GPU availability by leveraging the widest selection of clouds (see below)
- Vicuna LLM chatbot trained using SkyPilot for $300 on spot instances!
- More clouds, more choices: delivering the highest GPU availability & cost savings
- Lambda Cloud is now supported!
- IBM Cloud is now supported!
- This brings the first hyperscaler cloud after AWS/GCP/Azure to SkyPilot. (#1598)
- Cloudflare R2 object store is now supported!
- This brings zero-egress cost object storage to SkyPilot. (#1736)
- To use it, see setup docs and usage docs.
- Managed Spot is made significantly more robust via a host of fixes/enhancements.
- Cluster leakage prevention and detection are significantly improved.
- CLI/API & Backend shipped many new features:
sky cost-report
; fine-grained optimizer; user identity; AWS SSO; private IP-only VPCs; Ray runtime is decoupled from user's Ray clusters; ...
CLI/API
New Features
- New CLI
sky cost-report
: show the estimated cost of launched clusters (#1301, #1621, #1780, #1680, #1788)- Experimental: Costs for clusters with auto{stop,down} scheduled may not be accurate.
- New resource filtering support in
sky launch
/ YAMLresources:
field - Add
--detach-setup
and--detach-run
tosky launch
#1379 - Add
--retry-until-up
,--region
,--zone
, and--idle-minutes-to-autostop
for interactive nodes #1297 - Add autodown (#1217, #1254)
- Support calling
sky status/sky.status()
on specific clusters #1568 - Support
--region
insky show-gpus
#1187 - Support passing AMIs for different regions in
image_id
field underresources
#1384
Enhancements
- Improvements to
sky show-gpus
- Check image existence and its size can fit in OS disk #1508
- Make
sky down -p
bypass identity mismatch errors. #1892
Fixes
- Make repeated
sky {cpu,gpu,tpu}node
commands correctly reuse existing cluster if possible #1787 - Fix errors from empty 'resources' field in YAML. #1816
- Make autostop more robust for AWS custom images that by default export 2 credential env vars (#1880, #1894, #1946)
Managed spot
New Features
- Latest in-progress spot jobs are shown in
sky status
(#1270, #1467, #1691) - Detailed reasons for failed spot jobs are exposed in
sky spot queue -a
(#1655)
Enhancements
- Make
sky spot launch
default-r/--retry-until-up
to True. #1781 - Make job termination/cancellation significantly more robust (#1433, #1745)
- Catch "pre-launch" errors early (e.g., invalid cluster names, no cloud access) to avoid unnecessary retries (#1714)
sky start
on the spot controller resets the default autostop #1453sky spot queue
displays job states with colors (#1473)sky spot queue
no longer shows a cached (and possibly stale) version of the jobs (#1742)- Disallow
sky down
on spot controller when in-progress spot jobs exist #1667 - New state
FAILED_SETUP
for spot jobs that fail duringsetup
(#1479) - New state
CANCELLING
for spot jobs that are being cancelled (#1785) - Keep env var
SKYPILOT_JOB_ID
the same for all recoveries of the same job #1400
Fixes
- Robustness fixes (#851, #1329, #1411, #1545, #1738, #1757, #1798, #1951, ...)
- Fixes for spot TPUs (#1249, #1470, #1500, #1555, #1717)
- Fix spot jobs with the same name (
-n
) possibly overwriting each other #1782 - Make spot job failover only use the regions in
ssh_proxy_command
if specified #1792 - Fix failing to launch spot jobs when spot controller is created with AWS SSO #1817
TPU
Robustness is enhanced for TPUs in various modes: VMs, pods, spot (#1500, #1279, #1359, #1483, #1562, ...).
Provisioner
Enhancements
- Cluster leakage prevention is significantly improved!
- Disable unattended-upgrade (nondeterministic APT lock) on cluster start
- Generate valid cluster names when username has invalid characters #1526
Fixes
- GCP/provisioner: Handle the occasional RESOURCE_NOT_FOUND error. #1842
- Robustness fixes (#1236, #1287, #1619, #1969)
Storage
New Features
- Cloudflare R2 is now supported! #1736
- R2 is an S3-compatible object store with zero egress fee.
- To use it, see setup docs and usage docs.
- Support multiple paths in the
source
of a storage mount, e.g.,source: [~/mydir/myfile.txt, ~/datasets]
#1311 #1677
Enhancements
- Exclude uploading
.git
folder for cloud storage mounts #1494 - If a
file_mounts
destination path is a relative path, it is treated as being under workdir #1315 - Upgrade GCSFuse version to 0.42.3 #1829
- Mounting options improvements (#1312, #1296, #1320)
- API improvements (#1223, #1239)
- UX/logging improvements (#1200, #1285, #1457, #1833, #1857, #1908, #1858)
Fixes
- Fix
sky storage delete
for externally deleted buckets #1875 - Disallow single files for upload to Storage #1231
- fix rsync for paths with spaces #1190
Backend
New Features
- New feature: Fine-grained optimizer
- Optimizing & provisioning retries at the granularity of regions/zones #975
- In other words, SkyPilot now automatically recognizes and optimizes across the cost differences between zones (e.g., AWS zones have different prices for the same spot instance type) or regions
- New feature: User identity is associated with each cluster (#1513, #1550, #1809)
- Identities are e.g., different AWS profiles / GCP projects
- With this, users are free to switch across identities, and SkyPilot will properly protect each cluster
Enhancements
- Ray runtime on SkyPilot clusters is upgraded to v2.4.0 (#1734)
- All existing clusters are automatically upgraded on its next
sky launch/start
- Local client's ray requirement is updated to
ray[default]>=2.2.0,<=2.4.0
to fix some dependency conflicts with click/grpcio/protobuf
- All existing clusters are automatically upgraded on its next
- Ray cluster used by the SkyPilot runt...