Skip to content

aws-samples/aws-deep-learning-ami-ubuntu-dcv-desktop

AWS Deep Learning Desktop with Amazon DCV

Launch an AWS deep learning desktop with Amazon DCV for developing, training, testing, and visualizing deep learning, and generative AI models.

Overview

Supported AMIs:

  • Ubuntu Server Pro 24.04 LTS, Version 20250516 (Default)
  • Ubuntu Server Pro 22.04 LTS, Version 20250516

Supported EC2 Instance Types:

Key Features:

Getting Started

Prerequisites

Requirements:

Supported AWS Regions: us-east-1, us-east-2, us-west-2, eu-west-1, eu-central-1, ap-southeast-1, ap-southeast-2, ap-northeast-1, ap-northeast-2, ap-south-1

Note: Not all EC2 instance types are available in all Availability Zones.

Setup Steps:

  1. Select your AWS Region from the supported regions above

  2. VPC and Subnets: Create a VPC or use an existing one. If needed, create three public subnets in three different Availability Zones

  3. EC2 Key Pair: Create an EC2 key pair if you don't have one (needed for KeyName parameter)

  4. S3 Bucket: Create an S3 bucket in your selected region (can be empty initially)

  5. Get Your Public IP: Use AWS check ip to find your public IP address (needed for DesktopAccessCIDR parameter)

  6. Clone Repository: Clone this repository to your laptop:

    git clone <repository-url>

Launch the Desktop

Create a CloudFormation stack using the deep-learning-ubuntu-desktop.yaml template via:

See CloudFormation Parameters for template inputs and Stack Outputs for outputs.

Important: The template creates IAM resources:

  • Console: Check "I acknowledge that AWS CloudFormation might create IAM resources" during review
  • CLI: Use --capabilities CAPABILITY_NAMED_IAM flag

Connect via SSH

  1. Wait for stack status to show CREATE_COMPLETE in CloudFormation console
  2. Find your desktop instance in EC2 console
  3. Connect via SSH as user ubuntu using your key pair

First-time Setup:

  • If you see "Cloud init in progress! Logs: /var/log/cloud-init-output.log", disconnect and wait ~15 minutes. The desktop installs Amazon DCV server and reboots automatically.
  • When you see Deep Learning Desktop is Ready!, set a password:
    sudo passwd ubuntu

Troubleshooting: The desktop uses EC2 user-data for automatic software installation. Check logs at /var/log/cloud-init-output.log. Most transient failures can be fixed by rebooting the instance.

Connect via Amazon DCV Client

  1. Download and install the Amazon DCV client on your laptop
  2. Login to the desktop as user ubuntu
  3. Do not upgrade the OS version when prompted on first login
  4. Configure Software Updater to only apply security updates automatically (avoid non-security updates unless you're an advanced user)

Using the Desktop

Generative AI Inference Testing

The desktop provides comprehensive inference testing frameworks for LLMs and embedding models. See Inference Testing Guide for complete documentation.

Note: Once you have successfully connected to the Deep Learning Desktop with the DCV client, perform the following steps:

  1. Clone the project's git repository to your home directory:
   cd ~ && git clone <repository-url>
  1. Open the cloned repository in Kiro (recommended) or Visual Studio Code (both are pre-installed).

Supported Inference Servers:

Supported Backends:

  • vLLM - High-performance inference (GPU and Neuron)
  • TensorRT-LLM - Optimized for NVIDIA GPUs
  • Custom Python backends for embeddings

Key Features:

  • Docker containers for all server/backend combinations
  • Locust-based load testing with configurable concurrency
  • Automatic model caching to EFS
  • Hardware auto-detection (CUDA GPUs or Neuron devices)
  • Performance metrics with latency, throughput, and error rates

Generative AI Training Testing

The desktop provides four frameworks for fine-tuning LLMs with PEFT (LoRA) or full fine-tuning. See Training Testing Guide for complete documentation.

Available Frameworks:

Framework Key Features
NeMo 2.0 Tensor/pipeline parallelism, Megatron-LM optimizations
PyTorch Lightning Full control, flexible callbacks
Accelerate Simple API, minimal code
Ray Train Distributed orchestration, auto-recovery

Common Features:

  • Generalized HuggingFace dataset pipeline with flexible templates
  • Multi-node, multi-GPU distributed training with FSDP
  • LoRA and full fine-tuning support
  • Automatic checkpoint conversion to HuggingFace format
  • Comprehensive testing and evaluation scripts
  • Docker containers for reproducibility

Amazon SageMaker AI

The desktop is pre-configured for Amazon SageMaker AI.

Clone SageMaker AI Examples GitHUb Repository:

mkdir ~/sagemaker-ai
cd ~/sagemaker-ai
git clone -b distributed-training-pipeline https://github.com/aws/amazon-sagemaker-examples.git

Install Python extension in Visual Code, and open the cloned amazon-sagemaker-examples repository within Visual Code.

Inference Examples:

  1. Navigate to: amazon-sagemaker-examples/advanced_functionality/large-model-inference-testing/large_model_inference.ipynb
  2. Use conda base environment as kernel
  3. Skip to Initialize Notebook

Training Examples (FSx for Lustre must be enabled on the Deep Learning desktop):

  1. Navigate to: amazon-sagemaker-examples/advanced_functionality/distributed-training-pipeline/dist_training_pipeline.ipynb
  2. Use conda base environment as kernel
  3. Skip to Initialize Notebook

Data Storage and File Systems

S3 Access: The desktop has access to your specified S3 bucket. Verify access:

aws s3 ls your-bucket-name

No output means the bucket is empty (normal). An error indicates access issues.

Storage Options:

  • Amazon EBS: Root volume (deleted when instance terminates)
  • Amazon EFS: Mounted at /home/ubuntu/efs by default (persists after termination)
  • Amazon FSx for Lustre: Optional, mounted at /home/ubuntu/fsx by default (enable via FSxForLustre parameter)

Important: EBS volumes are deleted on termination. EFS file-systems persist.

Managing the Desktop

Stopping and Restarting

You can safely reboot, stop, and restart the desktop instance anytime. EFS (and FSx for Lustre, if enabled) automatically remount on restart.

Distributed Training

For distributed training workloads, launch a deep-learning cluster with EFA and Open MPI. See the EFA Cluster Guide.

Deleting Resources

Delete CloudFormation stacks from the AWS console when no longer needed.

What Gets Deleted:

  • EC2 instances
  • EBS root volumes
  • FSx for Lustre file-systems (if enabled)

What Persists:

  • EFS file-systems are NOT automatically deleted

Reference

Desktop CloudFormation Template Parameters

Parameter Name Parameter Description
AWSUbuntuAMIType Required. Selects the AMI type. Default is AWS Deep Learning AMI (Ubuntu 18.04).
DesktopAccessCIDR Public IP CIDR range for desktop access, e.g. 1.2.3.4/32 or 7.8.0.0/16. Ignored if DesktopSecurityGroupId is specified.
DesktopHasPublicIpAddress Required. Specify if desktop has a public IP address. Set to "true" unless you have AWS VPN or DirectConnect enabled.
DesktopInstanceType Required. Amazon EC2 instance type. G3, G4, P3 and P4 instance types are GPU enabled.
DesktopSecurityGroupId Optional advanced parameter. EC2 security group for desktop. Must allow ports 22 (SSH) and 8443 (DCV) from DesktopAccessCIDR, access to EFS and FSx for Lustre, and all traffic within the security group. Leave blank to auto-create.
DesktopVpcId Required. Amazon VPC id.
DesktopVpcSubnetId Required. Amazon VPC subnet. Must be public with Internet Gateway (for Internet access) or private with NAT gateway.
EBSOptimized Required. Enable network optimization for EBS (default is true).
EFSFileSystemId Optional advanced parameter. Existing EFS file-system id with network mount target accessible from DesktopVpcSubnetId. Use with DesktopSecurityGroupId. Leave blank to create new.
EFSMountPath Absolute path where EFS file-system is mounted (default is /home/ubuntu/efs).
EbsVolumeSize Required. Size of EBS volume (default is 500 GB).
EbsVolumeType Required. EBS volume type (default is gp3).
FSxCapacity Optional. Capacity of FSx for Lustre file-system in multiples of 1200 GB (default is 1200 GB). See FSxForLustre parameter.
FSxForLustre Optional. Enable FSx for Lustre file-system (disabled by default). When enabled, automatically imports data from s3://S3bucket/S3Import. See S3Bucket and S3Import parameters.
FSxMountPath FSx file-system mount path (default is /home/ubuntu/fsx).
KeyName Required. EC2 key pair name for SSH access. You must have the private key.
S3Bucket Required. S3 bucket name for data storage. May be empty at stack creation.
S3Import Optional. S3 import prefix for FSx file-system. See FSxForLustre parameter.
UbuntuAMIOverride Optional advanced parameter to override the AMI. Leave blank to use default AMIs. See AWSUbuntuAMIType.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the MIT-0 License.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •