Skip to content

Files

Latest commit

f3e57ae · Mar 31, 2024

History

History
200 lines (139 loc) · 7.86 KB

aws_readme.md

File metadata and controls

200 lines (139 loc) · 7.86 KB

AWS Readme

For maximum flexibility, we want the option to deploy our own LLM - most likely Meta's LLaMA 2 - onto Amazon SageMaker, AWS's AI service. This folder contains the IaC (infrastructure-as-code) resources to automate this process as desired.

See below for a step-by-step guide on how to deploy LLaMA 2 onto AWS from scratch.

Contents

Creating a containerised AI model

First of all, install Docker.

Download LLaMA

In this tutorial, we download Meta's LLaMA 2. You may also wish to consider deploying a different Open LLM, for example one taken from the Hugging Face Open LLM Leaderboards.

  • In terminal, navigate to AWS/llama, our slightly modified copy of the Meta LLaMA repo.

  • Request a LLaMA download link here.

  • Ensure you have wget and md5sum installed.

  • Inside AWS/llama, run

    ./download.sh

    Note: for some reason I had to run

    . download.sh

    instead.

  • Paste the download link from the email you requested and select which model you wish to download. For our purposes here, make sure it is one of the "chat" models.

Build Docker image

We can now build the Docker image.

  • Start the Docker daemon, by either:

    • opening Docker Desktop.
    • running sudo systemctl start docker in a Linux terminal.
    • running dockerd or sudo dockerd. See here for more details.
  • Build the Docker image. In terminal, navigate to to the LLaMA folder containing the Dockerfile discussed above, and run

    docker build -t llama-image .

    where the string llama-image can be whatever name you like.

    Note that this will take a while - ~600 seconds on my laptop.

Note: as mentioned above, this repo contains some slightly modified LLaMA code. Here are the modifications:

  • AWS/llama/custom_chat_completion.py has been added to allow you to send LLaMA custom prompts as follows:

    torchrun --nproc_per_node 1 custom_chat_completion.py \
      --ckpt_dir llama-2-7b-chat/ \ # Specify the model version you downloaded here.
      --tokenizer_path tokenizer.model \
      --max_seq_len 512 --max_batch_size 6
      --user_message "Tell me about Conduit, the LLM-based text editor."
  • AWS/llama/Dockerfile specifies how to build the container.

    • To avoid incurring unnecessary AWS ECR expenses, we have removed as many files from the original LLaMA repo as possible. The Dockerfile copies all files from this folder, then removes the license (which we are required to host in the repo), and download.sh, neither of which are necessary to run the model.

    • We also set the entrypoint of the model as the custom_chat_completion.py file defined above, so we can send prompts to our model.

    • It is also possible to configure the entrypoint with more arguments, so that we have to specify fewer arguments when calling the model.

Run Docker container and model

  • Once you have built the image, find it in the terminal and run:

    docker run -it --name llama-container llama-image

    where llama-container can be whatever name you like, and llama-image should match with the name you chose for the image above.

    This will start a new container that will accept commands and keep running until you stop it.

    • -it stands for "interactive" and "terminal", allowing you to interact with the container's CLI.
  • Now the container is running, you can send it prompts via the endpoint with:

    docker run -it --name llama-container llama-image \
    --ckpt_dir llama-2-7b/ \ # Specify the model version you downloaded here.
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 4 
    --user_message "Tell me about Conduit, the LLM-based text editor."
  • When you're done, stop your container with:

    docker stop llama-container

AWS setup

Now we've created our containerised AI model, we're ready to deploy it on AWS, as disussed below. See this Amazon article for more details on this process.

First of all, install Terraform.

Deploy AWS instances

With Terraform installed, the following can be run to deploy the AWS Sagemaker instances specified in terraform.tfvars.

[Warning: This will cost you money if you're not careful! Make sure to clean up when you're done.]

export AWS_PROFILE=<your_aws_cli_profile_name>
cd terraform/infrastructure
terraform init
terraform plan
terraform apply

Push containerised AI model to AWS

We must now push the containerised AI model we built above to Amazon ECR ("Elastic Container Registry"). From here, our SageMaker instance can access our AI model.

To do this, first create an ECR repo, then run:

cd location/of/your/container
export AWS_PROFILE=<your_aws_cli_profile_name>
aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin <account_number>.dkr.ecr.eu-west-1.amazonaws.com
docker build -t ml-training .
docker tag ml-training:latest <account_number>.dkr.ecr.eu-west-1.amazonaws.com/<ecr_repository_name>:latest
docker push <account_number>.dkr.ecr.eu-west-1.amazonaws.com/<ecr_repository_name>

with the appropriate details substituted into the angled brackets above.

Run ML Pipeline

We now go to AWS Step Functions, a visual workflow tool, to run the ML pipeline. From here we can:

  • Run our AI model.

  • Train our AI model on data stored in an S3 bucket.

  • Create an endpoint so we can access our model for inference.

Invoke endpoint

Finally, we can run a Python script using Boto3, AWS's Python SDK, to invoke the endpoint of our AI model. For example:

import boto3
from io import StringIO
import pandas as pd

client = boto3.client('sagemaker-runtime')

endpoint_name = 'Your endpoint name' # Your endpoint name.
content_type = "text/csv"   # The MIME type of the input data in the request body.

payload = pd.DataFrame([[1.5,0.2,4.4,2.6]])
csv_file = StringIO()
payload.to_csv(csv_file, sep=",", header=False, index=False)
payload_as_csv = csv_file.getvalue()

response = client.invoke_endpoint(
    EndpointName=endpoint_name, 
    ContentType=content_type,
    Body=payload_as_csv
    )

label = response['Body'].read().decode('utf-8')
print(label)

In other words, we can finally send our model data (e.g. a question), and it will send stuff back (e.g. an answer).

Cleanup

When we're done, we must destroy all of our AWS infrastructure, else we will incur (substantial!) additional costs.

  1. On the Amazon S3 console, delete the training set and all the models we trained.

    (The models can also be deleted from the AWS CLI.)

  2. Delete the SageMaker endpoints, endpoint configuration, and models created via Step Functions - either via the SageMaker console or the AWS CLI.

  3. Destroy our infrastructure:

    cd terraform/infrastructure
    terraform destroy