Skip to content

Commit f90a11f

Browse files
committed
update with DSv0.1 improvements
1 parent 9692271 commit f90a11f

File tree

9 files changed

+561
-100
lines changed

9 files changed

+561
-100
lines changed

Diff for: config.py

+8-1
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,14 @@
2727
# SQS QUEUE INFORMATION:
2828
SQS_QUEUE_NAME = APP_NAME + 'Queue'
2929
SQS_MESSAGE_VISIBILITY = 4*60*60 # Timeout (secs) for messages in flight (average time to be processed)
30-
SQS_DEAD_LETTER_QUEUE = 'arn:aws:sqs:some-region:111111100000:DeadMessages'
30+
SQS_DEAD_LETTER_QUEUE = 'user_DeadMessages'
31+
32+
# MONITORING
33+
AUTO_MONITOR = 'True'
34+
35+
# CLOUDWATCH DASHBOARD CREATION
36+
CREATE_DASHBOARD = 'True' # Create a dashboard in Cloudwatch for run
37+
CLEAN_DASHBOARD = 'True' # Automatically remove dashboard at end of run with Monitor
3138

3239
# REDUNDANCY CHECKS
3340
CHECK_IF_DONE_BOOL = 'False' #True or False - should it check if there is already a .zarr file and delete the job if yes?

Diff for: documentation/DOZC-documentation/overview_2.md

+13-3
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
The steps for actually running the Distributed-OMEZARRCreator code are outlined in the repository [README](https://github.com/DistributedScience/Distributed-OMEZARRCreator/blob/master/README.md), and details of the parameters you set in each step are on their respective Documentation pages ([Step 1: Config](step_1_configuration.md), [Step 2: Jobs](step_2_submit_jobs.md), [Step 3: Fleet](step_3_start_cluster.md), and optional [Step 4: Monitor](step_4_monitor.md)).
66
We'll give an overview of what happens in AWS at each step here and explain what AWS does automatically once you have it set up.
77

8-
![Distributed-Something Chronological Overview](images/Distributed-Something_chronological_overview.png)
8+
![Distributed-OMEZARRCreator Chronological Overview](images/Distributed-OMEZARRCreator_chronological_overview.png)
99

1010
**Step 1 (A)**:
1111
In the Config file you set quite a number of specifics that are used by EC2, ECS, SQS, and in making Dockers.
@@ -42,7 +42,7 @@ If SQS tells them there are no visible jobs then they shut themselves down.
4242
**Optional Step 4 (E)**:
4343
If you choose to run `python3 run.py monitor` it will automatically scale down your hardware (e.g. intelligently scale down your spot fleet request) during a run and clean up all of the infrastructure you created for the run at the end of the run.
4444

45-
## What does this look like?
45+
## What does an instance configuration look like?
4646

4747
![Example Instance Configuration](images/sample_DCP_config_1.png)
4848

@@ -65,4 +65,14 @@ How long a job takes to run and how quickly you need the data may also affect ho
6565
* Running a few large Docker containers (as opposed to many small ones) increases the amount of memory all the copies of your software are sharing, decreasing the likelihood you'll run out of memory if you stagger your job start times.
6666
However, you're also at a greater risk of running out of hard disk space.
6767

68-
Keep an eye on all of the logs the first few times you run any workflow and you'll get a sense of whether your resources are being utilized well or if you need to do more tweaking of your configuration.
68+
Keep an eye on all of the logs the first few times you run any workflow and you'll get a sense of whether your resources are being utilized well or if you need to do more tweaking of your configuration.
69+
70+
## What does this look like on AWS?
71+
The following five are the primary resources that Distributed-OMEZARRCreator interacts with.
72+
After you have finished [preparing for Distributed-OMEZARRCreator](step_0_prep), you do not need to directly interact with any of these services outside of Distributed-OMEZARRCreator.
73+
If you would like a granular view of what Distributed-OMEZARRCreator is doing while it runs, you can open each console in a separate tab in your browser and watch their individual behaviors, though this is not necessary, especially if you run the [monitor command](step_4_monitor.md) and/or have DS automatically create a Dashboard for you (see [Configuration](step_1_configuration.md)).
74+
* [S3 Console](https://console.aws.amazon.com/s3)
75+
* [EC2 Console](https://console.aws.amazon.com/ec2/)
76+
* [ECS Console](https://console.aws.amazon.com/ecs/)
77+
* [SQS Console](https://console.aws.amazon.com/sqs/)
78+
* [CloudWatch Console](https://console.aws.amazon.com/cloudwatch/)

Diff for: documentation/DOZC-documentation/step_0_prep.md

+56-59
Original file line numberDiff line numberDiff line change
@@ -1,103 +1,100 @@
11
# Step 0: Prep
2+
There are two classes of AWS resources that Distributed-OMEZARRCreator interacts with: 1) infrastructure that is made once per AWS account to enable any Distributed-OMEZARRCreator implementation to run and 2) infrastructure that is made and destroyed with every run.
3+
This section describes the creation of the first class of AWS infrastructure and only needs to be followed once per account.
24

3-
Distributed-OMEZARRCreator runs many parallel jobs in EC2 instances that are automatically managed by ECS.
4-
To get jobs started, a control node to submit jobs and monitor progress is needed.
5-
This section describes what you need in AWS and in the control node to get started.
6-
This guide only needs to be followed once per account.
7-
(Though we recommend each user has their own control node, further control nodes can be created from an AMI after this guide has been followed to completion once.)
8-
9-
10-
## 1. AWS Configuration
11-
12-
The AWS resources involved in running Distributed-OMEZARRCreator can be primarily configured using the [AWS Web Console](https://aws.amazon.com/console/).
13-
The architecture of Distributed-OMEZARRCreator is based in the [worker pattern](https://aws.amazon.com/blogs/compute/better-together-amazon-ecs-and-aws-lambda/) for distributed systems.
14-
We have adapted and simplified that architecture for Distributed-OMEZARRCreator.
15-
16-
You need an active account configured to proceed.
17-
Log in into your AWS account, and make sure the following list of resources is created:
5+
## AWS Configuration
6+
The AWS resources involved in running Distributed-OMEZARRCreator are configured using the [AWS Web Console](https://aws.amazon.com/console/) and a setup script we provide ([setup_AWS.py](../../setup_AWS.py)).
7+
You need an active AWS account configured to proceed.
8+
Login into your AWS account, and make sure the following list of resources is created:
189

19-
### 1.1 Access keys
20-
* Get [security credentials](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) for your account.
10+
### 1.1 Manually created resources
11+
* **Security Credentials**: Get [security credentials](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) for your account.
2112
Store your credentials in a safe place that you can access later.
22-
* You will probably need an ssh key to login into your EC2 instances (control or worker nodes).
13+
* **SSH Key**: You will probably need an ssh key to login into your EC2 instances (control or worker nodes).
2314
[Generate an SSH key](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) and store it in a safe place for later use.
2415
If you'd rather, you can generate a new key pair to use for this during creation of the control node; make sure to `chmod 600` the private key when you download it.
25-
26-
### 1.2 Roles and permissions
27-
* You can use your default VPC, subnet, and security groups; you should add an inbound SSH connection from your IP address to your security group.
28-
* [Create an ecsInstanceRole](http://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html) with appropriate permissions (An S3 bucket access policy CloudWatchFullAccess, CloudWatchActionEC2Access, AmazonEC2ContainerServiceforEC2Role policies, ec2.amazonaws.com as a Trusted Entity)
29-
* [Create an aws-ec2-spot-fleet-tagging-role](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet-requests.html) with appropriate permissions (just needs AmazonEC2SpotFleetTaggingRole); ensure that in the "Trust Relationships" tab it says "spotfleet.amazonaws.com" rather than "ec2.amazonaws.com" (edit this if necessary).
30-
In the current interface, it's easiest to click "Create role", select "EC2" from the main service list, then select "EC2- Spot Fleet Tagging".
16+
* **SSH Connection**: You can use your default AWS account VPC, subnet, and security groups.
17+
You should add an inbound SSH connection from your IP address to your security group.
18+
19+
### 1.2 Automatically created resources
20+
* Run setup_AWS by entering `python setup_AWS.py` from your command line.
21+
It will automatically create:
22+
* an [ecsInstanceRole](http://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html) with appropriate permissions.
23+
This role is used by the EC2 instances generated by your spot fleet request and coordinated by ECS.
24+
* an [aws-ec2-spot-fleet-tagging-role](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet-requests.html) with appropriate permissions.
25+
This role grants the Spot Fleet the permissions to request, launch, terminate, and tag instances.
26+
* an SNS topic that is used for triggering the auto-Monitor.
27+
* a Monitor lambda function that is used for auto-monitoring of your runs (see [Step 4: Monitor](step_4_monitor.md) for more information).
3128

3229
### 1.3 Auxiliary Resources
30+
*You can certainly configure Distributed-OMEZARRCreator for use without S3, but most DS implementations use S3 for storage.*
3331
* [Create an S3 bucket](http://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html) and upload your data to it.
34-
* Add permissions to your bucket so that [logs can be exported to it](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3ExportTasksConsole.html) (Step 3, first code block)
35-
* [Create an SQS](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/CreatingQueue.html) queue for unprocessable-messages to be dumped into (aka a DeadLetterQueue).
36-
37-
### 1.4 Primary Resources
38-
The following five are the primary resources that Distributed-OMEZARRCreator interacts with.
39-
After you have finished preparing for Distributed-OMEZARRCreator (this guide), you do not need to directly interact with any of these services outside of Distributed-OMEZARRCreator.
40-
If you would like a granular view of [what Distributed-OMEZARRCreator is doing while it runs](overview_2.md), you can open each console in a separate tab in your browser and watch their individual behaviors, though this is not necessary, especially if you run the [monitor command](step_4_monitor.md) and/or enable auto-Dashboard creation in your [configuration](step_1_configuration.md).
41-
* [S3 Console](https://console.aws.amazon.com/s3)
42-
* [EC2 Console](https://console.aws.amazon.com/ec2/)
43-
* [ECS Console](https://console.aws.amazon.com/ecs/)
44-
* [SQS Console](https://console.aws.amazon.com/sqs/)
45-
* [CloudWatch Console](https://console.aws.amazon.com/cloudwatch/)
46-
47-
### 1.5 Spot Limits
48-
AWS initially [limits the number of spot instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-limits.html) you can use at one time.
49-
You can request more through a process in the linked documentation.
32+
Add permissions to your bucket so that [logs can be exported to it](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3ExportTasksConsole.html) (Step 3, first code block).
33+
34+
### 1.4 Increase Spot Limits
35+
AWS initially [limits the number of spot instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-limits.html) you can use at one time; you can request more through a process in the linked documentation.
5036
Depending on your workflow (your scale and how you group your jobs), this may not be necessary.
5137

52-
## 2. The Control Node
53-
The control node can be your local machine if it is configured properly, or it can also be a small instance in AWS.
38+
## The Control Node
39+
The control node is a machine that is used for running the Distributed-OMEZARRCreator scripts.
40+
It can be your local machine, if it is configured properly, or it can also be a small instance in AWS.
5441
We prefer to have a small EC2 instance dedicated to controlling our Distributed-OMEZARRCreator workflows for simplicity of access and configuration.
55-
To login in an EC2 machine you need an ssh key that can be generated in the web console.
42+
To login in an EC2 machine you need an SSH key that can be generated in the web console.
5643
Each time you launch an EC2 instance you have to confirm having this key (which is a .pem file).
5744
This machine is needed only for submitting jobs, and does not have any special computational requirements, so you can use a micro instance to run basic scripts to proceed.
45+
(Though we recommend each user has their own control node, further control nodes can be created from an AMI after this guide has been followed to completion once.)
5846

5947
The control node needs the following tools to successfully run Distributed-OMEZARRCreator.
60-
Here we assume you are using the command line in a Linux machine, but you are free to try other operating systems too.
48+
These instructions assume you are using the command line in a Linux machine, but you are free to try other operating systems too.
6149

62-
### 2.1 Make your own
50+
### Create Control Node from Scratch
51+
#### 2.1 Install Python 3.8 or higher and pip
52+
Most scripts are written in Python and support Python 3.8 and 3.9.
53+
Follow installation instructions for your platform to install Python.
54+
pip should be included with the installation of Python 3.8 or 3.9, but if you do not have it installed, install pip.
6355

64-
#### 2.1.1 Clone this repo
56+
#### 2.2 Clone this repository and install requirements
6557
You will need the scripts in Distributed-OMEZARRCreator locally available in your control node.
6658
<pre>
6759
sudo apt-get install git
6860
git clone https://github.com/DistributedScience/Distributed-OMEZARRCreator.git
6961
cd Distributed-OMEZARRCreator/
7062
git pull
71-
</pre>
72-
73-
#### 2.1.2 Python 3.8 or higher and pip
74-
Most scripts are written in Python and support Python 3.8 and 3.9.
75-
Follow installation instructions for your platform to install python and, if needed, pip.
76-
After Python has been installed, you need to install the requirements for Distributed-Something following this steps:
77-
78-
<pre>
79-
cd Distributed-OMEZARRCreator/files
63+
# install requirements
64+
cd files
8065
sudo pip install -r requirements.txt
8166
</pre>
8267

83-
#### 2.1.3 AWS CLI
68+
#### 2.3 Install AWS CLI
8469
The command line interface is the main mode of interaction between the local node and the resources in AWS.
85-
You need to install [awscli](http://docs.aws.amazon.com/cli/latest/userguide/installing.html) for Distributed-Something to work properly:
70+
You need to install [awscli](http://docs.aws.amazon.com/cli/latest/userguide/installing.html) for Distributed-OMEZARRCreator to work properly:
8671

8772
<pre>
8873
sudo pip install awscli --ignore-installed six
8974
sudo pip install --upgrade awscli
9075
aws configure
9176
</pre>
9277

93-
When running the last step, you will need to enter your AWS credentials.
78+
When running the last step (`aws configure`), you will need to enter your AWS credentials.
9479
Make sure to set the region correctly (i.e. us-west-1 or eu-east-1, not eu-west-2a), and set the default file type to json.
9580

9681
#### 2.1.4 s3fs-fuse (optional)
9782
[s3fs-fuse](https://github.com/s3fs-fuse/s3fs-fuse) allows you to mount your s3 bucket as a pseudo-file system.
9883
It does not have all the performance of a real file system, but allows you to easily access all the files in your s3 bucket.
9984
Follow the instructions at the link to mount your bucket.
10085

101-
#### 2.1.5 Create Control Node AMI (optional)
86+
### Create Control Node from AMI (optional)
10287
Once you've set up the other software (and gotten a job running, so you know everything is set up correctly), you can use Amazon's web console to set this up as an Amazon Machine Instance, or AMI, to replicate the current state of the hard drive.
10388
Create future control nodes using this AMI so that you don't need to repeat the above installation.
89+
90+
## Removing long-term infrastructure
91+
If you decide that you never want to run Distributed-OMEZARRCreator again and would like to remove the long-term infrastructure, follow these steps.
92+
93+
### Remove Roles, Lambda Monitor, and Monitor SNS
94+
<pre>
95+
python setup_AWS.py destroy
96+
</pre>
97+
98+
### Remove EC2 Control node
99+
If you made your control node as an EC2 instance, while in the AWS console, select the instance.
100+
Select `Instance state` => `Terminate instance`.

Diff for: documentation/DOZC-documentation/step_1_configuration.md

+13
Original file line numberDiff line numberDiff line change
@@ -54,10 +54,23 @@ We recommend setting this to slightly longer than the average amount of time it
5454
See [SQS_QUEUE_information](SQS_QUEUE_information) for more information.
5555
* **SQS_DEAD_LETTER_QUEUE:** The name of the queue to send jobs to if they fail to process correctly multiple times.
5656
This keeps a single bad job (such as one where a single file has been corrupted) from keeping your cluster active indefinitely.
57+
This queue will be automatically made if it doesn't exist already.
5758
See [Step 0: Prep](step_0_prep.med) for more information.
5859

5960
***
6061

62+
### MONITORING
63+
* **AUTO_MONITOR:** Whether or not to have Auto-Monitor automatically monitor your jobs.
64+
65+
***
66+
67+
### CLOUDWATCH DASHBOARD CREATION
68+
69+
* **CREATE_DASHBOARD:** Create a Cloudwatch Dashboard that plots run metrics?
70+
* **CLEAN_DASHBOARD:** Automatically clean up the Cloudwatch Dashboard at the end of the run?
71+
72+
***
73+
6174
### REDUNDANCY CHECKS
6275

6376
* **CHECK_IF_DONE_BOOL:** Whether or not to check the output folder before proceeding.

0 commit comments

Comments
 (0)