Skip to content

Commit

Permalink
Merge branch 'main' of github.com:lifemapper/bison into main
Browse files Browse the repository at this point in the history
  • Loading branch information
zzeppozz committed Oct 31, 2024
2 parents f0396a5 + 4223f67 commit 19f7ed2
Show file tree
Hide file tree
Showing 6 changed files with 199 additions and 167 deletions.
17 changes: 11 additions & 6 deletions _sphinx_config/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Current
------------

.. toctree::
:maxdepth: 2
:maxdepth: 1

pages/about
pages/workflow
Expand All @@ -18,23 +18,28 @@ Setup AWS
------------

.. toctree::
:maxdepth: 2
:maxdepth: 1

pages/aws/aws_setup
pages/aws/ec2_setup
pages/aws/lambda
pages/aws/roles
pages/aws/automation

Using BISON
------------

.. toctree::
:maxdepth: 2
:maxdepth: 1

pages/interaction/about
pages/interaction/debug
pages/interaction/deploy

History
Old Stuff
------------

.. toctree::
:maxdepth: 2
:maxdepth: 1

pages/history/year4_planB
pages/history/year4_planA
Expand Down
169 changes: 64 additions & 105 deletions _sphinx_config/pages/aws/automation.rst
Original file line number Diff line number Diff line change
@@ -1,27 +1,75 @@
Create lambda function to initiate processing
Workflow Automation
#####################################

Lambda Functions For Workflow Steps
=====================================

Overview
----------
Lambda functions:

* can run for 15 minutes or less
* must contain only a single function

Therefore, long-running processes and computations that require a complex programming
are less suitable for Lambda. The alternative we use for this workflow is to use
Lambda to launch an EC2 instance to complete the processing.

For the BISON workflow the first step is to annotate RIIS records with GBIF accepted
taxa, a process that takes 30-60 minutes to resolve the approximately 15K names in RIIS.

The final step is to build 2d matrices, species by region, from the data, and compute
biogeographic statistics on them. This process requires more complex code which is
present in the BISON codebase.

In both cases, we install the code onto the newly launched EC2 instance, and build a
Docker container to install all dependencies and run the code.

In future iterations, we will download a pre-built Docker image.

More detailed setup instructions in lambda


Initiate Workflow on a Schedule
------------------------------------------------
* Create a lambda function for execution when the trigger condition is activated,
aws/events/bison_find_current_gbif_lambda.py

* This trigger condition is a file deposited in the BISON bucket
Step 1: Annotate RIIS with GBIF accepted taxa
......................................

* TODO: change to the first of the month
This ensures that we can match RIIS records with the GBIF records that we
will annotate with RIIS determination. This process requires sending the scientific
name in the RIIS record to the GBIF 'species' API, to find the accepted name,
`acceptedScientificName` (and GBIF identifier, `acceptedTaxonKey`).

* The lambda function will delete the new file, and test the existence of
GBIF data for the current month
* Create an AWS EventBridge Schedule

* TODO: change to mount GBIF data in Redshift, subset, unmount
* Create a lambda function for execution when the trigger condition is activated, in
this case, the time/date in the schedule.
aws/lambda/bison_s0_annotate_riis_lambda.py

Edit the execution role for lambda function
--------------------------------------------
* Under Configuration/Permissions see the Execution role Role name
(bison_find_current_gbif_lambda-role-fb05ks88) automatically created for this function
* Open in a new window and under Permissions policies, Add permissions
* The lambda function will make sure the data to be created does not already exist
in S3, execute if needed, return if it does not.

* bison_s3_policy
* redshift_glue_policy

Create trigger to initiate lambda function

Triggering execution
-------------------------
The first step will be executed on a schedule, such as the second day of the month
(GBIF data is deposited on the first day of the month).

Scheduled execution (Temporary): Each step after the first, is also executed on a
schedule, roughly estimating completion of the previous step. These steps with a
dependency on previous outputs will first check for the existence of required inputs,
failing immediately if inputs are not present.

Automatic execution (TODO): The successful deposition of output of the first
(scheduled) and all following steps into S3 or Redshift triggers subsequent steps.

Both automatic and scheduled execution will require examining the logs to ensure
successful completion.


TODO: Create rule to initiate lambda function based on previous step
------------------------------------------------

* Check for existence of new GBIF data
Expand Down Expand Up @@ -49,92 +97,3 @@ Create trigger to initiate lambda function
* Select target(s)

* AWS service


Lambda to query Redshift
--------------------------------------------

https://repost.aws/knowledge-center/redshift-lambda-function-queries

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift-data/client/execute_statement.html

* Connect to a serverless workgroup (bison), namespace (bison), database name (dev)

* When connecting to a serverless workgroup, specify the workgroup name and database
name. The database user name is derived from the IAM identity. For example,
arn:iam::123456789012:user:foo has the database user name IAM:foo. Also, permission
to call the redshift-serverless:GetCredentials operation is required.
* need redshift:GetClusterCredentialsWithIAM permission for temporary authentication
with a role

Lambda to start EC2 for task
--------------------------------------------

Lambda functions must be single-function tasks that run in less than 15 minutes.
For complex or long-running tasks we start an EC2 instance containing bison code
and execute it in a docker container.

For each task, the lambda function should create a Spot EC2 instance with a template
containing userdata that will either 1) pull the Github repo, then build the docker
image, or 2) pull a docker image directly.

Annotating the RIIS records with GBIF accepted taxa takes about 1 hour and uses
multiple bison modules.

EC2/Docker setup
....................

* Create the first EC2 Launch Template as a "one-time" Spot instance, no hibernation

* The Launch template should have the following settings::

Name: bison_spot_task
Application and OS Images: Ubuntu
AMI: Ubuntu 24.04 LTS
Architecture: 64-bit ARM
Instance type: t4g.micro
Key pair: bison-task-key
Network settings/Select existing security group: launch-wizard-1
Configure storage: 8 Gb gp3 (default)
Details - encrypted
Advanced Details:
IAM instance profile: bison_ec2_s3_role
Shutdown behavior: Terminate
Cloudwatch monitoring: Enable
Purchasing option: Spot instances
Request type: One-time

* Use the launch template to create a version for each task.
* The launch template task versions must have the task name in the description, and
have the following script in the userdata::

#!/bin/bash
sudo apt-get -y update
sudo apt-get -y install docker.io
sudo apt-get -y install docker-compose-v2
git clone https://github.com/lifemapper/bison.git
cd bison
sudo docker compose -f compose.test_task.yml up
sudo shutdown -h now


* For each task **compose.test_task.yml** must be replaced with the appropriate compose file.
* On EC2 instance startup, the userdata script will execute
* The compose file sets an environment variable (TASK_APP) containing a python module
to be executed from the Dockerfile.
* Tasks should deposit outputs and logfiles into S3.
* After completion, the docker container will stop automatically and the EC2 instance
will stop because of the shutdown command in the final line of the userdata script.
* **TODO**: once the workflow is stable, to eliminate Docker build time, create a Docker
image and download it in userdata script.

Lambda setup
....................

Triggering execution
-------------------------
The first step may be executed on a schedule, such as the second day of the month (since
GBIF data is deposited on the first day of the month).

Upon successful completion, the deposition of successful output into S3 can trigger
following steps.
34 changes: 34 additions & 0 deletions _sphinx_config/pages/aws/aws_setup.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
AWS Resource Setup
###################

Security
********************

Create policies and roles
===========================================================


The :ref:`_bison_redshift_lambda_role` allows access to the bison Redshift
namespace/workgroup, lambda functions, EventBridge Scheduler, and S3 data.
The Trusted Relationships on this policy allow each to
Expand All @@ -25,6 +29,36 @@ external schema, you may encounter an error indicating that the "dev" database d
exist. This refers to the external database, and may indicate that the role used by the
command and/or namespace differs from the role granted to the schema upon creation.

Create a Security Group for the region
===========================================================

* Test this group!
* Create a security group for the project/region

* inbound rules allow:

* Custom TCP, port 8000
* Custom TCP, port 8080
* HTTPS, port 80
* HTTPS, port 443
* SSH, port 22

* Consider restricting SSH to campus

* or use launch-wizard-1 security group (created by some EC2 instance creation in 2023)

* inbound rules IPv4:

* Custom TCP 8000
* Custom TCP 8080
* SSH 22
* HTTP 80
* HTTPS 443

* outbound rules IPv4, IPv6:

* All traffic all ports

Redshift Namespace and Workgroup
===========================================================

Expand Down
47 changes: 47 additions & 0 deletions _sphinx_config/pages/aws/ec2_setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -151,3 +151,50 @@ Hop Limit for AWS communication
--http-endpoint enabled

* or in console, add metadata tag/value HttpPutResponseHopLimit/2

EC2/Docker setup
....................

* Create the first EC2 Launch Template as a "one-time" Spot instance, no hibernation

* The Launch template should have the following settings::

Name: bison_spot_task
Application and OS Images: Ubuntu
AMI: Ubuntu 24.04 LTS
Architecture: 64-bit ARM
Instance type: t4g.micro
Key pair: bison-task-key
Network settings/Select existing security group: launch-wizard-1
Configure storage: 8 Gb gp3 (default)
Details - encrypted
Advanced Details:
IAM instance profile: bison_ec2_s3_role
Shutdown behavior: Terminate
Cloudwatch monitoring: Enable
Purchasing option: Spot instances
Request type: One-time

* Use the launch template to create a version for each task.
* The launch template task versions must have the task name in the description, and
have the following script in the userdata::

#!/bin/bash
sudo apt-get -y update
sudo apt-get -y install docker.io
sudo apt-get -y install docker-compose-v2
git clone https://github.com/lifemapper/bison.git
cd bison
sudo docker compose -f compose.test_task.yml up
sudo shutdown -h now


* For each task **compose.test_task.yml** must be replaced with the appropriate compose file.
* On EC2 instance startup, the userdata script will execute
* The compose file sets an environment variable (TASK_APP) containing a python module
to be executed from the Dockerfile.
* Tasks should deposit outputs and logfiles into S3.
* After completion, the docker container will stop automatically and the EC2 instance
will stop because of the shutdown command in the final line of the userdata script.
* **TODO**: once the workflow is stable, to eliminate Docker build time, create a Docker
image and download it in userdata script.
43 changes: 43 additions & 0 deletions _sphinx_config/pages/aws/lambda.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Create lambda function to initiate processing
--------------------------------------------
* Create a lambda function for execution when the trigger condition is activated,
i.e. aws/lambda/bison_s0_test_task_lambda.py
* This trigger condition can be either a schedule (i.e. midnight on the second day of
every month) or a rule (i.e. file matching xxx* deposited in an S3 bucket)

Edit the execution role for lambda function
--------------------------------------------
* Under Configuration/Permissions set the Execution role to the Workflow role
(bison_redshift_lambda_role)



Lambda to query Redshift
--------------------------------------------

https://repost.aws/knowledge-center/redshift-lambda-function-queries

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift-data/client/execute_statement.html

* Connect to a serverless workgroup (bison), namespace (bison), database name (dev)

* When connecting to a serverless workgroup, specify the workgroup name and database
name. The database user name is derived from the IAM identity. For example,
arn:iam::123456789012:user:foo has the database user name IAM:foo. Also, permission
to call the redshift-serverless:GetCredentials operation is required.
* need redshift:GetClusterCredentialsWithIAM permission for temporary authentication
with a role

Lambda to start EC2 for task
--------------------------------------------

Lambda functions must be single-function tasks that run in less than 15 minutes.
For complex or long-running tasks we start an EC2 instance containing bison code
and execute it in a docker container.

For each task, the lambda function should create a Spot EC2 instance with a template
containing userdata that will either 1) pull the Github repo, then build the docker
image, or 2) pull a docker image directly.

Annotating the RIIS records with GBIF accepted taxa takes about 1 hour and uses
multiple bison modules.
Loading

0 comments on commit 19f7ed2

Please sign in to comment.