Merge branch 'main' of github.com:lifemapper/bison into main

zzeppozz · zzeppozz · commit 19f7ed2bbc60 · 2024-10-31T16:09:28.000-05:00
diff --git a/_sphinx_config/index.rst b/_sphinx_config/index.rst
@@ -9,7 +9,7 @@ Current
 ------------
 
 .. toctree::
-    :maxdepth: 2
+    :maxdepth: 1
 
     pages/about
     pages/workflow
@@ -18,23 +18,28 @@ Setup AWS
 ------------
 
 .. toctree::
-    :maxdepth: 2
+    :maxdepth: 1
 
     pages/aws/aws_setup
+    pages/aws/ec2_setup
+    pages/aws/lambda
+    pages/aws/roles
+    pages/aws/automation
 
 Using BISON
 ------------
 
 .. toctree::
-    :maxdepth: 2
+    :maxdepth: 1
 
-    pages/interaction/about
+    pages/interaction/debug
+    pages/interaction/deploy
 
-History
+Old Stuff
 ------------
 
 .. toctree::
-    :maxdepth: 2
+    :maxdepth: 1
 
     pages/history/year4_planB
     pages/history/year4_planA
diff --git a/_sphinx_config/pages/aws/automation.rst b/_sphinx_config/pages/aws/automation.rst
@@ -1,27 +1,75 @@
-Create lambda function to initiate processing
+Workflow Automation
+#####################################
+
+Lambda Functions For Workflow Steps
+=====================================
+
+Overview
+----------
+Lambda functions:
+
+* can run for 15 minutes or less
+* must contain only a single function
+
+Therefore, long-running processes and computations that require a complex programming
+are less suitable for Lambda.  The alternative we use for this workflow is to use
+Lambda to launch an EC2 instance to complete the processing.
+
+For the BISON workflow the first step is to annotate RIIS records with GBIF accepted
+taxa, a process that takes 30-60 minutes to resolve the approximately 15K names in RIIS.
+
+The final step is to build 2d matrices, species by region, from the data, and compute
+biogeographic statistics on them.  This process requires more complex code which is
+present in the BISON codebase.
+
+In both cases, we install the code onto the newly launched EC2 instance, and build a
+Docker container to install all dependencies and run the code.
+
+In future iterations, we will download a pre-built Docker image.
+
+More detailed setup instructions in lambda
+
+
+Initiate Workflow on a Schedule
 ------------------------------------------------
-* Create a lambda function for execution when the trigger condition is activated,
-  aws/events/bison_find_current_gbif_lambda.py
 
-  * This trigger condition is a file deposited in the BISON bucket
+Step 1: Annotate RIIS with GBIF accepted taxa
+......................................
 
-    * TODO: change to the first of the month
+This ensures that we can match RIIS records with the GBIF records that we
+will annotate with RIIS determination.  This process requires sending the scientific
+name in the RIIS record to the GBIF 'species' API, to find the accepted name,
+`acceptedScientificName` (and GBIF identifier, `acceptedTaxonKey`).
 
-  * The lambda function will delete the new file, and test the existence of
-    GBIF data for the current month
+* Create an AWS EventBridge Schedule
 
-    * TODO: change to mount GBIF data in Redshift, subset, unmount
+* Create a lambda function for execution when the trigger condition is activated, in
+  this case, the time/date in the schedule.
+  aws/lambda/bison_s0_annotate_riis_lambda.py
 
-Edit the execution role for lambda function
---------------------------------------------
-* Under Configuration/Permissions see the Execution role Role name
-  (bison_find_current_gbif_lambda-role-fb05ks88) automatically created for this function
-* Open in a new window and under Permissions policies, Add permissions
+  * The lambda function will make sure the data to be created does not already exist
+    in S3, execute if needed, return if it does not.
 
-  * bison_s3_policy
-  * redshift_glue_policy
 
-Create trigger to initiate lambda function
+
+Triggering execution
+-------------------------
+The first step will be executed on a schedule, such as the second day of the month
+(GBIF data is deposited on the first day of the month).
+
+Scheduled execution (Temporary): Each step after the first, is also executed on a
+schedule, roughly estimating completion of the previous step.  These steps with a
+dependency on previous outputs will first check for the existence of required inputs,
+failing immediately if inputs are not present.
+
+Automatic execution (TODO):  The successful deposition of output of the first
+(scheduled) and all following steps into S3 or Redshift triggers subsequent steps.
+
+Both automatic and scheduled execution will require examining the logs to ensure
+successful completion.
+
+
+TODO: Create rule to initiate lambda function based on previous step
 ------------------------------------------------
 
 * Check for existence of new GBIF data
@@ -49,92 +97,3 @@ Create trigger to initiate lambda function
   * Select target(s)
 
     * AWS service
-
-
-Lambda to query Redshift
---------------------------------------------
-
-https://repost.aws/knowledge-center/redshift-lambda-function-queries
-
-https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift-data/client/execute_statement.html
-
-* Connect to a serverless workgroup (bison), namespace (bison), database name (dev)
-
-* When connecting to a serverless workgroup, specify the workgroup name and database
-  name. The database user name is derived from the IAM identity. For example,
-  arn:iam::123456789012:user:foo has the database user name IAM:foo. Also, permission
-  to call the redshift-serverless:GetCredentials operation is required.
-* need redshift:GetClusterCredentialsWithIAM permission for temporary authentication
-  with a role
-
-Lambda to start EC2 for task
---------------------------------------------
-
-Lambda functions must be single-function tasks that run in less than 15 minutes.
-For complex or long-running tasks we start an EC2 instance containing bison code
-and execute it in a docker container.
-
-For each task, the lambda function should create a Spot EC2 instance with a template
-containing userdata that will either 1) pull the Github repo, then build the docker
-image, or 2) pull a docker image directly.
-
-Annotating the RIIS records with GBIF accepted taxa takes about 1 hour and uses
-multiple bison modules.
-
-EC2/Docker setup
-....................
-
-* Create the first EC2 Launch Template as a "one-time" Spot instance, no hibernation
-
-* The Launch template should have the following settings::
-
-  Name: bison_spot_task
-  Application and OS Images: Ubuntu
-  AMI: Ubuntu 24.04 LTS
-  Architecture: 64-bit ARM
-  Instance type: t4g.micro
-  Key pair: bison-task-key
-  Network settings/Select existing security group: launch-wizard-1
-  Configure storage: 8 Gb gp3 (default)
-    Details - encrypted
-  Advanced Details:
-    IAM instance profile: bison_ec2_s3_role
-    Shutdown behavior: Terminate
-    Cloudwatch monitoring: Enable
-    Purchasing option: Spot instances
-    Request type: One-time
-
-* Use the launch template to create a version for each task.
-* The launch template task versions must have the task name in the description, and
-  have the following script in the userdata::
-
-    #!/bin/bash
-    sudo apt-get -y update
-    sudo apt-get -y install docker.io
-    sudo apt-get -y install docker-compose-v2
-    git clone https://github.com/lifemapper/bison.git
-    cd bison
-    sudo docker compose -f compose.test_task.yml up
-    sudo shutdown -h now
-
-
-* For each task **compose.test_task.yml** must be replaced with the appropriate compose file.
-* On EC2 instance startup, the userdata script will execute
-* The compose file sets an environment variable (TASK_APP) containing a python module
-  to be executed from the Dockerfile.
-* Tasks should deposit outputs and logfiles into S3.
-* After completion, the docker container will stop automatically and the EC2 instance
-  will stop because of the shutdown command in the final line of the userdata script.
-* **TODO**: once the workflow is stable, to eliminate Docker build time, create a Docker
-  image and download it in userdata script.
-
-Lambda setup
-....................
-
-Triggering execution
--------------------------
-The first step may be executed on a schedule, such as the second day of the month (since
-GBIF data is deposited on the first day of the month).
-
-Upon successful completion, the deposition of successful output into S3 can trigger
-following steps.
diff --git a/_sphinx_config/pages/aws/aws_setup.rst b/_sphinx_config/pages/aws/aws_setup.rst
@@ -1,9 +1,13 @@
 AWS Resource Setup
+###################
+
+Security
 ********************
 
 Create policies and roles
 ===========================================================
 
+
 The :ref:`_bison_redshift_lambda_role` allows access to the bison Redshift
 namespace/workgroup, lambda functions, EventBridge Scheduler, and S3 data.
 The Trusted Relationships on this policy allow each to
@@ -25,6 +29,36 @@ external schema, you may encounter an error indicating that the "dev" database d
 exist.  This refers to the external database, and may indicate that the role used by the
 command and/or namespace differs from the role granted to the schema upon creation.
 
+Create a Security Group for the region
+===========================================================
+
+* Test this group!
+* Create a security group for the project/region
+
+  * inbound rules allow:
+
+    * Custom TCP, port 8000
+    * Custom TCP, port 8080
+    * HTTPS, port 80
+    * HTTPS, port 443
+    * SSH, port 22
+
+  * Consider restricting SSH to campus
+
+* or use launch-wizard-1 security group (created by some EC2 instance creation in 2023)
+
+  * inbound rules IPv4:
+
+    * Custom TCP 8000
+    * Custom TCP 8080
+    * SSH 22
+    * HTTP 80
+    * HTTPS 443
+
+  * outbound rules IPv4, IPv6:
+
+    * All traffic all ports
+
 Redshift Namespace and Workgroup
 ===========================================================
 
diff --git a/_sphinx_config/pages/aws/ec2_setup.rst b/_sphinx_config/pages/aws/ec2_setup.rst
@@ -151,3 +151,50 @@ Hop Limit for AWS communication
         --http-endpoint enabled
 
 * or in console, add metadata tag/value HttpPutResponseHopLimit/2
+
+EC2/Docker setup
+....................
+
+* Create the first EC2 Launch Template as a "one-time" Spot instance, no hibernation
+
+* The Launch template should have the following settings::
+
+  Name: bison_spot_task
+  Application and OS Images: Ubuntu
+  AMI: Ubuntu 24.04 LTS
+  Architecture: 64-bit ARM
+  Instance type: t4g.micro
+  Key pair: bison-task-key
+  Network settings/Select existing security group: launch-wizard-1
+  Configure storage: 8 Gb gp3 (default)
+    Details - encrypted
+  Advanced Details:
+    IAM instance profile: bison_ec2_s3_role
+    Shutdown behavior: Terminate
+    Cloudwatch monitoring: Enable
+    Purchasing option: Spot instances
+    Request type: One-time
+
+* Use the launch template to create a version for each task.
+* The launch template task versions must have the task name in the description, and
+  have the following script in the userdata::
+
+    #!/bin/bash
+    sudo apt-get -y update
+    sudo apt-get -y install docker.io
+    sudo apt-get -y install docker-compose-v2
+    git clone https://github.com/lifemapper/bison.git
+    cd bison
+    sudo docker compose -f compose.test_task.yml up
+    sudo shutdown -h now
+
+
+* For each task **compose.test_task.yml** must be replaced with the appropriate compose file.
+* On EC2 instance startup, the userdata script will execute
+* The compose file sets an environment variable (TASK_APP) containing a python module
+  to be executed from the Dockerfile.
+* Tasks should deposit outputs and logfiles into S3.
+* After completion, the docker container will stop automatically and the EC2 instance
+  will stop because of the shutdown command in the final line of the userdata script.
+* **TODO**: once the workflow is stable, to eliminate Docker build time, create a Docker
+  image and download it in userdata script.
diff --git a/_sphinx_config/pages/aws/lambda.rst b/_sphinx_config/pages/aws/lambda.rst
@@ -0,0 +1,43 @@
+Create lambda function to initiate processing
+--------------------------------------------
+* Create a lambda function for execution when the trigger condition is activated,
+  i.e.  aws/lambda/bison_s0_test_task_lambda.py
+* This trigger condition can be either a schedule (i.e. midnight on the second day of
+  every month) or a rule (i.e. file matching xxx* deposited in an S3 bucket)
+
+Edit the execution role for lambda function
+--------------------------------------------
+* Under Configuration/Permissions set the Execution role to the Workflow role
+  (bison_redshift_lambda_role)
+
+
+
+Lambda to query Redshift
+--------------------------------------------
+
+https://repost.aws/knowledge-center/redshift-lambda-function-queries
+
+https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift-data/client/execute_statement.html
+
+* Connect to a serverless workgroup (bison), namespace (bison), database name (dev)
+
+* When connecting to a serverless workgroup, specify the workgroup name and database
+  name. The database user name is derived from the IAM identity. For example,
+  arn:iam::123456789012:user:foo has the database user name IAM:foo. Also, permission
+  to call the redshift-serverless:GetCredentials operation is required.
+* need redshift:GetClusterCredentialsWithIAM permission for temporary authentication
+  with a role
+
+Lambda to start EC2 for task
+--------------------------------------------
+
+Lambda functions must be single-function tasks that run in less than 15 minutes.
+For complex or long-running tasks we start an EC2 instance containing bison code
+and execute it in a docker container.
+
+For each task, the lambda function should create a Spot EC2 instance with a template
+containing userdata that will either 1) pull the Github repo, then build the docker
+image, or 2) pull a docker image directly.
+
+Annotating the RIIS records with GBIF accepted taxa takes about 1 hour and uses
+multiple bison modules.
diff --git a/_sphinx_config/pages/interaction/aws_prep.rst b/_sphinx_config/pages/interaction/aws_prep.rst