Init

blieusong · blieusong · commit ad6e8d9a185c · 2024-03-29T00:40:23.000Z
diff --git a/README.md b/README.md
@@ -0,0 +1,26 @@
+# AWS Cookbook
+Collection of AWS commands and scripts that I use on a regular basis.
+
+- [S3](docs/s3.md)
+- [Athena](docs/athena.md)
+- [Glue](docs/glue.md)
+- [SES](docs/ses.md)
+- [IAM](docs/iam.md)
+- [STS](docs/sts.md)
+
+
+## Prerequisites
+### Dependencies
+- [**jq**](https://stedolan.github.io/jq/): A lightweight and flexible 
+  command-line JSON processor.
+- [**parallel**](https://www.gnu.org/software/parallel/): A shell tool for 
+  executing jobs in parallel using one or more computers.
+- [**s5cmd**](https://github.com/peak/s5cmd): A fast S3 command line tool written
+  in Go.
+
+### Config
+In the **aws**' `config` file, set the output to `json`.
+
+```
+output = json
+```
diff --git a/docs/athena.md b/docs/athena.md
@@ -0,0 +1,25 @@
+# Athena
+
+## Caveats
+Queries' resultsets must be stored in an S3 bucket, and that bucket must be in
+the same region as the Athena resources queried.
+
+## Using a Local SQL Client to Query Athena
+AWS provides 
+[JDBC Drivers](https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html)
+for Athena.
+
+Most SQL client written in Java will work with it. 
+[DBeaver](https://dbeaver.io/) is one such client. It is also free.
+
+Be aware that DBeaver does won't be able to retrieve resultsets of queries
+between different sessions. Running them several times will only stack 
+resultsets in the specified output bucket.
+
+## Execute a query
+
+```shell
+aws athena start-query-execution \
+    --query-string "SELECT * FROM database.table limit 10;"\
+    --result-configuration "OutputLocation=s3://output-bucket/"
+```
diff --git a/docs/glue.md b/docs/glue.md
@@ -0,0 +1,44 @@
+# Glue
+AWS Glue is a managed ETL service.
+
+## ARNs
+### Glue Catalog ARN Format
+<pre>
+arn:aws:glue:<i>region</i>:<i>account-id</i>:catalog
+</pre>
+
+Example:
+
+`arn:aws:glue:eu-west-1:999999999999:catalog`
+
+### Glue Database ARN Format
+<pre>
+arn:aws:glue:<i>region</i>:<i>account-id</i>:database/<i>database name</i>
+</pre>
+
+Example:
+
+`arn:aws:glue:eu-west-1:999999999999:database/salesdb`
+
+### Glue Table ARN Format
+<pre>
+arn:aws:glue:<i>region</i>:<i>account-id</i>:table/<i>database name</i>/<i>table name</i>
+</pre>
+
+Example:
+
+`arn:aws:glue:eu-west-1:999999999999:table/salesdb/salestable`
+
+## Listing All Glue Tables And Their S3 Location
+Assuming *accound id* is 999999999999, and saving the output in file 
+`tables_location.json`:
+
+```shell
+(for database in $(aws glue get-databases --catalog-id 999999999999 | jq ".DatabaseList[]" | jq -r ".Name"); do
+     for table in $(aws glue get-tables --database-name $database --catalog-id 999999999999 | jq ".TableList[]" | jq -r ".Name"); do
+         aws glue get-table --database-name $database --name $table | jq -c ".Table | {database: .DatabaseName, name: .Name, location: .StorageDescriptor.Location}"
+     done
+ done) | tee tables_location.json
+ ```
+
+ This command also retrieve views, which are not tables. To filter out views,
diff --git a/docs/iam.md b/docs/iam.md
@@ -0,0 +1,28 @@
+# IAM
+
+## Searching a Given Pattern in all the Policies of an Account
+Looking for `pattern`
+```shell
+for arn version in $(aws iam list-policies --scope Local | jq -r ".Policies[] | [.Arn,.DefaultVersionId] | @csv " | sed 's/"//g' | cut -f1,2 -d, | tr , ' '); do
+    echo $arn
+    aws iam get-policy-version --no-cli-pager --policy-arn $arn --version-id $version | grep -i "pattern"
+done
+```
+
+## Finding Who Has Access to a Folder in an S3 Bucket
+This script generates a CSV file which lists the users of an account and tells
+if they have `s3:GetObject` access to the `/specific/folder` in `bucket_name` 
+bucket or not.
+
+```shell
+(
+    echo "User;Access allowed" && \
+    for user in $(aws iam list-users --query 'Users[].Arn' --output text); do
+        echo "${user};"\
+$(aws iam simulate-principal-policy --policy-source-arn "$user" \
+            --action-names "s3:GetObject" \
+            --resource-arns "arn:aws:s3:::bucket_name/specific/folder/*" \
+        | jq -r ".EvaluationResults[].EvalDecision")
+    done
+) | tee file.csv
+```
diff --git a/docs/s3.md b/docs/s3.md
@@ -0,0 +1,161 @@
+# S3 (Simple Storage Service)
+## Good to Know...
+S3 is more a key / value storage where the value is the content of a file. 
+There's no such concept as folders in S3. But this is emulated by using prefixes
+containing `/` as folder separator in the key name.
+
+## Deleting an object
+With `aws s3`
+
+```shell
+aws s3 rm s3://bucket_name/object_key
+```
+
+With `aws s3api`
+
+```shell
+aws s3api delete-object --bucket bucket_name --key object_key
+```
+
+## Listing all older versions of all objects
+It is highly recommended to save the output in a file as it can be huge and long
+to obtain on large buckets.
+
+```shell
+aws s3api list-object-versions \
+    --bucket bucket_name \
+    | jq '[.Versions[] | select(.IsLatest == false)]' \
+    | jq -c ".[] | {Key, VersionId}" > bucket_older_versions.json
+```
+
+### Filtering on a specific prefix
+
+Use option `--prefix`:
+
+```shell
+aws s3api list-object-versions \
+    --bucket bucket_name \
+    --prefix 'prefix/' \
+    | jq '[.Versions[] | select(.IsLatest == false)]' \
+    | jq -c ".[] | {Key, VersionId}" > bucket_older_versions.json
+```
+
+### Delete all older versions of all objects
+Assuming we ran the above commands and saved the output in a file named 
+`bucket_older_versions.json`, and using 20 threads:
+
+```shell
+cat bucket_older_versions.json | parallel -j 20 --linebuffer '
+    key=$(echo {} | jq -r ".Key")
+    versionId=$(echo {} | jq -r ".VersionId")
+    aws s3api delete-object --bucket bucket_name --key "$key" --version-id "$versionId"
+'
+```
+
+## List all objects from only the root folder of a bucket
+```bash
+aws s3api list-object-v2 \
+    --bucket bucket_name \
+    --delimiter '/' \
+    --prefix ''
+```
+
+## Computing size of objects in a bucket
+First generate a full listing of all objects in a bucket and save it in a file.
+```shell
+aws s3 ls --recursive s3://bucket_name > bucket_name_listing.txt
+```
+
+The advantage of pregenerating the listing is that it can be used multiple times
+without having to query the S3 API (and get charged) again and again. It is also
+much faster to process a local file.
+
+The downside is the initial time to generate that listing and the fact it may
+not be up to date. So use that approach mostly for very large buckets.
+
+### Total size of all objects of the bucket
+```shell
+awk '{sum+=$3} END {print sum}' bucket_name_listing.txt
+```
+
+To get the size in MB, divide by (1024 * 1024). For sizes in GB, divide by 
+(1024 * 1024 * 1024).
+
+Total size of all objects in GB:
+
+```shell
+awk '{sum+=$3} END {print sum/(1024*1024*1024)}' bucket_name_listing.txt
+```
+
+### Total Size of Objects in a Specific Folder In GB
+```shell
+grep 'folder_name/' bucket_name_listing.txt \
+    | awk '{sum+=$3} END {print sum/(1024*1024*1024)}'
+```
+
+## Exploring a Bucket's Access Log.
+If a `bucket_name` bucket has access logs enabled, and the logs are stored at
+`s3://bucket_logs/bucket_name/`, then it is possible to query them with an
+**Athena table**.
+
+Create the table with the following SQL
+```sql
+CREATE EXTERNAL TABLE bucket_name_access_logs (
+    bucketowner STRING, 
+    bucket_name STRING, 
+    requestdatetime STRING, 
+    remoteip STRING, 
+    requester STRING, 
+    requestid STRING, 
+    operation STRING, 
+    key STRING, 
+    request_uri STRING, 
+    httpstatus STRING, 
+    errorcode STRING, 
+    bytessent BIGINT, 
+    objectsize BIGINT, 
+    totaltime STRING, 
+    turnaroundtime STRING, 
+    referrer STRING, 
+    useragent STRING, 
+    versionid STRING, 
+    hostid STRING, 
+    sigv STRING, 
+    ciphersuite STRING, 
+    authtype STRING, 
+    endpoint STRING, 
+    tlsversion STRING,
+    accesspointarn STRING,
+    aclrequired STRING)
+ROW FORMAT SERDE 
+    'org.apache.hadoop.hive.serde2.RegexSerDe' 
+WITH SERDEPROPERTIES ( 
+    'input.regex'='([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$') 
+STORED AS INPUTFORMAT 
+    'org.apache.hadoop.mapred.TextInputFormat' 
+OUTPUTFORMAT 
+    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
+LOCATION
+    's3://bucket_logs/bucket_name/'
+ ```
+
+Then looking for logs is straightforward, but there's a little caveat. The
+`requestdatetime` field is not in a `datetime` friendly format. To convert it,
+use:
+
+```sql
+parse_datetime(requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z')
+```
+
+For example, to get the most recent GET requests on the bucket:
+
+```sql
+SELECT 
+    *
+FROM
+    bucket_name_access_logs 
+WHERE 
+    operation LIKE 'REST.GET.%'
+ORDER BY
+    parse_datetime(requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z') DESC
+```
diff --git a/docs/ses.md b/docs/ses.md
@@ -0,0 +1,15 @@
+# SES (Simple Email Service)
+
+## Request production access for SES
+Assuming accessing AWS account with profile **my_profile**:
+
+```shell
+aws sesv2 put-account-details \
+--profile my_profile \
+--production-access-enabled \
+--mail-type TRANSACTIONAL \
+--website-url https://your.website.com \ # doesn't matter
+--use-case-description "describe your usecase" \
+--additional-contact-email-addresses your.email@gmail.com \
+--contact-language EN
+```
diff --git a/docs/sts.md b/docs/sts.md
@@ -0,0 +1,16 @@
+# STS (Simple Token Service)
+
+## Get a session token for a MFA device
+```shell
+aws sts get-session-token \
+    --serial-number arn:aws:iam::999999999999:mfa/mfa.device.name \
+    --token [TOKEN]
+```
+
+Then set the environment variable:
+
+```shell
+export AWS_SESSION_TOKEN=token_from_response
+```
+
+But it is best to use proper AWS CLI profiles.