Skip to content

Latest commit

 

History

History
328 lines (185 loc) · 18.5 KB

README.md

File metadata and controls

328 lines (185 loc) · 18.5 KB

Amazon Redshift Checklist

This checklist aims to be an exhaustive list of all elements you should consider when using Amazon Redshift.

Table of Contents

How to use

All items in the Amazon Redshift Checklist are required for the majority of projects, but some elements can be omitted or are not essential. We choose to use 3 levels of flexibility:

  • 🔴 means the item can't be omitted for any reason.
  • 🟡 means the item is highly recommended and can eventually be omitted in some really particular cases.
  • 🟢 means the item is recommended but can be omitted in some particular situations.

Some resources possess an emoticon to help you understand which type of content/help you may find on the checklist:

  • 📖 documentation or article
  • 🔧 online tool
  • 📹 media

Sister Projects

Checklist

Designing Tables

🔴 Select an appropriate table distribution style

In order to utilise the parallel nature of Redshift, data must be correctly distributed within each table of the cluster. Tables not distributed correctly (based on their query patterns) will generally lead to poor query performance.

🟡 Set column compression

Ensures data is better compressed utilising less storage space.

🟡 Select appropriate table sort keys

Ensures data is retrieved from within each node in the most performant way.

🟢 Define table constraints

Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.

Loading Data

🔴 Use the COPY command

Loads data into a table from data files or from an Amazon DynamoDB table. The files can be located in an Amazon Simple Storage Service (Amazon S3) bucket, an Amazon EMR cluster, or a remote host that is accessed using a Secure Shell (SSH) connection.

🟡 Compress data files

Compressed files generally load faster. Use either GZIP, LZOP, BZIP2, or ZSTD.

🟡 Use multi-row inserts

If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible.

🟡 Pre-sort data files in a sort key order

Load your data in sort key order to avoid needing to vacuum.

🟡 Enable automatic compression

Use the COPY command with COMPUPDATE set to ON to automatically set column encoding for new tables during their first load.

🟢 Split data into multiple files

Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression.

Performance

🔴 Enable automatic workload management (WLM)

Amazon Redshift determines how many concurrent queries and how much memory is allocated to each dispatched query.

🟡 Enable concurrency scaling

Dynamically adds concurrent clusters improving read query concurrency.

🟡 🆕 Use AZ64 column compression encoding

Consider using Redshift's proprietary new column encoding algorithm AZ64.

🟡 Analyse query performance

STL_ALERT_EVENT_LOG table allows users to analyse and improve performance issues.

🟢 Disable automatic compression

Use the COPY command with COMPUPDATE set to OFF. Running compression computing every time on an already known data set will decrease performance.

🟢 🆕 Use materialized views

Materialized views can significantly boost query performance for repeated and predictable analytical workloads such as dashboarding, queries from business intelligence (BI) tools, and ELT (Extract, Load, Transform) data processing.

🟢 Enable short query acceleration (SQA)

SQA runs short-running queries in a dedicated space so that SQA queries aren't forced to wait in queues behind longer queries.

🟢 Use elastic resize scheduling

Consider scheduling an elastic cluster resize for nightly ETL workloads or to accommodate heavier workloads during the day as well as shrinking a cluster to accommodate lighter workloads at specific times of the day.

🟢 Use TRUNCATE over DELETE

Consider using TRUNCATE instead of DELETE when creating transient tables. TRUNCATE is much more efficient than DELETE and doesn't require a VACUUM and ANALYZE.

Security

🔴 Enable cluster encryption

Ensure cluster encryption is turned on protecting data at rest.

🔴 Disable publicly accessibility

Most clusters should not be publicly accessible and therefore should be set to private.

🔴 Enable enhanced VPC routing

Forces all COPY and UNLOAD traffic between your cluster and your data repositories through your Amazon VPC.

🔴 Use user groups

To make permission management easier, create different user groups and grant privileges based on their roles. Add and remove users to/from groups instead of granting permissions to individual users.

🟡 🆕 Use federated user access

Consider providing user access via SAML-2.0 using AD FS, PingFederate, Okta, or Azure AD.

🟡 🆕 Enable multi-factor authentication (MFA)

Consider enabling MFA for production workloads.

🟡 Use Secrets Manager for service accounts

Configure AWS Secrets Manager to automatically rotate Amazon Redshift passwords for service accounts. Secrets Manager uses a Lambda function provided by Secrets Manager.

🟢 🆕 Use column-level access controls

Consider implementing column-level access controls to restrict users from accessing certain columns.

Monitoring

🔴 Action Redshift advisor recommendations

Redshift advisor analyses your cluster and makes recommendations to improve performance and decrease costs.

🔴 Monitor long running queries

Set an alarm to notify users when queries are running for longer than expected using the QueryDuration CloudWatch metric.

🔴 Monitor underutilised or over utilised clusters

Check if your cluster is underutilised or over utilised using the CPUUtilisation CloudWatch metric.

🔴 Monitor disk space usage

Check if your cluster is running out of disk space and whether you need to consider scaling using the PercentageDiskSpaceUsed metric.

🔴 🆕 Enable CloudWatch anomaly detection

Applies machine-learning algorithms to the metric's past data to create a model of the metric's expected values.

🔴 Query monitoring rules

Define metrics-based performance boundaries for WLM queues and specify what action to take when a query goes beyond those boundaries.

🟡 Analyse workload performance

Optimise your cluster based on how much time queries spend on different stages of processing.

🟡 Use Redshift Advance Monitoring

This GitHub project provides an advance monitoring system for Amazon Redshift that is completely serverless, based on AWS Lambda and Amazon CloudWatch. A serverless Lambda function runs on a schedule, connects to the configured Redshift cluster, and generates CloudWatch custom alarms for common possible issues.

Consumption

🟡 🆕 Use Data API

Using this API, you can access Amazon Redshift data with web services–based applications, including AWS Lambda, AWS AppSync, Amazon SageMaker notebooks, and AWS Cloud9.

Cluster

🔴 Increase automated snapshot retention

The default retention period of 1 day can catch organisations out in case of disaster recovery or rollback. Consider changing to 35 days. You can use the HTTP endpoint to run SQL statements without managing connections. Calls to the Data API are asynchronous.

🟡 🆕 Use RA3 nodes

Consider using Redshift's new RA3 nodes with a mix of local cache and S3 backed elastic storage if compute requirements exceed dense compute or dense storage node levels.

🟢 Use Redshift Spectrum

Consider using Redshift Spectrum to allow users to query data straight from S3 using their Redshift cluster. This can be used in replacement of a staging schema whereby your staged data lives within your data lake and is read into Redshift via Spectrum.

🟢 🆕 Pause and resume clusters

Redshift has recently introduced the ability to pause and resume the cluster within minutes. Take advantage of this feature for non-production clusters to save money.

🟢 🆕 Use elastic resize over classic resize

Consider using elastic resize over classic resize when changing both the node types and the number of nodes within your Redshift cluster. Elastic resize is much quicker (minutes vs hours) and doesn't take your cluster out of commission.

Contributing

Open an issue or a pull request to suggest changes or additions.