This checklist aims to be an exhaustive list of all elements you should consider when using Amazon Redshift.
All items in the Amazon Redshift Checklist are required for the majority of projects, but some elements can be omitted or are not essential. We choose to use 3 levels of flexibility:
- 🔴 means the item can't be omitted for any reason.
- 🟡 means the item is highly recommended and can eventually be omitted in some really particular cases.
- 🟢 means the item is recommended but can be omitted in some particular situations.
Some resources possess an emoticon to help you understand which type of content/help you may find on the checklist:
- 📖 documentation or article
- 🔧 online tool
- 📹 media
In order to utilise the parallel nature of Redshift, data must be correctly distributed within each table of the cluster. Tables not distributed correctly (based on their query patterns) will generally lead to poor query performance.
- 📖 Choosing a data distribution style
- 📖 Amazon Redshift now recommends distribution keys for improved query performance
Ensures data is better compressed utilising less storage space.
Ensures data is retrieved from within each node in the most performant way.
- 📖 Choosing sort keys
- 📖 Amazon Redshift now supports changing table sort keys dynamically
- 📖 Amazon Redshift now recommends sort keys for improved query performance
- 📖 Compound and Interleaved Sort Keys
Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.
Loads data into a table from data files or from an Amazon DynamoDB table. The files can be located in an Amazon Simple Storage Service (Amazon S3) bucket, an Amazon EMR cluster, or a remote host that is accessed using a Secure Shell (SSH) connection.
Compressed files generally load faster. Use either GZIP, LZOP, BZIP2, or ZSTD.
If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible.
Load your data in sort key order to avoid needing to vacuum.
Use the COPY command with COMPUPDATE
set to ON
to automatically set column encoding for new tables during their first load.
Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression.
Amazon Redshift determines how many concurrent queries and how much memory is allocated to each dispatched query.
Dynamically adds concurrent clusters improving read query concurrency.
- 📖 Working with concurrency scaling
- 📖 Concurrency Scaling pricing
- 📹 Amazon Redshift Concurrency Scaling
Consider using Redshift's proprietary new column encoding algorithm AZ64.
STL_ALERT_EVENT_LOG
table allows users to analyse and improve performance issues.
Use the COPY command with COMPUPDATE
set to OFF
. Running compression computing every time on an already known data set will decrease performance.
Materialized views can significantly boost query performance for repeated and predictable analytical workloads such as dashboarding, queries from business intelligence (BI) tools, and ELT (Extract, Load, Transform) data processing.
SQA runs short-running queries in a dedicated space so that SQA queries aren't forced to wait in queues behind longer queries.
Consider scheduling an elastic cluster resize for nightly ETL workloads or to accommodate heavier workloads during the day as well as shrinking a cluster to accommodate lighter workloads at specific times of the day.
Consider using TRUNCATE
instead of DELETE
when creating transient tables. TRUNCATE
is much more efficient than DELETE
and doesn't require a VACUUM and ANALYZE.
- 📖 TRUNCATE
Ensure cluster encryption is turned on protecting data at rest.
Most clusters should not be publicly accessible and therefore should be set to private.
Forces all COPY and UNLOAD traffic between your cluster and your data repositories through your Amazon VPC.
To make permission management easier, create different user groups and grant privileges based on their roles. Add and remove users to/from groups instead of granting permissions to individual users.
Consider providing user access via SAML-2.0 using AD FS, PingFederate, Okta, or Azure AD.
- 📖 Federate Database User Authentication Easily with IAM and Amazon Redshift
- 📖 Federate Amazon Redshift access with Microsoft Azure AD single sign-on
- 📖 Federate Amazon Redshift access with Okta as an identity provider
- 📖 Options for providing IAM credentials
Consider enabling MFA for production workloads.
Configure AWS Secrets Manager to automatically rotate Amazon Redshift passwords for service accounts. Secrets Manager uses a Lambda function provided by Secrets Manager.
Consider implementing column-level access controls to restrict users from accessing certain columns.
Redshift advisor analyses your cluster and makes recommendations to improve performance and decrease costs.
Set an alarm to notify users when queries are running for longer than expected using the QueryDuration
CloudWatch metric.
Check if your cluster is underutilised or over utilised using the CPUUtilisation
CloudWatch metric.
Check if your cluster is running out of disk space and whether you need to consider scaling using the PercentageDiskSpaceUsed
metric.
Applies machine-learning algorithms to the metric's past data to create a model of the metric's expected values.
Define metrics-based performance boundaries for WLM queues and specify what action to take when a query goes beyond those boundaries.
Optimise your cluster based on how much time queries spend on different stages of processing.
This GitHub project provides an advance monitoring system for Amazon Redshift that is completely serverless, based on AWS Lambda and Amazon CloudWatch. A serverless Lambda function runs on a schedule, connects to the configured Redshift cluster, and generates CloudWatch custom alarms for common possible issues.
Using this API, you can access Amazon Redshift data with web services–based applications, including AWS Lambda, AWS AppSync, Amazon SageMaker notebooks, and AWS Cloud9.
The default retention period of 1 day can catch organisations out in case of disaster recovery or rollback. Consider changing to 35 days. You can use the HTTP endpoint to run SQL statements without managing connections. Calls to the Data API are asynchronous.
Consider using Redshift's new RA3 nodes with a mix of local cache and S3 backed elastic storage if compute requirements exceed dense compute or dense storage node levels.
- 📖 Amazon Redshift introduces RA3 nodes with managed storage enabling independent compute and storage scaling
- 📹 AWS re:Invent 2019: [NEW LAUNCH!] Amazon Redshift reimagined: RA3 and AQUA (ANT230)
- 📹 Amazon Redshift RA3 Nodes: Overview and How to Upgrade
Consider using Redshift Spectrum to allow users to query data straight from S3 using their Redshift cluster. This can be used in replacement of a staging schema whereby your staged data lives within your data lake and is read into Redshift via Spectrum.
- 📖 Getting started with Amazon Redshift Spectrum
- 📖 Why you’re better off exporting your data to Redshift Spectrum, instead of Redshift
- 📖 Redshift Spectrum pricing
- 📹 Cost and usage controls for Amazon Redshift
Redshift has recently introduced the ability to pause and resume the cluster within minutes. Take advantage of this feature for non-production clusters to save money.
Consider using elastic resize over classic resize when changing both the node types and the number of nodes within your Redshift cluster. Elastic resize is much quicker (minutes vs hours) and doesn't take your cluster out of commission.
Open an issue or a pull request to suggest changes or additions.