This repository provides you cdk scripts and sample code to create an Amazon RDS zero-ETL integration with Amazon Redshift Serverless.
An Amazon RDS zero-ETL integration with Amazon Redshift enables near real-time analytics and machine learning (ML) using Amazon Redshift on petabytes of transactional data from RDS.
The cdk.json
file tells the CDK Toolkit how to execute your app.
This project is set up like a standard Python project. The initialization
process also creates a virtualenv within this project, stored under the .venv
directory. To create the virtualenv it assumes that there is a python3
(or python
for Windows) executable in your path with access to the venv
package. If for any reason the automatic creation of the virtualenv fails,
you can create the virtualenv manually.
To manually create a virtualenv on MacOS and Linux:
$ git clone --depth=1 https://github.com/aws-samples/aws-kr-startup-samples.git
$ cd aws-kr-startup-samples
$ git sparse-checkout init --cone
$ git sparse-checkout set analytics/zero-etl-integrations/rds-mysql-to-redshift
$ cd analytics/zero-etl-integrations/rds-mysql-to-redshift
$ python3 -m venv .venv
After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv.
(.venv) $ source .venv/bin/activate
If you are a Windows platform, you would activate the virtualenv like this:
(.venv) % .venv\Scripts\activate.bat
Once the virtualenv is activated, you can install the required dependencies.
(.venv) $ pip install -r requirements.txt
To add additional dependencies, for example other CDK libraries, just add
them to your setup.py
file and rerun the pip install -r requirements.txt
command.
Set up cdk.context.json
Before synthesizing the CloudFormation, you need to prepare the cdk context configuration file, cdk.context.json
:
For example,
{ "rds_instance_name": "zero-etl-source-rds", "redshift": { "db_name": "zero-etl-target-rs", "namespace": "zero-etl-target-rs-ns", "workgroup": "zero-etl-target-rs-wg" }, "zero_etl_integration": { "data_filter": "include: demodb.retail_trans", "integration_name": "zero-etl-rss" } }
Bootstrap AWS environment for AWS CDK app
Also, before any AWS CDK app can be deployed, you have to bootstrap your AWS environment to create certain AWS resources that the AWS CDK CLI (Command Line Interface) uses to deploy your AWS CDK app.
Run the cdk bootstrap
command to bootstrap the AWS environment.
(.venv) $ cdk bootstrap
Now you can deploy the CloudFormation template for this code.
(.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
(.venv) $ export CDK_DEFAULT_REGION=$(aws configure get region)
(.venv) $ cdk list
MySQLVpcStack
MySQLStack
MySQLClientHostStack
RedshiftServerlessStack
ZeroETLfromRDStoRSS
(.venv) $ cdk deploy MySQLVpcStack MySQLStack MySQLClientHostStack
In order to set up the MySQL, you need to connect the MySQL cluster on an EC2 Bastion host.
ℹ️ The MySQL username
and password
are stored in the AWS Secrets Manager as a name such as DatabaseSecret-xxxxxxxxxxxx
.
To retrieve a secret (AWS console)
- (Step 1) Open the Secrets Manager console at https://console.aws.amazon.com/secretsmanager/.
- (Step 2) In the list of secrets, choose the secret you want to retrieve.
- (Step 3) In the Secret value section, choose Retrieve secret value.
Secrets Manager displays the current version (AWSCURRENT
) of the secret. To see other versions of the secret, such asAWSPREVIOUS
or custom labeled versions, use the AWS CLI.
To confirm that binary logging is enabled
-
Connect to the cluster writer node.
$ BASTION_HOST_ID=$(aws cloudformation describe-stacks --stack-name MySQLClientHoistStack --region ${CDK_DEFAULT_REGION} | \ jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("EC2InstanceId")) | .OutputValue') $ aws ec2-instance-connect ssh --instance-id ${BASTION_HOST_ID} --os-user ec2-user [ec2-user@ip-172-31-7-186 ~]$ mysql -hrds-instance-name.xxxxxxxxxxxx.region-name.rds.amazonaws.com -uadmin -p Enter password: Welcome to the MariaDB monitor. Commands end with ; or \g. Your MySQL connection id is 20 Server version: 8.0.23 Source distribution Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MySQL [(none)]>
ℹ️
MySQLClientHoistStack
is a CDK Stack to create the bastion host.ℹ️ You can connect to an EC2 instance using the EC2 Instance Connect CLI:
aws ec2-instance-connect ssh
. For more information, see Connect using the EC2 Instance Connect CLI. -
At SQL prompt run the below command to confirm that binary logging is enabled:
MySQL [(none)]> SHOW GLOBAL VARIABLES LIKE "log_bin"; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | log_bin | ON | +---------------+-------+ 1 row in set (0.00 sec)
-
Also run this to AWS DMS has bin log access that is required for replication
MySQL [(none)]> CALL mysql.rds_set_configuration('binlog retention hours', 24); Query OK, 0 rows affected (0.01 sec) MySQL [(none)]> CALL mysql.rds_show_configuration; +------------------------+-------+-----------------------------------------------------------------------------------------------------------+ | name | value | description | +------------------------+-------+-----------------------------------------------------------------------------------------------------------+ | binlog retention hours | 24 | binlog retention hours specifies the duration in hours before binary logs are automatically deleted. | | source delay | 0 | source delay specifies replication delay in seconds between current instance and its master. | | target delay | 0 | target delay specifies replication delay in seconds between current instance and its future read-replica. | +------------------------+-------+-----------------------------------------------------------------------------------------------------------+ 3 rows in set (0.00 sec) Query OK, 0 rows affected (0.00 sec) MySQL [(none)]>
-
Run the below command to create the sample database named
demodb
.MySQL [(none)]> SHOW DATABASES; +--------------------+ | Database | +--------------------+ | information_schema | | mysql | | performance_schema | | sys | +--------------------+ 4 rows in set (0.00 sec) MySQL [(none)]> CREATE DATABASE IF NOT EXISTS demodb; Query OK, 1 row affected (0.01 sec) MySQL [(none)]> USE demodb; Database changed MySQL [demodb]> SHOW TABLES; Empty set (0.00 sec)
-
Also run this to create the sample table named
retail_trans
MySQL [demodb]> CREATE TABLE IF NOT EXISTS demodb.retail_trans ( trans_id BIGINT(20) AUTO_INCREMENT, customer_id VARCHAR(12) NOT NULL, event VARCHAR(10) DEFAULT NULL, sku VARCHAR(10) NOT NULL, amount INT DEFAULT 0, device VARCHAR(10) DEFAULT NULL, trans_datetime DATETIME DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(trans_id), KEY(trans_datetime) ) ENGINE=InnoDB AUTO_INCREMENT=0; Query OK, 0 rows affected, 1 warning (0.04 sec) MySQL [demodb]> SHOW TABLES; +------------------+ | Tables_in_demodb | +------------------+ | retail_trans | +------------------+ 1 row in set (0.00 sec) MySQL [demodb]> DESC retail_trans; +----------------+-------------+------+-----+-------------------+-------------------+ | Field | Type | Null | Key | Default | Extra | +----------------+-------------+------+-----+-------------------+-------------------+ | trans_id | bigint | NO | PRI | NULL | auto_increment | | customer_id | varchar(12) | NO | | NULL | | | event | varchar(10) | YES | | NULL | | | sku | varchar(10) | NO | | NULL | | | amount | int | YES | | 0 | | | device | varchar(10) | YES | | NULL | | | trans_datetime | datetime | YES | MUL | CURRENT_TIMESTAMP | DEFAULT_GENERATED | +----------------+-------------+------+-----+-------------------+-------------------+ 7 rows in set (0.00 sec) MySQL [demodb]>
After setting up the MySQL, you should come back to the terminal where you are deploying stacks.
(.venv) $ cdk deploy RedshiftServerlessStack
Before you create a zero-ETL integration, you must create a source database and a target Amazon Redshift data warehouse. You also must allow replication into the data warehouse by adding the database as an authorized integration source.
You can configure authorized integration sources from the Resource Policy tab on the Amazon Redshift console or using the Amazon Redshift PutResourcePolicy
API operation.
To control the source that can create an inbound integration into the namespace, create a resource policy and attach it to the namespace. With the resource policy, you can specify the source that has access to the integration.
The following is a sample resource policy (e.g., rs-rp.json
).
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "redshift.amazonaws.com" }, "Action": "redshift:AuthorizeInboundIntegration", "Resource": "arn:aws:redshift-serverless:{region}:{account-id}:namespace/namespace-uuid", "Condition": { "StringEquals": { "aws:SourceArn": "arn:aws:rds:{region}:{account-id}:db:{rds-instance-name}" } } } ] }
ℹ️ You can find out the Amazon RDS Instance ARN by running the following.
aws cloudformation describe-stacks --stack-name MySQLStack --region {region} | \ jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("RDSZeroETLSourceArn")) | .OutputValue'
ℹ️ You can find out the Amazon Redshift namespace ARN by running the following.
aws cloudformation describe-stacks --stack-name RedshiftServerlessStack --region {region} | \ jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("NamespaceNameArn")) | .OutputValue'
To put a resource policy on your Amazon Redshift namespace ARN for a MySQL source, run a AWS CLI command similar to the following.
(.venv) $ export RSS_RESOURCE_ARN=$(aws cloudformation describe-stacks --stack-name RedshiftServerlessStack --region ${CDK_DEFAULT_REGION} | \ jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("NamespaceNameArn")) | .OutputValue') (.venv) $ aws redshift put-resource-policy \ --region ${CDK_DEFAULT_REGION} \ --policy file://rs-rp.json \ --resource-arn ${RSS_RESOURCE_ARN}
In this example we only want to replicate data from the MySQL table demodb.retail_trans
to Redshift.
So we add the data filtering option to cdk.context.json
like this:
{ ... "zero_etl_integration": { "data_filter": "include: demodb.retail_trans", "integration_name": "zero-etl-rss" } }
ℹ️ Data filtering for Amazon RDS zero-ETL integrations with Amazon Redshift
Now let's create the Zero-ETL integration. It takes a few minutes to change the status of the Zero-ETL integration from Creating to Active. The time varies depending on size of the dataset already available in the source.
(.venv) $ cdk deploy ZeroETLfromRDStoRSS
$ BASTION_HOST_ID=$(aws cloudformation describe-stacks --stack-name MySQLClientHoistStack --region ${CDK_DEFAULT_REGION} | jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("EC2InstanceId")) | .OutputValue') $ aws ec2-instance-connect ssh --instance-id ${BASTION_HOST_ID} --os-user ec2-user [ec2-user@ip-172-31-7-186 ~]$ python3 gen_fake_mysql_data.py \ --database {your-database-name} \ --table {your-table-name} \ --user {user-name} \ --password {password} \ --host {rds-instance-name}.xxxxxxxxxxxx.{region-name}.rds.amazonaws.com \ --max-count 10
After filling data into the MySQL table, connect to the primary instance and run some queries.
For example, retrieve some records.
$ mysql -hrds-instance-name.xxxxxxxxxxxx.region-name.rds.amazonaws.com -uadmin -p Enter password: Welcome to the MariaDB monitor. Commands end with ; or \g. Your MySQL connection id is 20 Server version: 8.0.23 Source distribution Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MySQL [(none)]> USE demodb; Database changed MySQL [demodb]> SELECT count(*) FROM retail_trans; +----------+ | count(*) | +----------+ | 10 | +----------+ 1 row in set (0.01 sec) MySQL [demodb]> SELECT * FROM retail_trans LIMIT 10; +----------+--------------+----------+------------+--------+--------+---------------------+ | trans_id | customer_id | event | sku | amount | device | trans_datetime | +----------+--------------+----------+------------+--------+--------+---------------------+ | 1 | 460104780596 | cart | IQ6879MMTB | 8 | mobile | 2023-01-16 06:08:06 | | 2 | 758933025159 | like | RL1573WWLT | 1 | tablet | 2023-01-16 06:17:21 | | 3 | 754384589074 | like | PX4135DYNT | 1 | mobile | 2023-01-16 06:08:52 | | 4 | 602811489876 | purchase | PI7913TREO | 66 | pc | 2023-01-16 06:01:07 | | 5 | 222732129586 | like | AS6987HGLN | 1 | mobile | 2023-01-16 06:09:06 | | 6 | 387378799012 | list | AI6161BEFX | 1 | pc | 2023-01-16 06:10:27 | | 7 | 843982894991 | cart | DA7930CJBR | 81 | pc | 2023-01-16 06:11:41 | | 8 | 818177069814 | like | JS6166YPTE | 1 | pc | 2023-01-16 06:17:08 | | 9 | 248083404876 | visit | AS8552DVOO | 1 | pc | 2023-01-16 06:24:39 | | 10 | 731184658511 | visit | XZ9997LSJN | 1 | tablet | 2023-01-16 06:12:18 | +----------+--------------+----------+------------+--------+--------+---------------------+ 10 rows in set (0.00 sec)
To create your database, complete the following steps:
-
On the Redshift Serverless dashboard, navigate to the
zero-etl-target-rs-ns
namespace. -
Connect to the Redshift Serverless data warehouse by choosing Create connection.
-
Obtain the
integration_id
from thesvv_integration
system table:---- copy this result, use in the next sql SELECT integration_id FROM svv_integration;
-
Use the
integration_id
from the previous step to create a new database from the integration:CREATE DATABASE aurora_zeroetl FROM INTEGRATION '<result from above>';
The integration is now complete, and an entire snapshot of the source will reflect as is in the destination. Ongoing changes will be synced in near-real time.
-
On the Redshift Serverless dashboard, open Query Editor v2 using the database you created as part of the integration setup. Use the following query to get information about checkpoint, snaphost, and subsequent CDC data replication:
SELECT * FROM SYS_INTEGRATION_ACTIVITY WHERE TRUNC(INTEGRATION_START_TIME)= CURRENT_DATE ORDER BY INTEGRATION_START_TIME;
-
To delete a zero-ETL integration, run the below command.
(.venv) $ cdk destroy --force -e ZeroETLfromRDStoRSS
When you delete a zero-ETL integration, your transactional data isn’t deleted from Amazon RDS or Amazon Redshift, but Amazon RDS doesn’t send new data to Amazon Redshift.
-
If you want to delete all CloudFormation stacks, run the below command.
(.venv) $ cdk destroy --force --all
npm install -g aws-cdk
install the AWS CDK Toolkit (thecdk
command).npm install -g aws-cdk@latest
install the latest AWS CDK Toolkit (thecdk
command).
cdk init app --language python
create a new, empty CDK Python project.cdk bootstrap --profile <AWS Profile>
Deploys the CDK Toolkit staging stack; see Bootstrappingcdk ls
list all stacks in the appcdk synth
emits the synthesized CloudFormation templatecdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk docs
open CDK documentation
Enjoy!
- (User Guide) Amazon RDS zero-ETL integrations with Amazon Redshift
- (Management Guide) Amazon Redshift - Zero-ETL integrations
- Supported Regions and DB engines for Amazon RDS zero-ETL integrations with Amazon Redshift
- (AWS Blog) Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift (2023-06-28)
- (AWS Blog) Unlock insights on Amazon RDS for MySQL data with zero-ETL integration to Amazon Redshift (2024-03-21)
- (Workshop) Zero ETL Integration
- (Workshop) Achieve near real-time data analytics with Zero-ETL on AWS
- AWS CDK Toolkit (cdk command)