Skip to content

Latest commit

 

History

History

rds-mysql-to-redshift

Amazon RDS MySQL Zero-ETL integrations with Amazon Redshift Serverless

This repository provides you cdk scripts and sample code to create an Amazon RDS zero-ETL integration with Amazon Redshift Serverless.

An Amazon RDS zero-ETL integration with Amazon Redshift enables near real-time analytics and machine learning (ML) using Amazon Redshift on petabytes of transactional data from RDS.

rds-mysql-zero-etl-integration-with-redsfhit-serverless

The cdk.json file tells the CDK Toolkit how to execute your app.

This project is set up like a standard Python project. The initialization process also creates a virtualenv within this project, stored under the .venv directory. To create the virtualenv it assumes that there is a python3 (or python for Windows) executable in your path with access to the venv package. If for any reason the automatic creation of the virtualenv fails, you can create the virtualenv manually.

To manually create a virtualenv on MacOS and Linux:

$ git clone --depth=1 https://github.com/aws-samples/aws-kr-startup-samples.git
$ cd aws-kr-startup-samples
$ git sparse-checkout init --cone
$ git sparse-checkout set analytics/zero-etl-integrations/rds-mysql-to-redshift
$ cd analytics/zero-etl-integrations/rds-mysql-to-redshift

$ python3 -m venv .venv

After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv.

(.venv) $ source .venv/bin/activate

If you are a Windows platform, you would activate the virtualenv like this:

(.venv) % .venv\Scripts\activate.bat

Once the virtualenv is activated, you can install the required dependencies.

(.venv) $ pip install -r requirements.txt

To add additional dependencies, for example other CDK libraries, just add them to your setup.py file and rerun the pip install -r requirements.txt command.

Prerequisites

Set up cdk.context.json

Before synthesizing the CloudFormation, you need to prepare the cdk context configuration file, cdk.context.json:

For example,

{
  "rds_instance_name": "zero-etl-source-rds",
  "redshift": {
    "db_name": "zero-etl-target-rs",
    "namespace": "zero-etl-target-rs-ns",
    "workgroup": "zero-etl-target-rs-wg"
  },
  "zero_etl_integration": {
    "data_filter": "include: demodb.retail_trans",
    "integration_name": "zero-etl-rss"
  }
}

Bootstrap AWS environment for AWS CDK app

Also, before any AWS CDK app can be deployed, you have to bootstrap your AWS environment to create certain AWS resources that the AWS CDK CLI (Command Line Interface) uses to deploy your AWS CDK app.

Run the cdk bootstrap command to bootstrap the AWS environment.

(.venv) $ cdk bootstrap

Now you can deploy the CloudFormation template for this code.

List all CDK Stacks

(.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
(.venv) $ export CDK_DEFAULT_REGION=$(aws configure get region)
(.venv) $ cdk list
MySQLVpcStack
MySQLStack
MySQLClientHostStack
RedshiftServerlessStack
ZeroETLfromRDStoRSS

Create MySQL cluster

  (.venv) $ cdk deploy MySQLVpcStack MySQLStack MySQLClientHostStack
  

Create a sample database and table

In order to set up the MySQL, you need to connect the MySQL cluster on an EC2 Bastion host.

ℹ️ The MySQL username and password are stored in the AWS Secrets Manager as a name such as DatabaseSecret-xxxxxxxxxxxx.

To retrieve a secret (AWS console)

  • (Step 1) Open the Secrets Manager console at https://console.aws.amazon.com/secretsmanager/.
  • (Step 2) In the list of secrets, choose the secret you want to retrieve.
  • (Step 3) In the Secret value section, choose Retrieve secret value.
    Secrets Manager displays the current version (AWSCURRENT) of the secret. To see other versions of the secret, such as AWSPREVIOUS or custom labeled versions, use the AWS CLI.

To confirm that binary logging is enabled

  1. Connect to the cluster writer node.

     $ BASTION_HOST_ID=$(aws cloudformation describe-stacks --stack-name MySQLClientHoistStack --region ${CDK_DEFAULT_REGION} | \
     jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("EC2InstanceId")) | .OutputValue')
    
     $ aws ec2-instance-connect ssh --instance-id ${BASTION_HOST_ID} --os-user ec2-user
    
     [ec2-user@ip-172-31-7-186 ~]$ mysql -hrds-instance-name.xxxxxxxxxxxx.region-name.rds.amazonaws.com -uadmin -p
     Enter password:
     Welcome to the MariaDB monitor.  Commands end with ; or \g.
     Your MySQL connection id is 20
     Server version: 8.0.23 Source distribution
    
     Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
    
     Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
    
     MySQL [(none)]>
    

    ℹ️ MySQLClientHoistStack is a CDK Stack to create the bastion host.

    ℹ️ You can connect to an EC2 instance using the EC2 Instance Connect CLI: aws ec2-instance-connect ssh. For more information, see Connect using the EC2 Instance Connect CLI.

  2. At SQL prompt run the below command to confirm that binary logging is enabled:

     MySQL [(none)]> SHOW GLOBAL VARIABLES LIKE "log_bin";
     +---------------+-------+
     | Variable_name | Value |
     +---------------+-------+
     | log_bin       | ON    |
     +---------------+-------+
     1 row in set (0.00 sec)
    
  3. Also run this to AWS DMS has bin log access that is required for replication

     MySQL [(none)]> CALL mysql.rds_set_configuration('binlog retention hours', 24);
     Query OK, 0 rows affected (0.01 sec)
    
     MySQL [(none)]> CALL mysql.rds_show_configuration;
     +------------------------+-------+-----------------------------------------------------------------------------------------------------------+
     | name                   | value | description                                                                                               |
     +------------------------+-------+-----------------------------------------------------------------------------------------------------------+
     | binlog retention hours | 24    | binlog retention hours specifies the duration in hours before binary logs are automatically deleted.      |
     | source delay           | 0     | source delay specifies replication delay in seconds between current instance and its master.              |
     | target delay           | 0     | target delay specifies replication delay in seconds between current instance and its future read-replica. |
     +------------------------+-------+-----------------------------------------------------------------------------------------------------------+
     3 rows in set (0.00 sec)
    
     Query OK, 0 rows affected (0.00 sec)
    
     MySQL [(none)]>
    
  4. Run the below command to create the sample database named demodb.

     MySQL [(none)]> SHOW DATABASES;
     +--------------------+
     | Database           |
     +--------------------+
     | information_schema |
     | mysql              |
     | performance_schema |
     | sys                |
     +--------------------+
     4 rows in set (0.00 sec)
    
     MySQL [(none)]> CREATE DATABASE IF NOT EXISTS demodb;
     Query OK, 1 row affected (0.01 sec)
    
     MySQL [(none)]> USE demodb;
     Database changed
     MySQL [demodb]> SHOW TABLES;
     Empty set (0.00 sec)
    
  5. Also run this to create the sample table named retail_trans

     MySQL [demodb]> CREATE TABLE IF NOT EXISTS demodb.retail_trans (
              trans_id BIGINT(20) AUTO_INCREMENT,
              customer_id VARCHAR(12) NOT NULL,
              event VARCHAR(10) DEFAULT NULL,
              sku VARCHAR(10) NOT NULL,
              amount INT DEFAULT 0,
              device VARCHAR(10) DEFAULT NULL,
              trans_datetime DATETIME DEFAULT CURRENT_TIMESTAMP,
              PRIMARY KEY(trans_id),
              KEY(trans_datetime)
            ) ENGINE=InnoDB AUTO_INCREMENT=0;
     Query OK, 0 rows affected, 1 warning (0.04 sec)
    
     MySQL [demodb]> SHOW TABLES;
     +------------------+
     | Tables_in_demodb |
     +------------------+
     | retail_trans     |
     +------------------+
     1 row in set (0.00 sec)
    
     MySQL [demodb]> DESC retail_trans;
     +----------------+-------------+------+-----+-------------------+-------------------+
     | Field          | Type        | Null | Key | Default           | Extra             |
     +----------------+-------------+------+-----+-------------------+-------------------+
     | trans_id       | bigint      | NO   | PRI | NULL              | auto_increment    |
     | customer_id    | varchar(12) | NO   |     | NULL              |                   |
     | event          | varchar(10) | YES  |     | NULL              |                   |
     | sku            | varchar(10) | NO   |     | NULL              |                   |
     | amount         | int         | YES  |     | 0                 |                   |
     | device         | varchar(10) | YES  |     | NULL              |                   |
     | trans_datetime | datetime    | YES  | MUL | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
     +----------------+-------------+------+-----+-------------------+-------------------+
     7 rows in set (0.00 sec)
    
     MySQL [demodb]>
    

After setting up the MySQL, you should come back to the terminal where you are deploying stacks.

Create Redshift Serverless cluster

  (.venv) $ cdk deploy RedshiftServerlessStack
  

Configure authorization for your Amazon Redshift data warehouse

Before you create a zero-ETL integration, you must create a source database and a target Amazon Redshift data warehouse. You also must allow replication into the data warehouse by adding the database as an authorized integration source.

You can configure authorized integration sources from the Resource Policy tab on the Amazon Redshift console or using the Amazon Redshift PutResourcePolicy API operation.

To control the source that can create an inbound integration into the namespace, create a resource policy and attach it to the namespace. With the resource policy, you can specify the source that has access to the integration.

The following is a sample resource policy (e.g., rs-rp.json).

  {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "redshift.amazonaws.com"
        },
        "Action": "redshift:AuthorizeInboundIntegration",
        "Resource": "arn:aws:redshift-serverless:{region}:{account-id}:namespace/namespace-uuid",
        "Condition": {
          "StringEquals": {
            "aws:SourceArn": "arn:aws:rds:{region}:{account-id}:db:{rds-instance-name}"
          }
        }
      }
    ]
  }
  

ℹ️ You can find out the Amazon RDS Instance ARN by running the following.

  aws cloudformation describe-stacks --stack-name MySQLStack --region {region} | \
  jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("RDSZeroETLSourceArn")) | .OutputValue'
  

ℹ️ You can find out the Amazon Redshift namespace ARN by running the following.

  aws cloudformation describe-stacks --stack-name RedshiftServerlessStack --region {region} | \
  jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("NamespaceNameArn")) | .OutputValue'
  

To put a resource policy on your Amazon Redshift namespace ARN for a MySQL source, run a AWS CLI command similar to the following.

  (.venv) $ export RSS_RESOURCE_ARN=$(aws cloudformation describe-stacks --stack-name RedshiftServerlessStack --region ${CDK_DEFAULT_REGION} | \
  jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("NamespaceNameArn")) | .OutputValue')
  (.venv) $ aws redshift put-resource-policy \
                --region ${CDK_DEFAULT_REGION} \
                --policy file://rs-rp.json \
                --resource-arn ${RSS_RESOURCE_ARN}
  

Create Zero ETL Integration with filters

In this example we only want to replicate data from the MySQL table demodb.retail_trans to Redshift. So we add the data filtering option to cdk.context.json like this:

   {
     ...
     "zero_etl_integration": {
       "data_filter": "include: demodb.retail_trans",
       "integration_name": "zero-etl-rss"
     }
   }
  

ℹ️ Data filtering for Amazon RDS zero-ETL integrations with Amazon Redshift

Now let's create the Zero-ETL integration. It takes a few minutes to change the status of the Zero-ETL integration from Creating to Active. The time varies depending on size of the dataset already available in the source.

  (.venv) $ cdk deploy ZeroETLfromRDStoRSS
  

Test Zero-ETL Integration

(1) Load Data Into Amazon MySQL Cluster

  $ BASTION_HOST_ID=$(aws cloudformation describe-stacks --stack-name MySQLClientHoistStack --region ${CDK_DEFAULT_REGION} | jq -r '.Stacks[0].Outputs | .[] | select(.OutputKey | endswith("EC2InstanceId")) | .OutputValue')

  $ aws ec2-instance-connect ssh --instance-id ${BASTION_HOST_ID} --os-user ec2-user

  [ec2-user@ip-172-31-7-186 ~]$ python3 gen_fake_mysql_data.py \
                                  --database {your-database-name} \
                                  --table {your-table-name} \
                                  --user {user-name} \
                                  --password {password} \
                                  --host {rds-instance-name}.xxxxxxxxxxxx.{region-name}.rds.amazonaws.com \
                                  --max-count 10
  

After filling data into the MySQL table, connect to the primary instance and run some queries.

For example, retrieve some records.

  $ mysql -hrds-instance-name.xxxxxxxxxxxx.region-name.rds.amazonaws.com -uadmin -p
  Enter password:
  Welcome to the MariaDB monitor.  Commands end with ; or \g.
  Your MySQL connection id is 20
  Server version: 8.0.23 Source distribution

  Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

  Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

  MySQL [(none)]> USE demodb;
  Database changed

  MySQL [demodb]> SELECT count(*) FROM retail_trans;
  +----------+
  | count(*) |
  +----------+
  |      10  |
  +----------+
  1 row in set (0.01 sec)

  MySQL [demodb]> SELECT * FROM retail_trans LIMIT 10;
  +----------+--------------+----------+------------+--------+--------+---------------------+
  | trans_id | customer_id  | event    | sku        | amount | device | trans_datetime      |
  +----------+--------------+----------+------------+--------+--------+---------------------+
  |        1 | 460104780596 | cart     | IQ6879MMTB |      8 | mobile | 2023-01-16 06:08:06 |
  |        2 | 758933025159 | like     | RL1573WWLT |      1 | tablet | 2023-01-16 06:17:21 |
  |        3 | 754384589074 | like     | PX4135DYNT |      1 | mobile | 2023-01-16 06:08:52 |
  |        4 | 602811489876 | purchase | PI7913TREO |     66 | pc     | 2023-01-16 06:01:07 |
  |        5 | 222732129586 | like     | AS6987HGLN |      1 | mobile | 2023-01-16 06:09:06 |
  |        6 | 387378799012 | list     | AI6161BEFX |      1 | pc     | 2023-01-16 06:10:27 |
  |        7 | 843982894991 | cart     | DA7930CJBR |     81 | pc     | 2023-01-16 06:11:41 |
  |        8 | 818177069814 | like     | JS6166YPTE |      1 | pc     | 2023-01-16 06:17:08 |
  |        9 | 248083404876 | visit    | AS8552DVOO |      1 | pc     | 2023-01-16 06:24:39 |
  |       10 | 731184658511 | visit    | XZ9997LSJN |      1 | tablet | 2023-01-16 06:12:18 |
  +----------+--------------+----------+------------+--------+--------+---------------------+
  10 rows in set (0.00 sec)
  

(2) Create a database from the integration in Amazon Redshift

To create your database, complete the following steps:

  1. On the Redshift Serverless dashboard, navigate to the zero-etl-target-rs-ns namespace.

  2. Choose Query data to open Query Editor v2.

  3. Connect to the Redshift Serverless data warehouse by choosing Create connection.

  4. Obtain the integration_id from the svv_integration system table:

    ---- copy this result, use in the next sql
    SELECT integration_id FROM svv_integration;
    
  5. Use the integration_id from the previous step to create a new database from the integration:

    CREATE DATABASE aurora_zeroetl FROM INTEGRATION '<result from above>';
    

    The integration is now complete, and an entire snapshot of the source will reflect as is in the destination. Ongoing changes will be synced in near-real time.

  6. On the Redshift Serverless dashboard, open Query Editor v2 using the database you created as part of the integration setup. Use the following query to get information about checkpoint, snaphost, and subsequent CDC data replication:

    SELECT * FROM SYS_INTEGRATION_ACTIVITY
    WHERE TRUNC(INTEGRATION_START_TIME)= CURRENT_DATE
    ORDER BY INTEGRATION_START_TIME;
    

Clean Up

  1. To delete a zero-ETL integration, run the below command.

    (.venv) $ cdk destroy --force -e ZeroETLfromRDStoRSS
    

    When you delete a zero-ETL integration, your transactional data isn’t deleted from Amazon RDS or Amazon Redshift, but Amazon RDS doesn’t send new data to Amazon Redshift.

  2. If you want to delete all CloudFormation stacks, run the below command.

    (.venv) $ cdk destroy --force --all
    

Useful commands

CDK CLI Installation

  • npm install -g aws-cdk install the AWS CDK Toolkit (the cdk command).
  • npm install -g aws-cdk@latest install the latest AWS CDK Toolkit (the cdkcommand).

CDK CLI commands

  • cdk init app --language python create a new, empty CDK Python project.
  • cdk bootstrap --profile <AWS Profile> Deploys the CDK Toolkit staging stack; see Bootstrapping
  • cdk ls list all stacks in the app
  • cdk synth emits the synthesized CloudFormation template
  • cdk deploy deploy this stack to your default AWS account/region
  • cdk diff compare deployed stack with current state
  • cdk docs open CDK documentation

Enjoy!

References