Skip to content

Redshift batch inserts using COPY FROM operation #25866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

brendanstennett
Copy link

@brendanstennett brendanstennett commented May 26, 2025

Description

Fixes #24546

This PR aims to allow the use of COPY FROM statements when sinking data into Redshift.

The Redshift connector inherits BaseJdbcConnector which uses batched INSERT statements to execute sink operations. Even when using non transactional mode, this can only push about 1000 rows per second. This change stages the rows to a parquet file first, then issues a COPY FROM statement to load the table. We are noticing 250K rows per second or more using this method.

This has been running in production for 2+ months on our own branch.

This functionality needs to be enabled by specifying the following config option:

redshift.batched-inserts-copy-location=s3://my-bucket/my-prefix

The following options are also required when specifying the above:

redshift.batched-inserts-copy-iam-role=arn:aws:iam::123456789000:role/redshift_iam_role
s3.region=region
s3.aws-access-key=KEY
s3.aws-secret-key=SECRET

A suggested IAM Policy to for this role and user:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket"
			],
			"Resource": "arn:aws:s3:::my-bucket"
		},
		{
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:PutObject",
				"s3:DeleteObject"
			],
			"Resource": "arn:aws:s3:::my-bucket/*"
		}
	]
}

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Redshift
* Add support for Redshift COPY FROM statements for batch insert operations ({issue}`24546`)

@cla-bot cla-bot bot added the cla-signed label May 26, 2025
@github-actions github-actions bot added the redshift Redshift connector label May 26, 2025
@brendanstennett brendanstennett force-pushed the redshift-copy branch 2 times, most recently from a2827db to 9187510 Compare May 26, 2025 18:31
@ebyhr ebyhr added the needs-docs This pull request requires changes to the documentation label May 26, 2025
@brendanstennett brendanstennett changed the title [WIP] Redshift batch inserts using COPY FROM operation Redshift batch inserts using COPY FROM operation May 30, 2025
@github-actions github-actions bot added the docs label May 30, 2025
@brendanstennett
Copy link
Author

@ebyhr Added requested documentation. We need the ENV var set on the repo for the tests to pass.

REDSHIFT_S3_COPY_ROOT which would be a similar location to the ENV var already set of REDSHIFT_S3_UNLOAD_ROOT

@brendanstennett brendanstennett marked this pull request as ready for review June 2, 2025 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs needs-docs This pull request requires changes to the documentation redshift Redshift connector
Development

Successfully merging this pull request may close these issues.

Support Redshift COPY for bulk loads
2 participants