-
Notifications
You must be signed in to change notification settings - Fork 1
Reason to build this library.
A library designed to convert sql data into a CSV.GZIP file and/or upload to AWS S3.
I worked on a golang library which queries sql database and uploads the data in csv.gzip format to AWS S3. In this post, I try to describe my reasons behind the creation of library and the problems I was trying to solve. There might be better tools out there to achieve the same thing. However, at the time of writing I was not aware of any one tool which met my needs. Hence, I took this opportunity to create another open source project.
One of the common task developers do is retrieve data from database and dump it to external storage. AWS S3 is vastly preferred as backend storeage because of its cost and availability. Data is often exported in CSV format as it is quite popular and compatible with many programs which data scientists use.
There are couple reasons to dump data from database to AWS S3.
- If the database is getting too big, archiving database in relatively cheap S3 bucket saves money. Developers can then trim the database to keep them performant.
- As a security measure, data scientists or other teams may not have access to the database. Hence, by storing data in S3 it becomes convenient for other teams access and analyze data.
We can write simple bash scripts to achieve this.
mysql -d database -e 'SELECT * INTO OUTFILE 'orders.csv' FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '' LINES TERMINATED BY '\n' FROM members;'
aws s3 cp ./orders.csv s3://myBucket/orders.csv
This approach works when we have relatively small amount of data. However, the fallacy of stability of this script hits us as soon as we try to run this script as a part of docker container/ kubernetes job.
- We start to notice that our pods are getting evicted because of disk pressure. We come to the conclusion that our data has increased expontentially in size over time. Our kubernetes job cannot run to completion successfully because the pod is running out of disk space.
- Our S3 data usage starts increasing every day since we are storing uncompressed csv.
- Additionally, we are also get charged for high AWS traffic.
We update our script to compress the csv file before uploading. Since we are compressing data before uploading, this solution solves issues (2) and (3). However, (1) is still an issue.
mysql -d database -e 'SELECT * INTO OUTFILE 'orders.csv' FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '' LINES TERMINATED BY '\n' FROM members;'
cat orders.csv | gzip | s3cmd put - s3://myBucket/orders.csv.gz
At this point we start noticing various other issues:
- In order to generate gzip file, we first need to create csv file. Even if we delete the csv file after the gzip file is generated, our container still needs to have storage space to accomodate both gzip file and csv file.
- If we manage to generate gzip file, without writing csv file to disk: (Using memory instead)
- Our container needs to have storage space to accomodate gzip file.
- Our container memory would likely increase in order to keep the csv file in memory.
- We could try uploading in batches as a workaround.
- Iterate over batches makes a simple query complicated.
- Iterating over batches using counters hit application as well as database performance.
- Long living sql connection with retries adds to the complexity.
- Even if we find a command to stream directly into AWS S3, there would be issues.
- We want to make our application work in limited memory. We will not have enough memory to store either csv or gzip in the memory.
- Streaming directly to S3 is a problem as we want to have some retry strategy.
- In case of retry, we have to make sure we resume our upload and not restart it.
- If using AWS Aurora, we can use sql syntax
INTO S3
which copies data from Aurora over to AWS S3.- This only works with AWS databases.
- This means we have to take care of one additional step when we move data out of Aurora.
- We also need to configure IAM policies which depending on your security model may not be allowed.
These were some of the issues I was facing myself. Hence, at this point I decided to write the library. Below are some of the goals I wanted this library to achieve:
-
Basic Goals:
- It should get data from sql database, convert it into csv file.
- It should be able to compress csv data to gzip
- It should be able to upload to AWS S3
-
Advance Goals:
- It should not write anything to local disk.
- This would enable me to use read-only pod volumes as a podSecurityPolicy.
- Given any size of data, it should use consistent memory, cpu, and bandwidth.
- The aim was not to utilize minimum possible resources. The goal was to ensure that the library would use same amount of resources depending on given buffer sizes.
- The resources required will not depend on the total amount of data ingested.
- It should have resiliency while uploading
- It should utilize concurrency.
- It should not write anything to local disk.
-
Basic goals:
-
database/sql
package was used in conjunction with joho/sqltocsv to generate csv. -
compress/gzip
package was used to convert csv to gzip. - aws-sdk-go was used to upload to S3.
-
-
Advance goals:
- In order to bypass local disk, I used 2 additional buffers for csv and gzip data. I would then upload gzip buffer to AWS S3.
- In order to use consistent resources: i. Sql batch of (default value) 4096 rows was used: Each batch would be converted to csv format and appended to csv buffer. ii. CSV Buffer of (default value) 1 Mb was used: At 1 MB, contents would be converted into gzip format and appended to gzip buffer. iii. Gzip Buffer of (default value) 50 Mb was used: Once the gzip buffer reached 50 Mb capacity, I would upload it to S3.
- To have resiliency while uploading: i. I used AWS multi-part upload strategy: This strategy uploads data in chunks and after completion S3 concatenates the chunks to produce gzip file. ii. We use max-retries of 3: If there is any temporary network issue, we try to retry upload 2 more times. iii. If upload has been cancelled, we let AWS know that the upload has been aborted. This makes AWS discard all the uploaded pieces, hence not costing us for partially uploaded data.
- In order to use concurrency:
i. We use golang channel whose max length is set to number of cores on our machine. ( This potentially increases memory usage by (max_length_of_channel * size of gzip buffer)).
ii. Gzip buffers are sent to this channel on creation.
iii. There are multiple AWS Uploader goroutines listening to this channel. Available goroutines uploads the gzipped data to AWS S3.
iv. We also have since switched to pgzip as a replacement
compress/gzip
library which parallelizes the zipping process.
There are certain caveats of using this library.
- pgzip spawns the same number of goroutines as the number of cores available on the machine. This does not work with containers as container cpus are controlled by cgroups which
runtime.GOMAXPROCS(0)
does not take into consideration. This may impact pgzip performance on a pod. - Maximum of 10000 part uploads are allowed by AWS. Hence, (50Mb x 10000) 500Gb of gzipped data is supported by default settings. Increase buffer size if you want to reduce parts or have more than 500Gb of gzipped data.