Best Practices to Copy On-Premise Data to AWS S3 and Redshift

Please follow below best practices when you are planning to move the data from the on-premise database to AWS S3 and then to Redshift:

  1. Files should be compressed at every step, so we can save storage cost and upload time.
  2. Split the source data into files so that the number of files is a multiple of the number of slices in our cluster. if you don’t know the slices number in your client system yet, parameterize the split number, for now, to generate the source table into multiple files.
  3. To save time, skip the encoding selection step by turning COMPUPDATE off while using the COPY command.
  4. Files should be uploaded into S3 in the same region where we have our redshift cluster, so we can avoid network latency and uploads that happen a bit faster than cross regions.
  5. The Thumb rule is, when we load data from the S3 bucket to Redshift using the multiple files method, for 1 GB worth of data, it shouldn’t take more than 30 seconds (might affect due to current cluster configuration in Redshift).
  6. When you COPY from Amazon S3 using a key prefix, there is a risk that you might load unwanted tables. So using manifest files is mandatory.
  7. This is for information purposes: When loading DATE and TIMESTAMP columns, COPY expects the default format, which is YYYY-MM-DD for dates and YYYY-MM-DD HH:MI:SS for timestamps. If the load data does not use a default format, you can use DATEFORMAT and TIMEFORMAT to specify the format.

 

Comments

comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: