Skip to content
Tom Vogels edited this page Aug 14, 2017 · 4 revisions

ETL Code for Loading Data Into a Redshift-based Data Warehouse

Overview

The goal in this project is to load data from "upstream" data sources (for example, application databases from the website or marketing) into your data warehouse. After the initial load, additional transformations follow towards having a data model that is conformed and easy to query. Basically, we have an ELT from multiple sources into one data warehouse.

We will use an S3 bucket as the intermediate "data lake" where we dump upstream data before loading into Redshift. This step will give you copies of your upstream data in your data warehouse. Since Redshift supports a sub-set of column types that PostgreSQL supports, the dump of data is setup to map various column types (like arrays or hstore) to new types, which Redshift does support.

ETL Flow diagram

Once the upstream data is loaded, it can be transformed. Your upstream data sources are likely created with their application in mind (such as your e-commerce web site or your marketing tool). This data should be conformed and be made ready for easy & fast queries. For great query performance, consider transforming your data into star schemas.

Organizing data model

The Hyppo database is where we aggregate third-party data, like shipping or payment information. The integration manager for Hyppo is also available as open-source project.

The Tallboy database is the ingress location for our users' spreadsheets.

Basic flow from design to production

The CLI tool allows users to iterate over the table designs, dump upstream data, load that into Redshift and build tables and views to build out the warehouse.

Development ETLs

You can bootstrap the table design files within the CLI tool. (See wiki pages for general information about table design files.) Then copy them into an "environment" folder in your S3 bucket (again, using the CLI tool). Then dump your upstream sources and load them into your warehouse. There are commands available to validate your setup before uploading or to explore what's actually present.

Production ETLs

For production, a Data Pipeline can be started that will kick off an EMR cluster and EC2 instances at regular intervals to ETL (or ELT) the data. You can deploy to production simply by copying your tested configuration and table designs within S3.

Alternatives

Depending on what you're looking for, there are better alternatives out there. If you just want to consume data and have an RDS instance with PostgreSQL 9.5, then check out the Database Migration Service from AWS or AWS Glue. There are also ETL programs available either free or paid that will load and transform data. Check out Talend, Fivetran or Pipeline. (We're not endorsing any of these options, simply pointing out some options and there are more for you to discover).

We embarked on this "PostgreSQL-to-Redshift" ETL adventure since we wanted an ETL

  • that would leave a lot of control over the table in Redshift with us (such as column encoding, table sorting and distribution)
  • that would give us the option to run transformations in Spark (including SparkSQL), especially with the thought of having the same toolset for analytics jobs that may use the same data source but not create tables in the data warehouse
  • that would enable us to start testing all the transformations along the way
  • that would allow us to coral our data into a meaningful organization and allow us to quickly iterate over it
  • that would lend itself to production releases with stand tools and services.

(Some of this may have been driven by our previous solution that limited the number of iterations for development plus had some frailties in scheduling or running it.)