Skip to content

Setting up Redshift

Yali Sassoon edited this page Aug 15, 2013 · 17 revisions

HOME > SNOWPLOW SETUP GUIDE > Step 4: setting up alternative data stores > Setup Redshift

Setting up Redshift is an 6 step process:

  1. Launch a cluster
  2. Authorize client connections to your cluster
  3. Connect to your cluster
  4. Setting up the Snowplow database and events table
  5. Creating a Redshift user with restrict permissions just for loading data into your Snowplow table
  6. Generating Redshift-format data from Snowplow
  7. Automating the loading of Snowplow data into Redshift

Note: We recommend running all Snowplow AWS operations through an IAM user with the bare minimum permissions required to run Snowplow. Please see our IAM user setup page for more information on doing this.

## 1. Launch a Redshift Cluster

Go into the Amazon webservices console and select "Redshift" from the list of services.

Click on the "Launch Cluster" button:

Enter suitable values for the cluster identifier, database name, port, username and password. Click the "Continue" button.

We now need to configure the cluster size. Select the values that are most appropriate to your situation. We generally recommend starting with a single node cluster with node type dw.hs1.xlarge, and then adding nodes as your data volumes grow.

You now have the opportunity to encrypt the database and and set the availability zone if you wish. Select your preferences and click "Continue".

Amazon summarises your cluster information. Click "Launch Cluster" to fire your Redshift instance up. This will take a few minutes to complete.

## 2. Authorize client connections to your cluster

You authorize access to Redshift differently depending on whether the client you're authorizing is an EC2 instance or not

2.1 EC2 instance
2.2 Other client

### 2.1 Granting access to an EC2 instance

TO WRITE

### 2.2 Granting access to non-EC2 boxes

To enable a direct connection between a client (e.g. on your local machine) and Redshift, click on the cluster you want to access, via the AWS console:

Click on "Security Groups" on the left hand menu.

Amazon lets you create several different security groups so you can manage access by different groups of people. In this tutorial, we will just update the default group to grant access to a particular IP address.

Select the default security group:

We need to enable a connection type for this security group. Amazon offers two choices: an 'EC2 Security Group' (if you want to grant access to a client running on EC2) or a CIDR/IP connection if you want to connect a clieint that is not an EC2 instance.

In this example we're going to establish a direct connection between Redshift and our local machine (not on EC2), so select CIDR/IP. Amazon helpfully guesses the CIDR of the current machine. In our case, this is right, so we enter the value:

and click "Add".

We should now be able to connect a SQL client on our local machine to Amazon Redshift.

## 3. Connect to your cluster

There are two ways to connect to your Redshift cluster:

3.1 Directly
3.2 Via SSL

### 3.1 Directly connect

Amazon has helpfully provided detailed instructions for connecting to Redshift using [SQL Workbench] sql-workbench-tutorial. In this tutorial we will connect using Navicat, a database querying tool which we recommend (30 day trial versios are available from the Navicat website).

Note: Redshift can be accessed using PostgreSQL JDBC or ODBC drivers. Only specific versions of these drivers work with Redshift. These are:

Clients running different PostgreSQL drivers will not be able to connect with Redshift specifically. Note that a number of SQL and BI vendors are launching Redshift specific drivers.

If you have the drivers setup, connecting to Redshift is straightforward:

Open Navicat, select "Connection" -> "PostgreSQL" to establish a new connection:

Give your connection a suitable name. We now need to enter the host name, port, database, username and password. With the exception of password, these are all available directly from the AWS UI. Go back to your browser, open the AWS console, go to Redshift and select your cluster:

Copy the endpoint, port, database name and username into the relevant fields in Navicat, along with the password you created when you setup the cluster:

Click "Test Connection" to check that it is working. Assuming it is, click "OK".

The Redshift cluster is now visible on Navicat, alongside every other database it is connected to.

### 3.2 Connect via SSL

TO WRITE

## 4. Setting up the Snowplow events table

Now that you have Redshift up and running, you need to create your Snowplow events table.

The Snowplow events table definition for Redshift is available on the repo [here] redshift-table-def. Execute this query in Redshift to create the Snowplow events table.

## 5. Creating a Redshift user with restrict permissions *just* for loading data into your Snowplow table

We recommend that you create a specific user in Redshift with only the permissions required to load data into your Snowplow events table, and use this user's credentials in the StorageLoader config to manage the automatic movement of data into the table. (That way, in the event that the server running StorageLoader is hacked and the hacker gets access to those credentials, they cannot use them to do any harm to your data.)

To create aa new user with restrictive permissions, log into Redshift, open the database containing the Snowplow events table, and execute the following SQL:

CREATE USER storageloader PASSWORD 'mYh4RDp4ssW0rD';
GRANT USAGE ON SCHEMA atomic TO storageloader;
GRANT INSERT ON TABLE "atomic"."events" TO storageloader;

You can set the user name and password (storageloader and mYh4RDp4ssW0rD in the example above) to your own values. Note them down: you will need them when you come to setup the storageLoader in the next phase of the your Snowplow setup.

## 6. Generating Redshift-format data from Snowplow

Assuming you are working through the setup guide sequentially, you will have already ([setup EmrEtlRunner] emr-etl-runner). You should therefore have Snowplow events in S3, ready for uploading into Redshift.

If you have not already [setup EmrEtlRunner] emr-etl-runner, then please do so now, before proceeding onto the next stage.

## 7. Automating the loading of Snowplow data into Redshift

Now that you have your Snowplow database and table setup on Redshift, you are ready to [setup the StorageLoader to regularly upload Snowplow data into the table] storage-loader. Click [here] storage-loader for step-by-step instructions on how.

Back to top.

HOME > SNOWPLOW SETUP GUIDE > Step 4: Setting up alternative data stores

Setup Snowplow

  • [Step 1: Setup a Collector] (setting-up-a-collector)
  • [Step 2: Setup a Tracker] (setting-up-a-tracker)
  • [Step 3: Setup EmrEtlRunner] (setting-up-EmrEtlRunner)
  • [Step 4: Setup alternative data stores] (setting-up-alternative-data-stores)
    • [4.1: setup Redshift] (setting-up-redshift)
    • [4.2: setup PostgreSQL] (setting-up-postgresql)
    • [4.3: installing the StorageLoader] (1-installing-the-storageloader)
    • [4.4: using the StorageLoader] (2-using-the-storageloader)
    • [4.5: scheduling the StorageLoader] (3-scheduling-the-storageloader)
  • [Step 5: Analyze your data!] (Getting started analyzing Snowplow data)

Useful resources

Clone this wiki locally