Skip to content

3 Scheduling EmrEtlRunner

Alex Dean edited this page May 17, 2013 · 5 revisions

HOME > SNOWPLOW SETUP GUIDE > Step 3: setting up EmrEtlRunner > 3: Scheduling EmrEtlRunner

  1. Overview
  2. cron
  3. Jenkins
  4. Windows Task Scheduler
  5. Next steps
## Scheduling ## 1. Overview

Once you have the ETL process working smoothly, you can schedule a daily (or more frequent) task to automate the daily ETL process.

We run our daily ETL jobs at 3am UTC, so that we are sure that we have processed all of the events from the day before (CloudFront logs can take some time to arrive).

To consider your different scheduling options in turn:

## 2. cron

The recommended way of scheduling the ETL process is as a daily cronjob using the shell script available in the Snowplow GitHub repository at [/3-enrich/emr-etl-runner/bin/snowplow-emr-etl-runner.sh] bash-script.

You need to edit this script and update the three variables:

rvm_path=/path/to/.rvm # Typically in the $HOME of the user who installed RVM
RUNNER_PATH=/path/to/snowplow/3-enrich/snowplow-emr-etl-runner
RUNNER_CONFIG=/path/to/your-config.yml

So for example if you installed RVM as the admin user, then you would set:

rvm_path=/home/admin/.rvm

Now, assuming you're using the excellent cronic cronic as a wrapper for your cronjobs, and that both cronic and Bundler are on your path, you can configure your cronjob like so:

0 4   * * *   root    cronic /path/to/snowplow/3-enrich/bin/snowplow-emr-etl-runner.sh

This will run the ETL job daily at 4am, emailing any failures to you via cronic.

## 3. Jenkins

Some developers use the Jenkins jenkins continuous integration server (or Hudson hudson, which is very similar) to schedule their Hadoop and Hive jobs.

Describing how to do this is out of scope for this guide, but the blog post [Lowtech Monitoring with Jenkins] jenkins-tutorial is a great tutorial on using Jenkins for non-CI-related tasks, and could be easily adapted to schedule EmrEtlRunner.

## 4. Windows Task Scheduler

For Windows servers, in theory it should be possible to use a Windows PowerShell script plus [Windows Task Scheduler] windows-task-scheduler instead of bash and cron. However, this has not been tested or documented.

If you get this working, please let us know!

## 5. Next steps

Now you have installed and scheduled [EmrEtlRunner] emr-etl-runner, you have all your data ready for analysis in S3. Learn how to [setup the StorageLoader] storage-loader to regularly load your data into a database e.g. Infobright or Redshift for e.g. OLAP analysis, or to [analyse it on S3 via Emr] emr-analysis.

HOME > SNOWPLOW SETUP GUIDE > Step 3: Setup EmrEtlRunner

Setup Snowplow

  • [Step 1: Setup a Collector] (setting-up-a-collector)
  • [Step 2: Setup a Tracker] (setting-up-a-tracker)
  • [Step 3: Setup EmrEtlRunner] (setting-up-EmrEtlRunner)
    • [3.1: install EmrEtlRunner] (1-Installing-EmrEtlRunner)
    • [3.2: using EmrEtlRunner] (2-Using-EmrEtlRunner)
    • [3.3: scheduling EmrEtlRunner] (3-scheduling-EmrEtlRunner)
  • [Step 4: Setup alternative data stores] (setting-up-alternative-data-stores)
  • [Step 5: Analyze your data!] (Getting started analyzing Snowplow data)

Useful resources

Clone this wiki locally