-
Notifications
You must be signed in to change notification settings - Fork 0
3 Scheduling EmrEtlRunner
HOME > SNOWPLOW SETUP GUIDE > Step 3: setting up EmrEtlRunner > 3: Scheduling EmrEtlRunner
## Scheduling ## 1. OverviewOnce you have the ETL process working smoothly, you can schedule a daily (or more frequent) task to automate the daily ETL process.
We run our daily ETL jobs at 3am UTC, so that we are sure that we have processed all of the events from the day before (CloudFront logs can take some time to arrive).
To consider your different scheduling options in turn:
## 2. cronThe recommended way of scheduling the ETL process is as a daily cronjob using the
shell script available in the Snowplow GitHub repository at
[/3-enrich/emr-etl-runner/bin/snowplow-emr-etl-runner.sh
] bash-script.
You need to edit this script and update the three variables:
rvm_path=/path/to/.rvm # Typically in the $HOME of the user who installed RVM
RUNNER_PATH=/path/to/snowplow/3-enrich/snowplow-emr-etl-runner
RUNNER_CONFIG=/path/to/your-config.yml
So for example if you installed RVM as the admin
user, then you would set:
rvm_path=/home/admin/.rvm
Now, assuming you're using the excellent cronic cronic as a wrapper for your cronjobs, and that both cronic and Bundler are on your path, you can configure your cronjob like so:
0 4 * * * root cronic /path/to/snowplow/3-enrich/bin/snowplow-emr-etl-runner.sh
This will run the ETL job daily at 4am, emailing any failures to you via cronic.
## 3. JenkinsSome developers use the Jenkins jenkins continuous integration server (or Hudson hudson, which is very similar) to schedule their Hadoop and Hive jobs.
Describing how to do this is out of scope for this guide, but the blog post [Lowtech Monitoring with Jenkins] jenkins-tutorial is a great tutorial on using Jenkins for non-CI-related tasks, and could be easily adapted to schedule EmrEtlRunner.
## 4. Windows Task SchedulerFor Windows servers, in theory it should be possible to use a Windows PowerShell script plus [Windows Task Scheduler] windows-task-scheduler instead of bash and cron. However, this has not been tested or documented.
If you get this working, please let us know!
## 5. Next stepsNow you have installed and scheduled [EmrEtlRunner] emr-etl-runner, you have all your data ready for analysis in S3. Learn how to [setup the StorageLoader] storage-loader to regularly load your data into a database e.g. Infobright or Redshift for e.g. OLAP analysis, or to [analyse it on S3 via Emr] emr-analysis.
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2013 Snowplow Analytics Ltd
HOME > SNOWPLOW SETUP GUIDE > Step 3: Setup EmrEtlRunner
- [Step 1: Setup a Collector] (setting-up-a-collector)
- [Step 2: Setup a Tracker] (setting-up-a-tracker)
- [Step 3: Setup EmrEtlRunner] (setting-up-EmrEtlRunner)
- [3.1: install EmrEtlRunner] (1-Installing-EmrEtlRunner)
- [3.2: using EmrEtlRunner] (2-Using-EmrEtlRunner)
- [3.3: scheduling EmrEtlRunner] (3-scheduling-EmrEtlRunner)
- [Step 4: Setup alternative data stores] (setting-up-alternative-data-stores)
- [Step 5: Analyze your data!] (Getting started analyzing Snowplow data)
Useful resources