forked from STAT545-UBC/STAT545-UBC-original-website
-
Notifications
You must be signed in to change notification settings - Fork 0
/
automation00_index.html
260 lines (217 loc) · 11 KB
/
automation00_index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<title>Automating data analysis pipelines</title>
<script src="libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="libs/bootstrap-3.3.5/css/bootstrap.min.css" rel="stylesheet" />
<script src="libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-68219208-1', 'auto');
ga('send', 'pageview');
</script>
<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet"
href="libs/highlight/default.css"
type="text/css" />
<script src="libs/highlight/highlight.js"></script>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<script type="text/javascript">
if (window.hljs && document.readyState && document.readyState === "complete") {
window.setTimeout(function() {
hljs.initHighlighting();
}, 0);
}
</script>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
<link rel="stylesheet" href="libs/local/main.css" type="text/css" />
<link rel="stylesheet" href="libs/local/nav.css" type="text/css" />
<link rel="stylesheet" href="//netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css" type="text/css" />
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
button.code-folding-btn:focus {
outline: none;
}
</style>
<div class="container-fluid main-container">
<!-- tabsets -->
<script src="libs/navigation-1.1/tabsets.js"></script>
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
</script>
<!-- code folding -->
<header>
<div class="nav">
<a class="nav-logo" href="index.html">
<img src="static/img/stat545-logo-s.png" width="70px" height="70px"/>
</a>
<ul>
<li class="home"><a href="index.html">Home</a></li>
<li class="faq"><a href="faq.html">FAQ</a></li>
<li class="syllabus"><a href="syllabus.html">Syllabus</a></li>
<li class="topics"><a href="topics.html">Topics</a></li>
<li class="people"><a href="people.html">People</a></li>
</ul>
</div>
</header>
<div class="fluid-row" id="header">
<h1 class="title toc-ignore">Automating data analysis pipelines</h1>
</div>
<div id="TOC">
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#install-make">Install <code>make</code></a></li>
<li><a href="#test-drive-make-and-rstudio">Test drive <code>make</code> and RStudio</a></li>
<li><a href="#hands-on-activity">Hands-on activity</a></li>
<li><a href="#more-examples">More examples</a></li>
<li><a href="#resources">Resources</a></li>
</ul>
</div>
<p>Although we spend alot of time working with data interactively, this sort of hands-on babysitting is not always appropriate. We have a philosophy of “source is real” in this class and that philosophy can be implemented on a grander scale. Just as we save R code in a script so we can replay analytical steps, we can also record how a series of scripts and commands work together to produce a set of analytical results. This is what we mean by automating data analysis or building an analytical pipeline.</p>
<div id="overview" class="section level3">
<h3>Overview</h3>
<p><a href="automation01_slides/index.html" target="_blank">slides</a></p>
<p>Why and how we automate data analyses + examples.</p>
</div>
<div id="install-make" class="section level3">
<h3>Install <code>make</code></h3>
<p><em>2015-11-17 NOTE: since we have already set up a build environment for R packages, it is my hope that everyone has Make. These instructions were from 2014, when we did everything in a different order. Cross your fingers and ignore!</em></p>
<p><a href="automation02_windows.html">Windows installation</a></p>
<p>(If you are running Mac OS or Linux, <code>make</code> should already be installed.)</p>
</div>
<div id="test-drive-make-and-rstudio" class="section level3">
<h3>Test drive <code>make</code> and RStudio</h3>
<p><a href="automation03_make-test-drive.html">Test drive of <code>make</code></a>.</p>
<p>Walk before you run! Prove that <code>make</code> is actually installed and that it can be found and executed from the <a href="git09_shell.html">shell</a> and from RStudio. It is also important to tell RStudio to NOT substitute spaces for tabs when editing a <code>Makefile</code> (applies to any text editor).</p>
</div>
<div id="hands-on-activity" class="section level3">
<h3>Hands-on activity</h3>
<p><a href="automation04_make-activity.html">This fully developed example</a> shows you</p>
<ul>
<li>How to run an R script non-interactively</li>
<li>How to use <code>make</code>
<ul>
<li>to record which files are inputs vs. intermediates vs. outputs</li>
<li>to capture how scripts and commands convert inputs to outputs</li>
<li>to re-run parts of an analysis that are out-of-date</li>
</ul></li>
<li>The intersection of R and <code>make</code>, i.e. how to
<ul>
<li>run snippets of R code</li>
<li>run an entire R script</li>
<li>render an R Markdown document (or R script)</li>
</ul></li>
<li>The interface between RStudio and <code>make</code></li>
<li>How to use <code>make</code> from the <a href="git09_shell.html">shell</a></li>
<li>How Git facilitates the process of building a pipeline</li>
</ul>
<p><em>2015-11-19 Andrew MacDonald translated the above into a pipeline for the <a href="https://github.com/richfitz/remake"><code>remake</code> package</a> from Rich Fitzjohn: see <a href="https://gist.github.com/aammd/72a5b98356893c001001">this gist</a>.</em></p>
</div>
<div id="more-examples" class="section level3">
<h3>More examples</h3>
<p>There are three more toy pipelines, using the Lord of the Rings data, that reinforce:</p>
<ul>
<li><a href="https://github.com/STAT545-UBC/STAT545-UBC.github.io/tree/master/automation10_holding-area/01_automation-example_just-r">01_automation-example_just-r</a>: use of an R script as a pseudo-<code>Makefile</code></li>
<li><a href="https://github.com/STAT545-UBC/STAT545-UBC.github.io/tree/master/automation10_holding-area/02_automation-example_r-and-make">02_automation-example_r-and-make</a>: use of a simple <code>Makefile</code></li>
<li><a href="https://github.com/STAT545-UBC/STAT545-UBC.github.io/tree/master/automation10_holding-area/03_automation-example_render-without-rstudio">03_automation-example_render-without-rstudio</a>: use of <code>rmarkdown::render()</code> from a <code>Makefile</code>, as the default way of running an R script or an R Markdown document, leading to pretty HTML reports without any mouse clicks</li>
</ul>
</div>
<div id="resources" class="section level3">
<h3>Resources</h3>
<p><a href="http://xkcd.com/1319/">xkcd comic on automation</a>. ‘Automating’ comes from the roots ‘auto-’ meaning ‘self-’, and ‘mating’, meaning ‘screwing’.</p>
<p>Karl Broman covers GNU Make in his course <a href="http://kbroman.org/Tools4RR/pages/schedule.html">Tools for Reproducible Research</a> <em>(see first week)</em></p>
<p>Karl Broman also wrote <a href="http://kbroman.github.io/minimal_make/">An introduction to <code>Make</code></a>, aimed at stats / data science types</p>
<p><a href="http://www.bendmorris.com/2013/09/using-make-for-reproducible-scientific.html">Using Make for reproducible scientific analyses</a>, blog post by Ben Morris</p>
<p>Software Carpentry’s <a href="http://software-carpentry.org/v4/make/index.html">Slides on <code>Make</code></a></p>
<p>Zachary M. Jones wrote <a href="http://zmjones.com/make.html">GNU Make for Reproducible Data Analysis</a></p>
<p><a href="http://alaiacano.github.io/blog/2013/03/14/keeping-tabs-on-your-data-analysis-workflow/">Keeping tabs on your data analysis workflow</a>, blog post by Adam Laiacano, who works at Tumblr</p>
<p>Mike Bostock, of D3.js and New York Times fame, explains <a href="http://bost.ocks.org/mike/make/">Why Use Make</a>: “it’s about the benefits of capturing workflows via a file-based dependency-tracking build system”</p>
<p><a href="http://bitaesthetics.com/posts/make-for-data-scientists.html">Make for Data Scientists</a>, blog post by Paul Butler, who also made a <a href="https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919">beautiful map of Facebook connections</a> using R</p>
<p>Other, more modern data-oriented alternatives to <code>make</code></p>
<ul>
<li><a href="https://github.com/Factual/drake">Drake</a>, a kind of “make for data”</li>
<li><a href="http://www.nextflow.io">Nextflow</a> for “data-driven computational pipelines”</li>
<li><a href="https://github.com/richfitz/maker"><code>maker</code></a>, “Make-like build management, re-imagined for R”</li>
</ul>
<p><a href="http://www.oreilly.com/openbook/make3/book/">Managing Projects with GNU Make, Third Edition By Robert Mecklenburg</a> is a fantastic book but, sadly, is very focused on compiling software</p>
<p>RStudio’s <a href="http://rmarkdown.rstudio.com">website documenting R Markdown</a> is generated from <a href="https://github.com/rstudio/rmarkdown/tree/gh-pages">this repo</a> using <a href="https://github.com/rstudio/rmarkdown/blob/gh-pages/Makefile">this 20 line Makefile</a>, which is sort of amazing. This is why we study regular expressions and follow filename conventions, people!</p>
<p><a href="http://dirk.eddelbuettel.com/code/littler.html">littler</a> is an R package maintained by Dirk Eddelbuettel that “provides the <code>r</code> program, a simplified command-line interface for GNU R.”</p>
</div>
<div class="footer">
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>