R originated as a blend between Scheme, a functional programming language, and S, a statistical computing language. The language specification started as a toy used within a closed-group of statisticians in the early 1990s [1], and grew organically in the public domain from the year 2000. Parts of the language will feel familiar to users of languages like Lisp or Haskell, and other parts will feel familiar to users of software like Matlab.
The language provides very few primitives, and most operators, like +
or [
are implemented as functions, which can be overloaded and overridden by users. Given the power and flexibility, it is important that users of the language limit themselves to a known safe set of features, and understand their implications.
For a more comprehensive reading on the language, please see the official manuals. If you are a beginner, start with sections 1-4 of the R Language Definition, and follow up with An Introduction to R for general use. If your application is very specific and limited scope, you may choose to skip learning the language and simply use a framework from our third-party dependencies.
At GRAIL, the following is a community agreed set of rules to simplify the language and be consistent in our usage. It is not the aim of this guide to act as a tutorial or a reference.
The following language features are often the subject of opinion and hence we discuss them here. The decisions will gradually become part of our linter rules in Phabricator.
Functions like library
, attachNamespace
, and attach
attach objects from their corresponding namespaces or data to the global search space, masking previously attached objects. This is equivalent to the using namespace
feature in C++, or import * from
feature in Python.
An example of how a rogue package can break your code:
# Function defined in roguePkg
`+` <- function(a, b) {"foo"}
# Code in your R package or analysis script
1 + 1 # Returns 2
library(roguePkg)
1 + 1 # Returns "foo"
Pros: Attaching namespaces and data is convenient and makes the code more compact. This is especially true for common operators.
Cons: Although messages are printed on the console listing the objects that were masked by new definitions, these can be easily missed. This makes code behavior dependent on the order or existence of attach operations, which is obviously undesirable.
Decision: Avoid attaching as much as you can, and qualify all your objects with the namespace, e.g. dplyr::mutate
. Attach when you are absolutely sure you are not unintentionally masking. If you do have to attach, in interactive runs, visually inspect the output from your attach operation, and in non-interactive runs, check the output of base::conflicts function. Also, see the Good Practice section on base::attach.
# Bad
library(somePkg)
some_func()
# Good
somePkg::some_func
somePkg::some_var
A convenient way to generate sequences in R is using the :
operator, e.g. 1:3
. This section concerns when at least one side of the operator is variable.
Pros: It is a much more intuitive way to think that you are counting from 1
to x
.
Cons: This pattern is often misused as a way to get a sequence of length x
because it gives incorrect results when x < 1
, because 1:0 == c(1, 0)
which is not the intended result in most cases, where you would expect an empty vector.
Decision: Use the seq family of functions to generate sequences in a reliable way. seq_along
and seq_len
are probably the most commonly used, i.e.
- Use
seq_len(x)
instead of1:x
- Use
seq_along(x)
instead of1:length(x)
# Bad
1:x
1:length(y)
# Good
seq_len(x)
seq_along(y)
1:4
1:-2
R provides separate message and warnings channels, that can be suppressed if the user wants.
Pros: Keeping warnings as a separate warnings channel allows you to focus on the details of your analysis.
Cons: Warnings are often ignored by users, but they may contain hints about subtle bugs.
Decision: In production R code, and in official analysis (presented to an internal audience, etc.), set the warning level to 2 (all warnings become hard errors). In all code, set at least level 1 (all warnings are printed as they happen). Also enable warnings in additional situations when you can. We recommend these set of base R options related to warnings:
options(warn = 2)
options(warnPartialMatchArgs = TRUE)
options(warnPartialMatchAttr = TRUE)
options(warnPartialMatchDollar = TRUE)
It pays to keep in mind that there are no scalars in R, and that vectors of insufficient length are recycled in different ways to match the dimensions needed for an operation. Also note that indexing in R takes on different semantics in different contexts, and implicit recycling may or may not generate a warning. See this course lecture for some examples of these semantics.
Pros: These semantics are a convenience for interactive analysis
Cons: They are very opaque for ensuring code correctness during code reviews.
Decision: Base R does not give you many tools to control these semantics, and it may help to use third-party packages specialized for the problem you are trying to solve. A future version of these guidelines will be more explicit. Some guidelines to help keep you sane in base R:
-
Use
drop = FALSE
as a third argument when indexing a variable number of columns in a data.frame to always consistently get a data.frame back. Without this, single column subsets automatically convert to a column vector.> df <- data.frame(foo = 1:2, bar = 3:4) > df[, "foo"] [1] 1 2 > df[, "foo", drop = FALSE] foo 1 1 2 2
-
Be careful when using logic operators in R. The longer form
&&
and||
are meant to be used only inif
clauses. It is unfortunate that they do not generate warnings when using them on vectors of length larger than 1.# Bad > c(TRUE, FALSE) && c(TRUE, TRUE) [1] TRUE > c(FALSE, FALSE) || c(FALSE, TRUE) [1] FALSE # Good > c(TRUE, FALSE) & c(TRUE, TRUE) [1] TRUE FALSE > c(FALSE, FALSE) | c(FALSE, TRUE) [1] FALSE TRUE # Bad > if (x & y) { ... } > ifelse(x && y, ..., ...) # Good > if (x && y) { ... } > ifelse(x & y, ..., ...)
-
Be careful when using the
c
function on lists, as it will flatten its constituents. Prefer explicitly calling the constructor for the appropriate type, e.g.list
if you don't want this behavior.# Different behaviors of list concatenation > a <- list(one=1, two=2) > b <- list(three=3, four=4) > class(c(a, b)) [1] "list" > length(c(a, b)) [1] 4 > class(c(a, b, recursive=TRUE)) [1] "numeric" > length(list(a, b)) [1] 2
-
Be careful in the semantics of the
[[
and the[
operators. When you want an individual element, always use the[[
operator. When you want a vector or a list, use the[
operator. The advantage is that the[[
operator will check if you are referencing more than one elements. Unfortunately, the converse is not true for the[
operator.> x <- as.list(1:2) > class(x[[2]]) [1] "integer" > class(x[2]) [1] "list" > class(x[1:2]) [1] "list" > class(x[[1:2]]) Error in x[[1:2]] : subscript out of bounds
-
When computing indices, prefer to keep them in logical form rather than numeric form. This allows you to explicitly assert or debug that the length of the indices is the same as the length of the object being indexed. This is especially useful when indices are passed to other contexts where such assertions are not obvious and the user might want to explicitly assert. This rule usually only means that you should not call
which
to convert logical indices into numeric. -
Set
options(check.bounds = TRUE)
to generate a warning whenever indexing a vector or a list out-of-bounds. Without this option, an out-of-bounds index access returnsNA
silently, and an index write will grow the object automatically to the index you specify filling all intermediate positions as NA. Note that existing code may rely on implicit growth of vectors by indexing, so it may not be feasible for you to set this option globally.> x <- 1 # Maybe unintended results > x[3] <- 3 > x [1] 1 NA 3
R has broadly three object oriented programming systems — S3 and S4 which are implemented from the S language specification, and reference classes (a.k.a. R5) which is implemented in base R and an alternative lightweight implementation provided in the CRAN package R6.
Here, it will suffice to say that base R uses the simpler S3 system, Bioconductor packages use the more complex S4 system, and most other contemporary packages use the R6 system. For a detailed discussion on these systems, see Advanced R, pt III.
We propose using the R6 system for GRAIL packages, whenever there is non-trivial complexity.
Pros: The R6 system has reference semantics which allows a method on a class to modify its own state and return a value. Moreover, the methods are encapsulated within an object and do not live in the global namespace.
Cons: R6 leads to non-idiomatic R code because its semantics are very different from the S language specification.
Decision: OOP is not expected to be very common in GRAIL R code. When an OOP system makes sense in new code, prefer R6. Avoid mixing S3 and S4: S4 methods ignore S3 inheritance and vice-versa. Between S3 and S4, prefer S3 for its simplicity.
R was designed as an REPL (read-eval-print-loop), which relies on dynamic creation of expressions. R made a design choice in its early days to expose expressions as first class objects to users of the language. This makes it possible to have powerful metaprogramming features, in which code can generate more code dynamically to be evaluated in the interpreter loop. See documentation for expression and eval, or the R manual.
Metaprogramming makes code referentially opaque and consequently makes it harder to read and reason for correctness. It is so pervasive in R that the two programming styles have been named "Standard Evaluation" and "Nonstandard Evaluation". See this article from the early years of R (2003) for a glimpse into its evolution.
For a more detailed reading, see Advanced R, pt IV.
Pros: Metaprogramming allows you to define convenient usage patterns in your code. For example, plot(x, sin(x))
can automatically infer the axes labels to be "x" and "sin(x)" by deparsing the arguments. Arguments to functions in packages like dplyr do not have to be valid R symbols, as they can be parsed lazily in a different context, like column names in the context of a table, by the underlying function.
Cons: The convenience this adds to the language also brings mysticism to it because it allows multiple domain specific languages (DSL) to be mixed together in the same code. For example, both data.table
and dplyr
packages implement their own DSLs which look very different from each other. This makes code extremely hard to read.
Decision: Use metaprogramming only when the alternative is much more complex. While this language feature is not difficult to use, it is much better to forsake convenience to be explicit for future users, readers and maintainers of your code. Always favor code correctness, maintainability and an intuitive usage signature over syntactic sugar.
# Hypothetical example of metaprogramming
a <- 1
# Avoid
> my_function <- function(x) sprintf("Value of %s = %d", as.character(substitute(x)), x)
> my_function(a)
[1] "Value of a = 1"
# Prefer
> my_function <- function(x, var_name) sprintf("Value of %s = %d", var_name, x)
> my_function(a, "a")
[1] "Value of a = 1"
The following functions from standard packages are discouraged because of unexpected behavior.
base::sample
Usebase::sample.int
instead on the indices of the data you want to sample, e.g.x[sample.int(length(x))]
instead ofsample(x)
. Thesample
function has a convenience feature that leads to undesired behavior when the argument is a single number> 1
. See Details and Examples in the documentation.base::subset
See warning in the documentation about caveats with nonstandard evaluation of the argument.
"Dependencies are invitations for other people to break your code"
R has a thriving ecosystem of community contributions in the form of established package repositories like CRAN and Bioconductor. CRAN Task Views and Bioc Views provide a catalogued view of these ~13,000 packages.
However still, follow the Tinyverse advice and try to keep your dependencies to a minimum; not all available packages are written and maintained to high standards.
Here, we discuss decisions on the usage of some features from the most common packages used at GRAIL.
https://tidyeval.tidyverse.org/
Tidy Evaluation is a set of principles made possible using metaprogramming. This framework modernizes similar metaprogramming frameworks in base R, and is primarily intended for data transformations without the boilerplate code from standard evaluation. Usual caveats of metaprogramming apply.
dplyr::arrange
, dplyr::filter
, dplyr::mutate
, dplyr::select
, etc.
One of the major objectives of Tidy Evaluation is to make the data your execution environment, extending the idea introduced by the formula interface, and by functions like base::with
et al. To facilitate this concept, several functions exist in various tidyverse packages that take quoted expressions as arguments.
Pros: These functions reduce the boilerplate for the most common uses, when using them without any safety checks.
Cons: Use of safety checks (like quasiquotation or the .data pronoun) actually makes the code comparable in verbosity and readability to conventional idioms.
Decision:
- Prefer conventional idioms, if the readability is more or comparable.
- Use the programming recipes in Tidy Evaluation cheat sheet (p. 2), whenever correctness and mantainability of your code are important.
- Prefer alternative implementations which have cleaner standard evaluation semantics, e.g. the
*_se
variants in the seplyr package (introduction).
magrittr::`%>%`
Read this blog post for some discussion on semantics of the pipe operator, and alternatives.
Pros: Great for readability as the code is organized as a flow.
Cons: The pipe operator will give different results if your RHS function call uses the current environment, or uses lazy evaluation (see how it works). See Technical Notes in documentation.
Decision:
- Prefer conventional idioms (example), if the readability is more or comparable.
- Do not use the pipe operator in a loop; your code will become slow.
- Use the operator only when the flow is linear.
- Break long flows (> 10 steps) by assigning intermediate variable names.
- Prefer alternative implementations which have cleaner standard evaluation semantics, e.g. the dot pipe operator (introduction, technical article).
R packages provide a namespace scoped collection of functions to other users. This provides much needed encapsulation when distributing your code and using code from other people.
Prefer writing and maintaining your code as an R package. Some useful guidelines:
- Use the official R manual as your reference whenever in doubt. All other works are derivations of this document.
- For GRAIL internal packages, use an empty or dummy License file.
- Avoid using the
Depends
clause in the DESCRIPTION file for packages. Having package requirements there is used only to automatically attach the package before your package is attached. - Avoid importing entire packages in the NAMESPACE file. In Roxygen parlance, avoid using
@import
. - Be selective in the symbols you import from other packages in the NAMESPACE file. In almost all cases, you should qualify your symbols with its namespace, e.g.
dplyr::filter
instead of justfilter
, which will remove the need for you to import the symbol. Symbols like%>%
are worthy exceptions. In Roxygen parlance, this is the@importFrom
directive. - When using Roxygen, collect all package level directives, i.e. package documentation,
@importFrom
directives, etc., above a single symbol (typicallyNULL
or"_PACKAGE"
), usually in the filezzz.R
. - Write unit tests for your package when possible. Any executable file in the tests directory is considered a test. The preferred unit testing framework at GRAIL is
testthat
.
This section focuses on consistency in coding style as opposed to the focus on avoiding error-prone coding style as in the language section above.
The GRAIL style guide is based on the Tidyverse style guide, which is a stricter set than the Hadley style guide, which in turn is based on the Google style guide.
Notable observations and exceptions are listed below.
- Use lower case alphabets and hyphens (-) only.
- Use capital R in all relevant file extensions (.R, .Rmd, .Rdata, .Rds, etc.)
- For scripts meant to be run sequentially, prefix with left padded numbers, e.g.
00_setup.sh
,01_preprocess.sh
, etc.
# Bad
import_data.R # "_" should be "-"
importData.R # Should not be camel case
import-data.r # Should be "import-data.R" with upper case "R"
# Good
import-data.R
- Keep them clear and short, in that order.
- Use all lower case letters, no punctuation.
- If the name references a GRAIL project that is a somewhat unique name, say 'ccga', then do not prefix 'grail'. Only prefix when you think there may be a name collision with a public package from CRAN, etc.
- Do not suffix the package name with 'r' if it does not make sense. This is in contrast with the Tidyverse guideline which says that you should suffix with 'r'.
# Bad
grailwgcnar # Do not use the r suffix.
grailstriveadsl # STRIVE is already a GRAIL project.
StriveADSL # Do not use mixed case.
# Good
grailwgcna # Use of grail prefix OK because file is too generic.
striveadsl
- Keep them clear and short, in that order.
- Use snake case (all lower case, words separated by underscore).
- Try to use verbs for function names and nouns for others.
- Do not reuse exported names from default packages that are attached on startup (base, stats, etc.). Use a syntax highlighter to guide yourself on which symbols are exported from the default packages.
# Bad
# Not following snake case; all the below should be day_one.
DayOne
dayone
day.one
# Too long or too cryptic.
first_day_of_the_month
djm1
# Overriding symbols from default packages.
T <- FALSE
c <- 10
mean <- function(x) sum(x)
- Be consistent in your usage of quotes, at least in the same file, and preferably in your package. Either use double quotes, or single quotes.
- Prefer using
return
explicitly, especially in longer functions. This is in contrast with the Google style guide which says you should usereturn
only when you are using imperative style programming, i.e. using control flow operations. At GRAIL, we prefer to be consistent and not ask the user to be cognizant of which style a partiicular function is following because it is easy to mix the two. - For functions that return an object that you do not want printed on the console, wrap your object in
invisible
. - For side-effect functions that do not return anything, prefer using
return(invisible(NULL))
.
- Omit argument names if usage is obvious, but be explicit otherwise.
- Do not use partial matching of argument names, e.g.
val = TRUE
instead ofvalue = TRUE
. - Set default values for arguments only when the default conveys a meaning. For example, a default value of
NULL
rarely conveys a meaning. Usebase::missing
otherwise to check if an argument was set or not. - Use
match.arg
(documentation) when you want your argument to accept one of a finite set of possible values.
- In non-testing code, raise all errors with
stop
orstopifnot
. - Use
tryCatch
judiciously.
- Less than 100 characters.
- OK to exceed for long strings that do not naturally break; it is better than pasting parts of the string together.
- Use two spaces as indent. Never use tabs or mix tabs and spaces. Exception: When a line break occurs inside parentheses, align the wrapped line with the first character inside the parenthesis, or use 4 characters as indent on next line.
- Place spaces around all infix operators (
=
,+
,-
,<-
, etc.). Exception: Spaces around=
s are optional (but preferred) when passing parameters in a function call. - Do not place a space before a comma, but always place one after a comma. Exception: This rule is a paradox when there are consecutive commas.
- Extra spacing (i.e., more than one consecutive space) is OK if it improves alignment.
- Do not place spaces around code in parentheses or square brackets.
- Use
{
as last character on the line (except comments), and}
as first non-indent character on the line. - Do not use semicolons.
- Inline statements (no
{}
) are OK for short, simple expressions without control flow operations. - Surround else with braces.
- Entire commented lines should begin with
#
and one space. - Short comments can be placed after code preceded by two spaces,
#
, and then one space. - Comments can be added between piped operations when the purpose of an operation is not obvious.
- Title is a single sentence in sentence case, without a period at the end. Separate from the next documentation block with one blank line.
- Description can be multiple paragraphs. Separate from the next documentation block with one blank line.
- Parameters are a single block with each parameter description starting with a lower case letter and ending in a period. Multi-line parameter descriptions should be indented 2 extra spaces or aligned with the lines above. When aligning, align all parameter descriptions the same way.
- No blank lines are needed between the function definition and the last documentation block.
- Have at least one space between
#'
and the text.
# Good
#' Get best labels after collapsing multiple labels into a group.
#'
#' Given a set of multiclass predictions per sample and a mapping from fine-grained
#' multiclass labels to a set of more high-level labels, finds the high-level
#' label w/ the highest score.
#'
#' @param too_scores a data.frame w/ sample scores: must have an accession_id
#' column; multiclass score columns must have the same labels as too_class
#' column in too_labels_map.
#' @param too_labels_map a data.frame w/ too_class, full_name columns where
#' too_class represents a fine-grained set of class labels and full_name
#' aggregates/translates too_class labels to a smaller set of human-readable
#' class labels.
#' @return data.frame with accession_id, predicted_label with a row per sample.
#' @export
get_best_label <- function(too_scores, too_labels_map) { ... }
- Use a consistent style for TODOs throughout your code.
TODO(username): Explicit description of action to be taken
- Pros and cons of
tibble
as an alternative todata.frame
. Looks like accessing non-existent columns generate a warning, anddrop=FALSE
is default. - Performance benchmarks of Tidy Evaluation compared to alternatives.
- Preference to single quotes over double quotes for string constants.
- Generating errors for unintended use of
switch
,ifelse
,case_when
, etc. - Examples of well formatted code blocks.
[1]: Thieme, N. (2018), R generation. Significance, 15: 14-19. doi:10.1111/j.1740-9713.2018.01169.x