Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify R for data analysis commands #143

Open
dkcoxie opened this issue Oct 28, 2021 · 1 comment
Open

Simplify R for data analysis commands #143

dkcoxie opened this issue Oct 28, 2021 · 1 comment
Assignees
Labels

Comments

@dkcoxie
Copy link
Contributor

dkcoxie commented Oct 28, 2021

Reduce length of individual commands to allow for easier typing and streamline to allow sufficient time for later material, specifically cleaning 'messy' data.

@MrFlick
Copy link
Contributor

MrFlick commented Nov 4, 2021

The two main area where we had used copy/paste with long code sections where (1) the col_names during read

read_csv("data/co2-un-data.csv", skip=2,
         col_names=c("region", "country", "year", "series", "value", "footnotes", "source"))

and (2) the seriess name during recode

  mutate(series = recode(series, "Emissions (thousand metric tons of carbon dioxide)" = "total",
                         "Emissions per capita (metric tons of carbon dioxide)" = "per_capita")) %>%

For (1) I propose that we skip the col_names argument and use the rename verbs. I think this is a bit better because it can be easy to mix up column names and not realize it especially if data formats change over time. At least with explicit renames it's more clear what the intention is. WIth the latest version of dplyr/readr I think it makes sense to instead use

read_csv("co2-un-data.csv", skip=1) %>% 
  rename_with(tolower) %>% 
  rename(region = `region/country/area`, country=...2)

For (2), the concern was that you just have to type really long names exactly for the recode to work and the IDE can't provide autocomplete in that case. An alternative to consider would be case_when. For example

  mutate(series = case_when(
    str_starts(series, "Emissions per capita") ~ "per_capita",
    str_starts(series, "Emissions") ~ "total"
  ))

Unfortunately case_when isn't the most straightforward function to use but it can help out with a lot of data manipulation tasks. But it's kind of annoying to have to explain the ~ syntax. So an alternative would just be if_else

  mutate(series = if_else(str_starts(series, "Emissions per capita"), "per_capita", "total"))

which works find when there are just two categories.

If anyone has other suggestions, let me know. Otherwise if these look good, I can create a pull request for the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants