Non-standard evaluation(NSE) mode in dplyr package

From now on, if any, I will try my best to keep notes in English to exercise written for work.

Recently I have discussed the non-standard evaluation mode in dplyr package with a colleague. Before that conversation, I always defined the mode as dynamic variables to search in google to solve related problems. Then I knowed that the dynamic mode is called “non-standard evaluation” in dplyr.

In order to keep a tidy environment, most dplyr verbs use tidy evaluation which is a special type of non-standard evaluation throughout the tidyverse. It defined a concept of data masking that you can use data variables as if they were variables in the environment. Even to keep tidy selection, you can choose variables based on their position(eg. 1,2,3), name or type(eg. is.numeric).

For this post, I mainly record some common solutions on how to use the dynamic variables (or called intermediate variables, NSE) in dplyr. Although the above make some tasks easier, sometimes we may be confused for example how to use NSE in the mutate(), summarise(), group_by(), filter(), especially in self-defined function arguments or ggplot2 arguments.

In the other words, I need to learn how to use non-standard evaluation(NSE) in dplyr calls.

Use the .data pronoun to transfer string variables.

GraphVar <- "dist"
cars %>% 
  group_by(.data[["speed"]]) %>% 
  summarise(Sum = sum(.data[[GraphVar]], na.rm = TRUE), 
            Count = n() ) %>%

Use the name of string variables in the output dataframe with :=

var <- "value"
iris %>%
  mutate(!!var := ifelse(Sepal.Length > 5, 1, 0)) %>%

The easiest way to remember and operate is using the constructor sym() when we want to to unquote something that looks like code instead of a string, which often used in ggplot2 and R shiny.

grp.var <- "Species"
uniq.var <- "Sepal.Width"
iris %>%
  group_by(!!sym(grp.var)) %>%
  summarise(n_uniq = n_distinct(!!sym(uniq.var)))

For the tricks in the function aspect, it should be divided into two situations depending on what is the type of variables, env-variables or data-variables.

  • Env-variables are “programming” variables that live in an environment. They are usually created with <-.
  • Data-variables are “statistical” variables that live in a data frame. I understand them as column names.

The variable names will be automatically quoted by surrounding it in doubled braces, if the function arguments are not string.

mean_by <- function(data, var, group) {
  data %>%
    group_by({{group}}) %>%
    summarise(avg = mean({{var}}))
mean_by(starwars, group = species, var = height) %>% head()

We need to construct symbols to transform the string if we'd like to use a character as the arguments.

mean_by <- function(data, var, group) {
  group <- sym(group)
  var <- sym(var)
  data %>%
    group_by(!!group) %>%
    summarise(avg = mean(!!var))
mean_by(starwars, group = "species", var = "height") %>% head()

If you want to import user-supplied expressions, such as height*100, doubled braces run normally, but sym does not. In this situation, we need to replace sym by enquo.

mean_by <- function(data, var, group) {
  group <- enquo(group)
  var <- enquo(var)
  data %>%
    group_by(!!group) %>%
    summarise(avg = mean(!!var))
mean_by(starwars, var = height * 100, group = as.factor(species)) %>% head()

The dplyr allows multiple grouping variables, which can be represented by object

mean_by <- function(data, var, ...) { var <- enquo(var)

data %>% group_by(...) %>% summarise(avg = mean(!! var)) } mean_by(starwars, height, species, eye_color)

Above are the supplementary for a previous blog post(https://www.bioinfo-scrounger.com/archives/R-dplyr-tricks/).


Dynamic column/variable names with dplyr using Standard Evaluation functions
Programming with dplyr-vigenettes
Programming with dplyr

Please indicate the source: http://www.bioinfo-scrounger.com