Recode values for multiple variables

Package: dplyr

Function: `case_when()`

Note: dplyr::case_when() will evaluate in order, so proceed from most specific to most general. The function will not write over a previous evaluation even if something meets a new criteria.

1. Recode across multiple character variables (all variables)

Review the data (d19)

# A tibble: 3 x 3
  Var1  Var2  Var3 
  <chr> <chr> <chr>
1 y     n     Y    
2 Y     N     Y    
3 n     N     n

When this data was entered it appears that both lower and upper case letters were used.

We want to recode all “y” to “Y” and all “n” to “N” for Var1, Var2 and Var3.

Note: Unlike dplyr::recode() where we did not have to put quotes around the old character values, we now have to put quotes around the old and new character values.
Note: I use dplyr::across() to select multiple variables to apply my function to.
Note: We use the tidyselect selection helper everything() to refer to all variables. I could also have selected variables Var1:Var3.
Note: The . is used here to say apply this function to all of the variables I selected above.
Note: TRUE means, if a value was not evaluated in my arguments above, replace with the value I give. In this case, I am saying just replace with the existing value.

d19 %>%
  dplyr::mutate(dplyr::across(everything(),
                                 ~case_when(
                                   . == "y" ~ "Y",
                                   . == "n" ~ "N",
                                   TRUE ~ .
                                 )))

# A tibble: 3 x 3
  Var1  Var2  Var3 
  <chr> <chr> <chr>
1 Y     N     Y    
2 Y     N     Y    
3 N     N     N

A more sophisticated way to do this might be to bypass the dplyr::case_when() statement all together though and use something like stringr::str_to_upper() though.

d19 %>%
  dplyr::mutate(dplyr::across(everything(), ~ stringr::str_to_upper(.)))

# A tibble: 3 x 3
  Var1  Var2  Var3 
  <chr> <chr> <chr>
1 Y     N     Y    
2 Y     N     Y    
3 N     N     N

2. Recode across multiple character variables (Var1:Var3)

Review the data (d15)

# A tibble: 3 x 4
     id Var1  Var2  Var3 
  <dbl> <chr> <chr> <chr>
1   123 1.5   2.2   3.0  
2   234 2.2   UI    6.9  
3   345 MR    5.4   MR

In this example, missing values were entered as character values in a numeric column. So now these columns are considered character.

Recode “MR” to -98 and “UI” to -99 for Var1, Var2, Var3.

Notice that I am putting -98 and -99 in quotes. That is because this variable is currently considered a character variable. So we will recode first, and then convert the variable to numeric.

d15 <- d15 %>%
  dplyr::mutate(dplyr::across(Var1:Var3,
                       ~ case_when(
                       . == "MR" ~ "-98",
                       . == "UI" ~ "-99",
                       TRUE ~ .)))

d15

# A tibble: 3 x 4
     id Var1  Var2  Var3 
  <dbl> <chr> <chr> <chr>
1   123 1.5   2.2   3.0  
2   234 2.2   -99   6.9  
3   345 -98   5.4   -98

Remember I still need to do one additional step to convert this variable to numeric (using the base::as.numeric() function).

However, before I do that, I want to be certain I have no other character values left in my variables. If I do still have other character values and I try to use base::as.numeric() those values will be converted to NA.

I can check that there are no more instances of MR or UI using dplyr::filter() in conjunction with the dplyr::if_any() predicate function to select columns where the condition applies to any of the columns. This function is available in version 1.0.5 of dplyr.

.x (or just .) is used to say, for all of the variables I have selected

d15 %>% 
  dplyr::filter(dplyr::if_any(everything(), ~ .x %in% c('MR', 'UI')))

# A tibble: 0 x 4
# ... with 4 variables: id <dbl>, Var1 <chr>, Var2 <chr>, Var3 <chr>

If I wanted to be more general, I could also just check for any alphabetical character left in my variables using stringr::str_detect() and including the regex pattern [a-zA-Z]

d15 %>%
  dplyr::filter(dplyr::if_any(everything(), ~ stringr::str_detect(., pattern = "[a-zA-Z]")))

# A tibble: 0 x 4
# ... with 4 variables: id <dbl>, Var1 <chr>, Var2 <chr>, Var3 <chr>

Now that I know my variable has all character values removed, I can convert them to numeric.

d15 %>%
  dplyr::mutate(dplyr::across(Var1:Var3, as.numeric))

# A tibble: 3 x 4
     id  Var1  Var2  Var3
  <dbl> <dbl> <dbl> <dbl>
1   123   1.5   2.2   3  
2   234   2.2 -99     6.9
3   345 -98     5.4 -98

3. Recode across multiple numeric variables (sum and mean variables)

Review the data (d20)

# A tibble: 4 x 5
     id  sc_1  sc_2 sc_sum sc_mean
  <dbl> <dbl> <dbl>  <dbl>   <dbl>
1    10     1     2      3     1.5
2    11   -80   -80     NA    NA  
3    12     2     3      5     2.5
4    13    NA    NA     NA    NA

In this situation, -80 means the item was not provided to a student. If a student has a -80 for all items in a scale, we want to assign a -80 for the sum and mean of that scale to denote that this scale was not provided (to differentiate between a value of NA meaning the data are actually missing).

The scale “sc” is made up of 2 items.

Note: I am using the dplyr::if_all() predicate function to select columns where the condition applies to all columns. This function is available in version 1.0.5 of dplyr.

d20 %>%
  dplyr::mutate(dplyr::across(
    sc_sum:sc_mean,
    ~ dplyr::case_when(dplyr::if_all(sc_1:sc_2, ~ . == -80) ~ -80, TRUE ~ .)
  ))

# A tibble: 4 x 5
     id  sc_1  sc_2 sc_sum sc_mean
  <dbl> <dbl> <dbl>  <dbl>   <dbl>
1    10     1     2      3     1.5
2    11   -80   -80    -80   -80  
3    12     2     3      5     2.5
4    13    NA    NA     NA    NA

Return to Recode

Recode values for multiple variables

Package: dplyr

Function: case_when()

Function: `case_when()`