case_when()Note: dplyr::case_when() will evaluate
in order, so proceed from most specific to most general. The function
will not write over a previous evaluation even if something meets a new
criteria.
1. Recode across multiple character variables (all variables)
Review the data (d19)
# A tibble: 3 x 3
Var1 Var2 Var3
<chr> <chr> <chr>
1 y n Y
2 Y N Y
3 n N n
When this data was entered it appears that both lower and upper case letters were used.
We want to recode all “y” to “Y” and all “n” to “N” for
Var1, Var2 and Var3.
Note: Unlike dplyr::recode() where we did not have
to put quotes around the old character values, we now have to put quotes
around the old and new character values.
Note: I use dplyr::across() to select multiple
variables to apply my function to.
Note: We use the tidyselect selection helper
everything() to refer to all variables. I could also have
selected variables Var1:Var3.
Note: The . is used here to say apply this function to all of the variables I selected above.
Note: TRUE means, if a value was not evaluated in my arguments above, replace with the value I give. In this case, I am saying just replace with the existing value.
d19 %>%
dplyr::mutate(dplyr::across(everything(),
~case_when(
. == "y" ~ "Y",
. == "n" ~ "N",
TRUE ~ .
)))
# A tibble: 3 x 3
Var1 Var2 Var3
<chr> <chr> <chr>
1 Y N Y
2 Y N Y
3 N N N
A more sophisticated way to do this might be to bypass the
dplyr::case_when() statement all together though and use
something like stringr::str_to_upper() though.
d19 %>%
dplyr::mutate(dplyr::across(everything(), ~ stringr::str_to_upper(.)))
# A tibble: 3 x 3
Var1 Var2 Var3
<chr> <chr> <chr>
1 Y N Y
2 Y N Y
3 N N N
2. Recode across multiple character variables (Var1:Var3)
Review the data (d15)
# A tibble: 3 x 4
id Var1 Var2 Var3
<dbl> <chr> <chr> <chr>
1 123 1.5 2.2 3.0
2 234 2.2 UI 6.9
3 345 MR 5.4 MR
In this example, missing values were entered as character values in a numeric column. So now these columns are considered character.
Recode “MR” to -98 and “UI” to -99 for Var1,
Var2, Var3.
Notice that I am putting -98 and -99 in quotes. That is because this variable is currently considered a character variable. So we will recode first, and then convert the variable to numeric.
d15 <- d15 %>%
dplyr::mutate(dplyr::across(Var1:Var3,
~ case_when(
. == "MR" ~ "-98",
. == "UI" ~ "-99",
TRUE ~ .)))
d15
# A tibble: 3 x 4
id Var1 Var2 Var3
<dbl> <chr> <chr> <chr>
1 123 1.5 2.2 3.0
2 234 2.2 -99 6.9
3 345 -98 5.4 -98
Remember I still need to do one additional step to convert this
variable to numeric (using the base::as.numeric()
function).
However, before I do that, I want to be certain I have no other
character values left in my variables. If I do still have other
character values and I try to use base::as.numeric() those
values will be converted to NA.
I can check that there are no more instances of MR or UI using
dplyr::filter() in conjunction with the
dplyr::if_any() predicate function to select columns where
the condition applies to any of the columns. This function is available
in version 1.0.5 of dplyr.
d15 %>%
dplyr::filter(dplyr::if_any(everything(), ~ .x %in% c('MR', 'UI')))
# A tibble: 0 x 4
# ... with 4 variables: id <dbl>, Var1 <chr>, Var2 <chr>, Var3 <chr>
If I wanted to be more general, I could also just check for
any alphabetical character left in my variables using
stringr::str_detect() and including the regex pattern
[a-zA-Z]
d15 %>%
dplyr::filter(dplyr::if_any(everything(), ~ stringr::str_detect(., pattern = "[a-zA-Z]")))
# A tibble: 0 x 4
# ... with 4 variables: id <dbl>, Var1 <chr>, Var2 <chr>, Var3 <chr>
Now that I know my variable has all character values removed, I can convert them to numeric.
d15 %>%
dplyr::mutate(dplyr::across(Var1:Var3, as.numeric))
# A tibble: 3 x 4
id Var1 Var2 Var3
<dbl> <dbl> <dbl> <dbl>
1 123 1.5 2.2 3
2 234 2.2 -99 6.9
3 345 -98 5.4 -98
3. Recode across multiple numeric variables (sum and mean variables)
Review the data (d20)
# A tibble: 4 x 5
id sc_1 sc_2 sc_sum sc_mean
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10 1 2 3 1.5
2 11 -80 -80 NA NA
3 12 2 3 5 2.5
4 13 NA NA NA NA
In this situation, -80 means the item was not provided to a student. If a student has a -80 for all items in a scale, we want to assign a -80 for the sum and mean of that scale to denote that this scale was not provided (to differentiate between a value of NA meaning the data are actually missing).
The scale “sc” is made up of 2 items.
dplyr::if_all() predicate function
to select columns where the condition applies to all columns. This
function is available in version 1.0.5 of dplyr.d20 %>%
dplyr::mutate(dplyr::across(
sc_sum:sc_mean,
~ dplyr::case_when(dplyr::if_all(sc_1:sc_2, ~ . == -80) ~ -80, TRUE ~ .)
))
# A tibble: 4 x 5
id sc_1 sc_2 sc_sum sc_mean
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10 1 2 3 1.5
2 11 -80 -80 -80 -80
3 12 2 3 5 2.5
4 13 NA NA NA NA
Return to Recode