Package: dplyr


Function: mutate()


1. Create an indicator column (scale1_risk)

# A tibble: 5 x 4
     id grade cut_scale1 scale1_sum
  <dbl> <dbl>      <dbl>      <dbl>
1     1     2        2.5          1
2     2     3        2            4
3     3     3        2            1
4     4    NA       NA            3
5     5     2        2.5          4

I have a sum score and a cut score for a scale. The cut scores are different depending on grade level.

I want to create an indicator of risk level:

sum score < cut score = 0 score score >= cut score = 1

  • Note: I use dplyr::case_when() to recode my sum variable into a dichotomous variable.

My logic says:

  • If either grade or my selected sum score variable is NA, then my new variable should be coded to NA.
  • Then if my selected sum score variable is less than my selected cut score variable, code my new variable to 0.
  • Else (TRUE) code to 1.
d9 %>%
  dplyr::mutate(scale1_risk =
                  dplyr::case_when(
    is.na(grade) | is.na(scale1_sum) ~ NA_real_,
    scale1_sum < cut_scale1 ~ 0,
    TRUE ~ 1
  ))
# A tibble: 5 x 5
     id grade cut_scale1 scale1_sum scale1_risk
  <dbl> <dbl>      <dbl>      <dbl>       <dbl>
1     1     2        2.5          1           0
2     2     3        2            4           1
3     3     3        2            1           0
4     4    NA       NA            3          NA
5     5     2        2.5          4           1

2. Create multiple _risk indicator columns

Review the data (d6)

# A tibble: 5 x 8
     id grade cut_scale1 cut_scale2 cut_scale3 scale1_sum scale2_sum scale3_sum
  <dbl> <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
1     1     2        2.5        3.1          3          1          2          4
2     2     3        2          3            2          2          4          1
3     3     3        2          3            2          3          1          2
4     4    NA       NA         NA           NA          3          4          1
5     5     2        2.5        3.1          3          4          2         NA

This time I have sum scores for 3 different scales. I also have 3 different cut scores for each of those scales. The cut scores are different depending on grade level.

I want to create an indicator of risk level:

sum score < cut score = 0 score score >= cut score = 1

I could write this calculation out one piece at a time, but I would prefer to create a function that I can use that will minimize duplication.

First I create a cutoff function that I can use.

In this function I call my data frame and I use dplyr::mutate() to create my new at_risk variable which will be named dynamically.

  • Note: In order to name this variable dynamically I have to use :=. I also have to surround all data frame variables with {{}} to denote that I am referring to data frame variables, not environment variables.
cutoff <- function(df, new_name, cut_score, sum_score) {
  df %>%
    dplyr::mutate({{new_name}} := dplyr::case_when(
      is.na(grade) | is.na({{sum_score}}) ~ NA_real_,
      {{sum_score}} < {{cut_score}} ~ 0,
      TRUE ~ 1))
}

Now I can use my new function to create my new _risk variables.

d6 <- cutoff(d6, scale1_risk, cut_scale1, scale1_sum)

d6 <- cutoff(d6, scale2_risk, cut_scale2, scale2_sum)

d6 <- cutoff(d6, scale3_risk, cut_scale3, scale3_sum)

d6 %>%
  dplyr::select(tidyselect::contains("risk"))
# A tibble: 5 x 3
  scale1_risk scale2_risk scale3_risk
        <dbl>       <dbl>       <dbl>
1           0           0           1
2           1           1           0
3           1           0           1
4          NA          NA          NA
5           1           0          NA

Return to Create New Variables