Package: dplyr


Function: mutate()


1. Create new dummy coded variables from a categorical column with multiple items selected per cell (cat)

Review the data (d2)

# A tibble: 5 x 2
     id cat     
  <dbl> <chr>   
1    10 1, 4, 11
2    11 4, 5, 10
3    12 5, 2, 11
4    13 5, 1    
5    14 <NA>    

We see here that our cat variable was a select all that apply question (where options are separated by a comma). We want these options to separated out into their own dummy coded variables that are 1 if the selection was chosen and 0 if the selection was not chosen.

We can create our new variables using dplyr::mutate(), adding all variables in our one mutate function.

cat1 = 1
cat2 = 2
cat4 = 4
cat5 = 5
cat10 = 10
cat11 = 11

  • Note: I used stringr::str_count() to count the number of instances for each categorical variable (there should only be one per row). We used regular expressions to pull out the values of interest.
d2 %>%
  mutate(
    cat1 = str_count(cat, " 1,|^1,|, 1$"),
    cat2 = str_count(cat, "2"),
    cat4 = str_count(cat, "4"),
    cat5 = str_count(cat, "5"),
    cat10 = str_count(cat, "^10| 10,|, 10$"),
    cat11 = str_count(cat, "^11| 11,|, 11$"))
# A tibble: 5 x 8
     id cat       cat1  cat2  cat4  cat5 cat10 cat11
  <dbl> <chr>    <int> <int> <int> <int> <int> <int>
1    10 1, 4, 11     1     0     1     0     0     1
2    11 4, 5, 10     0     0     1     1     1     0
3    12 5, 2, 11     0     1     0     1     0     1
4    13 5, 1         1     0     0     1     0     0
5    14 <NA>        NA    NA    NA    NA    NA    NA

We could have also used stringr::str_detect() if we were worried a category could duplicate in each row.

  • Note: With dplyr::case_when() we need to specifically recode NA values back to NA, otherwise they will be coded to 0.
d2 %>%
  mutate(cat1 =
           case_when(
             str_detect(cat, " 1,|^1,|, 1$") ~ 1,
             is.na(cat) ~ NA_real_,
             .default = 0
           ))
# A tibble: 5 x 3
     id cat       cat1
  <dbl> <chr>    <dbl>
1    10 1, 4, 11     1
2    11 4, 5, 10     0
3    12 5, 2, 11     0
4    13 5, 1         1
5    14 <NA>        NA

Or dplyr::if_else()

d2 %>%
  mutate(cat1 = if_else(str_detect(cat, " 1,|^1,|, 1$"), 1, 0))
# A tibble: 5 x 3
     id cat       cat1
  <dbl> <chr>    <dbl>
1    10 1, 4, 11     1
2    11 4, 5, 10     0
3    12 5, 2, 11     0
4    13 5, 1         1
5    14 <NA>        NA

2. Create new dummy coded race variables from a race variable with multiple items selected per cell

Review the data (d12)

# A tibble: 4 x 2
  stu_id race 
   <dbl> <chr>
1    100 6    
2    101 1, 3 
3    102 99   
4    103 <NA> 

Here we want to create one dummy variable per race option BUT when “99” is selected (prefer not to answer), all values for race dummy variables should be 99.

While I could write out logic to create all 6 variables, I could instead create a function which would remove some of the repetitiveness.

# Create function

race_vars <- function(Var, Num) {
  x <- case_when(
    is.na(Var) ~ NA,
    Var == 99 ~ 99,
    str_detect(Var, Num) ~ 1, 
    .default = 0
  )
}

# Use function to create race variables

d12 %>% 
  mutate(race1 = 
           race_vars(race, "1"),
         race2 =
           race_vars(race, "2"),
         race3 = 
           race_vars(race, "3"),
         race4 =
           race_vars(race, "4"),
         race5 =
           race_vars(race, "5"),
         race6 = 
           race_vars(race, "6"))
# A tibble: 4 x 8
  stu_id race  race1 race2 race3 race4 race5 race6
   <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    100 6         0     0     0     0     0     1
2    101 1, 3      1     0     1     0     0     0
3    102 99       99    99    99    99    99    99
4    103 <NA>     NA    NA    NA    NA    NA    NA


Package: tidyr


Function: pivot_wider()


1. Create new dummy coded variables from a string categorical column (cert)

Review the data (d3)

# A tibble: 5 x 2
     id cert        
  <dbl> <chr>       
1    10 elem ed     
2    11 special ed  
3    12 elem ed     
4    13 special ed  
5    14 secondary ed

In this situation, we only have one word per cell selected.

Using tidyr::pivot_wider() we can pivot our categorical variable into separate dummy variables where 1 = yes the category was selected and 0 = the category was not selected.

Before doing this, we will need to add one new count variable using the dplyr::mutate() function. This count variable will be used in our values_from argument in for tidyr::pivot_wider().

  • Note: We also add the argument values_fill to fill in any cell where the response option was not selected.

  • Note: I added janitor::clean_names() at the end to replace the space within variable names with an underscore.

d3 %>%
  mutate(count = 1) %>%
  pivot_wider(names_from = cert,
              values_from = count,
              values_fill = 0) %>%
  janitor::clean_names()
# A tibble: 5 x 4
     id elem_ed special_ed secondary_ed
  <dbl>   <dbl>      <dbl>        <dbl>
1    10       1          0            0
2    11       0          1            0
3    12       1          0            0
4    13       0          1            0
5    14       0          0            1

There are many other ways to do this using dplyr::if_else() or dplyr::case_when(). Here is an example below. However, I think using tidyr::pivot_wider() is a quicker and more efficient option.

d3 %>%
  mutate(elem_ed = if_else(cert == "elem ed", 1, 0))
# A tibble: 5 x 3
     id cert         elem_ed
  <dbl> <chr>          <dbl>
1    10 elem ed            1
2    11 special ed         0
3    12 elem ed            1
4    13 special ed         0
5    14 secondary ed       0

2. Create new dummy coded variables from a categorical column with multiple items selected per cell (cert)

Review the data (d7)

# A tibble: 5 x 3
     id cert                                age
  <dbl> <chr>                             <dbl>
1    10 elem ed, secondary ed                30
2    11 special ed                           42
3    12 elem ed, special ed, secondary ed    23
4    13 special ed, secondary ed             51
5    14 secondary ed                         29

Now we have a select all question where multiple responses may exist in the cert variable. This is okay, we can still use tidyr::pivot_wider() here.

First we need to separate our variable into multiple rows. We can do this using tidyr::separate_rows() like we did in the separate section. When we do the separation, I save the output over the original “cert” variable. I could have also created a new variable if I wanted to.

Once we do that our cases will repeat with each response to the cert variable appearing on a new line. This may be scary at first but we will clean it up in just a second.

Next we can create a count variable like we did above using dplyr::mutate(). We can call this variable whatever we want. Here I am calling it count.

Now we can pivot. After we pivot, we notice that our data has been regrouped to one line per case again. No more duplicate rows. And our columns such as age are still in our dataset, which is what we wanted.

d7 %>% 
  separate_rows(cert, sep = ", ") %>%
  mutate(count = 1) %>%
  pivot_wider(
    names_from = cert,
    values_from = count,
    values_fill = 0
  ) %>%
  janitor::clean_names()
# A tibble: 5 x 5
     id   age elem_ed secondary_ed special_ed
  <dbl> <dbl>   <dbl>        <dbl>      <dbl>
1    10    30       1            1          0
2    11    42       0            0          1
3    12    23       1            1          1
4    13    51       0            1          1
5    14    29       0            1          0

If we wanted to more explicitly keep our age and id variables, we could name them in the id_cols argument to make sure they carry over.

d7 %>% 
  separate_rows(cert, sep = ", ") %>%
  mutate(count = 1) %>%
  pivot_wider(
    id_cols = c(id, age),
    names_from = cert,
    values_from = count,
    values_fill = 0
  ) %>%
  janitor::clean_names()
# A tibble: 5 x 5
     id   age elem_ed secondary_ed special_ed
  <dbl> <dbl>   <dbl>        <dbl>      <dbl>
1    10    30       1            1          0
2    11    42       0            0          1
3    12    23       1            1          1
4    13    51       0            1          1
5    14    29       0            1          0

3. Create new dummy coded variables from a categorical column with multiple items selected per cell (cat)

Review the data (d2)

# A tibble: 5 x 2
     id cat     
  <dbl> <chr>   
1    10 1, 4, 11
2    11 4, 5, 10
3    12 5, 2, 11
4    13 5, 1    
5    14 <NA>    

This is the same example from #1 at the top of this page. However, now we can create our dummy coded variables using tidy::pivot_wider() saving us so much time compared to the stringr::str_count() or stringr::str_detect() method.

The only thing I am adding here, compared to the example directly above, is the names_prefix argument.

I am also adding a dplyr::select() statement at the end to reorder the variables and drop the “catNA” variable that is created due to id 14 having a value of NA for “cat”.

d2 %>% 
  separate_rows(cat, sep = ", ") %>%
  mutate(count = 1) %>%
  pivot_wider(
    names_from = cat,
    names_prefix = "cat",
    values_from = count, 
    values_fill = 0
  ) %>%
  select(id, cat1, cat2, cat4, cat5, cat10, cat11)
# A tibble: 5 x 7
     id  cat1  cat2  cat4  cat5 cat10 cat11
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    10     1     0     1     0     0     1
2    11     0     0     1     1     1     0
3    12     0     1     0     1     0     1
4    13     1     0     0     1     0     0
5    14     0     0     0     0     0     0

Last, you’ll notice above that id 14 now has zeros for all of the categories. Whereas before they had NAs. If the NAs actually mean that their data is missing, then we don’t want id 14 to have zeros for all of our dummy categories. We want them to have NAs.

We could fix this by keeping the “catNA” variable and then using this variable as an indicator for when to recode our other categorical variables back to NA.

  • Note: We use dplyr::case_when() and dplyr::across() to recode across all of our variables.
d2 %>% 
  separate_rows(cat, sep = ", ") %>%
  mutate(count = 1) %>%
  pivot_wider(
    names_from = cat,
    names_prefix = "cat",
    values_from = count, 
    values_fill = 0
  ) %>%
  select(id, catNA, cat1, cat2, cat4, cat5, cat10, cat11) %>%
  mutate(across(cat1:cat11,
                       ~ case_when(
                         catNA == 1 ~ NA_real_,
                         .default = .
                       ))) %>%
  select(-catNA)
# A tibble: 5 x 7
     id  cat1  cat2  cat4  cat5 cat10 cat11
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    10     1     0     1     0     0     1
2    11     0     0     1     1     1     0
3    12     0     1     0     1     0     1
4    13     1     0     0     1     0     0
5    14    NA    NA    NA    NA    NA    NA

Return to Create New Variables