mutate()1. Create new dummy coded variables from a categorical column
with multiple items selected per cell (cat)
Review the data (d2)
# A tibble: 5 x 2
id cat
<dbl> <chr>
1 10 1, 4, 11
2 11 4, 5, 10
3 12 5, 2, 11
4 13 5, 1
5 14 <NA>
We see here that our cat variable was a select all that
apply question (where options are separated by a comma). We want these
options to separated out into their own dummy coded variables that are 1
if the selection was chosen and 0 if the selection was not chosen.
We can create our new variables using dplyr::mutate(),
adding all variables in our one mutate function.
cat1 = 1
cat2 = 2
cat4 = 4
cat5 = 5
cat10 = 10
cat11 = 11
stringr::str_count() to count the number
of instances for each categorical variable (there should only be one per
row). We used regular expressions to pull out the values of
interest.d2 %>%
mutate(
cat1 = str_count(cat, " 1,|^1,|, 1$"),
cat2 = str_count(cat, "2"),
cat4 = str_count(cat, "4"),
cat5 = str_count(cat, "5"),
cat10 = str_count(cat, "^10| 10,|, 10$"),
cat11 = str_count(cat, "^11| 11,|, 11$"))
# A tibble: 5 x 8
id cat cat1 cat2 cat4 cat5 cat10 cat11
<dbl> <chr> <int> <int> <int> <int> <int> <int>
1 10 1, 4, 11 1 0 1 0 0 1
2 11 4, 5, 10 0 0 1 1 1 0
3 12 5, 2, 11 0 1 0 1 0 1
4 13 5, 1 1 0 0 1 0 0
5 14 <NA> NA NA NA NA NA NA
We could have also used stringr::str_detect() if we were
worried a category could duplicate in each row.
dplyr::case_when() we need to specifically
recode NA values back to NA, otherwise they will be coded to 0.d2 %>%
mutate(cat1 =
case_when(
str_detect(cat, " 1,|^1,|, 1$") ~ 1,
is.na(cat) ~ NA_real_,
.default = 0
))
# A tibble: 5 x 3
id cat cat1
<dbl> <chr> <dbl>
1 10 1, 4, 11 1
2 11 4, 5, 10 0
3 12 5, 2, 11 0
4 13 5, 1 1
5 14 <NA> NA
Or dplyr::if_else()
d2 %>%
mutate(cat1 = if_else(str_detect(cat, " 1,|^1,|, 1$"), 1, 0))
# A tibble: 5 x 3
id cat cat1
<dbl> <chr> <dbl>
1 10 1, 4, 11 1
2 11 4, 5, 10 0
3 12 5, 2, 11 0
4 13 5, 1 1
5 14 <NA> NA
2. Create new dummy coded race variables from a race variable with multiple items selected per cell
Review the data (d12)
# A tibble: 4 x 2
stu_id race
<dbl> <chr>
1 100 6
2 101 1, 3
3 102 99
4 103 <NA>
Here we want to create one dummy variable per race option BUT when “99” is selected (prefer not to answer), all values for race dummy variables should be 99.
While I could write out logic to create all 6 variables, I could instead create a function which would remove some of the repetitiveness.
# Create function
race_vars <- function(Var, Num) {
x <- case_when(
is.na(Var) ~ NA,
Var == 99 ~ 99,
str_detect(Var, Num) ~ 1,
.default = 0
)
}
# Use function to create race variables
d12 %>%
mutate(race1 =
race_vars(race, "1"),
race2 =
race_vars(race, "2"),
race3 =
race_vars(race, "3"),
race4 =
race_vars(race, "4"),
race5 =
race_vars(race, "5"),
race6 =
race_vars(race, "6"))
# A tibble: 4 x 8
stu_id race race1 race2 race3 race4 race5 race6
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 6 0 0 0 0 0 1
2 101 1, 3 1 0 1 0 0 0
3 102 99 99 99 99 99 99 99
4 103 <NA> NA NA NA NA NA NA
pivot_wider()1. Create new dummy coded variables from a string categorical
column (cert)
Review the data (d3)
# A tibble: 5 x 2
id cert
<dbl> <chr>
1 10 elem ed
2 11 special ed
3 12 elem ed
4 13 special ed
5 14 secondary ed
In this situation, we only have one word per cell selected.
Using tidyr::pivot_wider() we can pivot our categorical
variable into separate dummy variables where 1 = yes the category was
selected and 0 = the category was not selected.
Before doing this, we will need to add one new count variable using
the dplyr::mutate() function. This count variable will be
used in our values_from argument in for
tidyr::pivot_wider().
Note: We also add the argument values_fill to fill in any cell where the response option was not selected.
Note: I added janitor::clean_names() at the end to
replace the space within variable names with an underscore.
d3 %>%
mutate(count = 1) %>%
pivot_wider(names_from = cert,
values_from = count,
values_fill = 0) %>%
janitor::clean_names()
# A tibble: 5 x 4
id elem_ed special_ed secondary_ed
<dbl> <dbl> <dbl> <dbl>
1 10 1 0 0
2 11 0 1 0
3 12 1 0 0
4 13 0 1 0
5 14 0 0 1
There are many other ways to do this using
dplyr::if_else() or dplyr::case_when(). Here
is an example below. However, I think using
tidyr::pivot_wider() is a quicker and more efficient
option.
d3 %>%
mutate(elem_ed = if_else(cert == "elem ed", 1, 0))
# A tibble: 5 x 3
id cert elem_ed
<dbl> <chr> <dbl>
1 10 elem ed 1
2 11 special ed 0
3 12 elem ed 1
4 13 special ed 0
5 14 secondary ed 0
2. Create new dummy coded variables from a categorical column
with multiple items selected per cell (cert)
Review the data (d7)
# A tibble: 5 x 3
id cert age
<dbl> <chr> <dbl>
1 10 elem ed, secondary ed 30
2 11 special ed 42
3 12 elem ed, special ed, secondary ed 23
4 13 special ed, secondary ed 51
5 14 secondary ed 29
Now we have a select all question where multiple responses may exist
in the cert variable. This is okay, we can still use
tidyr::pivot_wider() here.
First we need to separate our variable into multiple rows. We can do
this using tidyr::separate_rows() like we did in the separate
section. When we do the separation, I save the output over the original
“cert” variable. I could have also created a new variable if I wanted
to.
Once we do that our cases will repeat with each response to the
cert variable appearing on a new line. This may be scary at
first but we will clean it up in just a second.
Next we can create a count variable like we did above using
dplyr::mutate(). We can call this variable whatever we
want. Here I am calling it count.
Now we can pivot. After we pivot, we notice that our data has been regrouped to one line per case again. No more duplicate rows. And our columns such as age are still in our dataset, which is what we wanted.
d7 %>%
separate_rows(cert, sep = ", ") %>%
mutate(count = 1) %>%
pivot_wider(
names_from = cert,
values_from = count,
values_fill = 0
) %>%
janitor::clean_names()
# A tibble: 5 x 5
id age elem_ed secondary_ed special_ed
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10 30 1 1 0
2 11 42 0 0 1
3 12 23 1 1 1
4 13 51 0 1 1
5 14 29 0 1 0
If we wanted to more explicitly keep our age and id variables, we could name them in the id_cols argument to make sure they carry over.
d7 %>%
separate_rows(cert, sep = ", ") %>%
mutate(count = 1) %>%
pivot_wider(
id_cols = c(id, age),
names_from = cert,
values_from = count,
values_fill = 0
) %>%
janitor::clean_names()
# A tibble: 5 x 5
id age elem_ed secondary_ed special_ed
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10 30 1 1 0
2 11 42 0 0 1
3 12 23 1 1 1
4 13 51 0 1 1
5 14 29 0 1 0
3. Create new dummy coded variables from a categorical column with multiple items selected per cell (cat)
Review the data (d2)
# A tibble: 5 x 2
id cat
<dbl> <chr>
1 10 1, 4, 11
2 11 4, 5, 10
3 12 5, 2, 11
4 13 5, 1
5 14 <NA>
This is the same example from #1 at the top of this page. However,
now we can create our dummy coded variables using
tidy::pivot_wider() saving us so much time compared to the
stringr::str_count() or stringr::str_detect()
method.
The only thing I am adding here, compared to the example directly above, is the names_prefix argument.
I am also adding a dplyr::select() statement at the end
to reorder the variables and drop the “catNA” variable that is created
due to id 14 having a value of NA for “cat”.
d2 %>%
separate_rows(cat, sep = ", ") %>%
mutate(count = 1) %>%
pivot_wider(
names_from = cat,
names_prefix = "cat",
values_from = count,
values_fill = 0
) %>%
select(id, cat1, cat2, cat4, cat5, cat10, cat11)
# A tibble: 5 x 7
id cat1 cat2 cat4 cat5 cat10 cat11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10 1 0 1 0 0 1
2 11 0 0 1 1 1 0
3 12 0 1 0 1 0 1
4 13 1 0 0 1 0 0
5 14 0 0 0 0 0 0
Last, you’ll notice above that id 14 now has zeros for all of the categories. Whereas before they had NAs. If the NAs actually mean that their data is missing, then we don’t want id 14 to have zeros for all of our dummy categories. We want them to have NAs.
We could fix this by keeping the “catNA” variable and then using this variable as an indicator for when to recode our other categorical variables back to NA.
dplyr::case_when() and
dplyr::across() to recode across all of our variables.d2 %>%
separate_rows(cat, sep = ", ") %>%
mutate(count = 1) %>%
pivot_wider(
names_from = cat,
names_prefix = "cat",
values_from = count,
values_fill = 0
) %>%
select(id, catNA, cat1, cat2, cat4, cat5, cat10, cat11) %>%
mutate(across(cat1:cat11,
~ case_when(
catNA == 1 ~ NA_real_,
.default = .
))) %>%
select(-catNA)
# A tibble: 5 x 7
id cat1 cat2 cat4 cat5 cat10 cat11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10 1 0 1 0 0 1
2 11 0 0 1 1 1 0
3 12 0 1 0 1 0 1
4 13 1 0 0 1 0 0
5 14 NA NA NA NA NA NA
Return to Create New Variables