pivot_longer()tidyr::pivot_longer(). Type ?pivot_longer in your
console to see all arguments. But the most basic arguments to be aware
of include:1. Restructure our “select all” items into long format
Review the data (d9)
# A tibble: 3 x 5
id school tool1 tool2 tool3
<dbl> <chr> <chr> <chr> <chr>
1 1234 a google meet <NA> other
2 2345 a <NA> zoom <NA>
3 3456 b google meet zoom <NA>
Oftentimes when we collect survey data that includes a “select all” question, the data exports with each item displayed in its own column. With the response option shown if selected and NA shown if not selected.
If we want to visualize the options in this select all question, it can be very helpful to restructure these items into long format.
d9 %>%
pivot_longer(
cols = contains("tool"),
names_to = "tools",
values_to = "response"
)
# A tibble: 9 x 4
id school tools response
<dbl> <chr> <chr> <chr>
1 1234 a tool1 google meet
2 1234 a tool2 <NA>
3 1234 a tool3 other
4 2345 a tool1 <NA>
5 2345 a tool2 zoom
6 2345 a tool3 <NA>
7 3456 b tool1 google meet
8 3456 b tool2 zoom
9 3456 b tool3 <NA>
You can then manipulate the “response” column to create your visualization, potentially filtering out any NA values.
Your “select all” data may also come in this way, with the columns named for the response options and the responses given as 1 or NA.
# A tibble: 3 x 5
id school google_meet zoom other
<dbl> <chr> <dbl> <dbl> <dbl>
1 1234 a 1 NA 1
2 2345 a NA 1 NA
3 3456 b 1 1 NA
This data can be restructured just as easily.
Except instead of working with the “response” column, you will work with the “tools” column, potentially filtering data based on NA values in the “response” column.
d10 %>%
pivot_longer(
cols = google_meet:other,
names_to = "tools",
values_to = "response"
)
# A tibble: 9 x 4
id school tools response
<dbl> <chr> <chr> <dbl>
1 1234 a google_meet 1
2 1234 a zoom NA
3 1234 a other 1
4 2345 a google_meet NA
5 2345 a zoom 1
6 2345 a other NA
7 3456 b google_meet 1
8 3456 b zoom 1
9 3456 b other NA
2. Restructure our longitudinal data into long format
Review the data (d1)
# A tibble: 3 x 4
tch_id treatment w1_q1 w2_q1
<dbl> <dbl> <dbl> <dbl>
1 1234 1 3 5
2 2345 0 4 5
3 3456 1 NA 2
In education research we are often wanting to restructure longitudinal data for analysis purposes.
In this example, we want to restructure the data into long format, making a new “wave” variable
Note: We use a tidyselect selection helper
matches() in the cols argument to select the
columns we wish to pivot long. In this case we want any column that has
the prefix “w#_“. Using matches() rather than a selection
helper such as contains(), we can use regular expressions
to select columns.
Note: Here we use the names_to argument to state that we want two new columns, the first one being wave and the second column being whatever our remaining value is after wave (in this case “q1”). “.value” defines the column containing the cell values and overrides the values_to argument.
Note: We use the names_sep argument to state how the wave and question variable should be separated (by “_“). Using names_pattern is another option if you have more complicated variable names.
Note: We use the names_prefix argument to remove the “w” that would be attached to every number in the wave column since that is redundant information now. Be aware that the wave column will still be considered character since it contained “w” before we removed it. You may choose to convert that column type.
d1 %>%
pivot_longer(
cols = tidyselect::matches("w[0-9]_"),
names_to = c("wave", ".value"),
names_sep = "_",
names_prefix = "w")
# A tibble: 6 x 4
tch_id treatment wave q1
<dbl> <dbl> <chr> <dbl>
1 1234 1 1 3
2 1234 1 2 5
3 2345 0 1 4
4 2345 0 2 5
5 3456 1 1 NA
6 3456 1 2 2
3. Restructure our longitudinal data into long format
Review the data (d8)
# A tibble: 3 x 4
tch_id treatment q1_sum_2018 q1_sum_2019
<dbl> <dbl> <dbl> <dbl>
1 1234 1 3 5
2 2345 0 4 5
3 3456 1 NA 2
Unlike the prior example, we now have underscores used in our variable name to separate other pieces of metadata besides just our time component. Because of this, we can no longer simply use the argument names_sep. Instead we can use the names_pattern argument to develop a regular expression to describe the variable name and year groups of the variable.
d8 %>%
pivot_longer(
cols = -c(tch_id:treatment),
names_to = c(".value", "year"),
names_pattern = "^(.*)_([0-9]{4})$")
# A tibble: 6 x 4
tch_id treatment year q1_sum
<dbl> <dbl> <chr> <dbl>
1 1234 1 2018 3
2 1234 1 2019 5
3 2345 0 2018 4
4 2345 0 2019 5
5 3456 1 2018 NA
6 3456 1 2019 2
4. Restructure our data into tidy format
Review the data (d5)
# A tibble: 2 x 5
school enroll_6 enroll_7 attend_6 attend_7
<chr> <dbl> <dbl> <dbl> <dbl>
1 schoolx 50 40 87.5 90.2
2 schooly 70 80 88.2 91.4
The wide format above is not conducive for calculating summary statistics by group. So we need to reformat the data to long/tidy format. We want to end up with 4 columns (school, grade level, attendance and enrollment).
Note: Similar to above we use the argument cols to select the columns we wish to pivot. Here we select all columns associated with a grade level (6 or 7).
Note: We use the names_to argument again but this time we switch the order. Since the value we want to group by is second in the variable name, we want to put that second in the names_to argument.
d5 %>%
pivot_longer(
cols = enroll_6:attend_7,
names_to = c(".value", "grade"),
names_sep = "_")
# A tibble: 4 x 4
school grade enroll attend
<chr> <chr> <dbl> <dbl>
1 schoolx 6 50 87.5
2 schoolx 7 40 90.2
3 schooly 6 70 88.2
4 schooly 7 80 91.4
If we only had one column type to pivot longer, we could simplify the arguments. We would no longer need to use the names_sep argument and we would no longer need to use “.value” for the names_to argument.
For example, here we only have enrollment to pivot longer.
# A tibble: 2 x 3
school `6` `7`
<chr> <dbl> <dbl>
1 schoolx 50 40
2 schooly 70 80
And we can now just name the column we want the grouping variable to be in, and we use the argument values_to rather than names_sep to state the name of the values column.
d6 %>%
pivot_longer(
cols = -school,
names_to = "grade",
values_to = "enroll")
# A tibble: 4 x 3
school grade enroll
<chr> <chr> <dbl>
1 schoolx 6 50
2 schoolx 7 40
3 schooly 6 70
4 schooly 7 80
5. Restructure our wide repeating data into one row per student
Review the data (d11)
# A tibble: 3 x 7
tch_id stu1 stu2 stu1_q1 stu2_q1 stu1_q2 stu2_q2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1001 20 21 1 1 2 1
2 1002 25 27 2 3 2 4
3 1004 30 23 4 3 1 2
We should be able to pivot similar to how we did with our longitudinal data. However, in order to split the information into multiple columns (student identifier column and then question item columns), we need to have similar patterns across all items.
So first we will add “_stuid” to the end of all of the student identifier variables so the pattern will match the question items.
d11 <- d11 %>%
dplyr::rename_with(~stringr::str_c(., "_stuid"), .cols = stu1:stu2)
Now we can pivot using our same arguments from example #2, except now we can change our first names_to value to NA (which discards the column), because we have no need for a column that tells us if the student was #1 or #2.
d11 %>%
pivot_longer(
cols = stu1_stuid:stu2_q2,
names_to = c(NA, ".value"),
names_sep = "_")
# A tibble: 6 x 4
tch_id stuid q1 q2
<dbl> <dbl> <dbl> <dbl>
1 1001 20 1 2
2 1001 21 1 1
3 1002 25 2 2
4 1002 27 3 4
5 1004 30 4 1
6 1004 23 3 2
6. Restructure to one row per organization
Review the data (d12)
# A tibble: 1 x 13
rater org1 org1_item1 org1_item2 org1_item2_text org2 org2_item1 org2_item2
<dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 12 Org1 ~ 1 3 interesting te~ Org2~ 2 4
# i 5 more variables: org2_item2_text <chr>, org3 <chr>, org3_item1 <dbl>,
# org3_item2 <dbl>, org3_item2_text <chr>
In this example, a rater may rate up to 3 organizations in a row of data. We want each of these organizations to be on their own line.
We are unable to use names_sep because the underscores in our variable names are used in different places, so we will use names_pattern again. We will also use the values_drop_na argument to remove any empty row that is created in our pivot process.
d12 %>%
pivot_longer(
cols = -rater,
names_to = c("organization", ".value"),
names_pattern = "^(org[0-9]{1})_(.*)$",
values_drop_na = TRUE)
# A tibble: 3 x 5
rater organization item1 item2 item2_text
<dbl> <chr> <dbl> <dbl> <chr>
1 12 org1 1 3 "interesting text"
2 12 org2 2 4 "other text"
3 12 org3 2 1 ""
What if our data also contained additional variables to denote that a rater has another organization to rate (1 = yes, 0 = no), as in (d13)?
We can use the same code, but remove the values_drop_na argument because now empty rows are not created.
# A tibble: 1 x 13
rater org1_rate org1_item1 org1_item2 org1_item2_text org2_rate org2_item1
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 12 1 1 3 interesting text 1 2
# i 6 more variables: org2_item2 <dbl>, org2_item2_text <chr>, org3_rate <dbl>,
# org3_item1 <dbl>, org3_item2 <dbl>, org3_item2_text <chr>
d13 %>%
pivot_longer(
cols = -rater,
names_to = c("organization", ".value"),
names_pattern = "^(org[0-9]{1})_(.*)$")
# A tibble: 3 x 6
rater organization rate item1 item2 item2_text
<dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 12 org1 1 1 3 "interesting text"
2 12 org2 1 2 4 "other text"
3 12 org3 1 2 1 ""
Return to Restructure