Restructure data to long format

Package: tidyr

Function: `pivot_longer()`

Note: There are many arguments available in tidyr::pivot_longer(). Type ?pivot_longer in your console to see all arguments. But the most basic arguments to be aware of include:
cols = cols you want to pivot,
names_to = name of your new categorical column/s typically,
values_to = name of the new value column/s

1. Restructure our “select all” items into long format

Review the data (d9)

# A tibble: 3 x 5
     id school tool1       tool2 tool3
  <dbl> <chr>  <chr>       <chr> <chr>
1  1234 a      google meet <NA>  other
2  2345 a      <NA>        zoom  <NA> 
3  3456 b      google meet zoom  <NA>

Oftentimes when we collect survey data that includes a “select all” question, the data exports with each item displayed in its own column. With the response option shown if selected and NA shown if not selected.

If we want to visualize the options in this select all question, it can be very helpful to restructure these items into long format.

d9 %>%
  pivot_longer(
    cols = contains("tool"),
    names_to = "tools",
    values_to = "response"
  )

# A tibble: 9 x 4
     id school tools response   
  <dbl> <chr>  <chr> <chr>      
1  1234 a      tool1 google meet
2  1234 a      tool2 <NA>       
3  1234 a      tool3 other      
4  2345 a      tool1 <NA>       
5  2345 a      tool2 zoom       
6  2345 a      tool3 <NA>       
7  3456 b      tool1 google meet
8  3456 b      tool2 zoom       
9  3456 b      tool3 <NA>

You can then manipulate the “response” column to create your visualization, potentially filtering out any NA values.

Your “select all” data may also come in this way, with the columns named for the response options and the responses given as 1 or NA.

# A tibble: 3 x 5
     id school google_meet  zoom other
  <dbl> <chr>        <dbl> <dbl> <dbl>
1  1234 a                1    NA     1
2  2345 a               NA     1    NA
3  3456 b                1     1    NA

This data can be restructured just as easily.

Except instead of working with the “response” column, you will work with the “tools” column, potentially filtering data based on NA values in the “response” column.

d10 %>%
  pivot_longer(
    cols = google_meet:other,
    names_to = "tools",
    values_to = "response"
  )

# A tibble: 9 x 4
     id school tools       response
  <dbl> <chr>  <chr>          <dbl>
1  1234 a      google_meet        1
2  1234 a      zoom              NA
3  1234 a      other              1
4  2345 a      google_meet       NA
5  2345 a      zoom               1
6  2345 a      other             NA
7  3456 b      google_meet        1
8  3456 b      zoom               1
9  3456 b      other             NA

2. Restructure our longitudinal data into long format

Review the data (d1)

# A tibble: 3 x 4
  tch_id treatment w1_q1 w2_q1
   <dbl>     <dbl> <dbl> <dbl>
1   1234         1     3     5
2   2345         0     4     5
3   3456         1    NA     2

In education research we are often wanting to restructure longitudinal data for analysis purposes.

In this example, we want to restructure the data into long format, making a new “wave” variable

Note: We use a tidyselect selection helper matches() in the cols argument to select the columns we wish to pivot long. In this case we want any column that has the prefix “w#_“. Using matches() rather than a selection helper such as contains(), we can use regular expressions to select columns.
Note: Here we use the names_to argument to state that we want two new columns, the first one being wave and the second column being whatever our remaining value is after wave (in this case “q1”). “.value” defines the column containing the cell values and overrides the values_to argument.
Note: We use the names_sep argument to state how the wave and question variable should be separated (by “_“). Using names_pattern is another option if you have more complicated variable names.
Note: We use the names_prefix argument to remove the “w” that would be attached to every number in the wave column since that is redundant information now. Be aware that the wave column will still be considered character since it contained “w” before we removed it. You may choose to convert that column type.

d1 %>%
  pivot_longer(
    cols = tidyselect::matches("w[0-9]_"),
                      names_to = c("wave", ".value"),
                      names_sep = "_",
    names_prefix = "w")

# A tibble: 6 x 4
  tch_id treatment wave     q1
   <dbl>     <dbl> <chr> <dbl>
1   1234         1 1         3
2   1234         1 2         5
3   2345         0 1         4
4   2345         0 2         5
5   3456         1 1        NA
6   3456         1 2         2

3. Restructure our longitudinal data into long format

Review the data (d8)

# A tibble: 3 x 4
  tch_id treatment q1_sum_2018 q1_sum_2019
   <dbl>     <dbl>       <dbl>       <dbl>
1   1234         1           3           5
2   2345         0           4           5
3   3456         1          NA           2

Unlike the prior example, we now have underscores used in our variable name to separate other pieces of metadata besides just our time component. Because of this, we can no longer simply use the argument names_sep. Instead we can use the names_pattern argument to develop a regular expression to describe the variable name and year groups of the variable.

Note: Instead of selecting the columns I want to pivot, I chose to select the columns I did NOT want to pivot in the cols argument.

d8 %>%
  pivot_longer(
    cols = -c(tch_id:treatment),
    names_to = c(".value", "year"),
    names_pattern = "^(.*)_([0-9]{4})$")

# A tibble: 6 x 4
  tch_id treatment year  q1_sum
   <dbl>     <dbl> <chr>  <dbl>
1   1234         1 2018       3
2   1234         1 2019       5
3   2345         0 2018       4
4   2345         0 2019       5
5   3456         1 2018      NA
6   3456         1 2019       2

4. Restructure our data into tidy format

Review the data (d5)

# A tibble: 2 x 5
  school  enroll_6 enroll_7 attend_6 attend_7
  <chr>      <dbl>    <dbl>    <dbl>    <dbl>
1 schoolx       50       40     87.5     90.2
2 schooly       70       80     88.2     91.4

The wide format above is not conducive for calculating summary statistics by group. So we need to reformat the data to long/tidy format. We want to end up with 4 columns (school, grade level, attendance and enrollment).

Note: Similar to above we use the argument cols to select the columns we wish to pivot. Here we select all columns associated with a grade level (6 or 7).
Note: We use the names_to argument again but this time we switch the order. Since the value we want to group by is second in the variable name, we want to put that second in the names_to argument.

d5 %>%
  pivot_longer(
    cols = enroll_6:attend_7,
    names_to = c(".value", "grade"),
    names_sep = "_")

# A tibble: 4 x 4
  school  grade enroll attend
  <chr>   <chr>  <dbl>  <dbl>
1 schoolx 6         50   87.5
2 schoolx 7         40   90.2
3 schooly 6         70   88.2
4 schooly 7         80   91.4

If we only had one column type to pivot longer, we could simplify the arguments. We would no longer need to use the names_sep argument and we would no longer need to use “.value” for the names_to argument.

For example, here we only have enrollment to pivot longer.

# A tibble: 2 x 3
  school    `6`   `7`
  <chr>   <dbl> <dbl>
1 schoolx    50    40
2 schooly    70    80

And we can now just name the column we want the grouping variable to be in, and we use the argument values_to rather than names_sep to state the name of the values column.

d6 %>%
  pivot_longer(
    cols = -school,
    names_to = "grade",
    values_to = "enroll")

# A tibble: 4 x 3
  school  grade enroll
  <chr>   <chr>  <dbl>
1 schoolx 6         50
2 schoolx 7         40
3 schooly 6         70
4 schooly 7         80

5. Restructure our wide repeating data into one row per student

Review the data (d11)

# A tibble: 3 x 7
  tch_id  stu1  stu2 stu1_q1 stu2_q1 stu1_q2 stu2_q2
   <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1   1001    20    21       1       1       2       1
2   1002    25    27       2       3       2       4
3   1004    30    23       4       3       1       2

We should be able to pivot similar to how we did with our longitudinal data. However, in order to split the information into multiple columns (student identifier column and then question item columns), we need to have similar patterns across all items.

So first we will add “_stuid” to the end of all of the student identifier variables so the pattern will match the question items.

d11 <- d11 %>%
  dplyr::rename_with(~stringr::str_c(., "_stuid"), .cols = stu1:stu2)

Now we can pivot using our same arguments from example #2, except now we can change our first names_to value to NA (which discards the column), because we have no need for a column that tells us if the student was #1 or #2.

d11 %>%
  pivot_longer(
    cols = stu1_stuid:stu2_q2,
                      names_to = c(NA, ".value"),
                      names_sep = "_")

# A tibble: 6 x 4
  tch_id stuid    q1    q2
   <dbl> <dbl> <dbl> <dbl>
1   1001    20     1     2
2   1001    21     1     1
3   1002    25     2     2
4   1002    27     3     4
5   1004    30     4     1
6   1004    23     3     2

6. Restructure to one row per organization

Review the data (d12)

# A tibble: 1 x 13
  rater org1   org1_item1 org1_item2 org1_item2_text org2  org2_item1 org2_item2
  <dbl> <chr>       <dbl>      <dbl> <chr>           <chr>      <dbl>      <dbl>
1    12 Org1 ~          1          3 interesting te~ Org2~          2          4
# i 5 more variables: org2_item2_text <chr>, org3 <chr>, org3_item1 <dbl>,
#   org3_item2 <dbl>, org3_item2_text <chr>

In this example, a rater may rate up to 3 organizations in a row of data. We want each of these organizations to be on their own line.

We are unable to use names_sep because the underscores in our variable names are used in different places, so we will use names_pattern again. We will also use the values_drop_na argument to remove any empty row that is created in our pivot process.

d12 %>%
  pivot_longer(
    cols = -rater,
    names_to = c("organization", ".value"),
    names_pattern = "^(org[0-9]{1})_(.*)$",
    values_drop_na = TRUE)

# A tibble: 3 x 5
  rater organization item1 item2 item2_text        
  <dbl> <chr>        <dbl> <dbl> <chr>             
1    12 org1             1     3 "interesting text"
2    12 org2             2     4 "other text"      
3    12 org3             2     1 ""

What if our data also contained additional variables to denote that a rater has another organization to rate (1 = yes, 0 = no), as in (d13)?

We can use the same code, but remove the values_drop_na argument because now empty rows are not created.

# A tibble: 1 x 13
  rater org1_rate org1_item1 org1_item2 org1_item2_text  org2_rate org2_item1
  <dbl>     <dbl>      <dbl>      <dbl> <chr>                <dbl>      <dbl>
1    12         1          1          3 interesting text         1          2
# i 6 more variables: org2_item2 <dbl>, org2_item2_text <chr>, org3_rate <dbl>,
#   org3_item1 <dbl>, org3_item2 <dbl>, org3_item2_text <chr>

d13 %>%
  pivot_longer(
    cols = -rater,
    names_to = c("organization", ".value"),
    names_pattern = "^(org[0-9]{1})_(.*)$")

# A tibble: 3 x 6
  rater organization  rate item1 item2 item2_text        
  <dbl> <chr>        <dbl> <dbl> <dbl> <chr>             
1    12 org1             1     1     3 "interesting text"
2    12 org2             1     2     4 "other text"      
3    12 org3             1     2     1 ""

Return to Restructure

Restructure data to long format

Package: tidyr

Function: pivot_longer()

Function: `pivot_longer()`