Complete long data using existing values

Package: tidyr

Function: `fill()`

1. Fill NA values for both school and year with existing values

Review the data (d1)

# A tibble: 12 × 4
   school    year grade n_students
   <chr>    <dbl> <dbl>      <dbl>
 1 School A  2009     3        100
 2 <NA>        NA     4         80
 3 <NA>        NA     5         90
 4 <NA>      2010     3         98
 5 <NA>        NA     4         88
 6 <NA>        NA     5         91
 7 School B  2009     6        150
 8 <NA>        NA     7        145
 9 <NA>        NA     8        150
10 <NA>      2010     6        155
11 <NA>        NA     7        151
12 <NA>        NA     8        160

Fill values using the default argument direction = “down”.

d1 %>% 
  tidyr::fill(school, year)

# A tibble: 12 × 4
   school    year grade n_students
   <chr>    <dbl> <dbl>      <dbl>
 1 School A  2009     3        100
 2 School A  2009     4         80
 3 School A  2009     5         90
 4 School A  2010     3         98
 5 School A  2010     4         88
 6 School A  2010     5         91
 7 School B  2009     6        150
 8 School B  2009     7        145
 9 School B  2009     8        150
10 School B  2010     6        155
11 School B  2010     7        151
12 School B  2010     8        160

2. Fill NA values for cohort with existing values (grouped by the school)

Review the data (d4)

# A tibble: 11 × 4
   school    year enroll cohort
   <chr>    <dbl>  <dbl>  <dbl>
 1 School A  2009    350      1
 2 School B  2009    400      1
 3 School C  2009    500      2
 4 School D  2009    300      3
 5 School A  2010    360     NA
 6 School B  2010    460     NA
 7 School C  2010    560     NA
 8 School D  2010    360     NA
 9 School B  2011    465     NA
10 School C  2011    660     NA
11 School D  2011    260     NA

Fill values using the default argument direction = “down”.

You have two options. You can either use dplyr::group_by() to group by school.

d4 %>% 
  dplyr::group_by(school) %>% 
  tidyr::fill(cohort) %>% 
  dplyr::ungroup()

# A tibble: 11 × 4
   school    year enroll cohort
   <chr>    <dbl>  <dbl>  <dbl>
 1 School A  2009    350      1
 2 School B  2009    400      1
 3 School C  2009    500      2
 4 School D  2009    300      3
 5 School A  2010    360      1
 6 School B  2010    460      1
 7 School C  2010    560      2
 8 School D  2010    360      3
 9 School B  2011    465      1
10 School C  2011    660      2
11 School D  2011    260      3

Or you can arrange your data by school using dplyr::arrange() and then fill values. This option is only possible because there are consistently 2 rows for each school, and data is always missing from the later years. If the data was more inconsistent, you would want to stick with the dplyr::group_by() option.

d4 %>% 
  dplyr::arrange(school) %>% 
  tidyr::fill(cohort)

# A tibble: 11 × 4
   school    year enroll cohort
   <chr>    <dbl>  <dbl>  <dbl>
 1 School A  2009    350      1
 2 School A  2010    360      1
 3 School B  2009    400      1
 4 School B  2010    460      1
 5 School B  2011    465      1
 6 School C  2009    500      2
 7 School C  2010    560      2
 8 School C  2011    660      2
 9 School D  2009    300      3
10 School D  2010    360      3
11 School D  2011    260      3

3. Fill NA values for cohort with existing values (grouped by the school)

Review the data (d7)

# A tibble: 8 × 4
  school    year enroll cohort
  <chr>    <dbl>  <dbl>  <dbl>
1 School A  2009    350      1
2 School B  2009    400      1
3 School C  2009    500      3
4 School A  2010    360     NA
5 School C  2010    560     NA
6 School D  2010    360     NA
7 School C  2011    660     NA
8 School D  2011    260      2

In this scenario, the data is inconsistently organized. Sometimes we only have values in later years and need to fill earlier years, sometimes data doesn’t exist at all for certain years. In this case, we will need to make sure to use dplyr::group_by() as well as the argument direction = “downup” to make sure our values are correctly filled in.

d7 %>% 
  dplyr::group_by(school) %>% 
  tidyr::fill(cohort, .direction = "downup") %>% 
  dplyr::ungroup()

# A tibble: 8 × 4
  school    year enroll cohort
  <chr>    <dbl>  <dbl>  <dbl>
1 School A  2009    350      1
2 School B  2009    400      1
3 School C  2009    500      3
4 School A  2010    360      1
5 School C  2010    560      3
6 School D  2010    360      2
7 School C  2011    660      3
8 School D  2011    260      2

Function: `complete()`

1. Create rows for any missing time periods in the data

Review the data (d2)

# A tibble: 7 × 3
   s_id time_period test_score
  <dbl>       <dbl>      <dbl>
1   123           1        456
2   123           2        480
3   145           1        460
4   150           1        422
5   150           2        435
6   164           1        445
7   164           1        440

Complete the data by creating rows for any student that is missing a time period and filling the test score with NA.

d2 %>% 
  tidyr::complete(s_id, time_period)

# A tibble: 9 × 3
   s_id time_period test_score
  <dbl>       <dbl>      <dbl>
1   123           1        456
2   123           2        480
3   145           1        460
4   145           2         NA
5   150           1        422
6   150           2        435
7   164           1        445
8   164           1        440
9   164           2         NA

Note: If you want to fill with a value other than NA, add the argument fill = list() and add the value you want to replace NA with.

Here is an example, using fill = list(test_score=0)

d2 %>% 
  tidyr::complete(s_id, time_period, fill = list(test_score=0))

# A tibble: 9 × 3
   s_id time_period test_score
  <dbl>       <dbl>      <dbl>
1   123           1        456
2   123           2        480
3   145           1        460
4   145           2          0
5   150           1        422
6   150           2        435
7   164           1        445
8   164           1        440
9   164           2          0

2. Create rows for any missing categories in the data

Review the data (d8)

# A tibble: 5 × 4
  dist_id sch_id cat   count
    <dbl>  <dbl> <chr> <dbl>
1       1     10 cat1     20
2       1     10 cat2     13
3       1     11 cat2     30
4       2     10 cat1     NA
5       2     10 cat2     25

In this data there are two situations - Categories with counts that are suppressed have a value of NA in the count column - Categories that have a count of 0 are not included in the data

We want to make the implicit categories explicit by bringing in those categories with counts of 0

There are a few additional arguments we are going to add in this time.

nesting is added to account for the fact that both dist_id and sch_id make up unique rows explicit is added with a value of FALSE to denote that we want our existing NA values to remain NA since these are suppressed values

d8 %>% 
  tidyr::complete(nesting(dist_id, sch_id), cat, fill = list(count = 0),
                  explicit = FALSE)

# A tibble: 6 × 4
  dist_id sch_id cat   count
    <dbl>  <dbl> <chr> <dbl>
1       1     10 cat1     20
2       1     10 cat2     13
3       1     11 cat1      0
4       1     11 cat2     30
5       2     10 cat1     NA
6       2     10 cat2     25

Return to Complete

Complete long data using existing values

Package: tidyr

Function: fill()

Function: complete()

Function: `fill()`

Function: `complete()`