fill()1. Fill NA values for both school and
year with existing values
Review the data (d1)
# A tibble: 12 × 4
school year grade n_students
<chr> <dbl> <dbl> <dbl>
1 School A 2009 3 100
2 <NA> NA 4 80
3 <NA> NA 5 90
4 <NA> 2010 3 98
5 <NA> NA 4 88
6 <NA> NA 5 91
7 School B 2009 6 150
8 <NA> NA 7 145
9 <NA> NA 8 150
10 <NA> 2010 6 155
11 <NA> NA 7 151
12 <NA> NA 8 160
Fill values using the default argument direction = “down”.
d1 %>%
tidyr::fill(school, year)
# A tibble: 12 × 4
school year grade n_students
<chr> <dbl> <dbl> <dbl>
1 School A 2009 3 100
2 School A 2009 4 80
3 School A 2009 5 90
4 School A 2010 3 98
5 School A 2010 4 88
6 School A 2010 5 91
7 School B 2009 6 150
8 School B 2009 7 145
9 School B 2009 8 150
10 School B 2010 6 155
11 School B 2010 7 151
12 School B 2010 8 160
2. Fill NA values for cohort with existing
values (grouped by the school)
Review the data (d4)
# A tibble: 11 × 4
school year enroll cohort
<chr> <dbl> <dbl> <dbl>
1 School A 2009 350 1
2 School B 2009 400 1
3 School C 2009 500 2
4 School D 2009 300 3
5 School A 2010 360 NA
6 School B 2010 460 NA
7 School C 2010 560 NA
8 School D 2010 360 NA
9 School B 2011 465 NA
10 School C 2011 660 NA
11 School D 2011 260 NA
Fill values using the default argument direction = “down”.
You have two options. You can either use
dplyr::group_by() to group by school.
d4 %>%
dplyr::group_by(school) %>%
tidyr::fill(cohort) %>%
dplyr::ungroup()
# A tibble: 11 × 4
school year enroll cohort
<chr> <dbl> <dbl> <dbl>
1 School A 2009 350 1
2 School B 2009 400 1
3 School C 2009 500 2
4 School D 2009 300 3
5 School A 2010 360 1
6 School B 2010 460 1
7 School C 2010 560 2
8 School D 2010 360 3
9 School B 2011 465 1
10 School C 2011 660 2
11 School D 2011 260 3
Or you can arrange your data by school using
dplyr::arrange() and then fill values. This option is only
possible because there are consistently 2 rows for each school, and data
is always missing from the later years. If the data was more
inconsistent, you would want to stick with the
dplyr::group_by() option.
d4 %>%
dplyr::arrange(school) %>%
tidyr::fill(cohort)
# A tibble: 11 × 4
school year enroll cohort
<chr> <dbl> <dbl> <dbl>
1 School A 2009 350 1
2 School A 2010 360 1
3 School B 2009 400 1
4 School B 2010 460 1
5 School B 2011 465 1
6 School C 2009 500 2
7 School C 2010 560 2
8 School C 2011 660 2
9 School D 2009 300 3
10 School D 2010 360 3
11 School D 2011 260 3
3. Fill NA values for cohort with existing
values (grouped by the school)
Review the data (d7)
# A tibble: 8 × 4
school year enroll cohort
<chr> <dbl> <dbl> <dbl>
1 School A 2009 350 1
2 School B 2009 400 1
3 School C 2009 500 3
4 School A 2010 360 NA
5 School C 2010 560 NA
6 School D 2010 360 NA
7 School C 2011 660 NA
8 School D 2011 260 2
In this scenario, the data is inconsistently organized. Sometimes we
only have values in later years and need to fill earlier years,
sometimes data doesn’t exist at all for certain years. In this case, we
will need to make sure to use dplyr::group_by() as well as
the argument direction = “downup” to make sure our values are
correctly filled in.
d7 %>%
dplyr::group_by(school) %>%
tidyr::fill(cohort, .direction = "downup") %>%
dplyr::ungroup()
# A tibble: 8 × 4
school year enroll cohort
<chr> <dbl> <dbl> <dbl>
1 School A 2009 350 1
2 School B 2009 400 1
3 School C 2009 500 3
4 School A 2010 360 1
5 School C 2010 560 3
6 School D 2010 360 2
7 School C 2011 660 3
8 School D 2011 260 2
complete()1. Create rows for any missing time periods in the data
Review the data (d2)
# A tibble: 7 × 3
s_id time_period test_score
<dbl> <dbl> <dbl>
1 123 1 456
2 123 2 480
3 145 1 460
4 150 1 422
5 150 2 435
6 164 1 445
7 164 1 440
Complete the data by creating rows for any student that is missing a time period and filling the test score with NA.
d2 %>%
tidyr::complete(s_id, time_period)
# A tibble: 9 × 3
s_id time_period test_score
<dbl> <dbl> <dbl>
1 123 1 456
2 123 2 480
3 145 1 460
4 145 2 NA
5 150 1 422
6 150 2 435
7 164 1 445
8 164 1 440
9 164 2 NA
Here is an example, using fill = list(test_score=0)
d2 %>%
tidyr::complete(s_id, time_period, fill = list(test_score=0))
# A tibble: 9 × 3
s_id time_period test_score
<dbl> <dbl> <dbl>
1 123 1 456
2 123 2 480
3 145 1 460
4 145 2 0
5 150 1 422
6 150 2 435
7 164 1 445
8 164 1 440
9 164 2 0
2. Create rows for any missing categories in the data
Review the data (d8)
# A tibble: 5 × 4
dist_id sch_id cat count
<dbl> <dbl> <chr> <dbl>
1 1 10 cat1 20
2 1 10 cat2 13
3 1 11 cat2 30
4 2 10 cat1 NA
5 2 10 cat2 25
In this data there are two situations - Categories with counts that are suppressed have a value of NA in the count column - Categories that have a count of 0 are not included in the data
We want to make the implicit categories explicit by bringing in those categories with counts of 0
There are a few additional arguments we are going to add in this time.
nesting is added to account for the fact that both
dist_id and sch_id make up unique rows
explicit is added with a value of FALSE to denote that we want
our existing NA values to remain NA since these are suppressed
values
d8 %>%
tidyr::complete(nesting(dist_id, sch_id), cat, fill = list(count = 0),
explicit = FALSE)
# A tibble: 6 × 4
dist_id sch_id cat count
<dbl> <dbl> <chr> <dbl>
1 1 10 cat1 20
2 1 10 cat2 13
3 1 11 cat1 0
4 1 11 cat2 30
5 2 10 cat1 NA
6 2 10 cat2 25
Return to Complete