filter()
Examples of removing rows through filtering while keeping NA values
dplyr::across()
in
dplyr::filter()
is deprecated. dplyr::if_any()
and dplyr::if_all()
are predicate functions used to select
columns within dplyr::filter()
. This function is available
in version 1.0.5 of dplyr
. dplyr::if_any()
returns a true when the statement is true for any of
the variables. dplyr::if_all()
returns a true when the
statement is true for all of the variables. See Filter
using if_all or if_any for further explanation1. Remove any row where extra2
is equal to -999,
while also keeping rows that have NA for
extra2
.
Review the data (d5).
# A tibble: 5 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 10 205 harris
2 b -999 0 11 220 steve
3 c 3 -999 12 250 harris
4 d 4 0 13 217 lewis
5 e NA NA NA NA ogunti
Filter out any row that has -999 for extra2.
extra2
in addition to removing rows with the
value of -999 for extra2
, using the | operator to
denote and/or. A drawback of the filter function is that if you filter
on a certain variable (such as extra2
) it will not only
filter out the values you request but it will also drop any row that has
NA for that variable. And we may not necessarily want that. In this
case, I want to keep the last row of data that has NA for
extra2
.d5 %>%
dplyr::filter(extra2 != -999 | is.na(extra2))
# A tibble: 4 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 10 205 harris
2 c 3 -999 12 250 harris
3 d 4 0 13 217 lewis
4 e NA NA NA NA ogunti
Another way to get around the issue of dropping NAs is to use the
within operator %in% rather than == to make sure we
keep the NA values for extra2
. The == (or
!=) is a logical operator that retains/drops rows that produce
a value of TRUE for all conditions (and NA cannot be evaluated and is
therefore dropped). However %in% matches values and therefore
doesn’t drop NA.
d5 %>%
dplyr::filter(!extra2 %in% -999)
# A tibble: 4 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 10 205 harris
2 c 3 -999 12 250 harris
3 d 4 0 13 217 lewis
4 e NA NA NA NA ogunti
2. Remove any row that has -999 for extra2
AND/OR has -999 for extra3
, while also keeping
rows that have NA values for those variables.
Review the data (d5)
# A tibble: 5 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 10 205 harris
2 b -999 0 11 220 steve
3 c 3 -999 12 250 harris
4 d 4 0 13 217 lewis
5 e NA NA NA NA ogunti
Filter out any row that has -999 for extra2
and/or -999
for extra3
.
Note: You’ll notice that I again make an explicit call to
keep rows with NA values for extra2
and
extra3
, using the |
operator to denote
and/or.
Note: Notice that for the first part of the equation I put the operator ! outside of the parentheses. While there are other ways to write this equation, I find this the least confusing way to remove rows based on criteria from multiple variables.
d5 %>%
dplyr::filter(!(extra2 == -999 | extra3 == -999) | (is.na(extra2) | is.na(extra3)))
# A tibble: 3 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 10 205 harris
2 d 4 0 13 217 lewis
3 e NA NA NA NA ogunti
Again, another way to get around the issue of dropping NAs is to use the within operator %in% rather than ==.
d5 %>%
dplyr::filter(!(extra2 %in% -999 | extra3 %in% -999))
# A tibble: 3 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 10 205 harris
2 d 4 0 13 217 lewis
3 e NA NA NA NA ogunti
And yet one other way to write this is using
dplyr::if_all()
.
d5 %>%
dplyr::filter(dplyr::if_all(extra2:extra3, ~ !. %in% -999))
# A tibble: 3 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 10 205 harris
2 d 4 0 13 217 lewis
3 e NA NA NA NA ogunti
3. Remove any row that has -999 for AT LEAST ONE numeric variable, while also keeping rows that have NA values for those variables.
Review the data (d9)
# A tibble: 6 x 5
extra1 extra2 extra3 id test_score
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 10 205
2 b -999 0 11 220
3 c 3 -999 12 250
4 d NA 0 13 217
5 <NA> NA NA NA NA
6 e NA NA NA NA
Filter out any row that has -999 for AT LEAST ONE numeric variable
Note: You need to wrap is.numeric
, a predicate function
(returns a true/false), in the tidyselect
selection helper
where()
.
d9 %>%
dplyr::filter(!dplyr::if_any(where(is.numeric), ~ . %in% -999))
# A tibble: 4 x 5
extra1 extra2 extra3 id test_score
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 10 205
2 d NA 0 13 217
3 <NA> NA NA NA NA
4 e NA NA NA NA
I could also write this using dplyr::if_all()
, saying
“If all values for numeric variables are NOT -999, then return the
rows.
d9 %>%
dplyr::filter(dplyr::if_all(where(is.numeric), ~ !. %in% -999))
# A tibble: 4 x 5
extra1 extra2 extra3 id test_score
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 10 205
2 d NA 0 13 217
3 <NA> NA NA NA NA
4 e NA NA NA NA
filter()
Examples of filtering on NA values
1. Remove any row that has NA for
extra2
.
Review the data (d7).
# A tibble: 4 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 NA 205 harris
2 b NA 0 11 220 steve
3 c 3 NA 12 250 harris
4 d 4 0 13 217 lewis
Remove any row that has NA for extra2
.
d7 %>%
dplyr::filter(!is.na(extra2))
# A tibble: 3 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 NA 205 harris
2 c 3 NA 12 250 harris
3 d 4 0 13 217 lewis
2. Remove any row that has NA for extra2
AND/OR
extra3
.
Review the data (d7).
# A tibble: 4 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 NA 205 harris
2 b NA 0 11 220 steve
3 c 3 NA 12 250 harris
4 d 4 0 13 217 lewis
d7 %>%
dplyr::filter(!(is.na(extra2) | is.na(extra3)))
# A tibble: 2 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 NA 205 harris
2 d 4 0 13 217 lewis
And yet one other way to write this is using
dplyr::if_any()
.
d7 %>%
dplyr::filter(!dplyr::if_any(extra2:extra3, ~ is.na(.)))
# A tibble: 2 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 NA 205 harris
2 d 4 0 13 217 lewis
3. Keep rows where extra2
is NA but
extra3
is not NA
Review the data (d10).
# A tibble: 6 x 4
extra2 extra3 id test_score
<dbl> <dbl> <dbl> <dbl>
1 1 2 10 205
2 -999 0 11 220
3 3 -999 12 250
4 NA 0 13 217
5 NA NA NA NA
6 -999 -999 -999 -999
Keep rows where extra2
is NA but extra3
is
not NA
d10 %>%
dplyr::filter(is.na(extra2) & !is.na(extra3))
# A tibble: 1 x 4
extra2 extra3 id test_score
<dbl> <dbl> <dbl> <dbl>
1 NA 0 13 217
4. Remove any row that has NA for ALL numeric variables.
Review the data (d9)
# A tibble: 6 x 5
extra1 extra2 extra3 id test_score
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 10 205
2 b -999 0 11 220
3 c 3 -999 12 250
4 d NA 0 13 217
5 <NA> NA NA NA NA
6 e NA NA NA NA
Filter out any row that has NA for ALL numeric variables
tidyselect
selection helper
everything()
to refer to all variables.d9 %>%
dplyr::filter(!dplyr::if_all(tidyselect::everything(), ~ is.na(.)))
# A tibble: 5 x 5
extra1 extra2 extra3 id test_score
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 10 205
2 b -999 0 11 220
3 c 3 -999 12 250
4 d NA 0 13 217
5 e NA NA NA NA
5. Remove any row where the sum of NA values is greater than 1 across selected variables (var1:var3).
Review the data (d21)
# A tibble: 7 x 5
sch stu_id var1 var2 var3
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 10 1 20 205
2 a 15 NA NA NA
3 a 11 2 15 220
4 a 21 3 12 NA
5 b 22 9 NA 217
6 b 31 10 NA NA
7 b 32 10 14 251
Filter out any row that has more than 1 NA across var1:var3
Note: We use the base::rowSums()
to calculate the
number of NA values across var1:var3.
Note: We use dplyr::across()
to select our variables
within base::rowSums()
d21 %>%
dplyr::filter(rowSums(dplyr::across(var1:var3, ~ is.na(.))) <= 1)
# A tibble: 5 x 5
sch stu_id var1 var2 var3
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 10 1 20 205
2 a 11 2 15 220
3 a 21 3 12 NA
4 b 22 9 NA 217
5 b 32 10 14 251
This could also be written with is.na()
wrapping the
dplyr::across()
statement.
d21 %>%
dplyr::filter(rowSums(is.na(dplyr::across(var1:var3))) <= 1)
# A tibble: 5 x 5
sch stu_id var1 var2 var3
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 10 1 20 205
2 a 11 2 15 220
3 a 21 3 12 NA
4 b 22 9 NA 217
5 b 32 10 14 251
tidyr::drop_na()
1. Remove any rows where stu_id
is
NA
Review the data (d7)
# A tibble: 4 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 NA 205 harris
2 b NA 0 11 220 steve
3 c 3 NA 12 250 harris
4 d 4 0 13 217 lewis
Filter out any row that has NA for stu_id
.
d7 %>%
tidyr::drop_na(stu_id)
# A tibble: 3 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 b NA 0 11 220 steve
2 c 3 NA 12 250 harris
3 d 4 0 13 217 lewis
2. Remove any row that has an NA value for any variable
Review the data (d7)
# A tibble: 4 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a 1 2 NA 205 harris
2 b NA 0 11 220 steve
3 c 3 NA 12 250 harris
4 d 4 0 13 217 lewis
Filter out any row that has NA for any variable.
d7 %>%
tidyr::drop_na()
# A tibble: 1 x 6
extra1 extra2 extra3 stu_id test_score tch_name
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 d 4 0 13 217 lewis
Return to Filter