Filter rows with NA values

Package: dplyr

Function: `filter()`

Examples of removing rows through filtering while keeping NA values

Note: Using dplyr::across() in dplyr::filter() is deprecated. dplyr::if_any() and dplyr::if_all() are predicate functions used to select columns within dplyr::filter(). This function is available in version 1.0.5 of dplyr. dplyr::if_any() returns a true when the statement is true for any of the variables. dplyr::if_all() returns a true when the statement is true for all of the variables. See Filter using if_all or if_any for further explanation

1. Remove any row where extra2 is equal to -999, while also keeping rows that have NA for extra2.

Review the data (d5).

# A tibble: 5 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     10        205 harris  
2 b        -999      0     11        220 steve   
3 c           3   -999     12        250 harris  
4 d           4      0     13        217 lewis   
5 e          NA     NA     NA         NA ogunti

Filter out any row that has -999 for extra2.

Note: You’ll notice that I make an explicit call to keep rows with NA values for extra2 in addition to removing rows with the value of -999 for extra2, using the | operator to denote and/or. A drawback of the filter function is that if you filter on a certain variable (such as extra2) it will not only filter out the values you request but it will also drop any row that has NA for that variable. And we may not necessarily want that. In this case, I want to keep the last row of data that has NA for extra2.

d5 %>% 
  dplyr::filter(extra2 != -999 | is.na(extra2))

# A tibble: 4 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     10        205 harris  
2 c           3   -999     12        250 harris  
3 d           4      0     13        217 lewis   
4 e          NA     NA     NA         NA ogunti

Another way to get around the issue of dropping NAs is to use the within operator %in% rather than == to make sure we keep the NA values for extra2. The == (or !=) is a logical operator that retains/drops rows that produce a value of TRUE for all conditions (and NA cannot be evaluated and is therefore dropped). However %in% matches values and therefore doesn’t drop NA.

d5 %>% 
  dplyr::filter(!extra2 %in% -999)

# A tibble: 4 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     10        205 harris  
2 c           3   -999     12        250 harris  
3 d           4      0     13        217 lewis   
4 e          NA     NA     NA         NA ogunti

2. Remove any row that has -999 for extra2 AND/OR has -999 for extra3, while also keeping rows that have NA values for those variables.

Review the data (d5)

# A tibble: 5 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     10        205 harris  
2 b        -999      0     11        220 steve   
3 c           3   -999     12        250 harris  
4 d           4      0     13        217 lewis   
5 e          NA     NA     NA         NA ogunti

Filter out any row that has -999 for extra2 and/or -999 for extra3.

Note: You’ll notice that I again make an explicit call to keep rows with NA values for extra2 and extra3, using the | operator to denote and/or.
Note: Notice that for the first part of the equation I put the operator ! outside of the parentheses. While there are other ways to write this equation, I find this the least confusing way to remove rows based on criteria from multiple variables.

d5 %>% 
  dplyr::filter(!(extra2 == -999 | extra3 == -999) | (is.na(extra2) | is.na(extra3)))

# A tibble: 3 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     10        205 harris  
2 d           4      0     13        217 lewis   
3 e          NA     NA     NA         NA ogunti

Again, another way to get around the issue of dropping NAs is to use the within operator %in% rather than ==.

d5 %>% 
  dplyr::filter(!(extra2 %in% -999 | extra3 %in% -999))

# A tibble: 3 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     10        205 harris  
2 d           4      0     13        217 lewis   
3 e          NA     NA     NA         NA ogunti

And yet one other way to write this is using dplyr::if_all().

d5 %>% 
  dplyr::filter(dplyr::if_all(extra2:extra3, ~ !. %in% -999))

# A tibble: 3 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     10        205 harris  
2 d           4      0     13        217 lewis   
3 e          NA     NA     NA         NA ogunti

3. Remove any row that has -999 for AT LEAST ONE numeric variable, while also keeping rows that have NA values for those variables.

Review the data (d9)

# A tibble: 6 x 5
  extra1 extra2 extra3    id test_score
  <chr>   <dbl>  <dbl> <dbl>      <dbl>
1 a           1      2    10        205
2 b        -999      0    11        220
3 c           3   -999    12        250
4 d          NA      0    13        217
5 <NA>       NA     NA    NA         NA
6 e          NA     NA    NA         NA

Filter out any row that has -999 for AT LEAST ONE numeric variable

Note: You’ll notice here that I used %in% rather than == to ensure I keep the NA values.

Note: You need to wrap is.numeric, a predicate function (returns a true/false), in the tidyselect selection helper where().

d9 %>% 
  dplyr::filter(!dplyr::if_any(where(is.numeric), ~ . %in% -999))

# A tibble: 4 x 5
  extra1 extra2 extra3    id test_score
  <chr>   <dbl>  <dbl> <dbl>      <dbl>
1 a           1      2    10        205
2 d          NA      0    13        217
3 <NA>       NA     NA    NA         NA
4 e          NA     NA    NA         NA

I could also write this using dplyr::if_all(), saying “If all values for numeric variables are NOT -999, then return the rows.

d9 %>% 
  dplyr::filter(dplyr::if_all(where(is.numeric), ~ !. %in% -999))

# A tibble: 4 x 5
  extra1 extra2 extra3    id test_score
  <chr>   <dbl>  <dbl> <dbl>      <dbl>
1 a           1      2    10        205
2 d          NA      0    13        217
3 <NA>       NA     NA    NA         NA
4 e          NA     NA    NA         NA

Function: `filter()`

Examples of filtering on NA values

1. Remove any row that has NA for extra2.

Review the data (d7).

# A tibble: 4 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     NA        205 harris  
2 b          NA      0     11        220 steve   
3 c           3     NA     12        250 harris  
4 d           4      0     13        217 lewis

Remove any row that has NA for extra2.

Note: Use the logical operator ! to denote NOT.

d7 %>% 
  dplyr::filter(!is.na(extra2))

# A tibble: 3 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     NA        205 harris  
2 c           3     NA     12        250 harris  
3 d           4      0     13        217 lewis

2. Remove any row that has NA for extra2 AND/OR extra3.

Review the data (d7).

# A tibble: 4 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     NA        205 harris  
2 b          NA      0     11        220 steve   
3 c           3     NA     12        250 harris  
4 d           4      0     13        217 lewis

Note: Notice that I put the operator ! outside of the parentheses. While there are other ways to write this equation, I find this the least confusing way to remove rows based on removal criteria from multiple variables.

d7 %>% 
  dplyr::filter(!(is.na(extra2) | is.na(extra3)))

# A tibble: 2 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     NA        205 harris  
2 d           4      0     13        217 lewis

And yet one other way to write this is using dplyr::if_any().

d7 %>%
  dplyr::filter(!dplyr::if_any(extra2:extra3, ~ is.na(.)))

# A tibble: 2 x 6
  extra1 extra2 extra3 stu_id test_score tch_name
  <chr>   <dbl>  <dbl>  <dbl>      <dbl> <chr>   
1 a           1      2     NA        205 harris  
2 d           4      0     13        217 lewis

3. Keep rows where extra2 is NA but extra3 is not NA

Review the data (d10).

# A tibble: 6 x 4
  extra2 extra3    id test_score
   <dbl>  <dbl> <dbl>      <dbl>
1      1      2    10        205
2   -999      0    11        220
3      3   -999    12        250
4     NA      0    13        217
5     NA     NA    NA         NA
6   -999   -999  -999       -999

Keep rows where extra2 is NA but extra3 is not NA

d10 %>%
  dplyr::filter(is.na(extra2) & !is.na(extra3))

# A tibble: 1 x 4
  extra2 extra3    id test_score
   <dbl>  <dbl> <dbl>      <dbl>
1     NA      0    13        217

4. Remove any row that has NA for ALL numeric variables.

Review the data (d9)

# A tibble: 6 x 5
  extra1 extra2 extra3    id test_score
  <chr>   <dbl>  <dbl> <dbl>      <dbl>
1 a           1      2    10        205
2 b        -999      0    11        220
3 c           3   -999    12        250
4 d          NA      0    13        217
5 <NA>       NA     NA    NA         NA
6 e          NA     NA    NA         NA

Filter out any row that has NA for ALL numeric variables

Note: We use the tidyselect selection helper everything() to refer to all variables.

d9 %>% 
  dplyr::filter(!dplyr::if_all(tidyselect::everything(), ~ is.na(.)))

# A tibble: 5 x 5
  extra1 extra2 extra3    id test_score
  <chr>   <dbl>  <dbl> <dbl>      <dbl>
1 a           1      2    10        205
2 b        -999      0    11        220
3 c           3   -999    12        250
4 d          NA      0    13        217
5 e          NA     NA    NA         NA

5. Remove any row where the sum of NA values is greater than 1 across selected variables (var1:var3).

Review the data (d21)

# A tibble: 7 x 5
  sch   stu_id  var1  var2  var3
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 a         10     1    20   205
2 a         15    NA    NA    NA
3 a         11     2    15   220
4 a         21     3    12    NA
5 b         22     9    NA   217
6 b         31    10    NA    NA
7 b         32    10    14   251

Filter out any row that has more than 1 NA across var1:var3

Note: We use the base::rowSums() to calculate the number of NA values across var1:var3.
Note: We use dplyr::across() to select our variables within base::rowSums()

d21 %>%
  dplyr::filter(rowSums(dplyr::across(var1:var3, ~ is.na(.))) <= 1)

# A tibble: 5 x 5
  sch   stu_id  var1  var2  var3
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 a         10     1    20   205
2 a         11     2    15   220
3 a         21     3    12    NA
4 b         22     9    NA   217
5 b         32    10    14   251

This could also be written with is.na() wrapping the dplyr::across() statement.

d21 %>%
  dplyr::filter(rowSums(is.na(dplyr::across(var1:var3))) <= 1)

# A tibble: 5 x 5
  sch   stu_id  var1  var2  var3
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 a         10     1    20   205
2 a         11     2    15   220
3 a         21     3    12    NA
4 b         22     9    NA   217
5 b         32    10    14   251

Filter rows with NA values

Package: dplyr

Function: `filter()`

Function: `filter()`

Package: tidyr

Function: `tidyr::drop_na()`

Filter rows with NA values

Package: dplyr

Function: filter()

Function: filter()

Package: tidyr

Function: tidyr::drop_na()

Function: `filter()`

Function: `filter()`

Function: `tidyr::drop_na()`