Package: dplyr


Function: semi_join()


1. Filter your data to only those who are in your study sample using a participant roster.

Review the data (pretest)

# A tibble: 4 x 3
     id pre_var1 pre_var2
  <dbl>    <dbl>    <dbl>
1    20        3        4
2    21        4        5
3    22        3        1
4    23        5        2

Let’s review our roster

# A tibble: 3 x 2
     id consent
  <dbl> <chr>  
1    20 yes    
2    21 yes    
3    22 yes    

Filter our pretest data to only those participants who are on our study roster.

We are going to filter using semi_join which is a filtering join. Filtering joins filter cases in your left table (the dataset you want filtered) based on information in your right table (your filtering dataset).

  • Note: Using the by argument, we add the variable/s we want to join on.
pretest %>%
  dplyr::semi_join(roster, by = "id")
# A tibble: 3 x 3
     id pre_var1 pre_var2
  <dbl>    <dbl>    <dbl>
1    20        3        4
2    21        4        5
3    22        3        1

There may be other ways to achieve something similar using a join like right_join (which is considered a mutating join rather than a filtering join), but it would bring in variables from both datasets. And if our roster dataset has more variables than just id (in this case there is an extra variable called “consent”), we may not want that. We may only want the variables in our original dataset.

pretest %>%
  dplyr::right_join(roster, by = "id")
# A tibble: 3 x 4
     id pre_var1 pre_var2 consent
  <dbl>    <dbl>    <dbl> <chr>  
1    20        3        4 yes    
2    21        4        5 yes    
3    22        3        1 yes    

2. Filter to only schools and grade levels that appear in a second dataset.

Review the data (d15)

# A tibble: 9 x 5
  sch   stu_id grade  var2  var3
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 a         10     1    20   205
2 a         15     1    14   201
3 a         11     2    15   220
4 a         21     3    12   250
5 b         22     9    13   217
6 b         31    10    NA    NA
7 b         32    10    14   251
8 c         40     2    10   200
9 c         41     3    14   220

We want to filter based on our second dataset (s_g).

# A tibble: 2 x 2
  sch   grade
  <chr> <dbl>
1 a         1
2 b         9

We see our second dataset includes just grade = 1 for sch = “a” and just grade = 9 for sch = “b”. We could conditionally filter grade levels based on school name using the dplyr::filter() function (see conditional-filtering). But instead, here we are going to use another data frame to filter. In this case we have the existing dataset (s_g) or you could create one.

We are going to filter using our semi_join.

  • Note: You can add an id variable/s to join on here in the by argument, but in this case, it’s fine to leave the argument NULL and a join will be performed using all variables in common across both datasets.

  • Note: You can see in the output below that a and b are retained (with the grade levels we wanted) but c is no longer in the dataset. This is because using semi_join returns all rows from your left table that have a match in your right table. Since there is no school c in our filtering table, school c will not be retained in our output.

d15 %>%
  dplyr::semi_join(s_g)
# A tibble: 3 x 5
  sch   stu_id grade  var2  var3
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 a         10     1    20   205
2 a         15     1    14   201
3 b         22     9    13   217

You could also include the by argument and include the variables you want to filter on.

d15 %>%
  dplyr::semi_join(s_g, by = c("sch", "grade"))
# A tibble: 3 x 5
  sch   stu_id grade  var2  var3
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 a         10     1    20   205
2 a         15     1    14   201
3 b         22     9    13   217

There are benefits here to using semi_join over the similar inner_join (which is considered a mutating join rather than a filtering join).

  1. It only keeps the variables in your left dataset
  2. If for some reason your right dataset has duplicates in it, semi_join will not duplicate your joined data, whereas inner_join will.

For example, say our second dataset (s_g_2) had an extra variable and a duplicate school.

# A tibble: 3 x 3
  sch   grade    id
  <chr> <dbl> <dbl>
1 a         1    30
2 b         9    40
3 a         1    30

If we use our semi_join our filter works great still.

d15 %>%
  dplyr::semi_join(s_g_2)
# A tibble: 3 x 5
  sch   stu_id grade  var2  var3
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 a         10     1    20   205
2 a         15     1    14   201
3 b         22     9    13   217

However, if we use an inner_join, we get extra variables and duplicate rows.

d15 %>%
  dplyr::inner_join(s_g_2)
# A tibble: 5 x 6
  sch   stu_id grade  var2  var3    id
  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 a         10     1    20   205    30
2 a         10     1    20   205    30
3 a         15     1    14   201    30
4 a         15     1    14   201    30
5 b         22     9    13   217    40

Also if the filtering dataset had an extra school in it, that is not in our left table, semi_join will still work fine as it will only return cases from the left table that are matched in the right table.

Here is an example with an extra school (school z) in the filtering dataset (s_g_3) that is not in our left data table.

# A tibble: 4 x 3
  sch   grade    id
  <chr> <dbl> <dbl>
1 a         1    30
2 b         9    40
3 a         1    30
4 z         2    80
d15 %>%
  dplyr::semi_join(s_g_3)
# A tibble: 3 x 5
  sch   stu_id grade  var2  var3
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 a         10     1    20   205
2 a         15     1    14   201
3 b         22     9    13   217

3. Filter to only schools and grade levels that appear in a second dataset.

Review the data (d11)

# A tibble: 7 x 5
  sch   stu_id grade  var2  var3
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 a         10     1    20   205
2 a         15     1    14   201
3 a         11     2    15   220
4 a         21     3    12   250
5 b         22     9    13   217
6 b         31    10    NA    NA
7 b         32    10    14   251

Review our joining dataset (s_g_more)

# A tibble: 3 x 2
  sch   grade
  <chr> <dbl>
1 a         1
2 a         2
3 b         9

This time our dataset contains more than one grade levels per school. We will be filtering to (grade = 1 & 2 for sch = “a”) and (grade = 9 for sch = “b”)

Similar to above, we could conditionally filter grade levels based on school name using the dplyr::filter() function (see conditional-filtering), but in this case we are going to filter using a second dataset (s_g_more).

  • Note: You can add an id variable/s to join on here in the by argument, but in this case, it’s fine to leave the argument NULL and a join will be performed using all variables in common across both datasets.
d11 %>%
  dplyr::semi_join(s_g_more)
# A tibble: 4 x 5
  sch   stu_id grade  var2  var3
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 a         10     1    20   205
2 a         15     1    14   201
3 a         11     2    15   220
4 b         22     9    13   217

Function: anti_join()


1. Filter your pre-test data to those who did not complete a post-test.

Review the data (pretest)

# A tibble: 4 x 3
     id pre_var1 pre_var2
  <dbl>    <dbl>    <dbl>
1    20        3        4
2    21        4        5
3    22        3        1
4    23        5        2

Let’s review the post-test

# A tibble: 3 x 3
     id post_var1 post_var2
  <dbl>     <dbl>     <dbl>
1    21         4         4
2    22         2         1
3    23         5         1

Filter using an anti_join which returns everyone from your left table that does not appear in your right (filtering) table. It also only returns the variables from the left table.

pretest %>%
  dplyr::anti_join(posttest)
# A tibble: 1 x 3
     id pre_var1 pre_var2
  <dbl>    <dbl>    <dbl>
1    20        3        4

Return to Filter