Find and remove duplicates

Package: dplyr

Function: `distinct()`

1. Remove rows associated with duplicate student IDs in our data

In the case of the data above (d1), I only expect to have one row per student id (stu_id). It appears one student completed the survey twice.

The duplicate entries are not identical. I need to make a plan for how to remove duplicate data. For example purposes, let’s say I have made the decision to always keep the most recently completed survey.

In order to ensure I remove duplicates in that specific manner, I need to first arrange my data descending by date. I need to do this because dplyr::distinct() always keeps the first instance of a survey (based on order).

Note: I need to add the dplyr::distinct() argument .keep_all = TRUE, otherwise all variables except for stu_id will be dropped

d1 %>%
  dplyr::arrange(desc(date)) %>%
  dplyr::distinct(stu_id, .keep_all = TRUE)

# A tibble: 5 x 5
  stu_id date       item1 item2 item3
   <dbl> <date>     <dbl> <dbl> <dbl>
1    102 2022-11-18     3     4     1
2    100 2022-11-17     2     2     4
3    106 2022-11-16     2     1     1
4    103 2022-11-15     2     5     2
5    105 2022-11-15     5     4     3

It is really important to never randomly drop duplicate data. This means, that you will ALWAYS want to arrange your data before using the dplyr::distinct() function. The reason for this is that if for some reason your raw data is ever rearranged, when you run this function again, you may be dropping a different duplicate than you intended to.

2. Remove rows associated with duplicate student IDs in our data (according to specific decision rules)

Review the data (d2)

# A tibble: 8 x 5
  stu_id date       item1 item2 item3
   <dbl> <date>     <dbl> <dbl> <dbl>
1    100 2022-11-15     3     4     2
2    103 2022-11-15     2     5    NA
3    105 2022-11-15     5     4     3
4    106 2022-11-16     2     1     1
5    100 2022-11-17     2    NA    NA
6    102 2022-11-18     3     4     1
7    103 2022-11-18     2     4     2
8    105 2022-11-18     5     4     1

We can see in our new data we have several instances of duplicate data (for IDS 100, 103, and 105). Sometimes those duplicates have both incomplete and complete data.

In this new scenario, my new decision rule for keeping duplicates is to keep data based on 2 criteria:
1. First I want to keep the record that is complete (if one is incomplete).
2. If both records are complete, I want to keep the most recent record.

We can ensure we meet this criteria by doing 2 things:

First, creating a way calculate completeness. In this case I just calculate a new variable using dplyr::mutate() called incomplete. Using base::rowSums() I calculate the number of NA values across the main survey items. If the incomplete value is 0, it means the survey is missing no data.
Next we can arrange by both the incomplete variable and date descending again.

Doing both of these steps, we can see that we will ultimately keep the surveys we want when we use dplyr::distinct() because the function keeps the first instance of the survey in your data.

Duplicate IDs 103, 105 and 100 all have the most complete, most recently completed surveys at the top.

d2 %>%
  dplyr::mutate(incomplete = rowSums(is.na(dplyr::across(item1:item3)))) %>%
  dplyr::arrange(incomplete, desc(date))

# A tibble: 8 x 6
  stu_id date       item1 item2 item3 incomplete
   <dbl> <date>     <dbl> <dbl> <dbl>      <dbl>
1    102 2022-11-18     3     4     1          0
2    103 2022-11-18     2     4     2          0
3    105 2022-11-18     5     4     1          0
4    106 2022-11-16     2     1     1          0
5    100 2022-11-15     3     4     2          0
6    105 2022-11-15     5     4     3          0
7    103 2022-11-15     2     5    NA          1
8    100 2022-11-17     2    NA    NA          2

We can now put it all together to remove duplicates

d2 %>%
  dplyr::mutate(incomplete = rowSums(is.na(dplyr::across(item1:item3)))) %>%
  dplyr::arrange(incomplete, desc(date)) %>%
  dplyr::distinct(stu_id, .keep_all = TRUE)

# A tibble: 5 x 6
  stu_id date       item1 item2 item3 incomplete
   <dbl> <date>     <dbl> <dbl> <dbl>      <dbl>
1    102 2022-11-18     3     4     1          0
2    103 2022-11-18     2     4     2          0
3    105 2022-11-18     5     4     1          0
4    106 2022-11-16     2     1     1          0
5    100 2022-11-15     3     4     2          0

3. Remove duplicate rows according to two variables (first_name and last_name)

Review the data (d3)

# A tibble: 4 x 3
  first_name last_name    q1
  <chr>      <chr>     <dbl>
1 micah      lewis         4
2 amy        henry         2
3 oscar      bobsen        1
4 micah      lewis         3

Now our data are uniquely identified by more than one variable.

d3 %>%
  dplyr::distinct(first_name, last_name, .keep_all = TRUE)

# A tibble: 3 x 3
  first_name last_name    q1
  <chr>      <chr>     <dbl>
1 micah      lewis         4
2 amy        henry         2
3 oscar      bobsen        1

To select columns using tidy selection, you can use dplyr::pick(), available in version 1.1.0. Complementary to dplyr::across(), however with pick you typically apply a function to the full data frame, while with select you typically apply a function to each column.

d3 %>%
  dplyr::distinct(pick(contains("name")), .keep_all = TRUE)

# A tibble: 3 x 3
  first_name last_name    q1
  <chr>      <chr>     <dbl>
1 micah      lewis         4
2 amy        henry         2
3 oscar      bobsen        1

Return to Duplicates

Find and remove duplicates

Package: janitor

Function: `get_dupes()`

Package: dplyr

Function: `distinct()`

Find and remove duplicates

Package: janitor

Function: get_dupes()

Package: dplyr

Function: distinct()

Function: `get_dupes()`

Function: `distinct()`