get_dupes()1. Review the duplicate student IDs in our student survey data
Review the data (d1)
# A tibble: 6 x 5
stu_id date item1 item2 item3
<dbl> <date> <dbl> <dbl> <dbl>
1 100 2022-11-15 3 4 2
2 103 2022-11-15 2 5 2
3 105 2022-11-15 5 4 3
4 106 2022-11-16 2 1 1
5 100 2022-11-17 2 2 4
6 102 2022-11-18 3 4 1
I can see that I have one stu_id that is duplicated in
my data.
d1 %>%
janitor::get_dupes(stu_id)
# A tibble: 2 x 6
stu_id dupe_count date item1 item2 item3
<dbl> <int> <date> <dbl> <dbl> <dbl>
1 100 2 2022-11-15 3 4 2
2 100 2 2022-11-17 2 2 4
distinct()1. Remove rows associated with duplicate student IDs in our data
In the case of the data above (d1), I only expect to have one row per
student id (stu_id). It appears one student completed the
survey twice.
The duplicate entries are not identical. I need to make a plan for how to remove duplicate data. For example purposes, letโs say I have made the decision to always keep the most recently completed survey.
In order to ensure I remove duplicates in that specific manner, I
need to first arrange my data descending by date. I need to
do this because dplyr::distinct() always keeps the
first instance of a survey (based on order).
dplyr::distinct() argument
.keep_all = TRUE, otherwise all variables except for
stu_id will be droppedd1 %>%
dplyr::arrange(desc(date)) %>%
dplyr::distinct(stu_id, .keep_all = TRUE)
# A tibble: 5 x 5
stu_id date item1 item2 item3
<dbl> <date> <dbl> <dbl> <dbl>
1 102 2022-11-18 3 4 1
2 100 2022-11-17 2 2 4
3 106 2022-11-16 2 1 1
4 103 2022-11-15 2 5 2
5 105 2022-11-15 5 4 3
It is really important to never randomly drop duplicate
data. This means, that you will ALWAYS want to arrange your
data before using the dplyr::distinct() function. The
reason for this is that if for some reason your raw data is ever
rearranged, when you run this function again, you may be dropping a
different duplicate than you intended to.
2. Remove rows associated with duplicate student IDs in our data (according to specific decision rules)
Review the data (d2)
# A tibble: 8 x 5
stu_id date item1 item2 item3
<dbl> <date> <dbl> <dbl> <dbl>
1 100 2022-11-15 3 4 2
2 103 2022-11-15 2 5 NA
3 105 2022-11-15 5 4 3
4 106 2022-11-16 2 1 1
5 100 2022-11-17 2 NA NA
6 102 2022-11-18 3 4 1
7 103 2022-11-18 2 4 2
8 105 2022-11-18 5 4 1
We can see in our new data we have several instances of duplicate data (for IDS 100, 103, and 105). Sometimes those duplicates have both incomplete and complete data.
In this new scenario, my new decision rule for keeping duplicates is
to keep data based on 2 criteria:
1. First I want to keep the record that is complete (if one is
incomplete).
2. If both records are complete, I want to keep the most recent
record.
We can ensure we meet this criteria by doing 2 things:
First, creating a way calculate completeness. In this case I just
calculate a new variable using dplyr::mutate() called
incomplete. Using base::rowSums() I calculate
the number of NA values across the main survey items. If the incomplete
value is 0, it means the survey is missing no data.
Next we can arrange by both the incomplete variable
and date descending again.
Doing both of these steps, we can see that we will ultimately keep
the surveys we want when we use dplyr::distinct() because
the function keeps the first instance of the survey in your data.
Duplicate IDs 103, 105 and 100 all have the most complete, most recently completed surveys at the top.
d2 %>%
dplyr::mutate(incomplete = rowSums(is.na(dplyr::across(item1:item3)))) %>%
dplyr::arrange(incomplete, desc(date))
# A tibble: 8 x 6
stu_id date item1 item2 item3 incomplete
<dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 102 2022-11-18 3 4 1 0
2 103 2022-11-18 2 4 2 0
3 105 2022-11-18 5 4 1 0
4 106 2022-11-16 2 1 1 0
5 100 2022-11-15 3 4 2 0
6 105 2022-11-15 5 4 3 0
7 103 2022-11-15 2 5 NA 1
8 100 2022-11-17 2 NA NA 2
We can now put it all together to remove duplicates
d2 %>%
dplyr::mutate(incomplete = rowSums(is.na(dplyr::across(item1:item3)))) %>%
dplyr::arrange(incomplete, desc(date)) %>%
dplyr::distinct(stu_id, .keep_all = TRUE)
# A tibble: 5 x 6
stu_id date item1 item2 item3 incomplete
<dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 102 2022-11-18 3 4 1 0
2 103 2022-11-18 2 4 2 0
3 105 2022-11-18 5 4 1 0
4 106 2022-11-16 2 1 1 0
5 100 2022-11-15 3 4 2 0
3. Remove duplicate rows according to two variables
(first_name and last_name)
Review the data (d3)
# A tibble: 4 x 3
first_name last_name q1
<chr> <chr> <dbl>
1 micah lewis 4
2 amy henry 2
3 oscar bobsen 1
4 micah lewis 3
Now our data are uniquely identified by more than one variable.
d3 %>%
dplyr::distinct(first_name, last_name, .keep_all = TRUE)
# A tibble: 3 x 3
first_name last_name q1
<chr> <chr> <dbl>
1 micah lewis 4
2 amy henry 2
3 oscar bobsen 1
To select columns using tidy selection, you can use
dplyr::pick(), available in version 1.1.0. Complementary to
dplyr::across(), however with pick you
typically apply a function to the full data frame, while with
select you typically apply a function to each column.
d3 %>%
dplyr::distinct(pick(contains("name")), .keep_all = TRUE)
# A tibble: 3 x 3
first_name last_name q1
<chr> <chr> <dbl>
1 micah lewis 4
2 amy henry 2
3 oscar bobsen 1
Return to Duplicates