diffdf()1. Compare two data frames that should be identical to see if there are any differences in their values and output those differences
Read in the data
entry1 <- readr::read_csv("project-a_forms_entry1.csv")
entry2 <- readr::read_csv("project-a_forms_entry2.csv")
Review entry1
# A tibble: 4 x 5
stu_id grade q1 q2 q3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 1 2 4 6
2 1235 2 1 5 6
3 1236 1 NA 12 4
4 1237 3 3 2 4
Review entry2
# A tibble: 4 x 5
stu_id grade q1 q2 q3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 1 1 4 6
2 1235 2 1 5 6
3 1236 1 NA 1 4
4 1237 3 3 2 4
Now check if there are any differences in the two data frames.
Note: For the keys argument I added stu_id.
This is because stu_id is what makes each row unique. If
you needed multiple variables to make a row unique, you can simply add
them all using combine. For example, keys = c(“stu_id”,
“tch_id”).
Note: I’ve included the base arguments you need for the function to run. Run ?diffdf in the console to see additional arguments that can be added.
diffdf(entry1,entry2, keys="stu_id")
Differences found between the objects!
A summary is given below.
Not all Values Compared Equal
All rows are shown in table below
=============================
Variable No of Differences
-----------------------------
q1 1
q2 1
-----------------------------
All rows are shown in table below
=================================
VARIABLE stu_id BASE COMPARE
---------------------------------
q1 1234 2 1
---------------------------------
All rows are shown in table below
=================================
VARIABLE stu_id BASE COMPARE
---------------------------------
q2 1236 12 1
---------------------------------
You can see our output tells us that there are differences between the two data frames.
The first difference is found in row stu_id = 1234. The base data frame (entry1) entered 2 for “q1” and the compare data frame (entry2) entered 1.
The second difference found is in row stu_id = 1236. The base data frame entered 12 for “q2” and the compare data frame entered 1.
You could use this information to return to your original forms to see what the correct answer should be and then rectify the errors in whichever data frame has the incorrect response. You would then read your revised data back into R and then run this code again to see if all errors have been corrected.
If all errors have been corrected, you will see this message.
No issues were found!
all_equal()1. Compare two data frames to see if they are identical
In this hypothetical scenario, someone has made a copy of our student data file, and we are unsure which one to use. We simply want to see if they are identical or not. We don’t care about the intricacies of the differences. A simple summary on differences in the contents and/or structure will suffice.
Read in the data
file1 <- readr::read_csv("project-a_forms_file1.csv")
file1_copy <- readr::read_csv("project-a_forms_file1_copy.csv")
Review entry1
# A tibble: 4 x 5
stu_id grade q1 q2 q3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 1 2 4 6
2 1235 2 1 5 6
3 1236 1 NA 12 4
4 1237 3 3 2 4
Review entry2
# A tibble: 4 x 5
stu_id grade q1 q3 q2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 1 2 6 4
2 1235 2 1 6 5
3 1236 1 NA 4 12
4 1237 3 3 4 2
Now check if there are any differences in the two data frames.
all_equal(file1, file1_copy)
[1] TRUE
In this scenario we find that the files are identical. Awesome! We can simply choose one to work with and archive the other.
But, what would the output look like if they were not identical?
Let’s read in another copy, file1_copy2.
# A tibble: 4 x 5
stu_id grade q1 q3 q2
<chr> <dbl> <dbl> <dbl> <dbl>
1 1234 1 2 6 4
2 1235 2 1 6 5
3 1236 1 NA 4 12
4 1237 3 3 4 2
It looks like the same data? Let’s check.
all_equal(file1, file1_copy2)
Different types for column `stu_id`: double vs character.
We find out that the files are not identical!
stu_id is now a character variable rather than numeric. We
would need to make a decision regarding which file to use.
Let’s read in yet another copy to see what differences might exist, file1_copy3.
# A tibble: 4 x 5
stu_id grade q1 q3 q2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 1 2 6 4
2 1235 2 1 6 5
3 1237 3 3 4 2
4 1236 1 NA 4 12
It looks like some of the rows and columns are out of order, which is okay, but is the data the same?
all_equal(file1, file1_copy3)
[1] TRUE
It is! So again, we can choose which copy to work with and archive the other.
Let’s look at one last copy to see what the differences are, file1_copy4.
# A tibble: 4 x 5
stu_id grade q1 q2 q3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 1 2 4 6
2 1235 2 1 5 6
3 1236 1 NA 12 4
4 1238 3 3 2 4
all_equal(file1, file1_copy4)
[1] "- Rows in x but not in y: 4\n- Rows in y but not in x: 4\n"
There are some major differences in this one so I would need to really dig into which copy of the file I should be using here.
Side note: If these examples haven’t been a case study for why you should version and document your file changes, I’m not sure what is! :)
Return to Compare data frames