Data frame comparison

Package: diffdf

Function: `diffdf()`

1. Compare two data frames that should be identical to see if there are any differences in their values and output those differences

Read in the data

Note: Here we read in csv files but of course you can read in any type of data (.xlsx, .sav, etc)

entry1 <- readr::read_csv("project-a_forms_entry1.csv")
entry2 <- readr::read_csv("project-a_forms_entry2.csv")

Review entry1

# A tibble: 4 x 5
  stu_id grade    q1    q2    q3
   <dbl> <dbl> <dbl> <dbl> <dbl>
1   1234     1     2     4     6
2   1235     2     1     5     6
3   1236     1    NA    12     4
4   1237     3     3     2     4

Review entry2

# A tibble: 4 x 5
  stu_id grade    q1    q2    q3
   <dbl> <dbl> <dbl> <dbl> <dbl>
1   1234     1     1     4     6
2   1235     2     1     5     6
3   1236     1    NA     1     4
4   1237     3     3     2     4

Now check if there are any differences in the two data frames.

Note: For the keys argument I added stu_id. This is because stu_id is what makes each row unique. If you needed multiple variables to make a row unique, you can simply add them all using combine. For example, keys = c(“stu_id”, “tch_id”).
Note: I’ve included the base arguments you need for the function to run. Run ?diffdf in the console to see additional arguments that can be added.

diffdf(entry1,entry2, keys="stu_id")

Differences found between the objects!

A summary is given below.

Not all Values Compared Equal
All rows are shown in table below

  =============================
   Variable  No of Differences 
  -----------------------------
      q1             1         
      q2             1         
  -----------------------------


All rows are shown in table below

  =================================
   VARIABLE  stu_id  BASE  COMPARE 
  ---------------------------------
      q1      1234    2       1    
  ---------------------------------


All rows are shown in table below

  =================================
   VARIABLE  stu_id  BASE  COMPARE 
  ---------------------------------
      q2      1236    12      1    
  ---------------------------------

You can see our output tells us that there are differences between the two data frames.

The first difference is found in row stu_id = 1234. The base data frame (entry1) entered 2 for “q1” and the compare data frame (entry2) entered 1.
The second difference found is in row stu_id = 1236. The base data frame entered 12 for “q2” and the compare data frame entered 1.

You could use this information to return to your original forms to see what the correct answer should be and then rectify the errors in whichever data frame has the incorrect response. You would then read your revised data back into R and then run this code again to see if all errors have been corrected.

If all errors have been corrected, you will see this message.

No issues were found!

Package: dplyr

Function: `all_equal()`

1. Compare two data frames to see if they are identical

In this hypothetical scenario, someone has made a copy of our student data file, and we are unsure which one to use. We simply want to see if they are identical or not. We don’t care about the intricacies of the differences. A simple summary on differences in the contents and/or structure will suffice.

Read in the data

Note: Here we read in csv files but of course you can read in any type of data (.xlsx, .sav, etc)

file1 <- readr::read_csv("project-a_forms_file1.csv")
file1_copy <- readr::read_csv("project-a_forms_file1_copy.csv")

Review entry1

# A tibble: 4 x 5
  stu_id grade    q1    q2    q3
   <dbl> <dbl> <dbl> <dbl> <dbl>
1   1234     1     2     4     6
2   1235     2     1     5     6
3   1236     1    NA    12     4
4   1237     3     3     2     4

Review entry2

# A tibble: 4 x 5
  stu_id grade    q1    q3    q2
   <dbl> <dbl> <dbl> <dbl> <dbl>
1   1234     1     2     6     4
2   1235     2     1     6     5
3   1236     1    NA     4    12
4   1237     3     3     4     2

Now check if there are any differences in the two data frames.

all_equal(file1, file1_copy)

[1] TRUE

In this scenario we find that the files are identical. Awesome! We can simply choose one to work with and archive the other.

But, what would the output look like if they were not identical?

Let’s read in another copy, file1_copy2.

# A tibble: 4 x 5
  stu_id grade    q1    q3    q2
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 1234       1     2     6     4
2 1235       2     1     6     5
3 1236       1    NA     4    12
4 1237       3     3     4     2

It looks like the same data? Let’s check.

all_equal(file1, file1_copy2)

Different types for column `stu_id`: double vs character.

We find out that the files are not identical! stu_id is now a character variable rather than numeric. We would need to make a decision regarding which file to use.

Let’s read in yet another copy to see what differences might exist, file1_copy3.

# A tibble: 4 x 5
  stu_id grade    q1    q3    q2
   <dbl> <dbl> <dbl> <dbl> <dbl>
1   1234     1     2     6     4
2   1235     2     1     6     5
3   1237     3     3     4     2
4   1236     1    NA     4    12

It looks like some of the rows and columns are out of order, which is okay, but is the data the same?

all_equal(file1, file1_copy3)

[1] TRUE

It is! So again, we can choose which copy to work with and archive the other.

Let’s look at one last copy to see what the differences are, file1_copy4.

# A tibble: 4 x 5
  stu_id grade    q1    q2    q3
   <dbl> <dbl> <dbl> <dbl> <dbl>
1   1234     1     2     4     6
2   1235     2     1     5     6
3   1236     1    NA    12     4
4   1238     3     3     2     4

all_equal(file1, file1_copy4)

[1] "- Rows in x but not in y: 4\n- Rows in y but not in x: 4\n"

There are some major differences in this one so I would need to really dig into which copy of the file I should be using here.

Side note: If these examples haven’t been a case study for why you should version and document your file changes, I’m not sure what is! :)

Return to Compare data frames

Data frame comparison

Package: diffdf

Function: diffdf()

Package: dplyr

Function: all_equal()

Function: `diffdf()`

Function: `all_equal()`