Package: dplyr


Function: rows_update()


1. Fill in missing values using a second dataset

Review the data (d5)

# A tibble: 5 x 4
     id grade item1 item2
  <dbl> <dbl> <dbl> <dbl>
1    22     3     4     5
2    23     3     5     4
3    24     4     2     1
4    25     5    NA    NA
5    26     4    NA    NA

Notice we are missing data for id 25 and 26, item1 and item2

Luckily that data was collected later on another form.

Let’s review that form (extra_form)

  • Note: Your new form does not need to have all of the same variables as your original form. Only variables that match across both forms will update. Your new form CANNOT, however, have MORE variables than the original form. This will cause an error.
# A tibble: 2 x 4
     id grade item1 item2
  <dbl> <dbl> <dbl> <dbl>
1    25     5     2     3
2    26     4     4     5

We can use dplyr::rows_update() to fill in the missing data on our original form with our new form data.

  • Note: Add your identifying variable to the by argument to denote the unique identifier linking key.
d5 %>%
  dplyr::rows_update(extra_form, by = "id")
# A tibble: 5 x 4
     id grade item1 item2
  <dbl> <dbl> <dbl> <dbl>
1    22     3     4     5
2    23     3     5     4
3    24     4     2     1
4    25     5     2     3
5    26     4     4     5

What would happen if our extra_form had different values than we already have for existing data though?

For example in this case where we have a new form (extra_form2), where the grade level has changed for id 25 (from 5 -> NA) and id 26 (from 4 -> 5).

# A tibble: 2 x 4
     id grade item1 item2
  <dbl> <dbl> <dbl> <dbl>
1    25    NA     2     3
2    26     5     4     5

We see that if we use this new form with updated information for existing variables, that all of our data will update based on the new data, not just the missing data. See that grade level for id 25 has changed from 5 to NA and grade level for id 26 has changed from 4 to 5.

We would need to make a decision about which values for grade level are more accurate, the old values or the new values.

d5 %>%
  dplyr::rows_update(extra_form2, by = "id")
# A tibble: 5 x 4
     id grade item1 item2
  <dbl> <dbl> <dbl> <dbl>
1    22     3     4     5
2    23     3     5     4
3    24     4     2     1
4    25    NA     2     3
5    26     5     4     5

If we do not want to overwrite all data with new information, we can use another function, dplyr::rows_patch() which only overwrites NA values.

d5 %>%
  dplyr::rows_patch(extra_form2, by = "id")
# A tibble: 5 x 4
     id grade item1 item2
  <dbl> <dbl> <dbl> <dbl>
1    22     3     4     5
2    23     3     5     4
3    24     4     2     1
4    25     5     2     3
5    26     4     4     5

Return to Complete