Complete wide data using existing values

Package: dplyr

Function: `coalesce()`

1. Create a complete 20-21 gender variable (gender_2021) with no missing data

Review the data (d3)

# A tibble: 5 x 5
   s_id gender_1718 gender_1819 gender_1920 gender_2021
  <dbl> <chr>       <chr>       <chr>       <chr>      
1   123 male        male        <NA>        male       
2   145 non-binary  non-binary  non-binary  <NA>       
3   150 female      female      female      female     
4   164 male        <NA>        male        <NA>       
5   170 female      male        <NA>        male

Fill in missing 20-21 self-reported gender information from the most recently available year of data.

Note: We are creating a new variable (gender_2021_new) by using dplyr::mutate().
Note: The values will be filled in the order that variables are presented. In the example below, if data is not available for 2021, then the function will look for data in 1920, then 1819, then 1718. And the value will be filled by the first available data in that order.

d3 %>% 
  dplyr::mutate(gender_2021_new = dplyr::coalesce(gender_2021, gender_1920, gender_1819, gender_1718))

# A tibble: 5 x 6
   s_id gender_1718 gender_1819 gender_1920 gender_2021 gender_2021_new
  <dbl> <chr>       <chr>       <chr>       <chr>       <chr>          
1   123 male        male        <NA>        male        male           
2   145 non-binary  non-binary  non-binary  <NA>        non-binary     
3   150 female      female      female      female      female         
4   164 male        <NA>        male        <NA>        male           
5   170 female      male        <NA>        male        male

However, imagine you have 20 years worth of gender variables and you do not want to list all of those variables out. Instead you can use the splice operator (!!!) to select all variables that meet a certain criteria. The coalescing happens by variable order, so if you wanted to use the most recent values and fill from there, you would need to reorder your variables first.

d3 <- d3 %>%
  dplyr::relocate(s_id, gender_2021, gender_1920, gender_1819)

d3 %>% 
    dplyr::mutate(gender_2021_new = dplyr::coalesce(!!! dplyr::select(., contains("gender"))))

# A tibble: 5 x 6
   s_id gender_2021 gender_1920 gender_1819 gender_1718 gender_2021_new
  <dbl> <chr>       <chr>       <chr>       <chr>       <chr>          
1   123 male        <NA>        male        male        male           
2   145 <NA>        non-binary  non-binary  non-binary  non-binary     
3   150 female      female      female      female      female         
4   164 <NA>        male        <NA>        male        male           
5   170 male        <NA>        male        female      male

Note: Coalesce works when your missing values are denoted as actual NAs (NA, NA_character, etc.). If you are working with a character variable where blanks are not denoted as NAs but are instead shown as blanks or ” “, then coalesce will not work. It will assume the” ” are actual values. In this case you can either first recode all ” ” to NA_character or you can use something like dplyr::case_when() to fill empty values (similar to what we did in the Recode section). If you are planning to coalesce using many variables, I think recoding blanks to NA_character first would be your best option.

Here is an example below. Our new data has blanks rather than NAs.

d6

# A tibble: 5 x 5
   s_id gender_1718 gender_1819  gender_1920  gender_2021
  <dbl> <chr>       <chr>        <chr>        <chr>      
1   123 male        "male"       ""           "male"     
2   145 non-binary  "non-binary" "non-binary" ""         
3   150 female      "female"     "female"     "female"   
4   164 male        ""           "male"       ""         
5   170 female      "male"       ""           "male"

I can select all gender variables that I want to recode using dplyr::across() along with tidyselect::contains() and then recode all “” to NAs by using dplyr::na_if(). Now that my blank values are coded to NA, coalesce will work fine.

d6 %>%
  dplyr::mutate(dplyr::across(tidyselect::contains("gender"), 
                              ~ dplyr::na_if(., ""))) %>%
  dplyr::mutate(gender_2021_new = dplyr::coalesce(gender_2021, gender_1920, gender_1819, gender_1718))

# A tibble: 5 x 6
   s_id gender_1718 gender_1819 gender_1920 gender_2021 gender_2021_new
  <dbl> <chr>       <chr>       <chr>       <chr>       <chr>          
1   123 male        male        <NA>        male        male           
2   145 non-binary  non-binary  non-binary  <NA>        non-binary     
3   150 female      female      female      female      female         
4   164 male        <NA>        male        <NA>        male           
5   170 female      male        <NA>        male        male

Return to Complete

Complete wide data using existing values

Package: dplyr

Function: coalesce()

Function: `coalesce()`