Package: tidyr


Function: replace_na()


1. Replace NA values for one variable (Var2)

Review the data (d18)

# A tibble: 3 x 3
   Var1  Var2  Var3
  <dbl> <dbl> <dbl>
1    NA    NA     3
2     1    NA     4
3     1     0     2

Recode NA to 0

d18 %>% 
  dplyr::mutate(Var2 = tidyr::replace_na(Var2, replace = 0))
# A tibble: 3 x 3
   Var1  Var2  Var3
  <dbl> <dbl> <dbl>
1    NA     0     3
2     1     0     4
3     1     0     2

2. Replace NA for multiple variables (Var1 and Var2) with the same NA value

Review the data (d18)

# A tibble: 3 x 3
   Var1  Var2  Var3
  <dbl> <dbl> <dbl>
1    NA    NA     3
2     1    NA     4
3     1     0     2

Recode NA for Var1 and Var2 to 0

  • Note: We are using dplyr::across() to apply a transformation to multiple columns.
d18 %>%
  dplyr::mutate(dplyr::across(Var1:Var2, ~ tidyr::replace_na(., 0)))
# A tibble: 3 x 3
   Var1  Var2  Var3
  <dbl> <dbl> <dbl>
1     0     0     3
2     1     0     4
3     1     0     2

3. Replace NA for multiple variables (Var2 and Var3) with the same NA value using your project data dictionary

Review the data (d18)

# A tibble: 3 x 3
   Var1  Var2  Var3
  <dbl> <dbl> <dbl>
1    NA    NA     3
2     1    NA     4
3     1     0     2

It’s possible that you may have built a data dictionary before collecting data and you may have added a column to your dictionary that helps you identify which columns need to be recoded in a particular way.

Let’s review our very simple data dictionary.

dict
# A tibble: 3 x 3
  var_name label                    var_type
  <chr>    <chr>                    <chr>   
1 Var1     Select all: Option 1     binary  
2 Var2     Select all: Option 2     binary  
3 Var3     Question wording of Var3 likert  

Here we see our data dictionary has a column called “var_type” which tells us which variables are binary (or coded 0 and 1).

Oftentimes when you download “select all” questions from a survey platform, each option is downloaded with a 1 (if selected) and NA (if not selected). There are times when this format makes sense (say for visualizations). However, for analysis purposes we probably want those variables to show 1s (Yes) and 0s (No) instead of 1s and NAs.

Instead of listing out all of the variables we want to recode, we can pull a vector of variable names using our dictionary.

vars <- dict %>%
  dplyr::filter(var_type == "binary") %>%
  dplyr::select(var_name) %>%
  dplyr::pull()

We can now use this vector in our tidyr::replace_na() function.

  • Note: We use tidyselect::all_of() to denote that we are selecting variable names from an external vector (“vars”).
d18 %>%
  dplyr::mutate(dplyr::across(tidyselect::all_of(vars), ~tidyr::replace_na(., 0)))
# A tibble: 3 x 3
   Var1  Var2  Var3
  <dbl> <dbl> <dbl>
1     0     0     3
2     1     0     4
3     1     0     2

4. Replace NA for multiple variables of the same class with the same NA value (Var2 and Var3)

Review the data (d1)

# A tibble: 3 x 3
  Var1   Var2  Var3
  <chr> <dbl> <dbl>
1 a         2   3.6
2 b        NA   8.5
3 c         3  NA  

Recode NA for Var2 and Var3 to -999

  • Note: You must wrap is.numeric, a predicate function (returns a true/false), in the tidyselect selection helper where().
d1 %>% 
  dplyr::mutate(dplyr::across(where(is.numeric), ~tidyr::replace_na(., -999)))
# A tibble: 3 x 3
  Var1   Var2   Var3
  <chr> <dbl>  <dbl>
1 a         2    3.6
2 b      -999    8.5
3 c         3 -999  

5. Replace NA Var2 and Var3 with different values

Review the data (d1)

# A tibble: 3 x 3
  Var1   Var2  Var3
  <chr> <dbl> <dbl>
1 a         2   3.6
2 b        NA   8.5
3 c         3  NA  

Replace NA with -999 for Var2 and replace NA with 0 for Var3.

Here we can use base::list() to list the variables we want to transform. Because we are not replacing NAs in a vector, we do not have to use dplyr::mutate() to do our transformation.

d1 %>% 
  tidyr::replace_na(list(Var2 = -999, Var3 = 0))
# A tibble: 3 x 3
  Var1   Var2  Var3
  <chr> <dbl> <dbl>
1 a         2   3.6
2 b      -999   8.5
3 c         3   0  

Package: dplyr


Function: na_if()


1. Replace a value for the entire dataset with NA

Review the data (d3)

# A tibble: 3 x 3
  stu_id    q1    q2
   <dbl> <dbl> <dbl>
1      1     1     1
2      2   -99     1
3      3   -99   -99

Replace -999 for the entire dataset with NA

d24 %>% 
  mutate(across(everything(), ~na_if(., -99)))
# A tibble: 3 x 3
  stu_id    q1    q2
   <dbl> <dbl> <dbl>
1      1     1     1
2      2    NA     1
3      3    NA    NA

2. Replace a value for select variables (Var2 and Var3) with NA

Review the data (d3)

# A tibble: 3 x 3
  Var1   Var2  Var3
  <chr> <dbl> <dbl>
1 a         1     2
2 b      -999     0
3 c         3  -999

Replace -999 for for Var2 and Var3 with NA

d3 %>% 
  dplyr::mutate(dplyr::across(Var2:Var3, ~dplyr::na_if(., -999)))
# A tibble: 3 x 3
  Var1   Var2  Var3
  <chr> <dbl> <dbl>
1 a         1     2
2 b        NA     0
3 c         3    NA

3. Replace a value for multiple variables of the same class (in this case numeric variables) with NA

Review the data (d3)

# A tibble: 3 x 3
  Var1   Var2  Var3
  <chr> <dbl> <dbl>
1 a         1     2
2 b      -999     0
3 c         3  -999

Replace -999 for numeric variables with NA

d3 %>%
  dplyr::mutate(dplyr::across(where(is.numeric), ~ dplyr::na_if(.,-999)))
# A tibble: 3 x 3
  Var1   Var2  Var3
  <chr> <dbl> <dbl>
1 a         1     2
2 b        NA     0
3 c         3    NA

Note: A potentially preferred method to using ~ (which comes from {rlang}), is to instead use \(x) which is native to R 4.1 and up.

d3 %>%
  dplyr::mutate(dplyr::across(where(is.numeric), \(x) dplyr::na_if(x,-999)))
# A tibble: 3 x 3
  Var1   Var2  Var3
  <chr> <dbl> <dbl>
1 a         1     2
2 b        NA     0
3 c         3    NA

Return to Recode