Select columns using a data dictionary

Package: dplyr

Function: `select()`

1. Drop columns using a data dictionary

Review the data (d2)

# A tibble: 5 x 6
  extra1 extra2 extra3    id test_score1 test_score2
  <chr>   <dbl>  <dbl> <dbl>       <dbl>       <dbl>
1 a           1      2    10         205         500
2 b        -999      0    11         220         480
3 c           3   -999    12         250         540
4 d           4      0    13         217         499
5 <NA>       NA     NA    NA          NA          NA

Review our data dictionary.

Oftentimes we develop a data dictionary to include everything that exists in our raw data. It is reasonable to include a column that describes what you plan to do with your variables, such as drop a column.

# A tibble: 6 x 3
  var_name    label                    keep 
  <chr>       <chr>                    <chr>
1 extra1      extra var from qualtrics no   
2 extra2      extra var from qualtrics no   
3 extra3      extra var from qualtrics no   
4 id          student id               yes  
5 test_score1 1st test score           yes  
6 test_score2 2nd test score           yes

We can now create a character vector of all of the variables we wish to drop using our data dictionary.

Note: We use dplyr::filter() to only keep the variables we wish to drop.
Note: We use dplyr::pull() to extract the one column with the names of the variables in our data dictionary and create a character vector

drop_vars <- dictionary %>%
  dplyr::filter(keep == "no") %>%
  dplyr::pull(var_name)

drop_vars

[1] "extra1" "extra2" "extra3"

We can now use this character vector to select/remove variables from our dataset.

Note: We use tidyselect::all_of() to select variables that are contained in a character vector (an environment variable).
Note: We add the - to denote that we want to remove variables.

d2 %>%
  dplyr::select(-all_of(drop_vars))

# A tibble: 5 x 3
     id test_score1 test_score2
  <dbl>       <dbl>       <dbl>
1    10         205         500
2    11         220         480
3    12         250         540
4    13         217         499
5    NA          NA          NA

Return to Select Variables

Select columns using a data dictionary

Package: dplyr

Function: select()

Function: `select()`