Package: dplyr


Function: select()


1. Drop columns (extra1, extra2, extra3)

Review the data (d1)

# A tibble: 5 x 5
  extra1 extra2 extra3    id test_score
  <chr>   <dbl>  <dbl> <dbl>      <dbl>
1 a           1      2    10        205
2 b        -999      0    11        220
3 c           3   -999    12        250
4 d           4      0    13        217
5 <NA>       NA     NA    NA         NA

Drop extra1, extra2, extra3

  • Note: Note the use of - to remove columns.
d1 %>% 
  dplyr::select(-c(extra1:extra3))
# A tibble: 5 x 2
     id test_score
  <dbl>      <dbl>
1    10        205
2    11        220
3    12        250
4    13        217
5    NA         NA

You can also drop columns based on shared column attributes using tidyselect selection helpers such as starts_with().

d1 %>% 
  dplyr::select(-(tidyselect::starts_with("extra")))
# A tibble: 5 x 2
     id test_score
  <dbl>      <dbl>
1    10        205
2    11        220
3    12        250
4    13        217
5    NA         NA

2. Keep selected columns (id, test_score1, test_score2)

Review the data (d2)

# A tibble: 5 x 6
  extra1 extra2 extra3    id test_score1 test_score2
  <chr>   <dbl>  <dbl> <dbl>       <dbl>       <dbl>
1 a           1      2    10         205         500
2 b        -999      0    11         220         480
3 c           3   -999    12         250         540
4 d           4      0    13         217         499
5 <NA>       NA     NA    NA          NA          NA

Keep id, test_score1 and test_score2

d2 %>% 
  dplyr::select(id:test_score2)
# A tibble: 5 x 3
     id test_score1 test_score2
  <dbl>       <dbl>       <dbl>
1    10         205         500
2    11         220         480
3    12         250         540
4    13         217         499
5    NA          NA          NA

You can also keep columns based on shared column attributes using tidyselect selection helpers such as contains().

d2 %>% 
  dplyr::select(id, tidyselect::contains("test"))
# A tibble: 5 x 3
     id test_score1 test_score2
  <dbl>       <dbl>       <dbl>
1    10         205         500
2    11         220         480
3    12         250         540
4    13         217         499
5    NA          NA          NA

You could also write this with both id and test in the tidyselect::contains selection helper. Note the use of c() to create a character vector.

d2 %>% 
  dplyr::select(tidyselect::contains(c("id", "test")))
# A tibble: 5 x 3
     id test_score1 test_score2
  <dbl>       <dbl>       <dbl>
1    10         205         500
2    11         220         480
3    12         250         540
4    13         217         499
5    NA          NA          NA

Another way you can write this is using the selection helper matches() which matches based on a regex operator rather than a literal string like contains().

d2 %>% 
  dplyr::select(tidyselect::matches("id|test"))
# A tibble: 5 x 3
     id test_score1 test_score2
  <dbl>       <dbl>       <dbl>
1    10         205         500
2    11         220         480
3    12         250         540
4    13         217         499
5    NA          NA          NA

3. Keep only numeric columns

Review the data (d2)

# A tibble: 5 x 6
  extra1 extra2 extra3    id test_score1 test_score2
  <chr>   <dbl>  <dbl> <dbl>       <dbl>       <dbl>
1 a           1      2    10         205         500
2 b        -999      0    11         220         480
3 c           3   -999    12         250         540
4 d           4      0    13         217         499
5 <NA>       NA     NA    NA          NA          NA

Keep numeric columns by adding the selection helper where().

  • Note: You must wrap is.numeric, a predicate function (returns a true/false), in the tidyselect selection helper where().
d2 %>% 
  dplyr::select(where(is.numeric))
# A tibble: 5 x 5
  extra2 extra3    id test_score1 test_score2
   <dbl>  <dbl> <dbl>       <dbl>       <dbl>
1      1      2    10         205         500
2   -999      0    11         220         480
3      3   -999    12         250         540
4      4      0    13         217         499
5     NA     NA    NA          NA          NA

And if I wanted all numeric columns except extra2, I could simply add this to my statement with the - operator.

d2 %>% 
  dplyr::select(where(is.numeric), -extra2)
# A tibble: 5 x 4
  extra3    id test_score1 test_score2
   <dbl> <dbl>       <dbl>       <dbl>
1      2    10         205         500
2      0    11         220         480
3   -999    12         250         540
4      0    13         217         499
5     NA    NA          NA          NA

4. Select all columns except extra1

Review the data (d4)

# A tibble: 5 x 5
  extra1 extra2 extra3 stu_id test_score
  <chr>  <lgl>   <dbl>  <dbl>      <dbl>
1 a      NA          2     10        205
2 b      NA          0     11        220
3 c      NA       -999     12        250
4 d      NA          0     13        217
5 e      NA         NA     NA         NA

Similar to above, I could use the - to remove columns

d4 %>% 
  dplyr::select(-extra1)
# A tibble: 5 x 4
  extra2 extra3 stu_id test_score
  <lgl>   <dbl>  <dbl>      <dbl>
1 NA          2     10        205
2 NA          0     11        220
3 NA       -999     12        250
4 NA          0     13        217
5 NA         NA     NA         NA

Another way I could select all columns except extra1 is to select extra2 through all columns until the last column, using the tidyselect selection helper last_col(). This is helpful if you don’t always know what the last column is but you know you want to keep everything until the end.

d4 %>% 
  dplyr::select(extra2:last_col())
# A tibble: 5 x 4
  extra2 extra3 stu_id test_score
  <lgl>   <dbl>  <dbl>      <dbl>
1 NA          2     10        205
2 NA          0     11        220
3 NA       -999     12        250
4 NA          0     13        217
5 NA         NA     NA         NA

5. Keep selected columns (stress1_reverse, anxiety1_reverse)

Review the data (d7)

# A tibble: 3 x 6
  stress1 stress1_reverse anxiety1 anxiety1_reverse extra1 extra1_reverse
    <dbl>           <dbl>    <dbl>            <dbl>  <dbl>          <dbl>
1       1               4        4                1      2              4
2       2               3        2                3      5              1
3       4               1        2                3      3              2

Because there are both reversed and non-reversed versions of these variables, we will need to be careful to not select the non-reversed versions.

Currently this is a small number of variables to select and we could easily just list these variables out in a select statement. But if there were 40 of these variables, we may want to be more strategic about selecting these variables.

We could add 2 selection criteria to select our variables that start with “stress” or “anxiety” and end in “reverse” in order to grab all of our variables of interest.

d7 %>%
  dplyr::select(tidyselect::starts_with(c("stress", "anxiety")) & tidyselect::ends_with("reverse"))
# A tibble: 3 x 2
  stress1_reverse anxiety1_reverse
            <dbl>            <dbl>
1               4                1
2               3                3
3               1                3

Return to Select Variables