select()1. Drop columns (extra1, extra2,
extra3)
Review the data (d1)
# A tibble: 5 x 5
extra1 extra2 extra3 id test_score
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 10 205
2 b -999 0 11 220
3 c 3 -999 12 250
4 d 4 0 13 217
5 <NA> NA NA NA NA
Drop extra1, extra2,
extra3
d1 %>%
dplyr::select(-c(extra1:extra3))
# A tibble: 5 x 2
id test_score
<dbl> <dbl>
1 10 205
2 11 220
3 12 250
4 13 217
5 NA NA
You can also drop columns based on shared column attributes using
tidyselect selection helpers such as
starts_with().
d1 %>%
dplyr::select(-(tidyselect::starts_with("extra")))
# A tibble: 5 x 2
id test_score
<dbl> <dbl>
1 10 205
2 11 220
3 12 250
4 13 217
5 NA NA
2. Keep selected columns (id,
test_score1, test_score2)
Review the data (d2)
# A tibble: 5 x 6
extra1 extra2 extra3 id test_score1 test_score2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 10 205 500
2 b -999 0 11 220 480
3 c 3 -999 12 250 540
4 d 4 0 13 217 499
5 <NA> NA NA NA NA NA
Keep id, test_score1 and
test_score2
d2 %>%
dplyr::select(id:test_score2)
# A tibble: 5 x 3
id test_score1 test_score2
<dbl> <dbl> <dbl>
1 10 205 500
2 11 220 480
3 12 250 540
4 13 217 499
5 NA NA NA
You can also keep columns based on shared column attributes using
tidyselect selection helpers such as
contains().
d2 %>%
dplyr::select(id, tidyselect::contains("test"))
# A tibble: 5 x 3
id test_score1 test_score2
<dbl> <dbl> <dbl>
1 10 205 500
2 11 220 480
3 12 250 540
4 13 217 499
5 NA NA NA
You could also write this with both id and test in the
tidyselect::contains selection helper. Note the use of
c() to create a character vector.
d2 %>%
dplyr::select(tidyselect::contains(c("id", "test")))
# A tibble: 5 x 3
id test_score1 test_score2
<dbl> <dbl> <dbl>
1 10 205 500
2 11 220 480
3 12 250 540
4 13 217 499
5 NA NA NA
Another way you can write this is using the selection helper
matches() which matches based on a regex operator rather
than a literal string like contains().
d2 %>%
dplyr::select(tidyselect::matches("id|test"))
# A tibble: 5 x 3
id test_score1 test_score2
<dbl> <dbl> <dbl>
1 10 205 500
2 11 220 480
3 12 250 540
4 13 217 499
5 NA NA NA
3. Keep only numeric columns
Review the data (d2)
# A tibble: 5 x 6
extra1 extra2 extra3 id test_score1 test_score2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 10 205 500
2 b -999 0 11 220 480
3 c 3 -999 12 250 540
4 d 4 0 13 217 499
5 <NA> NA NA NA NA NA
Keep numeric columns by adding the selection helper
where().
tidyselect selection helper
where().d2 %>%
dplyr::select(where(is.numeric))
# A tibble: 5 x 5
extra2 extra3 id test_score1 test_score2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 10 205 500
2 -999 0 11 220 480
3 3 -999 12 250 540
4 4 0 13 217 499
5 NA NA NA NA NA
And if I wanted all numeric columns except extra2, I
could simply add this to my statement with the -
operator.
d2 %>%
dplyr::select(where(is.numeric), -extra2)
# A tibble: 5 x 4
extra3 id test_score1 test_score2
<dbl> <dbl> <dbl> <dbl>
1 2 10 205 500
2 0 11 220 480
3 -999 12 250 540
4 0 13 217 499
5 NA NA NA NA
4. Select all columns except extra1
Review the data (d4)
# A tibble: 5 x 5
extra1 extra2 extra3 stu_id test_score
<chr> <lgl> <dbl> <dbl> <dbl>
1 a NA 2 10 205
2 b NA 0 11 220
3 c NA -999 12 250
4 d NA 0 13 217
5 e NA NA NA NA
Similar to above, I could use the - to remove columns
d4 %>%
dplyr::select(-extra1)
# A tibble: 5 x 4
extra2 extra3 stu_id test_score
<lgl> <dbl> <dbl> <dbl>
1 NA 2 10 205
2 NA 0 11 220
3 NA -999 12 250
4 NA 0 13 217
5 NA NA NA NA
Another way I could select all columns except extra1 is
to select extra2 through all columns until the last column,
using the tidyselect selection helper
last_col(). This is helpful if you don’t always know what
the last column is but you know you want to keep everything until the
end.
d4 %>%
dplyr::select(extra2:last_col())
# A tibble: 5 x 4
extra2 extra3 stu_id test_score
<lgl> <dbl> <dbl> <dbl>
1 NA 2 10 205
2 NA 0 11 220
3 NA -999 12 250
4 NA 0 13 217
5 NA NA NA NA
5. Keep selected columns (stress1_reverse,
anxiety1_reverse)
Review the data (d7)
# A tibble: 3 x 6
stress1 stress1_reverse anxiety1 anxiety1_reverse extra1 extra1_reverse
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 4 1 2 4
2 2 3 2 3 5 1
3 4 1 2 3 3 2
Because there are both reversed and non-reversed versions of these variables, we will need to be careful to not select the non-reversed versions.
Currently this is a small number of variables to select and we could easily just list these variables out in a select statement. But if there were 40 of these variables, we may want to be more strategic about selecting these variables.
We could add 2 selection criteria to select our variables that start with “stress” or “anxiety” and end in “reverse” in order to grab all of our variables of interest.
d7 %>%
dplyr::select(tidyselect::starts_with(c("stress", "anxiety")) & tidyselect::ends_with("reverse"))
# A tibble: 3 x 2
stress1_reverse anxiety1_reverse
<dbl> <dbl>
1 4 1
2 3 3
3 1 3
Return to Select Variables