sample()1. Add a new column (study_id) to the
data
Review the data (d5)
# A tibble: 3 x 6
f_name l_name item1 item2 item3 item4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 randi ivana 3 5 3 NA
2 nellie lorie 3 5 1 5
3 mike skuld 3 1 3 5
In order to de-identify my data I want to assign my participants a random unique numeric identifier (with a value between 400 and 500). At some point I will remove first and last name from my data and my de-identified data will only contain a study ID.
I can first create my new variable using the
dplyr::mutate() function.
Next, in order to assign my random numbers, I can use the
base::sample() function to sample values between 400 and
500 and I denote that I need 3 of those values (for my 3 cases) in the
argument size. This function has another argument
replace which has the default of FALSE which means I
want all numbers to be unique.
Before I calculate my new variable I need to do one thing. I need to
use base::set.seed() and choose any number to add as my
seed. Setting this seed ensures that if I ever run this code again at a
later time, I will get the same randomly generated numbers each time.
Very important!
Last, I have added the dplyr::mutate() argument
.after to denote that I want my new variable to be located
after the variable “l_name”
base::set.seed(3456)
d5 %>%
dplyr::mutate(study_id = base::sample(400:500, size = 3), .after = l_name)
# A tibble: 3 x 7
f_name l_name study_id item1 item2 item3 item4
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 randi ivana 481 3 5 3 NA
2 nellie lorie 409 3 5 1 5
3 mike skuld 405 3 1 3 5
Instead of typing in the exact size argument, I could also
use dplyr::n() to get the number of cases.
base::set.seed(3456)
d5 %>%
dplyr::mutate(study_id = base::sample(400:500, size = n()), .after = l_name)
# A tibble: 3 x 7
f_name l_name study_id item1 item2 item3 item4
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 randi ivana 481 3 5 3 NA
2 nellie lorie 409 3 5 1 5
3 mike skuld 405 3 1 3 5
2. Add study_id values for newly added
participants
Review the new data (d6)
# A tibble: 2 x 6
f_name l_name item1 item2 item3 item4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 oscar lewis 3 1 3 4
2 nate purdy 2 5 1 1
First we need to get a vector of the study_id values
that we have already used
used_ids <- d5 %>%
dplyr::pull(study_id)
Now we can create a vector of available study_id
values
new_ids <- tibble::tibble(study_id = 400:500) %>%
dplyr::filter(!study_id %in% used_ids) %>%
dplyr::pull(study_id)
Now we can randomly assign an ID to our new students that doesn’t overlap with the IDs we’ve already used
base::set.seed(1234)
d6 <- d6 %>%
dplyr::mutate(study_id = base::sample(new_ids, size = n()), .after = l_name)
d6
# A tibble: 2 x 7
f_name l_name study_id item1 item2 item3 item4
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 oscar lewis 429 3 1 3 4
2 nate purdy 482 2 5 1 1
Now we can append the two datasets to get our full sample.
dplyr::bind_rows(d5, d6)
# A tibble: 5 x 7
f_name l_name study_id item1 item2 item3 item4
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 randi ivana 481 3 5 3 NA
2 nellie lorie 409 3 5 1 5
3 mike skuld 405 3 1 3 5
4 oscar lewis 429 3 1 3 4
5 nate purdy 482 2 5 1 1
3. Add a new column (study_id) to the data but
keep the ID the same within groups
Review the data (d7)
# A tibble: 8 x 3
email f_name l_name
<chr> <chr> <chr>
1 crystal@email.com crystal harris
2 randy@email.com randy lewis
3 crystalharris@email.com crystal harris
4 charris@email.com Crystal harris
5 andy@email.com Andrew lemon
6 vince@email.com Vince lewis
7 kristin@email.com kristin black
8 andrewl@email.com andrew lemon
Here I have duplicate names in my list but I’m not sure which email is correct so I want to assign all duplicates the same ID.
First I will want to standardize capitalization in names. Then I can assign random IDs across groups, and same IDs within groups.
Here I am using tidyr::nest() rather than
dplyr::group_by() because the latter applies rowwise
sampling, allowing IDs to repeat across groups.
set.seed(2489)
d7 %>%
dplyr::mutate(dplyr::across(c(f_name, l_name), stringr::str_to_title)) %>%
tidyr::nest(data = c(-l_name, -f_name)) %>%
dplyr::mutate(id = base::sample(200:205, n(), replace = FALSE)) %>%
tidyr::unnest(data)
# A tibble: 8 x 4
f_name l_name email id
<chr> <chr> <chr> <int>
1 Crystal Harris crystal@email.com 201
2 Crystal Harris crystalharris@email.com 201
3 Crystal Harris charris@email.com 201
4 Randy Lewis randy@email.com 200
5 Andrew Lemon andy@email.com 203
6 Andrew Lemon andrewl@email.com 203
7 Vince Lewis vince@email.com 204
8 Kristin Black kristin@email.com 202
Similarly we could assign IDs to unique names and then join those IDs to the full file.
set.seed(2489)
ids <- d7 %>%
dplyr::mutate(dplyr::across(c(f_name, l_name), stringr::str_to_title)) %>%
dplyr::distinct(f_name, l_name) %>%
dplyr::mutate(id = base::sample(200:205, n()))
d7 %>%
dplyr::mutate(dplyr::across(c(f_name, l_name), stringr::str_to_title)) %>%
dplyr::left_join(ids, by = c("f_name", "l_name"))
# A tibble: 8 x 4
email f_name l_name id
<chr> <chr> <chr> <int>
1 crystal@email.com Crystal Harris 201
2 randy@email.com Randy Lewis 200
3 crystalharris@email.com Crystal Harris 201
4 charris@email.com Crystal Harris 201
5 andy@email.com Andrew Lemon 203
6 vince@email.com Vince Lewis 204
7 kristin@email.com Kristin Black 202
8 andrewl@email.com Andrew Lemon 203
Return to Randomize