Add random id column

Package: base

Function: `sample()`

1. Add a new column (study_id) to the data

Review the data (d5)

# A tibble: 3 x 6
  f_name l_name item1 item2 item3 item4
  <chr>  <chr>  <dbl> <dbl> <dbl> <dbl>
1 randi  ivana      3     5     3    NA
2 nellie lorie      3     5     1     5
3 mike   skuld      3     1     3     5

In order to de-identify my data I want to assign my participants a random unique numeric identifier (with a value between 400 and 500). At some point I will remove first and last name from my data and my de-identified data will only contain a study ID.

I can first create my new variable using the dplyr::mutate() function.

Next, in order to assign my random numbers, I can use the base::sample() function to sample values between 400 and 500 and I denote that I need 3 of those values (for my 3 cases) in the argument size. This function has another argument replace which has the default of FALSE which means I want all numbers to be unique.

Before I calculate my new variable I need to do one thing. I need to use base::set.seed() and choose any number to add as my seed. Setting this seed ensures that if I ever run this code again at a later time, I will get the same randomly generated numbers each time. Very important!

Last, I have added the dplyr::mutate() argument .after to denote that I want my new variable to be located after the variable “l_name”

base::set.seed(3456)

d5 %>% 
  dplyr::mutate(study_id = base::sample(400:500, size = 3), .after = l_name)

# A tibble: 3 x 7
  f_name l_name study_id item1 item2 item3 item4
  <chr>  <chr>     <int> <dbl> <dbl> <dbl> <dbl>
1 randi  ivana       481     3     5     3    NA
2 nellie lorie       409     3     5     1     5
3 mike   skuld       405     3     1     3     5

Instead of typing in the exact size argument, I could also use dplyr::n() to get the number of cases.

base::set.seed(3456)

d5 %>% 
  dplyr::mutate(study_id = base::sample(400:500, size = n()), .after = l_name)

# A tibble: 3 x 7
  f_name l_name study_id item1 item2 item3 item4
  <chr>  <chr>     <int> <dbl> <dbl> <dbl> <dbl>
1 randi  ivana       481     3     5     3    NA
2 nellie lorie       409     3     5     1     5
3 mike   skuld       405     3     1     3     5

2. Add study_id values for newly added participants

Review the new data (d6)

# A tibble: 2 x 6
  f_name l_name item1 item2 item3 item4
  <chr>  <chr>  <dbl> <dbl> <dbl> <dbl>
1 oscar  lewis      3     1     3     4
2 nate   purdy      2     5     1     1

First we need to get a vector of the study_id values that we have already used

used_ids <- d5 %>%
  dplyr::pull(study_id)

Now we can create a vector of available study_id values

new_ids <- tibble::tibble(study_id = 400:500) %>%
  dplyr::filter(!study_id %in% used_ids) %>%
  dplyr::pull(study_id)

Now we can randomly assign an ID to our new students that doesn’t overlap with the IDs we’ve already used

base::set.seed(1234)

d6 <- d6 %>%
  dplyr::mutate(study_id = base::sample(new_ids, size = n()), .after = l_name)

d6

# A tibble: 2 x 7
  f_name l_name study_id item1 item2 item3 item4
  <chr>  <chr>     <int> <dbl> <dbl> <dbl> <dbl>
1 oscar  lewis       429     3     1     3     4
2 nate   purdy       482     2     5     1     1

Now we can append the two datasets to get our full sample.

dplyr::bind_rows(d5, d6)

# A tibble: 5 x 7
  f_name l_name study_id item1 item2 item3 item4
  <chr>  <chr>     <int> <dbl> <dbl> <dbl> <dbl>
1 randi  ivana       481     3     5     3    NA
2 nellie lorie       409     3     5     1     5
3 mike   skuld       405     3     1     3     5
4 oscar  lewis       429     3     1     3     4
5 nate   purdy       482     2     5     1     1

3. Add a new column (study_id) to the data but keep the ID the same within groups

Review the data (d7)

# A tibble: 8 x 3
  email                   f_name  l_name
  <chr>                   <chr>   <chr> 
1 crystal@email.com       crystal harris
2 randy@email.com         randy   lewis 
3 crystalharris@email.com crystal harris
4 charris@email.com       Crystal harris
5 andy@email.com          Andrew  lemon 
6 vince@email.com         Vince   lewis 
7 kristin@email.com       kristin black 
8 andrewl@email.com       andrew  lemon

Here I have duplicate names in my list but I’m not sure which email is correct so I want to assign all duplicates the same ID.

First I will want to standardize capitalization in names. Then I can assign random IDs across groups, and same IDs within groups.

Here I am using tidyr::nest() rather than dplyr::group_by() because the latter applies rowwise sampling, allowing IDs to repeat across groups.

set.seed(2489)

d7 %>%
  dplyr::mutate(dplyr::across(c(f_name, l_name), stringr::str_to_title)) %>%
  tidyr::nest(data = c(-l_name, -f_name)) %>%
  dplyr::mutate(id = base::sample(200:205, n(), replace = FALSE)) %>%
  tidyr::unnest(data)

# A tibble: 8 x 4
  f_name  l_name email                      id
  <chr>   <chr>  <chr>                   <int>
1 Crystal Harris crystal@email.com         201
2 Crystal Harris crystalharris@email.com   201
3 Crystal Harris charris@email.com         201
4 Randy   Lewis  randy@email.com           200
5 Andrew  Lemon  andy@email.com            203
6 Andrew  Lemon  andrewl@email.com         203
7 Vince   Lewis  vince@email.com           204
8 Kristin Black  kristin@email.com         202

Similarly we could assign IDs to unique names and then join those IDs to the full file.

set.seed(2489)

ids <- d7 %>%
  dplyr::mutate(dplyr::across(c(f_name, l_name), stringr::str_to_title)) %>%
  dplyr::distinct(f_name, l_name) %>%
  dplyr::mutate(id = base::sample(200:205, n()))

d7 %>%
  dplyr::mutate(dplyr::across(c(f_name, l_name), stringr::str_to_title)) %>%  
  dplyr::left_join(ids, by = c("f_name", "l_name"))

# A tibble: 8 x 4
  email                   f_name  l_name    id
  <chr>                   <chr>   <chr>  <int>
1 crystal@email.com       Crystal Harris   201
2 randy@email.com         Randy   Lewis    200
3 crystalharris@email.com Crystal Harris   201
4 charris@email.com       Crystal Harris   201
5 andy@email.com          Andrew  Lemon    203
6 vince@email.com         Vince   Lewis    204
7 kristin@email.com       Kristin Black    202
8 andrewl@email.com       Andrew  Lemon    203

Return to Randomize

Add random id column

Package: base

Function: sample()

Function: `sample()`