Package: dplyr


Function: mutate()


1. Create a new calculated variable (age in months)

Review the data (d4)

# A tibble: 4 x 3
     id dob        test_date 
  <dbl> <date>     <date>    
1    10 2010-11-12 2021-10-14
2    11 2011-07-08 2021-10-15
3    12 2012-01-22 2021-10-15
4    13 2011-12-13 2021-10-17

Many times, in order to de-identify data we will remove date of birth and testing date from our data. However, age may still be relevant in our analysis, so we can calculate an age variable to be added to our data before removing dob and test_date.

Here we can use dplyr::mutate() to create our new age variable.

Then, we can calculate our age variable.

d4 %>%
  mutate(age_months = lubridate::interval(start = dob, end = test_date) /
                  months(1))
# A tibble: 4 x 4
     id dob        test_date  age_months
  <dbl> <date>     <date>          <dbl>
1    10 2010-11-12 2021-10-14  131.06452
2    11 2011-07-08 2021-10-15  123.22581
3    12 2012-01-22 2021-10-15  116.76667
4    13 2011-12-13 2021-10-17  118.12903

Let’s step through each part of this calculation to understand what is happening. Calculating variables from date ranges can be a little tricky because a) they are dates and b) calendar days are not consistent. While there are probably many ways to calculate our age variable, here I am first using the lubridate::interval() function to create a time span between two dates (as a sequence of seconds). We add our starting date to the start argument and our ending date to the end argument.

d4_new <- d4 %>%
  mutate(age_interval = lubridate::interval(start = dob, end = test_date))

d4_new
# A tibble: 4 x 4
     id dob        test_date  age_interval                  
  <dbl> <date>     <date>     <Interval>                    
1    10 2010-11-12 2021-10-14 2010-11-12 UTC--2021-10-14 UTC
2    11 2011-07-08 2021-10-15 2011-07-08 UTC--2021-10-15 UTC
3    12 2012-01-22 2021-10-15 2012-01-22 UTC--2021-10-15 UTC
4    13 2011-12-13 2021-10-17 2011-12-13 UTC--2021-10-17 UTC

If I want to see how many seconds this interval is, I can use the function base::as.numeric().

d4_new %>%
  mutate(age_interval_seconds = as.numeric(age_interval))
# A tibble: 4 x 5
     id dob        test_date  age_interval                  
  <dbl> <date>     <date>     <Interval>                    
1    10 2010-11-12 2021-10-14 2010-11-12 UTC--2021-10-14 UTC
2    11 2011-07-08 2021-10-15 2011-07-08 UTC--2021-10-15 UTC
3    12 2012-01-22 2021-10-15 2012-01-22 UTC--2021-10-15 UTC
4    13 2011-12-13 2021-10-17 2011-12-13 UTC--2021-10-17 UTC
# i 1 more variable: age_interval_seconds <dbl>

The next step in the above function is to calculate the number of months that occur in that interval. To do that we can use the base::months() function. Adding the input of 1 to this function generates a duration of one month in seconds (2629800). You can get this number by typing as.numeric(months(1)) in your console. Then we can divide our interval (which is in seconds) by our number of seconds in a month to get our total number of months (age) in an interval of time.

d4_new %>%
  mutate(age_months = age_interval/months(1))
# A tibble: 4 x 5
     id dob        test_date  age_interval                   age_months
  <dbl> <date>     <date>     <Interval>                          <dbl>
1    10 2010-11-12 2021-10-14 2010-11-12 UTC--2021-10-14 UTC  131.06452
2    11 2011-07-08 2021-10-15 2011-07-08 UTC--2021-10-15 UTC  123.22581
3    12 2012-01-22 2021-10-15 2012-01-22 UTC--2021-10-15 UTC  116.76667
4    13 2011-12-13 2021-10-17 2011-12-13 UTC--2021-10-17 UTC  118.12903

Another option is to wrap our interval in lubridate::time_length() rather than dividing by base::months()

d4_new %>%
     mutate(age_months = lubridate::time_length(age_interval, "months"))
# A tibble: 4 x 5
     id dob        test_date  age_interval                   age_months
  <dbl> <date>     <date>     <Interval>                          <dbl>
1    10 2010-11-12 2021-10-14 2010-11-12 UTC--2021-10-14 UTC  131.06452
2    11 2011-07-08 2021-10-15 2011-07-08 UTC--2021-10-15 UTC  123.22581
3    12 2012-01-22 2021-10-15 2012-01-22 UTC--2021-10-15 UTC  116.76667
4    13 2011-12-13 2021-10-17 2011-12-13 UTC--2021-10-17 UTC  118.12903

You could even do this by year if you wanted to.

d4_new %>%
  mutate(age_years = lubridate::time_length(age_interval, "years"))
# A tibble: 4 x 5
     id dob        test_date  age_interval                    age_years
  <dbl> <date>     <date>     <Interval>                          <dbl>
1    10 2010-11-12 2021-10-14 2010-11-12 UTC--2021-10-14 UTC 10.920548 
2    11 2011-07-08 2021-10-15 2011-07-08 UTC--2021-10-15 UTC 10.271233 
3    12 2012-01-22 2021-10-15 2012-01-22 UTC--2021-10-15 UTC  9.7287671
4    13 2011-12-13 2021-10-17 2011-12-13 UTC--2021-10-17 UTC  9.8438356

Last if you want a clean round number of months (and you most likely will want the number to be rounded to create more anonymity) you have several options. One option is to use the integral division operator %/% rather than /. Using this method divides two numbers and returns the integer part of the result.

d4 %>%
  mutate(age_months = lubridate::interval(start = dob, end = test_date)%/%months(1))
# A tibble: 4 x 4
     id dob        test_date  age_months
  <dbl> <date>     <date>          <dbl>
1    10 2010-11-12 2021-10-14        131
2    11 2011-07-08 2021-10-15        123
3    12 2012-01-22 2021-10-15        116
4    13 2011-12-13 2021-10-17        118

You could also use base::round() but as I’ve mentioned in other parts of this wiki, pay attention to how this function rounds. See Rounding for more information.

d4 %>%
  mutate(age_months = round(lubridate::interval(start = dob, end = test_date)/months(1)))
# A tibble: 4 x 4
     id dob        test_date  age_months
  <dbl> <date>     <date>          <dbl>
1    10 2010-11-12 2021-10-14        131
2    11 2011-07-08 2021-10-15        123
3    12 2012-01-22 2021-10-15        117
4    13 2011-12-13 2021-10-17        118

You could also use base::ceiling() or base::floor() as well.

d4 %>%
  mutate(age_months = ceiling(lubridate::interval(start = dob, end = test_date)/months(1)))
# A tibble: 4 x 4
     id dob        test_date  age_months
  <dbl> <date>     <date>          <dbl>
1    10 2010-11-12 2021-10-14        132
2    11 2011-07-08 2021-10-15        124
3    12 2012-01-22 2021-10-15        117
4    13 2011-12-13 2021-10-17        119

Return to Create New Variables