mutate()1. Create a new calculated variable (age in months)
Review the data (d4)
# A tibble: 4 x 3
id dob test_date
<dbl> <date> <date>
1 10 2010-11-12 2021-10-14
2 11 2011-07-08 2021-10-15
3 12 2012-01-22 2021-10-15
4 13 2011-12-13 2021-10-17
Many times, in order to de-identify data we will remove date of birth
and testing date from our data. However, age may still be relevant in
our analysis, so we can calculate an age variable to be added to our
data before removing dob and test_date.
Here we can use dplyr::mutate() to create our new age
variable.
Then, we can calculate our age variable.
d4 %>%
mutate(age_months = lubridate::interval(start = dob, end = test_date) /
months(1))
# A tibble: 4 x 4
id dob test_date age_months
<dbl> <date> <date> <dbl>
1 10 2010-11-12 2021-10-14 131.06452
2 11 2011-07-08 2021-10-15 123.22581
3 12 2012-01-22 2021-10-15 116.76667
4 13 2011-12-13 2021-10-17 118.12903
Let’s step through each part of this calculation to understand what
is happening. Calculating variables from date ranges can be a little
tricky because a) they are dates and b) calendar days are not
consistent. While there are probably many ways to calculate our age
variable, here I am first using the lubridate::interval()
function to create a time span between two dates (as a sequence of
seconds). We add our starting date to the start argument and
our ending date to the end argument.
d4_new <- d4 %>%
mutate(age_interval = lubridate::interval(start = dob, end = test_date))
d4_new
# A tibble: 4 x 4
id dob test_date age_interval
<dbl> <date> <date> <Interval>
1 10 2010-11-12 2021-10-14 2010-11-12 UTC--2021-10-14 UTC
2 11 2011-07-08 2021-10-15 2011-07-08 UTC--2021-10-15 UTC
3 12 2012-01-22 2021-10-15 2012-01-22 UTC--2021-10-15 UTC
4 13 2011-12-13 2021-10-17 2011-12-13 UTC--2021-10-17 UTC
If I want to see how many seconds this interval is, I can use the
function base::as.numeric().
d4_new %>%
mutate(age_interval_seconds = as.numeric(age_interval))
# A tibble: 4 x 5
id dob test_date age_interval
<dbl> <date> <date> <Interval>
1 10 2010-11-12 2021-10-14 2010-11-12 UTC--2021-10-14 UTC
2 11 2011-07-08 2021-10-15 2011-07-08 UTC--2021-10-15 UTC
3 12 2012-01-22 2021-10-15 2012-01-22 UTC--2021-10-15 UTC
4 13 2011-12-13 2021-10-17 2011-12-13 UTC--2021-10-17 UTC
# i 1 more variable: age_interval_seconds <dbl>
The next step in the above function is to calculate the number of
months that occur in that interval. To do that we can use the
base::months() function. Adding the input of 1 to this
function generates a duration of one month in seconds (2629800). You can
get this number by typing as.numeric(months(1)) in your
console. Then we can divide our interval (which is in seconds) by our
number of seconds in a month to get our total number of months (age) in
an interval of time.
d4_new %>%
mutate(age_months = age_interval/months(1))
# A tibble: 4 x 5
id dob test_date age_interval age_months
<dbl> <date> <date> <Interval> <dbl>
1 10 2010-11-12 2021-10-14 2010-11-12 UTC--2021-10-14 UTC 131.06452
2 11 2011-07-08 2021-10-15 2011-07-08 UTC--2021-10-15 UTC 123.22581
3 12 2012-01-22 2021-10-15 2012-01-22 UTC--2021-10-15 UTC 116.76667
4 13 2011-12-13 2021-10-17 2011-12-13 UTC--2021-10-17 UTC 118.12903
Another option is to wrap our interval in
lubridate::time_length() rather than dividing by
base::months()
d4_new %>%
mutate(age_months = lubridate::time_length(age_interval, "months"))
# A tibble: 4 x 5
id dob test_date age_interval age_months
<dbl> <date> <date> <Interval> <dbl>
1 10 2010-11-12 2021-10-14 2010-11-12 UTC--2021-10-14 UTC 131.06452
2 11 2011-07-08 2021-10-15 2011-07-08 UTC--2021-10-15 UTC 123.22581
3 12 2012-01-22 2021-10-15 2012-01-22 UTC--2021-10-15 UTC 116.76667
4 13 2011-12-13 2021-10-17 2011-12-13 UTC--2021-10-17 UTC 118.12903
You could even do this by year if you wanted to.
d4_new %>%
mutate(age_years = lubridate::time_length(age_interval, "years"))
# A tibble: 4 x 5
id dob test_date age_interval age_years
<dbl> <date> <date> <Interval> <dbl>
1 10 2010-11-12 2021-10-14 2010-11-12 UTC--2021-10-14 UTC 10.920548
2 11 2011-07-08 2021-10-15 2011-07-08 UTC--2021-10-15 UTC 10.271233
3 12 2012-01-22 2021-10-15 2012-01-22 UTC--2021-10-15 UTC 9.7287671
4 13 2011-12-13 2021-10-17 2011-12-13 UTC--2021-10-17 UTC 9.8438356
Last if you want a clean round number of months (and you most likely
will want the number to be rounded to create more anonymity) you have
several options. One option is to use the integral division operator
%/% rather than /. Using this method divides
two numbers and returns the integer part of the result.
d4 %>%
mutate(age_months = lubridate::interval(start = dob, end = test_date)%/%months(1))
# A tibble: 4 x 4
id dob test_date age_months
<dbl> <date> <date> <dbl>
1 10 2010-11-12 2021-10-14 131
2 11 2011-07-08 2021-10-15 123
3 12 2012-01-22 2021-10-15 116
4 13 2011-12-13 2021-10-17 118
You could also use base::round() but as I’ve mentioned
in other parts of this wiki, pay attention to how this function rounds.
See Rounding
for more information.
d4 %>%
mutate(age_months = round(lubridate::interval(start = dob, end = test_date)/months(1)))
# A tibble: 4 x 4
id dob test_date age_months
<dbl> <date> <date> <dbl>
1 10 2010-11-12 2021-10-14 131
2 11 2011-07-08 2021-10-15 123
3 12 2012-01-22 2021-10-15 117
4 13 2011-12-13 2021-10-17 118
You could also use base::ceiling() or
base::floor() as well.
d4 %>%
mutate(age_months = ceiling(lubridate::interval(start = dob, end = test_date)/months(1)))
# A tibble: 4 x 4
id dob test_date age_months
<dbl> <date> <date> <dbl>
1 10 2010-11-12 2021-10-14 132
2 11 2011-07-08 2021-10-15 124
3 12 2012-01-22 2021-10-15 117
4 13 2011-12-13 2021-10-17 119
Return to Create New Variables