rowSums()1. Calculate a sum score for each student called
measure_sum
Review the data (d2)
# A tibble: 5 x 4
stu_id item1 item2 item3
<dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4
2 2345 4 3 1
3 3456 NA 1 1
4 4567 4 5 1
5 5678 1 3 2
Calculate a sum score across item1, item2
and item
Note: We are calculating a new variable using
dplyr::mutate()
Note: Adding dplyr::across() allows you to select
the specific columns you want to calculate the rowSums()
for. Otherwise rowSums will be applied across all
columns.
Note: The default for base::rowSums() is to
not calculate a sum if any NA value exists. If you want
to still calculate a sum despite missing values, you can add the
argument na.rm = TRUE.
d2 %>%
mutate(measure_sum = rowSums(across(c(item1, item2, item3))))
# A tibble: 5 x 5
stu_id item1 item2 item3 measure_sum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 9
2 2345 4 3 1 8
3 3456 NA 1 1 NA
4 4567 4 5 1 10
5 5678 1 3 2 6
2. Calculate a sum score for each student called
measure_sum no matter what values are missing
Review the data (d10)
# A tibble: 5 x 4
stu_id item1 item2 item3
<dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4
2 2345 4 3 1
3 3456 NA 1 1
4 4567 4 5 1
5 5678 NA NA NA
Calculate a sum score across item1, item2
and item even if values are missing
d10 %>%
mutate(measure_sum = rowSums(across(c(item1, item2, item3)), na.rm = TRUE))
# A tibble: 5 x 5
stu_id item1 item2 item3 measure_sum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 9
2 2345 4 3 1 8
3 3456 NA 1 1 2
4 4567 4 5 1 10
5 5678 NA NA NA 0
NOTICE that we got a value of 0 when all columns have missing values. Most likely this is not what you want. So in order to return of value of NA when all columns have missing values, you need to add in logic.
Here I’m using dplyr::case_when() to first recode our
new variable to NA when all items are NA, and
then if all items are not NA we calculate
rowSums.
d10 %>%
mutate(measure_sum =
case_when(
if_all(item1:item3, ~ is.na(.)) ~ NA,
.default = rowSums(across(item1:item3), na.rm = TRUE)))
# A tibble: 5 x 5
stu_id item1 item2 item3 measure_sum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 9
2 2345 4 3 1 8
3 3456 NA 1 1 2
4 4567 4 5 1 10
5 5678 NA NA NA NA
rowMeans()1. Calculate a mean score (measure_mean) for
each student even with missing data
Review the data (d2)
# A tibble: 5 x 4
stu_id item1 item2 item3
<dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4
2 2345 4 3 1
3 3456 NA 1 1
4 4567 4 5 1
5 5678 1 3 2
Calculate the mean score of item1, item2,
and item3
Note: The default for base::rowMeans() is to
not calculate a mean if any NA value exists. If you
want to still calculate a mean despite missing values, you can set the
argument to na.rm = TRUE.
Note: I think it’s really important to point out
that the closing parentheses for dplyr::across() goes at
the end of item3, rather than at the end of of the
rowMeans function. This is different than how you add
parentheses when using dplyr::across() for something like a
dplyr::case_when() function. And if you do not put the
parentheses at the end of item3, the rowMeans
function will still run, but the na.rm = TRUE argument will not
be evaluated!
d2 %>%
mutate(measure_mean = rowMeans(across(item1:item3), na.rm = TRUE))
# A tibble: 5 x 5
stu_id item1 item2 item3 measure_mean
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 3
2 2345 4 3 1 2.67
3 3456 NA 1 1 1
4 4567 4 5 1 3.33
5 5678 1 3 2 2
You can set the number of decimal places you want the new variable to
have by wrapping the calculation in the function
base::round() and setting the digits equal to the number
you want.
base::round()
does rounding. See Rounding
for more information.d2 %>%
mutate(measure_mean = round(rowMeans(across(item1:item3), na.rm=TRUE), digits= 1))
# A tibble: 5 x 5
stu_id item1 item2 item3 measure_mean
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 3
2 2345 4 3 1 2.7
3 3456 NA 1 1 1
4 4567 4 5 1 3.3
5 5678 1 3 2 2
2. Calculate a mean score (toca_mean) for all
the variables that begin with “toca”, for each student even with missing
data
Review the data (d5)
# A tibble: 5 x 5
stu_id toca1 toca2 toca3 toca4
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 NA 3
2 2345 4 3 1 4
3 3456 2 NA 1 NA
4 4567 4 5 1 6
5 5678 1 3 2 2
Calculate the mean toca score
tidyselect::starts_with() to
grab all variables that start with “toca”d5 %>%
mutate(
toca_mean = rowMeans(
across(starts_with("toca")), na.rm=TRUE))
# A tibble: 5 x 6
stu_id toca1 toca2 toca3 toca4 toca_mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 NA 3 2.67
2 2345 4 3 1 4 3
3 3456 2 NA 1 NA 1.5
4 4567 4 5 1 6 4
5 5678 1 3 2 2 2
If you wanted to exclude some of the “toca” items (say just
toca2), you could exclude them this way, by wrapping your
selected and excluded variables in c().
d5 %>%
mutate(
toca_mean = rowMeans(
across(c(starts_with("toca"), -toca2)), na.rm=TRUE))
# A tibble: 5 x 6
stu_id toca1 toca2 toca3 toca4 toca_mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 NA 3 3
2 2345 4 3 1 4 3
3 3456 2 NA 1 NA 1.5
4 4567 4 5 1 6 3.67
5 5678 1 3 2 2 1.67
3. Calculate a mean score (toca_mean) for all
the variables that contain the word “toca”, for each student who is
missing one or less values for an item
Review the data (d5)
# A tibble: 5 x 5
stu_id toca1 toca2 toca3 toca4
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 NA 3
2 2345 4 3 1 4
3 3456 2 NA 1 NA
4 4567 4 5 1 6
5 5678 1 3 2 2
Calculate the mean toca score
d5 %>%
mutate(
toca_mean =
case_when(
rowSums(is.na(across(starts_with("toca")))) <= 1 ~ rowMeans(across(starts_with("toca")), na.rm=TRUE),
.default = NA_real_))
# A tibble: 5 x 6
stu_id toca1 toca2 toca3 toca4 toca_mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 NA 3 2.67
2 2345 4 3 1 4 3
3 3456 2 NA 1 NA NA
4 4567 4 5 1 6 4
5 5678 1 3 2 2 2
4. Calculate a mean score (toca_mean) for all
the variables that begin with “toca”, when the toca variables contain
labelled na values
Review the data (d6)
# A tibble: 5 x 6
stu_id toca1 toca2 toca3 item1 item2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 3 4
2 2345 4 3 1 4 5
3 3456 2 1 -99 5 5
4 4567 4 5 1 6 1
5 5678 -99 3 2 2 6
Let’s review the labelled NA values.
d6 %>%
labelled::na_values()
$stu_id
NULL
$toca1
[1] -99
$toca2
[1] -99
$toca3
[1] -99
$item1
NULL
$item2
NULL
We see that -99 is labelled as an NA value for the toca variables. However, R does not recognize this as an NA value when calculating row scores.
Note: We use base::ifelse() to say if our current
toca variables (denoted by .) are equal to -99, then make them NA,
otherwise keep the value as is. Using dplyr::if_else()
gives us errors in this process. I’m not sure why except that I read the
function is more strict in checking that all alternatives are of the
same variable type compared to using
base::ifelse().
Note: We use the dplyr::across() .names
argument to rename these variable to have the suffix “_temp” because we
do not want to alter our original toca variables.
Second we calculate our mean score within
dplyr::mutate() statement.
Last we drop our temporary toca variables using
dplyr::select(). We don’t need them anymore.
d6 %>%
mutate(across(contains("toca"),
~ ifelse(. == -99 , NA, .), .names = "{col}_temp"),
toca_mean = rowMeans(across(contains("temp")),
na.rm = TRUE)) %>%
select(-contains("temp"))
# A tibble: 5 x 7
stu_id toca1 toca2 toca3 item1 item2 toca_mean
<dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 3 4 3
2 2345 4 3 1 4 5 2.67
3 3456 2 1 -99 (NA) 5 5 1.5
4 4567 4 5 1 6 1 3.33
5 5678 -99 (NA) 3 2 2 6 2.5
We could also use haven::zap_missing() rather than
base::ifelse()
d6 %>%
mutate(
across(contains("toca"), haven::zap_missing, .names = "{col}_temp"),
toca_mean = rowMeans(across(contains("temp")),
na.rm = TRUE)) %>%
select(-contains("temp"))
# A tibble: 5 x 7
stu_id toca1 toca2 toca3 item1 item2 toca_mean
<dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 3 4 3
2 2345 4 3 1 4 5 2.67
3 3456 2 1 -99 (NA) 5 5 1.5
4 4567 4 5 1 6 1 3.33
5 5678 -99 (NA) 3 2 2 6 2.5
rowwise()1. Calculate a sum score for each student called
measure_sum
Review the data (d2)
# A tibble: 5 x 4
stu_id item1 item2 item3
<dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4
2 2345 4 3 1
3 3456 NA 1 1
4 4567 4 5 1
5 5678 1 3 2
Calculate the sum score across item1,
item2, and item3.
Here we use dplyr::rowwise() to calculate values across
rows, in conjunction with base::sum() to calculate a sum
score.
base::sum() is to not calculate a
sum if any NA value exists in your columns. If you want to still
calculate a sum despite missing values, you can add the argument
na.rm = TRUE.d2 %>%
rowwise() %>%
mutate(measure_sum = sum(c(item1, item2, item3)))
# A tibble: 5 x 5
# Rowwise:
stu_id item1 item2 item3 measure_sum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 9
2 2345 4 3 1 8
3 3456 NA 1 1 NA
4 4567 4 5 1 10
5 5678 1 3 2 6
If you don’t want to list out every single item in a measure, you can
use dplyr::c_across() which uses tidy selection syntax to
select variables. It is essentially a combination of
base::c() and dplyr::across(). So instead of
mutate(measure_sum = dplyr::across(c(item1:item3))), you
can simplify to the code below.
dplyr::c_across() cannot be used with
base::rowSums() and base::rowMeans() because
those functions take a different type of input compared to
dplyr::rowwise()d2 %>%
rowwise() %>%
mutate(measure_sum = sum(c_across(item1:item3)))
# A tibble: 5 x 5
# Rowwise:
stu_id item1 item2 item3 measure_sum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 9
2 2345 4 3 1 8
3 3456 NA 1 1 NA
4 4567 4 5 1 10
5 5678 1 3 2 6
2. Calculate a mean score (measure_mean) for
each student even with missing data
Review the data (d2)
# A tibble: 5 x 4
stu_id item1 item2 item3
<dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4
2 2345 4 3 1
3 3456 NA 1 1
4 4567 4 5 1
5 5678 1 3 2
Calculate the mean score
rowSums example, where you specifically
code those values to NA).d2 %>%
rowwise() %>%
mutate(measure_mean = mean(c_across(item1:item3), na.rm=TRUE))
# A tibble: 5 x 5
# Rowwise:
stu_id item1 item2 item3 measure_mean
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 3
2 2345 4 3 1 2.67
3 3456 NA 1 1 1
4 4567 4 5 1 3.33
5 5678 1 3 2 2
You can set the number of decimal places you want the new variable to
have by wrapping the calculation in the function
base::round() and setting the digits equal to the number
you want.
d2 %>%
rowwise() %>%
mutate(measure_mean =
round(mean(c_across(item1:item3), na.rm=TRUE), digits=2))
# A tibble: 5 x 5
# Rowwise:
stu_id item1 item2 item3 measure_mean
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 3
2 2345 4 3 1 2.67
3 3456 NA 1 1 1
4 4567 4 5 1 3.33
5 5678 1 3 2 2
3. Calculate a mean score (toca_mean) for all
the variables that contain the word “toca”, for each student even with
missing data
Review the data (d4)
# A tibble: 5 x 6
stu_id toca1 toca2 toca3 toca4 item1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 3 4
2 2345 4 3 1 4 5
3 3456 2 1 1 5 5
4 4567 4 5 1 6 1
5 5678 1 3 2 2 6
Calculate the mean toca score
d4 %>%
rowwise() %>%
mutate(toca_mean =
mean(c_across(contains("toca")), na.rm=TRUE))
# A tibble: 5 x 7
# Rowwise:
stu_id toca1 toca2 toca3 toca4 item1 toca_mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 3 4 3
2 2345 4 3 1 4 5 3
3 3456 2 1 1 5 5 2.25
4 4567 4 5 1 6 1 4
5 5678 1 3 2 2 6 2
If you wanted to exclude some of the “toca” items (say
toca2 and toca4 this time), you could exclude
them this way, by wrapping your selected and excluded variables in
c().
d4 %>%
rowwise() %>%
mutate(toca_mean =
mean(c_across(c(contains("toca"),-toca2,-toca4)),
na.rm = TRUE))
# A tibble: 5 x 7
# Rowwise:
stu_id toca1 toca2 toca3 toca4 item1 toca_mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4 3 4 3.5
2 2345 4 3 1 4 5 2.5
3 3456 2 1 1 5 5 1.5
4 4567 4 5 1 6 1 2.5
5 5678 1 3 2 2 6 1.5
adorn_totals()1. Calculate a sum score for each student called
measure_sum
Review the data (d2)
# A tibble: 5 x 4
stu_id item1 item2 item3
<dbl> <dbl> <dbl> <dbl>
1 1234 3 2 4
2 2345 4 3 1
3 3456 NA 1 1
4 4567 4 5 1
5 5678 1 3 2
Calculate the sum score across item1, item2
and item3
Note: We can add the argument where = “col” to denote we want rows summed across columns rather than the default which is column totals
Note: We can add the argument name = to rename the new column what we want rather than the default “Total”
Note: The argument default for the NA argument is na.rm = TRUE so in this case, if we don’t want a sum when there is missing data, we need to change it to FALSE.
d2 %>%
adorn_totals(where = "col", name = "measure_sum", na.rm = FALSE)
stu_id item1 item2 item3 measure_sum
1234 3 2 4 9
2345 4 3 1 8
3456 NA 1 1 NA
4567 4 5 1 10
5678 1 3 2 6
Return to Calculate Sums and Means