Package: base


Function: rowSums()


1. Calculate a sum score for each student called measure_sum

Review the data (d2)

# A tibble: 5 x 4
  stu_id item1 item2 item3
   <dbl> <dbl> <dbl> <dbl>
1   1234     3     2     4
2   2345     4     3     1
3   3456    NA     1     1
4   4567     4     5     1
5   5678     1     3     2

Calculate a sum score across item1, item2 and item

  • Note: We are calculating a new variable using dplyr::mutate()

  • Note: Adding dplyr::across() allows you to select the specific columns you want to calculate the rowSums() for. Otherwise rowSums will be applied across all columns.

  • Note: The default for base::rowSums() is to not calculate a sum if any NA value exists. If you want to still calculate a sum despite missing values, you can add the argument na.rm = TRUE.

d2 %>% 
  mutate(measure_sum = rowSums(across(c(item1, item2, item3))))
# A tibble: 5 x 5
  stu_id item1 item2 item3 measure_sum
   <dbl> <dbl> <dbl> <dbl>       <dbl>
1   1234     3     2     4           9
2   2345     4     3     1           8
3   3456    NA     1     1          NA
4   4567     4     5     1          10
5   5678     1     3     2           6

2. Calculate a sum score for each student called measure_sum no matter what values are missing

Review the data (d10)

# A tibble: 5 x 4
  stu_id item1 item2 item3
   <dbl> <dbl> <dbl> <dbl>
1   1234     3     2     4
2   2345     4     3     1
3   3456    NA     1     1
4   4567     4     5     1
5   5678    NA    NA    NA

Calculate a sum score across item1, item2 and item even if values are missing

d10 %>% 
  mutate(measure_sum = rowSums(across(c(item1, item2, item3)), na.rm = TRUE))
# A tibble: 5 x 5
  stu_id item1 item2 item3 measure_sum
   <dbl> <dbl> <dbl> <dbl>       <dbl>
1   1234     3     2     4           9
2   2345     4     3     1           8
3   3456    NA     1     1           2
4   4567     4     5     1          10
5   5678    NA    NA    NA           0

NOTICE that we got a value of 0 when all columns have missing values. Most likely this is not what you want. So in order to return of value of NA when all columns have missing values, you need to add in logic.

Here I’m using dplyr::case_when() to first recode our new variable to NA when all items are NA, and then if all items are not NA we calculate rowSums.

d10 %>%
  mutate(measure_sum =
           case_when(
             if_all(item1:item3, ~ is.na(.)) ~ NA,
             .default = rowSums(across(item1:item3), na.rm = TRUE)))
# A tibble: 5 x 5
  stu_id item1 item2 item3 measure_sum
   <dbl> <dbl> <dbl> <dbl>       <dbl>
1   1234     3     2     4           9
2   2345     4     3     1           8
3   3456    NA     1     1           2
4   4567     4     5     1          10
5   5678    NA    NA    NA          NA

Function: rowMeans()


1. Calculate a mean score (measure_mean) for each student even with missing data

Review the data (d2)

# A tibble: 5 x 4
  stu_id item1 item2 item3
   <dbl> <dbl> <dbl> <dbl>
1   1234     3     2     4
2   2345     4     3     1
3   3456    NA     1     1
4   4567     4     5     1
5   5678     1     3     2

Calculate the mean score of item1, item2, and item3

  • Note: The default for base::rowMeans() is to not calculate a mean if any NA value exists. If you want to still calculate a mean despite missing values, you can set the argument to na.rm = TRUE.

  • Note: I think it’s really important to point out that the closing parentheses for dplyr::across() goes at the end of item3, rather than at the end of of the rowMeans function. This is different than how you add parentheses when using dplyr::across() for something like a dplyr::case_when() function. And if you do not put the parentheses at the end of item3, the rowMeans function will still run, but the na.rm = TRUE argument will not be evaluated!

d2 %>% 
  mutate(measure_mean = rowMeans(across(item1:item3), na.rm = TRUE))
# A tibble: 5 x 5
  stu_id item1 item2 item3 measure_mean
   <dbl> <dbl> <dbl> <dbl>        <dbl>
1   1234     3     2     4         3   
2   2345     4     3     1         2.67
3   3456    NA     1     1         1   
4   4567     4     5     1         3.33
5   5678     1     3     2         2   

You can set the number of decimal places you want the new variable to have by wrapping the calculation in the function base::round() and setting the digits equal to the number you want.

  • Note: It is very important to note how base::round() does rounding. See Rounding for more information.
d2 %>% 
  mutate(measure_mean = round(rowMeans(across(item1:item3), na.rm=TRUE), digits= 1))
# A tibble: 5 x 5
  stu_id item1 item2 item3 measure_mean
   <dbl> <dbl> <dbl> <dbl>        <dbl>
1   1234     3     2     4          3  
2   2345     4     3     1          2.7
3   3456    NA     1     1          1  
4   4567     4     5     1          3.3
5   5678     1     3     2          2  

2. Calculate a mean score (toca_mean) for all the variables that begin with “toca”, for each student even with missing data

Review the data (d5)

# A tibble: 5 x 5
  stu_id toca1 toca2 toca3 toca4
   <dbl> <dbl> <dbl> <dbl> <dbl>
1   1234     3     2    NA     3
2   2345     4     3     1     4
3   3456     2    NA     1    NA
4   4567     4     5     1     6
5   5678     1     3     2     2

Calculate the mean toca score

  • Note: Here we are using tidyselect::starts_with() to grab all variables that start with “toca”
d5 %>% 
  mutate(
    toca_mean = rowMeans(
      across(starts_with("toca")), na.rm=TRUE))
# A tibble: 5 x 6
  stu_id toca1 toca2 toca3 toca4 toca_mean
   <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
1   1234     3     2    NA     3      2.67
2   2345     4     3     1     4      3   
3   3456     2    NA     1    NA      1.5 
4   4567     4     5     1     6      4   
5   5678     1     3     2     2      2   

If you wanted to exclude some of the “toca” items (say just toca2), you could exclude them this way, by wrapping your selected and excluded variables in c().

d5 %>% 
  mutate(
    toca_mean = rowMeans(
      across(c(starts_with("toca"), -toca2)), na.rm=TRUE))
# A tibble: 5 x 6
  stu_id toca1 toca2 toca3 toca4 toca_mean
   <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
1   1234     3     2    NA     3      3   
2   2345     4     3     1     4      3   
3   3456     2    NA     1    NA      1.5 
4   4567     4     5     1     6      3.67
5   5678     1     3     2     2      1.67

3. Calculate a mean score (toca_mean) for all the variables that contain the word “toca”, for each student who is missing one or less values for an item

Review the data (d5)

# A tibble: 5 x 5
  stu_id toca1 toca2 toca3 toca4
   <dbl> <dbl> <dbl> <dbl> <dbl>
1   1234     3     2    NA     3
2   2345     4     3     1     4
3   3456     2    NA     1    NA
4   4567     4     5     1     6
5   5678     1     3     2     2

Calculate the mean toca score

d5 %>% 
  mutate(
    toca_mean = 
      case_when(
        rowSums(is.na(across(starts_with("toca")))) <= 1 ~ rowMeans(across(starts_with("toca")), na.rm=TRUE),
    .default = NA_real_))
# A tibble: 5 x 6
  stu_id toca1 toca2 toca3 toca4 toca_mean
   <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
1   1234     3     2    NA     3      2.67
2   2345     4     3     1     4      3   
3   3456     2    NA     1    NA     NA   
4   4567     4     5     1     6      4   
5   5678     1     3     2     2      2   

4. Calculate a mean score (toca_mean) for all the variables that begin with “toca”, when the toca variables contain labelled na values

Review the data (d6)

# A tibble: 5 x 6
  stu_id toca1 toca2 toca3 item1 item2
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1   1234     3     2     4     3     4
2   2345     4     3     1     4     5
3   3456     2     1   -99     5     5
4   4567     4     5     1     6     1
5   5678   -99     3     2     2     6

Let’s review the labelled NA values.

d6 %>%
  labelled::na_values()
$stu_id
NULL

$toca1
[1] -99

$toca2
[1] -99

$toca3
[1] -99

$item1
NULL

$item2
NULL

We see that -99 is labelled as an NA value for the toca variables. However, R does not recognize this as an NA value when calculating row scores.

  1. So first we will need to recode these to NA before creating our mean score.
  • Note: We use base::ifelse() to say if our current toca variables (denoted by .) are equal to -99, then make them NA, otherwise keep the value as is. Using dplyr::if_else() gives us errors in this process. I’m not sure why except that I read the function is more strict in checking that all alternatives are of the same variable type compared to using base::ifelse().

  • Note: We use the dplyr::across() .names argument to rename these variable to have the suffix “_temp” because we do not want to alter our original toca variables.

  1. Second we calculate our mean score within dplyr::mutate() statement.

  2. Last we drop our temporary toca variables using dplyr::select(). We don’t need them anymore.

d6 %>%
  mutate(across(contains("toca"),
                ~ ifelse(. == -99 , NA, .), .names = "{col}_temp"),
         toca_mean = rowMeans(across(contains("temp")),
                              na.rm = TRUE)) %>%
  select(-contains("temp"))
# A tibble: 5 x 7
  stu_id toca1     toca2     toca3     item1 item2 toca_mean
   <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl>     <dbl>
1   1234   3       2           4           3     4      3   
2   2345   4       3           1           4     5      2.67
3   3456   2       1         -99 (NA)      5     5      1.5 
4   4567   4       5           1           6     1      3.33
5   5678 -99 (NA)  3           2           2     6      2.5 

We could also use haven::zap_missing() rather than base::ifelse()

d6 %>%
  mutate(
    across(contains("toca"), haven::zap_missing, .names = "{col}_temp"),
    toca_mean = rowMeans(across(contains("temp")),
                         na.rm = TRUE)) %>%
  select(-contains("temp"))
# A tibble: 5 x 7
  stu_id toca1     toca2     toca3     item1 item2 toca_mean
   <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl>     <dbl>
1   1234   3       2           4           3     4      3   
2   2345   4       3           1           4     5      2.67
3   3456   2       1         -99 (NA)      5     5      1.5 
4   4567   4       5           1           6     1      3.33
5   5678 -99 (NA)  3           2           2     6      2.5 


Package: dplyr


Function: rowwise()


1. Calculate a sum score for each student called measure_sum

Review the data (d2)

# A tibble: 5 x 4
  stu_id item1 item2 item3
   <dbl> <dbl> <dbl> <dbl>
1   1234     3     2     4
2   2345     4     3     1
3   3456    NA     1     1
4   4567     4     5     1
5   5678     1     3     2

Calculate the sum score across item1, item2, and item3.

Here we use dplyr::rowwise() to calculate values across rows, in conjunction with base::sum() to calculate a sum score.

  • Note: The default for base::sum() is to not calculate a sum if any NA value exists in your columns. If you want to still calculate a sum despite missing values, you can add the argument na.rm = TRUE.
d2 %>%
  rowwise() %>%
  mutate(measure_sum = sum(c(item1, item2, item3)))
# A tibble: 5 x 5
# Rowwise: 
  stu_id item1 item2 item3 measure_sum
   <dbl> <dbl> <dbl> <dbl>       <dbl>
1   1234     3     2     4           9
2   2345     4     3     1           8
3   3456    NA     1     1          NA
4   4567     4     5     1          10
5   5678     1     3     2           6

If you don’t want to list out every single item in a measure, you can use dplyr::c_across() which uses tidy selection syntax to select variables. It is essentially a combination of base::c() and dplyr::across(). So instead of mutate(measure_sum = dplyr::across(c(item1:item3))), you can simplify to the code below.

  • Note: dplyr::c_across() cannot be used with base::rowSums() and base::rowMeans() because those functions take a different type of input compared to dplyr::rowwise()
d2 %>%
  rowwise() %>%
  mutate(measure_sum = sum(c_across(item1:item3)))
# A tibble: 5 x 5
# Rowwise: 
  stu_id item1 item2 item3 measure_sum
   <dbl> <dbl> <dbl> <dbl>       <dbl>
1   1234     3     2     4           9
2   2345     4     3     1           8
3   3456    NA     1     1          NA
4   4567     4     5     1          10
5   5678     1     3     2           6

2. Calculate a mean score (measure_mean) for each student even with missing data

Review the data (d2)

# A tibble: 5 x 4
  stu_id item1 item2 item3
   <dbl> <dbl> <dbl> <dbl>
1   1234     3     2     4
2   2345     4     3     1
3   3456    NA     1     1
4   4567     4     5     1
5   5678     1     3     2

Calculate the mean score

  • Note: If all column values are NA, you will get a value of 0 returned. If you don’t want this, you will need to add logic (like we saw in the above rowSums example, where you specifically code those values to NA).
d2 %>%
  rowwise() %>%
  mutate(measure_mean = mean(c_across(item1:item3), na.rm=TRUE))
# A tibble: 5 x 5
# Rowwise: 
  stu_id item1 item2 item3 measure_mean
   <dbl> <dbl> <dbl> <dbl>        <dbl>
1   1234     3     2     4         3   
2   2345     4     3     1         2.67
3   3456    NA     1     1         1   
4   4567     4     5     1         3.33
5   5678     1     3     2         2   

You can set the number of decimal places you want the new variable to have by wrapping the calculation in the function base::round() and setting the digits equal to the number you want.

d2 %>%
  rowwise() %>%
  mutate(measure_mean = 
           round(mean(c_across(item1:item3), na.rm=TRUE), digits=2))
# A tibble: 5 x 5
# Rowwise: 
  stu_id item1 item2 item3 measure_mean
   <dbl> <dbl> <dbl> <dbl>        <dbl>
1   1234     3     2     4         3   
2   2345     4     3     1         2.67
3   3456    NA     1     1         1   
4   4567     4     5     1         3.33
5   5678     1     3     2         2   

3. Calculate a mean score (toca_mean) for all the variables that contain the word “toca”, for each student even with missing data

Review the data (d4)

# A tibble: 5 x 6
  stu_id toca1 toca2 toca3 toca4 item1
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1   1234     3     2     4     3     4
2   2345     4     3     1     4     5
3   3456     2     1     1     5     5
4   4567     4     5     1     6     1
5   5678     1     3     2     2     6

Calculate the mean toca score

d4 %>%
  rowwise() %>%
  mutate(toca_mean = 
           mean(c_across(contains("toca")), na.rm=TRUE))
# A tibble: 5 x 7
# Rowwise: 
  stu_id toca1 toca2 toca3 toca4 item1 toca_mean
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
1   1234     3     2     4     3     4      3   
2   2345     4     3     1     4     5      3   
3   3456     2     1     1     5     5      2.25
4   4567     4     5     1     6     1      4   
5   5678     1     3     2     2     6      2   

If you wanted to exclude some of the “toca” items (say toca2 and toca4 this time), you could exclude them this way, by wrapping your selected and excluded variables in c().

d4 %>%
  rowwise() %>%
  mutate(toca_mean = 
           mean(c_across(c(contains("toca"),-toca2,-toca4)), 
                na.rm = TRUE))
# A tibble: 5 x 7
# Rowwise: 
  stu_id toca1 toca2 toca3 toca4 item1 toca_mean
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
1   1234     3     2     4     3     4       3.5
2   2345     4     3     1     4     5       2.5
3   3456     2     1     1     5     5       1.5
4   4567     4     5     1     6     1       2.5
5   5678     1     3     2     2     6       1.5


Package: janitor


Function: adorn_totals()


1. Calculate a sum score for each student called measure_sum

Review the data (d2)

# A tibble: 5 x 4
  stu_id item1 item2 item3
   <dbl> <dbl> <dbl> <dbl>
1   1234     3     2     4
2   2345     4     3     1
3   3456    NA     1     1
4   4567     4     5     1
5   5678     1     3     2

Calculate the sum score across item1, item2 and item3

  • Note: We can add the argument where = “col” to denote we want rows summed across columns rather than the default which is column totals

  • Note: We can add the argument name = to rename the new column what we want rather than the default “Total”

  • Note: The argument default for the NA argument is na.rm = TRUE so in this case, if we don’t want a sum when there is missing data, we need to change it to FALSE.

d2 %>% 
  adorn_totals(where = "col", name = "measure_sum", na.rm = FALSE)
 stu_id item1 item2 item3 measure_sum
   1234     3     2     4           9
   2345     4     3     1           8
   3456    NA     1     1          NA
   4567     4     5     1          10
   5678     1     3     2           6

Return to Calculate Sums and Means