Import multiple files and combine

Package: purrr

Function: `map_dfr()`

1a. Read in all Excel files from the “attendance” folder and bind them together. Note that an asterisk was used to denote NAs and we need to notify R of that.

Review the files in our directory

Note: I am currently working in an R project. I used the here::here() function which implicitly sets my directory to the top level (root directory) of my current project. Then any subsequent folders can be listed in descending order. In this case my folder “attendance” is within another folder called “import-data” which is located in the top level of my project directory. You could also just list your entire file path.

fs::dir_ls(path = here::here("import-data", "attendance"))

[1] "C:/Users/Me/project/import-data/attendance/school-1_attendance.csv" 
[2] "C:/Users/Me/project/import-data/attendance/school-2_attendance.xlsx"
[3] "C:/Users/Me/project/import-data/attendance/school-3_attendance.xlsx"
[4] "C:/Users/Me/project/import-data/attendance/school-4_attendance.xlsx"
[5] "C:/Users/Me/project/import-data/attendance/school-5_attendance.xlsx"

We see that there are 4 Excel files in the directory and one csv file. We only want to import the Excel files.

We can read in all of our files using fs::dir_ls() and bind them together using purrr::map_dfr()

Note: By adding the dir_ls() argument glob = “*.xlsx” you are saying you only want to read in the names of all Excel files with the extension “.xlsx”. In this case, while there are 5 files in the directory, only 4 of them were Excel files, one file was csv format.
Note: While basic purrr::map() always returns a list, purrr::map_dfr() combines and returns tibbles (data frames) by stacking them on top of each other (using dplyr::bind_rows). The main arguments you include are .x, a list or atomic vector (files to read), and .f, which in this scenario is a function.
Note: Below I have not added a “.” for the .x argument to denote that I am referring to the character vector we created in the previous step. In this instance, it is implied.
Note: For this example I am using readxl::read_excel() for the .f argument. If you wanted to read in other file types, you would simply change the function to the one required to read in your file type.
Note: If not all files you read in have the same variables, the appended file will include all variables and whatever data is available for those variables.
Note: If files contain the same variable names but the variable classes aren’t the same for a variable (ex: V1 = character in one file and V1 = numeric in another file) then datasets cannot be combined. In the case of readxl::read_excel(), you could add the col_types argument to make that variable type the same across all files. If you are unsure what types of variables exist in each dataset, you can use janitor::compare_df_cols() to see an output of each data frames’ variables and variable types. By adding the argument return = “mismatch” the output will only show any variables that are different types across data frames.
Note: In the function readxl::read_excel() I added the argument na= because I know the files I am reading in use “*” to denote missing values.

d <- fs::dir_ls(path = here::here("import-data", "attendance"), glob ="*.xlsx") %>%
  purrr::map_dfr(readxl::read_excel, na="*")

head(d, n = c(10,5))

# A tibble: 10 x 5
    YEAR PROPORTIONAL_ATTENDANCE~1 PROPORTIONAL_ATTENDA~2 PROPORTIONAL_ATTENDA~3
   <dbl>                     <dbl>                  <dbl>                  <dbl>
 1  2015                      270                    282                    95.8
 2  2016                      275.                   299                    91.8
 3  2017                      362                    391.                   92.7
 4  2018                      368                    408.                   90.3
 5  2019                      396                    440                    90.1
 6  2015                      547.                   628                    87.0
 7  2016                      578                    643                    89.9
 8  2017                      568                    640.                   88.9
 9  2018                      554                    625                    88.6
10  2019                      536                    614                    87.3
# i abbreviated names: 1: PROPORTIONAL_ATTENDANCE_TOTAL_90ABV,
#   2: PROPORTIONAL_ATTENDANCE_TOTAL, 3: PROPORTIONAL_ATTENDANCE_TOTAL_PCT
# i 1 more variable: PROPORTIONAL_ATTENDANCE_BLACK_90ABV <dbl>

Note: If you did not want to use here::here() you could just list the full file path.

d <- fs::dir_ls(path = "C:/Users/Name/project/import-data/attendance", 
                glob ="*.xlsx") %>%
  purrr::map_dfr(readxl::read_excel, na="*")

Update 2024-06-13: purrr::map_dfr() is now superseded. You can still use it but the preferred method of binding files now is using purrr::list_rbind() from purrr version 1.0.2

d <-
  fs::dir_ls(path = here::here("import-data", "attendance"),
             glob = "*.xlsx") %>%
  purrr::map(readxl::read_excel, na = "*") %>%
  purrr::list_rbind()

1b. Add an identifier to differentiate which rows came from which files, by adding the argument “.id” and have it equal to whatever you want that new variable to be called. This will give you the full file name as the row identifier to differentiate which rows came from which file.

Add an identifier using the argument .id= and naming the variable whatever you want, in this case “source”

d <- fs::dir_ls(path = here::here("import-data", "attendance"), glob ="*.xlsx") %>%
  map_dfr(readxl::read_excel, na="*", .id = "source")

# A tibble: 10 x 5
   source                     YEAR PROPORTIONAL_ATTENDA~1 PROPORTIONAL_ATTENDA~2
   <chr>                     <dbl>                  <dbl>                  <dbl>
 1 C:/Users/Me/project/impo~  2015                   270                    282 
 2 C:/Users/Me/project/impo~  2016                   275.                   299 
 3 C:/Users/Me/project/impo~  2017                   362                    391.
 4 C:/Users/Me/project/impo~  2018                   368                    408.
 5 C:/Users/Me/project/impo~  2019                   396                    440 
 6 C:/Users/Me/project/impo~  2015                   547.                   628 
 7 C:/Users/Me/project/impo~  2016                   578                    643 
 8 C:/Users/Me/project/impo~  2017                   568                    640.
 9 C:/Users/Me/project/impo~  2018                   554                    625 
10 C:/Users/Me/project/impo~  2019                   536                    614 
# i abbreviated names: 1: PROPORTIONAL_ATTENDANCE_TOTAL_90ABV,
#   2: PROPORTIONAL_ATTENDANCE_TOTAL
# i 1 more variable: PROPORTIONAL_ATTENDANCE_TOTAL_PCT <dbl>

Update 2024-06-13: purrr::map_dfr() is now superseded. You can still use it but the preferred method of binding files is using purrr::list_rbind() from purrr version 1.0.2, and then adding the argument names_to rather than .id

d <-
  fs::dir_ls(path = here::here("import-data", "attendance"),
             glob = "*.xlsx") %>%
  purrr::map(readxl::read_excel, na = "*") %>%
  purrr::list_rbind(names_to = "source")

2. Read in all Excel files except one from the directory and bind them together.

Review the files in our “attendance” folder again

base::dir(path = here::here("import-data", "attendance"))

[1] "school-1_attendance.csv"  "school-2_attendance.xlsx"
[3] "school-3_attendance.xlsx" "school-4_attendance.xlsx"
[5] "school-5_attendance.xlsx"

Read in all files except “school-5_attendance.xlsx”

Note: You do not have to write the entire file name “school-5_attendance.xlsx” because stringr::str_subset() includes an argument to detect a pattern.

d <- fs::dir_ls(path = here::here("import-data", "attendance"), 
                glob ="*.xlsx") %>%
  stringr::str_subset(., "school-5_attendance", negate = TRUE) %>%
  purrr::map_dfr(readxl::read_excel, .id = "source")

If you wanted to exclude more than one file, you can add a “|” between the file names.

d <- fs::dir_ls(path = here::here("import-data", "attendance"), 
                glob ="*.xlsx") %>%
  stringr::str_subset(., "school-4_attendance|school-5_attendance", negate = TRUE) %>%
  purrr::map_dfr(readxl::read_excel, .id = "source")

Package: readr

Function: `read_csv()`

1. Read in a vector of CSV file names and bind them.

Review the files in our “csv_data” folder

base::dir(path = here::here("import-data", "csv_data"))

[1] "school-2_attendance.csv" "school-3_attendance.csv"

Note: The file argument for read_csv() allows you to provide “Either a path to a file, a connection, or literal data (either a single string or a raw vector). Since it allows vectors, you directly provide paths to multiple files as a vector in the file argument. read_excel() does not allows this.

If you want, you can add an identifier using the argument id= and name the variable whatever you want, in this case “source”

readr::read_csv(c("csv_data/school-2_attendance.csv",
           "csv_data/school-3_attendance.csv"), id = "source")

# A tibble: 10 x 27
   source                     YEAR PROPORTIONAL_ATTENDA~1 PROPORTIONAL_ATTENDA~2
   <chr>                     <dbl>                  <dbl>                  <dbl>
 1 csv_data/school-2_attend~  2015                   270                    282 
 2 csv_data/school-2_attend~  2016                   275.                   299 
 3 csv_data/school-2_attend~  2017                   362                    391.
 4 csv_data/school-2_attend~  2018                   368                    408.
 5 csv_data/school-2_attend~  2019                   396                    440 
 6 csv_data/school-3_attend~  2015                   547.                   628 
 7 csv_data/school-3_attend~  2016                   578                    643 
 8 csv_data/school-3_attend~  2017                   568                    640.
 9 csv_data/school-3_attend~  2018                   554                    625 
10 csv_data/school-3_attend~  2019                   536                    614 
# i abbreviated names: 1: PROPORTIONAL_ATTENDANCE_TOTAL_90ABV,
#   2: PROPORTIONAL_ATTENDANCE_TOTAL
# i 23 more variables: PROPORTIONAL_ATTENDANCE_TOTAL_PCT <dbl>,
#   PROPORTIONAL_ATTENDANCE_BLACK_90ABV <chr>,
#   PROPORTIONAL_ATTENDANCE_BLACK <chr>,
#   PROPORTIONAL_ATTENDANCE_BLACK_PCT <chr>,
#   PROPORTIONAL_ATTENDANCE_HISPANIC_90ABV <dbl>, ...

Return to Import Files

Import multiple files and combine

Package: purrr

Function: map_dfr()

Package: readr

Function: read_csv()

Function: `map_dfr()`

Function: `read_csv()`