map_dfr()1a. Read in all Excel files from the “attendance” folder and bind them together. Note that an asterisk was used to denote NAs and we need to notify R of that.
Review the files in our directory
here::here() function which implicitly sets my directory to
the top level (root directory) of my current project. Then any
subsequent folders can be listed in descending order. In this case my
folder “attendance” is within another folder called “import-data” which
is located in the top level of my project directory. You could also just
list your entire file path.fs::dir_ls(path = here::here("import-data", "attendance"))
[1] "C:/Users/Me/project/import-data/attendance/school-1_attendance.csv"
[2] "C:/Users/Me/project/import-data/attendance/school-2_attendance.xlsx"
[3] "C:/Users/Me/project/import-data/attendance/school-3_attendance.xlsx"
[4] "C:/Users/Me/project/import-data/attendance/school-4_attendance.xlsx"
[5] "C:/Users/Me/project/import-data/attendance/school-5_attendance.xlsx"
We see that there are 4 Excel files in the directory and one csv file. We only want to import the Excel files.
We can read in all of our files using fs::dir_ls() and
bind them together using purrr::map_dfr()
Note: By adding the dir_ls() argument glob =
“*.xlsx” you are saying you only want to read in the names of all
Excel files with the extension “.xlsx”. In this case, while there are 5
files in the directory, only 4 of them were Excel files, one file was
csv format.
Note: While basic purrr::map() always returns a
list, purrr::map_dfr() combines and returns tibbles (data
frames) by stacking them on top of each other (using
dplyr::bind_rows). The main arguments you include are
.x, a list or atomic vector (files to read), and .f,
which in this scenario is a function.
Note: Below I have not added a “.” for the .x argument to denote that I am referring to the character vector we created in the previous step. In this instance, it is implied.
Note: For this example I am using
readxl::read_excel() for the .f argument. If you
wanted to read in other file types, you would simply change the function
to the one required to read in your file type.
Note: If not all files you read in have the same variables, the appended file will include all variables and whatever data is available for those variables.
Note: If files contain the same variable names but the variable
classes aren’t the same for a variable (ex: V1 = character
in one file and V1 = numeric in another file) then datasets
cannot be combined. In the case of readxl::read_excel(),
you could add the col_types argument to make that variable type
the same across all files. If you are unsure what types of variables
exist in each dataset, you can use
janitor::compare_df_cols() to see an output of each data
frames’ variables and variable types. By adding the argument return
= “mismatch” the output will only show any variables that are
different types across data frames.
Note: In the function readxl::read_excel() I added
the argument na= because I know the files I am reading in use
“*” to denote missing values.
d <- fs::dir_ls(path = here::here("import-data", "attendance"), glob ="*.xlsx") %>%
purrr::map_dfr(readxl::read_excel, na="*")
head(d, n = c(10,5))
# A tibble: 10 x 5
YEAR PROPORTIONAL_ATTENDANCE~1 PROPORTIONAL_ATTENDA~2 PROPORTIONAL_ATTENDA~3
<dbl> <dbl> <dbl> <dbl>
1 2015 270 282 95.8
2 2016 275. 299 91.8
3 2017 362 391. 92.7
4 2018 368 408. 90.3
5 2019 396 440 90.1
6 2015 547. 628 87.0
7 2016 578 643 89.9
8 2017 568 640. 88.9
9 2018 554 625 88.6
10 2019 536 614 87.3
# i abbreviated names: 1: PROPORTIONAL_ATTENDANCE_TOTAL_90ABV,
# 2: PROPORTIONAL_ATTENDANCE_TOTAL, 3: PROPORTIONAL_ATTENDANCE_TOTAL_PCT
# i 1 more variable: PROPORTIONAL_ATTENDANCE_BLACK_90ABV <dbl>
here::here() you could
just list the full file path.d <- fs::dir_ls(path = "C:/Users/Name/project/import-data/attendance",
glob ="*.xlsx") %>%
purrr::map_dfr(readxl::read_excel, na="*")
purrr::map_dfr() is now superseded.
You can still use it but the preferred method of binding files now is
using purrr::list_rbind() from purrr version
1.0.2d <-
fs::dir_ls(path = here::here("import-data", "attendance"),
glob = "*.xlsx") %>%
purrr::map(readxl::read_excel, na = "*") %>%
purrr::list_rbind()
1b. Add an identifier to differentiate which rows came from which files, by adding the argument “.id” and have it equal to whatever you want that new variable to be called. This will give you the full file name as the row identifier to differentiate which rows came from which file.
Add an identifier using the argument .id= and naming the variable whatever you want, in this case “source”
d <- fs::dir_ls(path = here::here("import-data", "attendance"), glob ="*.xlsx") %>%
map_dfr(readxl::read_excel, na="*", .id = "source")
# A tibble: 10 x 5
source YEAR PROPORTIONAL_ATTENDA~1 PROPORTIONAL_ATTENDA~2
<chr> <dbl> <dbl> <dbl>
1 C:/Users/Me/project/impo~ 2015 270 282
2 C:/Users/Me/project/impo~ 2016 275. 299
3 C:/Users/Me/project/impo~ 2017 362 391.
4 C:/Users/Me/project/impo~ 2018 368 408.
5 C:/Users/Me/project/impo~ 2019 396 440
6 C:/Users/Me/project/impo~ 2015 547. 628
7 C:/Users/Me/project/impo~ 2016 578 643
8 C:/Users/Me/project/impo~ 2017 568 640.
9 C:/Users/Me/project/impo~ 2018 554 625
10 C:/Users/Me/project/impo~ 2019 536 614
# i abbreviated names: 1: PROPORTIONAL_ATTENDANCE_TOTAL_90ABV,
# 2: PROPORTIONAL_ATTENDANCE_TOTAL
# i 1 more variable: PROPORTIONAL_ATTENDANCE_TOTAL_PCT <dbl>
purrr::map_dfr() is now superseded.
You can still use it but the preferred method of binding files is using
purrr::list_rbind() from purrr version 1.0.2,
and then adding the argument names_to rather than
.idd <-
fs::dir_ls(path = here::here("import-data", "attendance"),
glob = "*.xlsx") %>%
purrr::map(readxl::read_excel, na = "*") %>%
purrr::list_rbind(names_to = "source")
2. Read in all Excel files except one from the directory and bind them together.
Review the files in our “attendance” folder again
base::dir(path = here::here("import-data", "attendance"))
[1] "school-1_attendance.csv" "school-2_attendance.xlsx"
[3] "school-3_attendance.xlsx" "school-4_attendance.xlsx"
[5] "school-5_attendance.xlsx"
Read in all files except “school-5_attendance.xlsx”
stringr::str_subset()
includes an argument to detect a pattern.d <- fs::dir_ls(path = here::here("import-data", "attendance"),
glob ="*.xlsx") %>%
stringr::str_subset(., "school-5_attendance", negate = TRUE) %>%
purrr::map_dfr(readxl::read_excel, .id = "source")
If you wanted to exclude more than one file, you can add a “|” between the file names.
d <- fs::dir_ls(path = here::here("import-data", "attendance"),
glob ="*.xlsx") %>%
stringr::str_subset(., "school-4_attendance|school-5_attendance", negate = TRUE) %>%
purrr::map_dfr(readxl::read_excel, .id = "source")
read_csv()1. Read in a vector of CSV file names and bind them.
Review the files in our “csv_data” folder
base::dir(path = here::here("import-data", "csv_data"))
[1] "school-2_attendance.csv" "school-3_attendance.csv"
Note: The file argument for
read_csv() allows you to provide “Either a path to a file,
a connection, or literal data (either a single string or a raw vector).
Since it allows vectors, you directly provide paths to multiple files as
a vector in the file argument. read_excel() does
not allows this.
If you want, you can add an identifier using the argument id= and name the variable whatever you want, in this case “source”
readr::read_csv(c("csv_data/school-2_attendance.csv",
"csv_data/school-3_attendance.csv"), id = "source")
# A tibble: 10 x 27
source YEAR PROPORTIONAL_ATTENDA~1 PROPORTIONAL_ATTENDA~2
<chr> <dbl> <dbl> <dbl>
1 csv_data/school-2_attend~ 2015 270 282
2 csv_data/school-2_attend~ 2016 275. 299
3 csv_data/school-2_attend~ 2017 362 391.
4 csv_data/school-2_attend~ 2018 368 408.
5 csv_data/school-2_attend~ 2019 396 440
6 csv_data/school-3_attend~ 2015 547. 628
7 csv_data/school-3_attend~ 2016 578 643
8 csv_data/school-3_attend~ 2017 568 640.
9 csv_data/school-3_attend~ 2018 554 625
10 csv_data/school-3_attend~ 2019 536 614
# i abbreviated names: 1: PROPORTIONAL_ATTENDANCE_TOTAL_90ABV,
# 2: PROPORTIONAL_ATTENDANCE_TOTAL
# i 23 more variables: PROPORTIONAL_ATTENDANCE_TOTAL_PCT <dbl>,
# PROPORTIONAL_ATTENDANCE_BLACK_90ABV <chr>,
# PROPORTIONAL_ATTENDANCE_BLACK <chr>,
# PROPORTIONAL_ATTENDANCE_BLACK_PCT <chr>,
# PROPORTIONAL_ATTENDANCE_HISPANIC_90ABV <dbl>, ...
Return to Import Files