Package: purrr


Function: map()


1. Read in all Excel files from the directory and create individual data frames for each file.

Review the files in our directory

  • Note: I am currently working in an R project. I used the here::here() function which implicitly sets my directory to the top level (root directory) of my current project. Then any subsequent folders can be listed in descending order. In this case my folder “attendance” is within another folder called “import-data” which is located in the top level of my project directory. You could also just list your entire file path.
fs::dir_ls(path = here::here("import-data", "attendance"))
[1] "C:/Users/Me/project/import-data/attendance/school-1_attendance.csv" 
[2] "C:/Users/Me/project/import-data/attendance/school-2_attendance.xlsx"
[3] "C:/Users/Me/project/import-data/attendance/school-3_attendance.xlsx"
[4] "C:/Users/Me/project/import-data/attendance/school-4_attendance.xlsx"
[5] "C:/Users/Me/project/import-data/attendance/school-5_attendance.xlsx"

We see that there are 4 Excel files in the directory and one csv file. We only want to import the Excel files.

There are several ways to do this.

  1. First, we can read in our files using fs::dir_ls(). By adding the fs::dir_ls() argument glob = “*.xlsx” we are saying we only want to read in the Excel files with the extension “.xlsx”. In this case, while there are 5 files in the directory, only 4 of them were Excel files, one file was csv format.
  • Then when I use that path in the readxl::read_excel() function, I use base R bracket indexing [] to extract the element we are referencing (only shown for 2 of the 4 files here)

  • Note: In the function readxl::read_excel() I added the argument na= because I know the files I am reading in use “*” to denote missing values.

# Create a list of files
files <- fs::dir_ls(here::here("import-data", "attendance"), glob = "*.xlsx")

# Read in each file
File1 <- readxl::read_excel(path = files[1], na = "*")

File2 <- readxl::read_excel(path = files[2], na = "*")
  1. Another approach if you have several files to read in, is that you can make a function to retrieve all of your separate data frames
f <- function(x) {
  fs::dir_ls(here::here("import-data", "attendance"), glob = "*.xlsx")[x] %>%
    readxl::read_excel(path = ., na = "*")
}

File1 <- f(1)
File2 <- f(2)
File3 <- f(3)
File4 <- f(4)
  1. Last, and probably the best solution I have for now, is to use purrr::map() in conjunction with the base::list2env() function, which creates environment variables from a named list.
  • Note: We use fs::dir_ls() to read the names of our files into a vector and we use purrr::set_names() (to create a named vector), in conjunction with base::basename() and stringr::str_remove() to create more usable names.

  • Note: We use purrr::map() to first create a list of our files. The main arguments you include for purrr::map() are .x, a list or atomic vector (files to read), and .f, which in this scenario is a function. Below “.” refers to the character vector we created in the previous step.

# Create our list of data frames

my_file_list <- fs::dir_ls(here::here("import-data", "attendance"), 
                glob ="*.xlsx") %>%
  purrr::set_names(basename(stringr::str_remove(., ".xlsx"))) %>%
  purrr::map(., .f = readxl::read_excel, na="*")

# Create environment variables from our list

list2env(my_file_list, envir=.GlobalEnv)
<environment: R_GlobalEnv>
# Review all objects in environment

ls()
[1] "my_file_list"        "school-2_attendance" "school-3_attendance"
[4] "school-4_attendance" "school-5_attendance"

2. Read in all Excel files from the directory, select the same variables from each file, and create individual data frames for each file.

Using the same files as above, if we only cared about certain variables in our data frames, we could add a dplyr::select() statement in a purrr::map() function to select our variables from our list of data frames before creating our environment variables.

  • Note: If your variable names change across files (for instance they have different prefix or suffix) you’ll need to use a tidyselection helper to grab similar names across files, such as tidyselect::contains().
# Create our list of data frames
my_file_list <- fs::dir_ls(path = here::here("import-data", "attendance"), 
                glob ="*.xlsx") %>%
  purrr::set_names(basename(stringr::str_remove(., ".xlsx"))) %>%
  purrr::map(., .f = readxl::read_excel, na="*") %>%
  purrr::map(., .f = . %>% dplyr::select(YEAR, tidyselect::contains("TOTAL")))

# Create environment variables from our list
list2env(my_file_list, envir=.GlobalEnv)
<environment: R_GlobalEnv>
# Review names of one object
names(`school-2_attendance`)
[1] "YEAR"                                "PROPORTIONAL_ATTENDANCE_TOTAL_90ABV"
[3] "PROPORTIONAL_ATTENDANCE_TOTAL"       "PROPORTIONAL_ATTENDANCE_TOTAL_PCT"  

Return to Import Files