Import files from multiple urls and bind them

Package: purrr

Function: `map_dfr()`

1. Read in all csv files (from an existing file with a list of links) and bind all files into one data frame.

In this instance, we are using a file downloaded on 2022-05-05 from Health and Human Services that provides a link to all archived datasets from the COVID-19 Community Profile Report County Level Data.

View the data

There are many columns in this dataset, but the one we care about it the column Archive Link that provides the link to the csv files stored in AWS.

archive <- readr::read_csv(here::here("import-data", "data", "hhs_archive.csv")) 

archive %>%
  dplyr::select(`Archive Link`) %>%
  head()

# A tibble: 6 x 1
  `Archive Link`                                                                
  <chr>                                                                         
1 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2022-04-18T16-41-42.c~
2 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2021-02-23T22-06-38.c~
3 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2021-02-24T20-51-14.c~
4 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2021-02-24T23-25-42.c~
5 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2021-02-25T23-32-48.c~
6 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2021-02-27T00-25-08.c~

In order to create a character vector to read into purrr::map_dfr() we must select this column and then use dplyr::pull().

files <- archive %>%
  dplyr::select(`Archive Link`) %>%
  dplyr::pull()

class(files)

[1] "character"

Last we can use this character vector in our purrr::map_dfr() function along with the readr::read_csv function to read in all of our files and bind them into one data frame.

Note: There is a variable in the files called fema_region that is not the same class across all files. We know from earlier that the files will not bind if the classes are not the same across files, so for the purpose of this example, I am removing that variable using the readr::read_csv() argument, col_select.
Note: Some files have 37 variables and other files have 39 variables. As mentioned above, this is okay, the appended file will include all variables and whatever data is available for those variables.
Note: While purrr::map_dfr() has an argument, .id, it will not provide us much information outside of the index of the file in this case. However, readr::read_csv() has it’s own id argument which will tell us the value in the character vector so we can track which link the data came from.

all_files <- purrr::map_dfr(files, read_csv, col_select = -fema_region, id = "source")

all_files %>%
  dplyr::select(`source`:cases_last_7_days) %>%
  head()

# A tibble: 6 x 6
  source                                         fips county state date  cases~1
  <chr>                                         <dbl> <chr>  <chr> <chr>   <dbl>
1 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~  1000 Unall~ AL    04/1~       0
2 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~  1001 Autau~ AL    04/1~       8
3 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~  1003 Baldw~ AL    04/1~      58
4 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~  1005 Barbo~ AL    04/1~       0
5 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~  1007 Bibb ~ AL    04/1~       6
6 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~  1009 Bloun~ AL    04/1~      30
# ... with abbreviated variable name 1: cases_last_7_days

Return to Import Files

Import files from multiple urls and bind them

Package: purrr

Function: map_dfr()

Function: `map_dfr()`