map_dfr()1. Read in all csv files (from an existing file with a list of links) and bind all files into one data frame.
In this instance, we are using a file downloaded on 2022-05-05 from Health and Human Services that provides a link to all archived datasets from the COVID-19 Community Profile Report County Level Data.
View the data
There are many columns in this dataset, but the one we care about it the column Archive Link that provides the link to the csv files stored in AWS.
archive <- readr::read_csv(here::here("import-data", "data", "hhs_archive.csv"))
archive %>%
dplyr::select(`Archive Link`) %>%
head()
# A tibble: 6 x 1
`Archive Link`
<chr>
1 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2022-04-18T16-41-42.c~
2 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2021-02-23T22-06-38.c~
3 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2021-02-24T20-51-14.c~
4 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2021-02-24T23-25-42.c~
5 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2021-02-25T23-32-48.c~
6 https://us-dhhs-aa.s3.us-east-2.amazonaws.com/di4u-7yu6_2021-02-27T00-25-08.c~
In order to create a character vector to read into
purrr::map_dfr() we must select this column and then use
dplyr::pull().
files <- archive %>%
dplyr::select(`Archive Link`) %>%
dplyr::pull()
class(files)
[1] "character"
Last we can use this character vector in our
purrr::map_dfr() function along with the
readr::read_csv function to read in all of our files and
bind them into one data frame.
Note: There is a variable in the files called
fema_region that is not the same class across all files. We
know from earlier that the files will not bind if the classes are not
the same across files, so for the purpose of this example, I am removing
that variable using the readr::read_csv() argument,
col_select.
Note: Some files have 37 variables and other files have 39 variables. As mentioned above, this is okay, the appended file will include all variables and whatever data is available for those variables.
Note: While purrr::map_dfr() has an argument,
.id, it will not provide us much information outside of the
index of the file in this case. However, readr::read_csv()
has itβs own id argument which will tell us the value in the
character vector so we can track which link the data came from.
all_files <- purrr::map_dfr(files, read_csv, col_select = -fema_region, id = "source")
all_files %>%
dplyr::select(`source`:cases_last_7_days) %>%
head()
# A tibble: 6 x 6
source fips county state date cases~1
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~ 1000 Unall~ AL 04/1~ 0
2 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~ 1001 Autau~ AL 04/1~ 8
3 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~ 1003 Baldw~ AL 04/1~ 58
4 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~ 1005 Barbo~ AL 04/1~ 0
5 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~ 1007 Bibb ~ AL 04/1~ 6
6 https://us-dhhs-aa.s3.us-east-2.amazonaws.co~ 1009 Bloun~ AL 04/1~ 30
# ... with abbreviated variable name 1: cases_last_7_days
Return to Import Files