Time | Topic |
---|---|
8:45 - 9:00 | Intro/Logistics |
9:00 - 9:45 | Fundamentals of Data Organization |
9:45 - 10:30 | Standardized Data Cleaning Checklist |
10:30 - 10:45 | Break |
10:45 - 12:30 | Data Cleaning Functions |
12:30 - 12:45 | Documentation for Data Sharing |
Materials
Cghlewis.github.io/ncme-data-cleaning-workshop/
Exercises
Login to Posit Cloud workspace: https://posit.cloud/content/7872027.
If Posit Cloud doesn’t work, download materials locally:
Feel free to interrupt me with questions/comments at any time ✋.
Get up and move around as much as you need to 🚶.
Analyzable
Interpretable
Complete
Valid
Accurate
Consistent
De-identified
What data quality issues do you detect for the analyzable
indicator?
01:00
Q4
variables_
.
or -
gender
instead of Q1
)wave_responder_scale#
(w1_t_mast1
)What data quality issues do you detect for the interpretable
indicator?
01:00
What data quality issues do you detect for the complete
indicator?
01:00
numeric
, character
, date
)1-5
)What data quality issues do you detect for the valid
indicator?
01:00
AGE/YR
does not adhere to our planned variable typeScore
fall out of our expected rangeWhat data quality issues do you detect for the accurate
indicator?
01:00
TEACHING LEVEL
and SCHOOL
Variable values are consistently measured, formatted, or categorized within a column
Variables are consistently measured across collections of the same form
What data quality issues do you detect for the consistent
indicator?
01:00
GENDER
are not consistently categorized🚨 Consider this in the context of risk
What data quality issues do you detect for the de-identified
indicator?
01:00
The number one way to reduce data errors is to make a plan before you collect data
Correct data at the source
Import the raw data
Review the raw data
Find missing data
Adjust the sample
De-identify data
Drop irrelevant columns
Split columns
Rename variables
Normalize variables
Standardize variables
Update variable types
Recode variables
Construct new variables
Add missing values
Validate data
Save clean data
Source | Resource |
---|---|
Alena Filip | Table 2 provides pros and cons of various de-identification methods |
J-PAL | Table 3 provides a list of direct and indirect identifiers and recommended removal methods |
Schatschneider, et.al | Deidentifying Data Guide |
When you export your files, it’s important to name them consistently and clearly.
-
is allowed)Which gives you a better idea of what is in the file? 🤔
15:00
R Script File
R Markdown File
?functionname
in your consolemean(x, trim, na.rm)
Function | Task |
---|---|
View() | View object |
str() | Display internal structure of an object |
c() | Combine elements |
class() | Check the class/type of an object |
We can also use functions that live in packages that we can install onto our computer
Once installed, there are two ways to call packages
2014+ magrittr pipe %>%
2021+ (R \(\geq\) 4.1.0) native R pipe |>
Isabella Velásquez’s blog post Understanding the native R pipe |> (2022)
To turn on the native pipe:
Tools
→ Global Options
→ Code
→ Editing
→ Use Native Pipe Operator
data
folder and download “sample_tch_svy_raw.xlsx” to view the file on your computer
05:00
Common data importing functions
read_csv()
, read_delim()
from the readr
package
read_excel()
from the readxl
package
read_sav()
, read_sas()
, read_dta()
from the haven
package
Learn more about importing multiple files at once here
Which function should we use to read in our sample data?🤔
read_excel()
has several arguments.
?read_excel()
in your console to see more argumentsIf you get stuck, you can open “solutions.Rmd”
02:00
There are several functions we can use to explore data
dplyr::glimpse()
skimr::skim()
base::summary()
summarytools::dfSummary()
Hmisc::describe()
dfSummary()
03:00
docs
folder.
docs
folder.
05:00
There are two key functions we can use to both identify and remove duplicates in our data
janitor::get_dupes()
dplyr::distinct()
id_vars
with your unique identifierc(first_name, last_name)
id_vars
with your unique identifier/stch_id
number.03:00
The functions used here will depend on what is required.
Examples of functions you might use:
dplyr::select()
to drop variablesdplyr::case_when()
or dplyr::recode()
to collapse categories/recode valuesdplyr::*_join()
to merge in study unique IDsstringr::str_remove_all()
or stringr::str_replace_all()
to redact character valuesFor our sample data we are going to use the following.
dplyr::case_when()
to recategorize names into our unique study ID values anddplyr::select()
to drop identifying variablescase_when()
, type ?case_when
in your consolecase_match()
, that is worth looking in to. It reduces repetition in the syntax.sch_id
variable.
03:00
Two functions we can use to rename variables
dplyr::rename()
purrr::set_names()
set_names()
03:00
There are several functions that can help us remove unexpected values from our variables.
Some examples of those include:
stringr::str_remove_all()
stringr::str_replace_all()
readr::parse_number()
Say this is our data
# A tibble: 3 x 2
id income
<dbl> <chr>
1 1 $32,000
2 2 120000
3 3 $45,000
Remove character values with stringr::str_remove_all()
# A tibble: 3 x 2
id income
<dbl> <chr>
1 1 32000
2 2 120000
3 3 45000
tch_yrs
variabletch_yrs
.tch_yrs
.02:00
Depending on what is needed, there are several functions we can use to change variable types.
Some examples include:
as.numeric()
as.character()
as.Date()
lubridate
package to assist with converting datesjanitor::excel_numeric_to_date()
can be very helpful at timesIt’s important to normalize variables before converting types (especially when converting from character to numeric)
Our data without normalizing
tch_yrs
to numerictch_yrs
.tch_yrs
.01:00
Common functions for recoding values are
dplyr::case_when()
dplyr::recode()
tidyr::replace_na()
dplyr::across()
allows you to apply the same transformation across multiple columns
case_when()
, recode()
, or replace_na()
This will save over existing variables.
dplyr::across()
function. This creates new variables with new names.grade
variables.
mathanx
variable.
02:00
The most important function for constructing new variables is dplyr::mutate()
From there, other functions may be required. For today, we are going to use rowSums()
which allows us to create sum scores for each row
gad_sum
gad_sum
01:00
dplyr::count()
, janitor::tabyl()
ggplot2
codebookr
, memisc
, sjPlot
pointblank
, validate
, assertr
, dataquieR
Here we are using the pointblank package to develop some validation tests
create_agent(df) |>
rows_distinct(columns = vars(stu_id)) |>
col_vals_not_null(columns = vars(stu_id)) |>
col_vals_between(columns = vars(stu_id), left = 300, right = 500, na_pass = FALSE) |>
col_is_numeric(columns = vars(age, test_score)) |>
col_vals_between(columns = vars(test_score), left = 0, right = 500, na_pass = TRUE) |>
interrogate()
05:00
Similar to importing data, we have many options for exporting our data depending on the format we want to export to.
openxlsx::write.xlsx()
haven::write_sav()
, haven::write_dta()
, haven::write_sas()
readr::write_csv()
02:00
✔️ Allows your data to be correctly interpreted and used
✔️ Allows your work to be reproducible
✔️ Allows your work to be findable
README
Data Cleaning Plan
Data dictionary
Codebook
🌟 Please provide feedback on this workshop using this short survey.
https://forms.gle/qVTvVgP8nafbAuXT8
🌟 Stay connected!
https://cghlewis.com/
https://www.linkedin.com/in/crystal-lewis-922b4193/
https://github.com/Cghlewis
Data Cleaning for Data Sharing // Cghlewis.github.io/ncme-data-cleaning-workshop/