| Time | Topic |
|---|---|
| 8:45 - 9:00 | Intro/Logistics |
| 9:00 - 9:45 | Fundamentals of Data Organization |
| 9:45 - 10:30 | Standardized Data Cleaning Checklist |
| 10:30 - 10:45 | Break |
| 10:45 - 12:30 | Data Cleaning Functions |
| 12:30 - 12:45 | Documentation for Data Sharing |
Materials
Cghlewis.github.io/ncme-data-cleaning-workshop/
Exercises
Login to Posit Cloud workspace: https://posit.cloud/content/7872027.
If Posit Cloud doesn’t work, download materials locally:
Feel free to interrupt me with questions/comments at any time ✋.
Get up and move around as much as you need to 🚶.


Analyzable
Interpretable
Complete
Valid
Accurate
Consistent
De-identified

What data quality issues do you detect for the analyzable indicator?
01:00
Q4 variables_
. or -gender instead of Q1)wave_responder_scale# (w1_t_mast1)
What data quality issues do you detect for the interpretable indicator?
01:00
What data quality issues do you detect for the complete indicator?
01:00
numeric, character, date)1-5)What data quality issues do you detect for the valid indicator?
01:00
AGE/YR does not adhere to our planned variable typeScore fall out of our expected rangeWhat data quality issues do you detect for the accurate indicator?
01:00
TEACHING LEVEL and SCHOOLVariable values are consistently measured, formatted, or categorized within a column
Variables are consistently measured across collections of the same form
What data quality issues do you detect for the consistent indicator?
01:00
GENDER are not consistently categorized🚨 Consider this in the context of risk
What data quality issues do you detect for the de-identified indicator?
01:00
The number one way to reduce data errors is to make a plan before you collect data
Correct data at the source
Import the raw data
Review the raw data
Find missing data
Adjust the sample
De-identify data
Drop irrelevant columns
Split columns
Rename variables
Normalize variables
Standardize variables
Update variable types
Recode variables
Construct new variables
Add missing values
Validate data
Save clean data
| Source | Resource |
|---|---|
| Alena Filip | Table 2 provides pros and cons of various de-identification methods |
| J-PAL | Table 3 provides a list of direct and indirect identifiers and recommended removal methods |
| Schatschneider, et.al | Deidentifying Data Guide |
When you export your files, it’s important to name them consistently and clearly.
- is allowed)Which gives you a better idea of what is in the file? 🤔
15:00
Image from RStudio User Guide
R Script File
R Markdown File
?functionname in your consolemean(x, trim, na.rm)
| Function | Task |
|---|---|
| View() | View object |
| str() | Display internal structure of an object |
| c() | Combine elements |
| class() | Check the class/type of an object |
We can also use functions that live in packages that we can install onto our computer
Once installed, there are two ways to call packages
2014+ magrittr pipe %>%
2021+ (R \(\geq\) 4.1.0) native R pipe |>
Isabella Velásquez’s blog post Understanding the native R pipe |> (2022)
To turn on the native pipe:
Tools → Global Options → Code → Editing → Use Native Pipe Operator
data folder and download “sample_tch_svy_raw.xlsx” to view the file on your computer
05:00
Common data importing functions
read_csv(), read_delim() from the readr package
read_excel() from the readxl package
read_sav(), read_sas(), read_dta() from the haven package
Learn more about importing multiple files at once here
Which function should we use to read in our sample data?🤔
read_excel() has several arguments.
?read_excel() in your console to see more argumentsIf you get stuck, you can open “solutions.Rmd”
02:00
There are several functions we can use to explore data
dplyr::glimpse()skimr::skim()base::summary()summarytools::dfSummary()Hmisc::describe()dfSummary()03:00
docs folder.
docs folder.
05:00
There are two key functions we can use to both identify and remove duplicates in our data
janitor::get_dupes()
dplyr::distinct()
id_vars with your unique identifierc(first_name, last_name)id_vars with your unique identifier/stch_id number.03:00
The functions used here will depend on what is required.
Examples of functions you might use:
dplyr::select() to drop variablesdplyr::case_when() or dplyr::recode() to collapse categories/recode valuesdplyr::*_join() to merge in study unique IDsstringr::str_remove_all() or stringr::str_replace_all() to redact character valuesFor our sample data we are going to use the following.
dplyr::case_when() to recategorize names into our unique study ID values anddplyr::select() to drop identifying variablescase_when(), type ?case_when in your consolecase_match(), that is worth looking in to. It reduces repetition in the syntax.sch_id variable.
03:00
Two functions we can use to rename variables
dplyr::rename()
purrr::set_names()
set_names()03:00
There are several functions that can help us remove unexpected values from our variables.
Some examples of those include:
stringr::str_remove_all()stringr::str_replace_all()readr::parse_number()Say this is our data
# A tibble: 3 x 2
id income
<dbl> <chr>
1 1 $32,000
2 2 120000
3 3 $45,000
Remove character values with stringr::str_remove_all()
# A tibble: 3 x 2
id income
<dbl> <chr>
1 1 32000
2 2 120000
3 3 45000
tch_yrs variabletch_yrs.tch_yrs.02:00
Depending on what is needed, there are several functions we can use to change variable types.
Some examples include:
as.numeric()as.character()as.Date()lubridate package to assist with converting datesjanitor::excel_numeric_to_date() can be very helpful at timesIt’s important to normalize variables before converting types (especially when converting from character to numeric)
Our data without normalizing
tch_yrs to numerictch_yrs.tch_yrs.01:00
Common functions for recoding values are
dplyr::case_when()dplyr::recode()tidyr::replace_na()dplyr::across() allows you to apply the same transformation across multiple columns
case_when(), recode(), or replace_na()This will save over existing variables.
dplyr::across() function. This creates new variables with new names.grade variables.
mathanx variable.
02:00
The most important function for constructing new variables is dplyr::mutate()
From there, other functions may be required. For today, we are going to use rowSums() which allows us to create sum scores for each row
gad_sumgad_sum01:00
dplyr::count(), janitor::tabyl()ggplot2codebookr, memisc, sjPlotpointblank, validate, assertr, dataquieRHere we are using the pointblank package to develop some validation tests
create_agent(df) |>
rows_distinct(columns = vars(stu_id)) |>
col_vals_not_null(columns = vars(stu_id)) |>
col_vals_between(columns = vars(stu_id), left = 300, right = 500, na_pass = FALSE) |>
col_is_numeric(columns = vars(age, test_score)) |>
col_vals_between(columns = vars(test_score), left = 0, right = 500, na_pass = TRUE) |>
interrogate()05:00
Similar to importing data, we have many options for exporting our data depending on the format we want to export to.
openxlsx::write.xlsx()haven::write_sav(), haven::write_dta(), haven::write_sas()readr::write_csv()
02:00
✔️ Allows your data to be correctly interpreted and used
✔️ Allows your work to be reproducible
✔️ Allows your work to be findable
README
Data Cleaning Plan
Data dictionary
Codebook
🌟 Please provide feedback on this workshop using this short survey.
https://forms.gle/qVTvVgP8nafbAuXT8
🌟 Stay connected!
https://cghlewis.com/
https://www.linkedin.com/in/crystal-lewis-922b4193/
https://github.com/Cghlewis

Data Cleaning for Data Sharing // Cghlewis.github.io/ncme-data-cleaning-workshop/