You can view slides from this talk here
In this training I continue to cover establishing systems that help make a project successful. These structures are the backbone of your project and without them in place, it can create many, many headaches for project staff, can compromise the confidentiality of your data, and may even make your data unusable.
If you are collecting your own original data as part of your study, for example a randomized controlled trial study, data management best practices should be interwoven throughout your data collection process. I will discuss the role of data management in data collection instrument design, tracking of participants and data collection, as well as data storage and security. I will not go in to the minutiae of project management, including things such as recruiting participants, consenting participants, training data collectors, or scheduling data collection as those are less tied to data management and more aligned with project coordination.
However, I will note that it is important that project coordinators and PIs work with a data manager to develop the language used in a consent form. Most, not all, education research that is anonymous or de-identified will fall under the exempt Institutional Review Board (IRB) category (minimal risk), allowing you to write your own consent rather than using an IRB template. If you plan to share your data upon conclusion of your project, either via a repository or your own data request system, you will want to make sure your consent has clear language about your intent to share your de-identified data. If applicable, also include language regarding your intent to collect identifying information in a linking key table for tracking purposes that will be stored separately from your de-identified data. Shero and Hart from Florida State University have a great Informed Consent Template.
Meyer (2018) has several helpful dos and donts for language to use in your consent form including:
✔️ Don’t promise to destroy your data (unless your funder explicitly
requires it)
✔️ Don’t promise to not share data
✔️ Do get consent to retain and share data
✔️ Do incorporate data-retention and sharing clauses into IRB
templates
✔️ Do be thoughtful when considering risks of re-identification (ex:
small sample size for sub-groups)
✔️ Don’t promise that research analyses of the collected data will be
limited to certain topics
Other helpful consent resources:
📑 Within
& Between podcast
📑 University
of Pittsburgh
📑 University
of Guelph
📑 ICPSR
📑 Shero
and Hart: Working with your IRB
Final note before diving into this content: Before any project begins, all data collection instruments and protocol must be submitted to an Institutional Review Board (IRB) for approval. The IRB, a formal organization designated to review and monitor human participant research, and ensure that the welfare, rights, and privacy of research participants are maintained throughout the project. Some of the systems I cover throughout this series will be vetted by an IRB (ex: original data collection), others will not (ex: documentation, style guide). This training will not cover the ins and outs of the IRB, but I wanted to note that while this training provides many suggestions for setting up your data collection systems, you must always have required forms approved by IRB before moving forward with original data collection.
When it comes to the intersection of data management and data collection, there is a lot to consider. If at all possible, take some planning time to create data collection instruments and procedures that keep your data secure, valid, minimize errors, and relieve future data cleaning headaches. I am going to cover some common instruments created to collect original data, and ways you may be able to make those instruments more streamlined and efficient.
Quick note: Everything you read below is from my personal experience as well as a summary of what I have heard while interviewing other researchers. As such I may be missing out on even better ways to set up these systems. I am open to feedback!
Research teams may be restricted in how they collect their survey data due to limited resources, research design, or even the population being studied. However, if at all possible, I highly recommend collecting surveys using software/web-based tools that directly feed into a table or database rather than through paper forms. The reasons are:
Note: With the COVID-19 pandemic in 2020, I think many of us doing education research have had to make a huge switch, very quickly, to collecting data electronically and remotely. This is yet another reason to create/utilize web-based data collection instruments in particular, to prepare for the unknown, and provide flexibility to collect data even when you aren’t able to physically be in schools.
There are many web-based survey platforms (Ex: Qualtrics, Survey Monkey, Google Forms, Microsoft Forms, SurveyCTO) you can use to develop surveys. If those existing platforms do not meet your needs, you can always build your own web-based application to collect data.
No matter what platform you use, building your survey requires some up front planning. First and foremost make sure to build a survey that is both valid and reliable. Secondly, take time to think how the data you collect will be translated into a database. Remember, every question answered is then stored in a database within the platform that you will later download to a file. Here are several suggestions to adhere to when creating your survey that will provide you with more accurate data and reduce future data management headaches (While I will not provide a tutorial for how to implement each suggestion, they are all possible in a platform such as Qualtrics):
📑 Pew Research and Qualtrics both provide comprehensive best practices in questionnaire design.
If your site or participants do not have WiFi and/or email access, you generally have a few options.
One option is to work with a survey platform (such as Qualtrics or SurveyCTO) that allows you to create a survey online, administer it using an offline survey app on your phone or tablet, and then upload that data back to the survey platform once you have an internet connection again.
Still, depending on your study, participants, and resources, building a web-based survey may not be possible and/or it may not be the best option for your project. So your second option is a paper survey.
For the same reasons I recommend web-based tools for surveys, I also recommend web-based tools for assessments.
If a web-based option already exists for your assessment, the obvious answer is to simply use this. For example the Renaissance STAR assessment is administered online.
However, if your assessment is only available in paper form, consider converting it to a web-based form if at all possible without altering the integrity of the assessment. You can do that in several ways:
Once data is collected, proceed with scoring as usual, following the guidelines of the specific assessment.
If there are no web-based options, or connecting to WiFi is not an option for your project, then of course, keep with the paper assessments and follow the guidelines of the specific assessment. If the assessment requires any manual scoring (such as the Woodcock Johnson), be sure to implement an error checking system (such as the one mentioned in the Offline Surveys section).
In person classroom observations, qualitative or quantitative, can be collected using a web-based tool.
Observation forms can be built into an online survey platform or your own application that data collectors can access on their phones, tablets or laptops in the field. If the observation has duration codes, you can build this into your app or there are existing applications that have built in timers. If the observation simply needs to last X amount of minutes, use a web-based form and you can always just have observers set timers on their phones, similar to what they may do using a paper form.
If WiFi is not available to you in the field, consider making an electronic form that still eliminates the need for data entry (so not a paper form) and that ideally connects to or creates some type of database or table. Then have data collectors enter data into this form on a device in the field. This does not have to be high-tech. It could simply be:
Collecting electronic data in the field becomes tricky when you are having people use tools simultaneously that aren’t connected to a shared database (such as separate Excel files or Access databases). When you use an online/offline survey platform, everyone’s data is feeding into the same database/table. However, when you are using separate Excel files for instance, everyone is storing data on their own device. This requires you to set up a way for collectors to share files (ex: Dropbox, Box, or whatever your institution deems is a secure system) and then for you to merge those files together across data collectors. It is definitely trickier, although possibly still a more efficient and reliable method that paper forms and hand entry of data.
All of the above assumes observations completed in the field. Of course if you record observations and code them from the luxury of your office, then the possibilities are wide open and I would recommend to follow any of the above methods that connect to a shared database, such as the online observation form to collect all data seamlessly in one location.
I don’t work much with qualitative data so I won’t say much here. Interviews and focus groups are typically recorded (at least audio) whether they are collected in-person or online. Then if you transcribe (some people simply just take notes), transcription is either outsourced to a company, done in-house, or done using transcription software and is usually kept in a file such as a Word document or Google doc. For this type of qualitative data, I have no preference for how it is collected and transcribed as long as the transcription is accurate and it is a format that is appropriate for any analysis software you plan to use.
If you receive any non-public, confidential, secondary data sources, such as school district records, just make sure you ask for them in a format that is usable to you. For instance, I have received school district records in PDF form, and while, after much data wrangling it can be usable, it would be much easier to request to receive this information in tabular format.
Most studies require researchers to collect consent and/or assent on participants. Informed consent is collected on any participant, 18 years or older. If a student is under 18 years old, a parent consent as well as a student assent is collected. Consents and assents are typically collected one of two ways.
Last I wanted to mention that it is also possible to build all or almost all of your study into one tool (depending on your study). Using a tool like Qualtrics, or building your own tools in applications like R Shiny, you can build your consent, randomization, study ID assignment, survey and assessment all in one tool that participants can access in one simple link. Lucy D’Agostino McGowan has a great example in her slides here.
If you end up having to use paper forms to collect any of your data and need to manually enter that data, then there are several steps you can take to minimize error in that process.
📑 Reynolds and Schatsnieder from Florida State University have some
great suggestions here
for setting up a data entry station and protocol for any manual data
entry that needs to occur.
📑 Data Management Episode 1 and 2 of Within and Between
podcast has some great examples of how to set up data entry
protocol.
📑 While I haven’t read The
Practice of Survey Research, it looks to have an entire chapter
devoted to data entry that may be worth reading
📑 Read Data
Organization in Spreadsheets to understand how data should and
should not be organized if you use a spreadsheet tool for data
entry.
📑 Barchard & Verenikina have an article: Improving
data accuracy: Selecting the best data checking technique
📑 Barchard & Pace have another article: Preventing
human error: The impact of data entry methods on data accuracy and
statistical results
The simplest way to maintain participant privacy is to collect anonymous data. If you collect anonymous data, no participant identifying information should be collected in your instruments (name, email, date of birth, etc.). You will also want to make sure to not collect IP addresses as they can be used to identify an individual’s computer. It’s also important to recognize that if you collect anonymous data, you will not be able to link data across measures or across time.
More commonly in education research you will be collecting some identifiers (ex: name). However, in order to maintain participant confidentiality you cannot keep those identifiers in your research data. You will need to assign unique IDs to all participants and replace all identifying information with that study ID. In longitudinal studies in particular, you will be using that unique participant ID to link data across time and across measures.
Quick thought: Identifiers
It is important to remember that you must have some sort of identifier built into every data collection instrument (survey, observation form, assessment, etc.) in order to link data across measures and time. Whether you have participants enter an identifier like their name into a form or you already have study ID linked to a form in some way. You just need to make sure you aren’t accidentally collecting anonymous data. If you do not have identifiers in your instruments, you will not be able to link your data across measures and time.
Even if you are purposefully collecting anonymous data, if you have randomized participants, say schools, you will still need an identifier for the randomization block/cluster built into your instrument (ex: sch_id) or you will not be able to cluster based on that information.
When sending out or completing in-person web-based surveys/assessments, you can do one of the following to connect your data to a participant (this is not all-inclusive):
This image shows what a data de-identification system looks like where identifiers are collected and need to be removed for the analysis dataset. Table 1 would be the incoming survey data with identifiers, Table 2 would be your Participant Database (which I talk about more below), and Table 3 is your de-identified analysis dataset.
Source: J-PAL
Ultimately, which method you use for web-based links will first, depend on the data collection instrument. For instance, an online assessment may only have one link and so option #2 or #3 above may be your only option. However, for online surveys, any option is possible. Secondly, it will also depend on your method of dissemination. If you are sending survey links out yourselves to teachers, option #1 is going to be the best method. If a school system will not give you student emails, and you need to have teachers send out links to students, #3 may be best your best option. That method reduces the burden on teachers (having to send multiple links in #1) and potentially reduces error (asking teachers to track multiple links or multiple IDs to send to students in #1 and #2).
Quick thought: Public links
All of the options above assume you are making a private link that you are only sending to existing or potential study participants (i.e.: students in a classroom, teachers in a school). However, there may be times you need to publicly recruit and collect data for your study. I’ve been reading about the threat of bots with public online surveys and the havoc they can wreak on a research study. If you need to use a public link, some suggestions for securing your survey include:
But even with these additions to the survey, you will want to check your data thoroughly before analyzing it and before providing payments to participants. Look for things such as participants who sped through the survey too fast, inconsistent answers on questions, nonsensical answers, timestamps that don’t make sense, or identical surveys.
Ultimately, the recommendation is to not have public survey links, and to only use unique links for participants. Even if this means extra work where you have a public link with a screener, and then after participants are verified through the screener, you then send a private link.
Resources:
📑 Melissa
Simone, Behavioral Scientist
📑 Melissa
Simone, STAT
📑 Cloud
Research
If you take paper forms into the field consider doing the following to connect your data to a participant:
Source: Poverty Action Lab
Resources:
A participant database is often called a “master list” or “key” by IRB as this is your only database that can link your participants’ true identity to their study ID. As you recruit and consent participants, you will record their name, assign them a study ID, and enter any other necessary identifying information (ex: email) into the participant database.
In addition to recording a participant’s name and assigned study ID in this database, this is also the location you will track any and all activities that occur with those participants.
There are many pieces of information to document/track for a research study including:
And all of this needs to be documented across time (waves and cohorts) and space (classrooms and sites).
A thorough and complete participant database is vital to:
While the coordinator typically oversees the updating of this database, it is extremely helpful to consult with a data manager when setting up this system to make sure you include all relevant fields and to make sure the database is understandable when someone needs to verify information.
In a nutshell, the participant database will either be set up in a database (a series of tables), or a series of spreadsheets. At the most basic level, you will create one table/sheet per entity in your study (entities being participants, sites, districts, etc.). Example: Student table, Teacher table, School table, District table.
A participant database can be set up in a program such as:
The possibilities really are endless and it all depends on what your team has access to and how tech-savvy your team is. Some systems require specific programming knowledge (ex: setting up a SQL database). I personally prefer using a relational database system, which allows relating tables to one another (a schema), eliminating redundant data. Without getting too technical, database normalization increases performance, decreases storage, and makes it much easier to make updates to tables as changes occur. There are many options out there, but Access is just one example of a tool that allows you to relate tables and allows some form of querying.
Consider this first structure below, with 3 very simple tables (a
student table, a teacher table, and a school table). Each table has a
primary key that makes individuals within that table unique and each
table can be connected through a foreign key. For example in the student
table, the primary key is stu_id
and the foreign key is
tch_id
which connects students to the teacher table. Using
a query language (such as SQL) in systems such as Access, we can pull
multiple tables together ad hoc to make a table with all the pieces of
information we need.
Note: Not all relational databases require technical skills and coding. Systems such as QuickBase build relations between tables and allow querying through their point and click, low-code application.
Say for example, we needed to pull a roster together for each
teacher. We could easily run a query, such as this SQL query, that joins
the student and teacher tables by tch_id
and then pulls the
relevant teacher and student information from both tables:
SELECT Student.first_name, Student.last_name, Teacher.first_name AS t_f_name, Teacher.last_name AS
t_l_name, Teacher.grade_level
FROM Student INNER JOIN Teacher ON Student.tch_id = Teacher.tch_id
ORDER BY Teacher.last_name, Teacher.first_name, Student.last_name, Student.first_name
Would produce a roster like this:
first_name | last_name | t_f_name | t_l_name | grade_level |
---|---|---|---|---|
Johnny | Rose | Stevie | Budd | 3 |
Moira | Rose | Stevie | Budd | 3 |
Jocelyn | Schitt | Stevie | Budd | 3 |
Patrick | Brewer | Twyla | Sands | 4 |
Ray | Butani | Twyla | Sands | 4 |
Ted | Mullens | Twyla | Sands | 4 |
Alexis | Rose | Twyla | Sands | 4 |
Now consider these 3 tables below that are not relational (such as 3 tabs in an Excel spreadsheet). Since we are unable to set up a system that links these tables together, we need to enter redundant information into each table (such as teacher or school name) in order to see that information within each table without having to flip back and forth across tables to find the information we need.
Using a relational structure allows us to eliminate redundant data. This not only saves us time and energy but reduces errors as well. You can imagine how useful that is.
Ultimately though, set up this participant database in whatever system works for you. It just needs to be set up. This is system is vital to protecting the confidentiality of your participant data as well as record keeping for all of your data collected. If it is easier for you and your project, you can build these tables as individual spreadsheets in a workbook rather than relational tables. If you do, remember it will be more difficult to query information and you will most likely want to include duplicate information across tables (teacher and school name in your student table for instance). This will prevent you from having to flip back and forth across sheets.
Relational database resources:
📑 Database
vs. Spreadsheet
📑 7
Excel Spreadsheet Problems
📑 Jenna
Jordan has a great example of how and why to build a relational
database
📑 QuickBase
explanation of table-to-table relationships
📑 Relating
tables in Google Tables
📑 Airtable
vs Google Sheets
As you build your tables/sheets, you will need to decide what fields to include. As you create those fields, consider the following:
Fields to include at the beginning of your study:
Quick thought: Study IDs
As you recruit and consent participants/sites, you will add them to a participant database under an assigned study ID (ex: stu_id, tch_id). These IDs allow participants/sites to remain confidential in your research data. That ID (typically a 2-6 digit random numeric or alphanumeric value) will follow that individual/site throughout the life of the study and should be unique to that individual/site. This number never changes.
Depending on your study design, some participants may have the opportunity to be re-recruited into the study more than once. For example, we had a study where three cohorts of 5th grade students were recruited over three years. Therefore, the same 5th grade teachers were often recruited back into the study each cohort. If you have a similar study design, you still keep that same study ID (in this case the same tch_id) for that participant. If you want to identify the unique wave or cohort that a participant is brought back into the study, make sure you have other variables in your data that help you understand why a study ID occurs in your data more than once (ex: a cohort variable). The study ID concatenated with the cohort provides you with a unique row identifier.
Jenna Jordan has a great blog post on UIDs.
After your participants are recruited and you begin data collection, you can imagine there are many pieces of information to track such as incentives/payments and data collection completion. These fields will also need to be added to your participant database.
Fields to track over time (for each wave of data collection):
This data collection information/tracking may seem trivial, but it is vitally important to the integrity of your research study for these reasons:
The participant database is what holds the project together. I cannot stress this enough.
Quick thought: Tracking Best Practices
A few best practices for tracking that improve project coordination and data management:
For the purposes of this database, there are probably two most common methods of entering data.
There are many other ways to load data into a database but for this use case (tracking participation in an education research study) these are probably the most common methods. However, you can read about several other methods here.
Also, most recently I’ve heard people mention the idea of automating at least some data entry/tracking through the use of unique scannable codes or through integration of your online data collection platform and your tracking database (ex: Qualtrics and Quick Base). This type of tracking most likely requires some technical skills to set up this linking process. But it’s a cool idea and may reduce at least some error that occurs in manually tracking data. For example, by creating a unique QR code/barcode (linked to a participant ID) that is added to every paper survey, similar to what is used for tracking inventory, you can scan that code when data collection is complete to “check in” a survey to your participant database. Or by integrating your survey platform and participant database, you could add new participants to your database as they complete their consents and surveys online.
QR code/barcode resources:
📑 Creating
Barcodes in Excel
📑 Scanning
Barcodes into an Excel Spreadsheet
📑 Creating
Barcodes in Word
📑 Scanning QR
codes into Google Sheets
📑 Creating
Barcodes in Access
📑 Scanning
Barcodes into Access
Integration resources:
📑 Examples
of Qualtrics Integration
📑 Examples
of QuickBase Integration
There are two issues you may want to consider when developing these systems:
At the most basic level you will have one table per entity (student, teacher, etc.), as we discussed. However, you can imagine these tables/sheets can become very wide if you are collecting data on a lot of measures or many waves of data collection and very long if you are collecting data on a lot of participants over several cohorts. If this is the case, you may want to consider building a more complex schema (such as different tables for each time period or for each cohort) that again link together through primary and foreign keys. You can read more about creating a schema here.
Another issue that may occur is, if team members other than the project coordinator will be tracking data collection, you may want to limit their access to participant identifying information. If this is true, similar to above, you may consider building a more complex schema where you create separate tables for participant demographic/identifying information and tables with only participant ID and data collection tracking, that again can be linked through primary and foreign keys.
Other overall preferences for your participant database include:
Although data storage is our next topic, I want to mention right now, that this participant database clearly has identifiable and protected information. Therefore it should be stored securely and apart from all other study data. This is the only file that directly links your participants’ true identity to your confidential study data and it should never be stored in the same folder as your study data. Additionally, it should have limited access. Only those who need access, such as the project coordinator, data manager, and any other staff assisting with tracking, should have access to this file for security purposes.
Resources on securing a participant database:
📑 University
of Guelph
📑 Poverty
Action Lab
Last, most of what I have covered here is relevant for studies where you plan to de-identify your data. If you are collecting anonymous data, you do not need a participant database per se because you are not collecting any identifying information on your participants. With that said, you may still wish to and probably should set up a simple system to track your data collection efforts.
Whether you are working with your own original collected data, or you are working with extant data provided to you from an entity such as a school district, or publicly available data, you need to consider short-term secure data storage. I say short-term because this is how you will store your data while you are actively collecting data for your project. Data retention/long-term storage, after your study is complete, will be discussed in a later training.
In general you will most likely be working with one of four types of data and will need to store your data according to the type of data:
For this section I am going to generally describe the level of security needed to store your data depending on the type of data. However, I am not a security expert and in an effort to not explain security incorrectly, I would like to refer readers to this document from J-PAL. It is one of the most comprehensive documents I have come across regarding data security.
Because your participant database contains identifiable information about your participants, this data must have the highest security. The specifics of this security will be set by your university or institution. However, the general rules for storing this type of data is:
Your typical study data files are in formats such as .csv, .tsv, .xlsx, .txt, .docx, or a statistical program file such as .sav, .dta, .rds, or .sas. No matter the file type, your study data should all be de-identified/confidential, only including study IDs and no other identifiers. However, there are still some security precautions you should take.
This includes items such as external hard drives, flash drives, or CDs.
A newer type of data collection is occurring since the COVID-19 pandemic, and that is observations, interviews, and focus groups occurring via video conferencing. This data is especially sensitive as it may include names and faces as part of the recording. Again, you will want to refer to your specific institution guidelines, but generally:
Note: These same data security rules will also apply to data that is recorded in person. And furthermore, any data that is recorded on detachable media will need to follow the detachable media guidelines.
Paper data can be one of the most difficult data types to securely manage and track the storage of. When you have a distributed team, collecting data at many sites, sometimes in different states, it can be difficult to track the path and storage of physical data. You will want to make sure you have a very detailed plan for the storage of your data added to your protocol. Never play fast and loose with your paper data. Losing physical files can lead to breaches in confidentiality, loss of data, and other negative consequences. This holds true even for consents and assents. Even though those forms may not have analyzable data, you never know when your IRB may conduct an audit for participant consent, or if you may need to present them for future data requests with an entity such as a school district. If you lose consents/assents, it can look like you are collecting data on those who did not agree to be in the study.
Again, I am not an IT professional, but some general security rules to keep in mind for all data and devices are:
Resources on data security:
📑 University
of Guelph
📑 University
of Pittsburgh
📑 University
of Michigan
📑 Princeton
University
📑 University
of Nevada
📑 Pacific
University Oregon
📑 Florida
International University
📑 DataONE
📑 IPA
📑 Karl
Broman
📑 University
of Missouri
📑 Foundational
Practices of Research Data Management
📑 University of
Pittsburgh, RDM
📑 Veeam
📑 University
of Connecticut
📑 Brandeis
University
📑 Reasons to
consider SharePoint for data storage
📑 Reasons
to consider Institutional Servers for data storage
No matter where your team stores your files, it is important to develop a logical directory structure. It makes it so much easier to find your files and it facilitates sharing within and outside of your team. You will want to build this structure into your style guide and implement it generally across all of your projects to create cohesion. The most important thing is to allow files to be findable, have a structure that allows you to control user access, and to keep folders shallow enough that you don’t reach a limit on the allowable length of a path name for any file.
At the highest, organizational level, you will want to have separate folders for each of your projects, as well as an overall Team folder that houses general documents related to your team functioning (meeting notes, hr documents, team expectations). You will want to develop a style guide that covers general rules across all folders and house that style guide in an easily accessible location (ex: Team Wiki and/or a README in each project folder)
Then, within each project folder you will want to build a hierarchy, something like this:
levelName
1 project-name
2 °--life-cycle-folder
3 °--time
4 °--content
5 °--participant
6 °--archive
Level 1: The name of the project
Level 2: General research life cycle folders (Ex: data, documentation, project management, intervention, tracking)
Level 3: For longitudinal studies I tend to like the third level to be time period or grouping sub-folders (Ex: cohort 1, or wave 1)
Level 4: Specific content folders (Ex: raw data, syntax, clean data)
Level 5: Participant specific folder (ex: student, teacher).
Level 6: All previous versions of files can go in an archive folder to reduce clutter
More details on setting up this structure can be found in training_3.
Resources: