You can view slides from this talk here
In this training I continue to cover establishing systems that help make a project successful. These structures are the backbone of your project and without them in place, it can create many, many headaches for project staff, can compromise the confidentiality of your data, and may even make your data unusable.
If you are collecting your own original data as part of your study, for example a randomized controlled trial study, data management best practices should be interwoven throughout your data collection process. I will discuss the role of data management in data collection instrument design, tracking of participants and data collection, as well as data storage and security. I will not go in to the minutiae of project management, including things such as recruiting participants, consenting participants, training data collectors, or scheduling data collection as those are less tied to data management and more aligned with project coordination.
However, I will note that it is important that project coordinators and PIs work with a data manager to develop the language used in a consent form. Most, not all, education research that is anonymous or de-identified will fall under the exempt IRB category (minimal risk), allowing you to write your own consent rather than using an IRB template. If you plan to share your data upon conclusion of your project, either via a repository or your own data request system, you will want to make sure your consent has clear language about your intent to share your de-identified data. If applicable, also include language regarding your intent to collect identifying information in a linking key table for tracking purposes that will be stored separately from your de-identified data. Shero and Hart from Florida State University have a great Informed Consent Template.
Meyer (2018) has several helpful dos and donts for language to use in your consent form including:
✔️ Don’t promise to destroy your data (unless your funder explicitly requires it)
✔️ Don’t promise to not share data
✔️ Do get consent to retain and share data
✔️ Do incorporate data-retention and sharing clauses into IRB templates
✔️ Do be thoughtful when considering risks of re-identification (ex: small sample size for sub-groups)
✔️ Don’t promise that research analyses of the collected data will be limited to certain topics
Other helpful consent resources:
Final note before diving into this content: Before any project begins, all data collection instruments and protocol must be submitted to an Institutional Review Board (IRB) for approval. The IRB, a formal organization designated to review and monitor human participant research, ensures that the welfare, rights, and privacy of research participants are maintained throughout the project. Some of the systems I cover throughout this series will be vetted by an IRB (ex: original data collection), others will not (ex: documentation, style guide). This training will not cover the ins and outs of the IRB, but I wanted to note that while this training provides many suggestions for setting up your data collection systems, you must always have required forms approved by IRB before moving forward with original data collection.
When it comes to the intersection of data management and data collection, there is a lot to consider. If at all possible, take some planning time to create data collection instruments and procedures that keep your data secure, valid, minimize errors, and relieve future data cleaning headaches. I am going to cover some common instruments teams create to collect original data, and ways you may be able to make these more streamlined and efficient instruments.
Quick note: Everything you read below is from my personal experience as well as a summary of what I have heard while interviewing other researchers. As such I may be missing out on even better ways to set up these systems. I am open to feedback!
Research teams may be restricted in how they collect their survey data due to limited resources, research design, or even the population being studied. However, if at all possible, I highly recommend collecting surveys using software/web-based tools that directly feed into a table or database rather than through paper forms. The reasons are:
Note: With the COVID-19 pandemic in 2020, I think many of us doing education research have had to make a huge switch, very quickly, to collecting data electronically and remotely. This is yet another reason to create/utilize web-based data collection instruments in particular, to prepare for the unknown, and provide flexibility to collect data even when you aren’t able to physically be in schools.
There are many web-based survey platforms (Ex: Qualtrics, Survey Monkey, Google Forms, Microsoft Forms, SurveyCTO) you can use to develop surveys. If those existing platforms do not meet your needs, you can always build your own web-based application to collect data.
No matter what platform you use, building your survey requires some up front planning. First and foremost make sure to build a survey that is both valid and reliable. Secondly, take time to think how the data you collect will be translated into a database. Remember, every question answered is then stored in a database within the platform that you will later download to a file. Here are several suggestions to adhere to when creating your survey that will provide you with more accurate data and reduce future data management headaches (While I will not provide a tutorial for how to implement each suggestion, they are all possible in a platform such as Qualtrics):
If your site or participants do not have WiFi and/or email access, you generally have a few options.
One option is to work with a survey platform (such as Qualtrics or SurveyCTO) that allows you to create a survey online, administer it using an offline survey app on your phone or tablet, and then upload that data back to the survey platform once you have an internet connection again.
Still, depending on your study, participants, and resources, building a web-based survey may not be possible and/or it may not be the best option for your project. So your second option is a paper survey.
📑 Reynolds and Schatsnieder from Florida State University have some great suggestions here for setting up a data entry station and protocol for any manual data entry that needs to occur.
📑 Data Management Episode 1 and 2 of Within and Between podcast has some great examples of how to set up data entry protocol.
For the same reasons I recommend web-based tools for surveys, I also recommend web-based tools for assessments.
If a web-based option already exists for your assessment, the obvious answer is to simply use this. For example the Renaissance STAR assessment is administered online.
However, if your assessment is only available in paper form, consider converting it to a web-based form if at all possible without altering the integrity of the assessment. You can do that in several ways:
Once data is collected, proceed with scoring as usual, following the guidelines of the specific assessment.
If there are no web-based options, or connecting to WiFi is not an option for your project, then of course, keep with the paper assessments and follow the guidelines of the specific assessment. If the assessment requires any manual scoring (such as the Woodcock Johnson), be sure to implement an error checking system (such as the one mentioned in the paper surveys section).
Classroom observations, qualitative or quantitative, can be collected using a web-based tool.
Observation forms can be built into an online survey platform or your own application that data collectors can access on their phones, tablets or laptops in the field. If the observation has duration codes, you can build this into your app or there are existing applications that have built in timers. If the observation simply needs to last X amount of minutes, use a web-based form and you can always just have observers set timers on their phones, similar to what they may do using a paper form.
If WiFi is not available to you in the field, consider making an electronic form that still eliminates the need for data entry (so not a paper form) and that ideally connects to or creates some type of database or table. Then have data collectors enter data into this form on a device in the field. This does not have to be high-tech. It could simply be:
Collecting electronic data in the field becomes tricky when you are having people use tools simultaneously that aren’t connected to a shared database (such as separate Excel files or Access databases). When you use an online/offline survey platform, everyone’s data is feeding into the same database/table. However, when you are using separate Excel files for instance, everyone is storing data on their own device. This requires you to set up a way for collectors to share files (ex: Dropbox, Box, or whatever your institution deems is a secure system) and then for you to merge those files together across data collectors. It is definitely trickier, although possibly still a more efficient and reliable method that paper forms and hand entry of data.
I don’t work much with qualitative data so I won’t say much here. Interviews and focus groups are typically recorded (at least audio) whether they are collected in-person or online. Then if you transcribe (some people simply just take notes), transcription is either outsourced to a company, done in-house, or done using transcription software and is usually kept in a file such as a Word document or Google doc. For this type of qualitative data, I have no preference for how it is collected and transcribed as long as the transcription is accurate and it is a format that is appropriate for any analysis software you plan to use.
If you receive any non-public, confidential, secondary data sources, such as school district records, just make sure you ask for them in a format that is usable to you. For instance, I have received school district records in PDF form, and while, after much data wrangling it can be usable, it would be much easier to request to receive this information in tabular format.
Consents and assents can also be collected through secure web-based tools, such as Qualtrics or DocuSign. It does however, require your participants to have access to the internet and/or email. Consult with your IRB to see what tools are approved.
Last I wanted to mention that it is also possible to build all or almost all of your study into one tool (depending on your study). Using a tool like Qualtrics, or building your own tools in applications like R Shiny, you can build your consent, randomization, study ID assignment, survey and assessment all in one tool that participants can access in one simple link. Lucy D’Agostino McGowan has a great example in her slides here.
The simplest way to maintain participant privacy is to collect anonymous data. This is easily done through methods such as anonymous survey links (or even scannable QR codes linked to an anonymous form), as well as distribution of paper forms or data collector entry into applications. However, if you collect anonymous data, no participant identifying information should be collected in your instruments (name, email, date of birth, etc.). You will also want to make sure to not collect IP addresses as they can be used to identify an individual’s computer. It’s also important to recognize that if you collect anonymous data, you will not be able to link data across measures or across time.
More commonly in education research you will be collecting some identifiers (ex: name). However, in order to maintain participant confidentiality you cannot keep those identifiers in your research data. You will need to assign unique IDs to all participants and replace all identifying information with that study ID. In longitudinal studies in particular, you will be using that unique participant ID to link data across time and across measures.
To the best of your ability, all of your data collection instruments should have that study ID associated embedded in the form when you send it out into the field, rather than name, to ensure confidentiality (participant name is never actually on your study data and therefore does not need to be removed) as well as to reduce errors that may occur when replacing names with study IDs. This is especially important for longitudinal studies, where maintaining the accuracy of your study IDs is critical to linking data over time.
Quick thought: Identifiers
While I am highly recommending you attach study ID alone to your data collection instruments, it is important to remember that you must have some sort of identifier (either name OR the study ID) built into every data collection instrument (survey, observation form, assessment, etc.) in order to link data across measures and time. If you do not collect identifiers, you will not be able to link your data across measures and time. Your data will essentially be anonymous.
Even if you are purposefully collecting anonymous data, if you have randomized participants, say schools, you will still need an identifier for the randomization block/cluster built into your instrument (ex: sch_id) or you will not be able to know which participants are treatment and which are control.
When sending out or completing in-person web-based surveys/assessments, you can do one of the following to connect your data to a participant (this is not all-inclusive):
For method #1 and #2, build a data check into the system. When a participant opens their unique link, or enters their study ID, have the system verify their identity by asking, “Are you first name, last name?”. If they say yes, they move forward, if they say no, the system redirects them to someone to contact. This ensures that participants are not completing someone else’s survey and IDs are connected to the correct participant.
Ultimately, which method you use for web-based links will first, depend on the data collection instrument. For instance, an online assessment may only have one link and so option #2 or #3 above may be your only option. However, for online surveys, any option is possible. Secondly, it will also depend on your method of dissemination. If you are sending survey links out yourselves to teachers, option #1 is going to be the best method. If a school system will not give you student emails, and you need to have teachers send out links to students, #3 may be best your best option. That method reduces the burden on teachers (having to send multiple links in #1) and potentially reduces error (asking teachers to track multiple links or multiple IDs to send to students in #1 and #2).
Quick thought: Public links
All of the options above assume you are making a private link that you are only sending to existing or potential study participants (i.e.: students in a classroom, teachers in a school). However, there may be times you need to publicly recruit and collect data for your study. I’ve been reading about the threat of bots with public online surveys and the havoc they can wreak on a research study. If you need to use a public link, some suggestions for securing your survey include:
But even with these additions to the survey, you will want to check your data thoroughly before analyzing it and before providing payments to participants. Look for things such as participants who sped through the survey too fast, inconsistent answers on questions, nonsensical answers, timestamps that don’t make sense, or identical surveys.
Ultimately, the recommendation is to not have public survey links, and to only use unique links for participants. Even if this means extra work where you have a public link with a screener, and then after participants are verified through the screener, you then send a private link.
If you take paper forms into the field consider doing the following to connect your data to a participant:
Source: Poverty Action Lab
A participant database is often called a “master list” or “key” by IRB as this is your only database that can link your participants’ true identity to their study ID. As you recruit and consent participants, you will record their name, assign them a study ID, and enter any other necessary identifying information (ex: email) into the participant database.
In addition to recording a participant’s name and assigned study ID in this database, this is also the location you will track any and all activities that occur with those participants.
There are many pieces of information to document/track for a research study including:
And all of this needs to be documented across time (waves and cohorts) and space (classrooms and sites).
A thorough and complete participant database is vital to:
While the coordinator typically oversees the updating of this database, it is extremely helpful to consult with a data manager when setting up this system to make sure you include all relevant fields and to make sure the database is understandable when someone needs to verify information.
In a nutshell, the participant database will either be set up in a database (a series of tables), or a series of spreadsheets. At the most basic level, you will create one table/sheet per entity in your study (entities being participants, sites, districts, etc.). Example: Student table, Teacher table, School table, District table.
A participant database can be set up in a program such as:
The possibilities really are endless and it all depends on what your team has access to and how tech-savvy your team is. Some systems require specific programming knowledge (ex: setting up a SQL database). I personally prefer using a relational database system, which allows relating tables to one another (a schema), eliminating redundant data. Without getting too technical, database normalization increases performance, decreases storage, and makes it much easier to make updates to tables as changes occur. There are many options out there, but Access is just one example of a tool that allows you to relate tables and allows some form of querying.
Consider this first structure below, with 3 very simple tables (a student table, a teacher table, and a school table). Each table has a primary key that makes individuals within that table unique and each table can be connected through a foreign key. For example in the student table, the primary key is
stu_id and the foreign key is
tch_id which connects students to the teacher table. Using a query language (such as SQL) in systems such as Access, we can pull multiple tables together ad hoc to make a table with all the pieces of information we need.
Note: Not all relational databases require technical skills and coding. Systems such as QuickBase build relations between tables and allow querying through their point and click, low-code application.
Say for example, we needed to pull a roster together for each teacher. We could easily run a query, such as this SQL query, that joins the student and teacher tables by
tch_id and then pulls the relevant teacher and student information from both tables:
SELECT Student.first_name, Student.last_name, Teacher.first_name AS t_f_name, Teacher.last_name AS
FROM Student INNER JOIN Teacher ON Student.tch_id = Teacher.tch_id
ORDER BY Teacher.last_name, Teacher.first_name, Student.last_name, Student.first_name
Would produce a roster like this:
Now consider these 3 tables below that are not relational (such as 3 tabs in an excel spreadsheet). Since we are unable to set up a system that links these tables together, we need to enter redundant information into each table (such as teacher or school name) in order to see that information within each table without having to flip back and forth across tables to find the information we need.
Using a relational structure allows us to eliminate redundant data. This not only saves us time and energy but reduces errors as well. You can imagine how useful that is.
Ultimately though, set up this participant database in whatever system works for you. It just needs to be set up. This is system is vital to protecting the confidentiality of your participant data as well as record keeping for all of your data collected.
Relational database resources:
📑 Database vs. Spreadsheet
📑 7 Excel Spreadsheet Problems
📑 Jenna Jordan has a great example of how and why to build a relational database
📑 QuickBase explanation of table-to-table relationships
📑 Relating tables in Google Tables
As you build your tables/sheets, you will need to decide what fields to include. As you create those fields, consider the following:
Fields to include at the beginning of your study:
Quick thought: Study IDs
As you recruit and consent participants/sites, you will add them to a participant database under an assigned study ID (ex: stu_id, tch_id). These IDs allow participants/sites to remain confidential in your research data. That ID (typically a 2-6 digit random numeric or alphanumeric value) will follow that individual/site throughout the life of the study and should be unique to that individual/site. This number never changes.
Depending on your study design, some participants may have the opportunity to be re-recruited into the study more than once. For example, we had a study where three cohorts of 5th grade students were recruited over three years. Therefore, the same 5th grade teachers were often recruited back into the study each cohort. If you have a similar study design, you still keep that same study ID (in this case the same tch_id) for that participant. If you want to identify the unique wave or cohort that a participant is brought back into the study, make sure you have other variables in your data that help you understand why a study ID occurs in your data more than once (ex: a cohort variable). The study ID concatenated with the cohort provides you with a unique row identifier.
Jenna Jordan has a great blog post on UIDs.
After your participants are recruited and you begin data collection, you can imagine there are many pieces of information to track such as incentives/payments and data collection completion. These fields will also need to be added to your participant database.
Fields to track over time (for each wave of data collection):
This data collection information/tracking may seem trivial, but it is vitally important to the integrity of your research study for these reasons:
The participant database is what holds the project together. I cannot stress this enough.
Quick thought: Tracking Best Practices
A few best practices for tracking that improve project coordination and data management:
Again if you are building these as individual spreadsheets in a workbook rather than relational tables that can be queried, you will most likely want to include duplicate information across tables (teacher and school name in your student table for instance). This will prevent you from having to flip back and forth across sheets.
For the purposes of this database, there are probably two most common methods of entering data.
There are many other ways to load data into a database but for this use case (tracking participation in an education research study) these are probably the most common methods. However, you can read about several other methods here.
Also, most recently I’ve heard people mention the idea of automating at least some data entry/tracking through the use of unique scannable codes or through integration of your online data collection platform and your tracking database (ex: Qualtrics and Quick Base). This type of tracking most likely requires some technical skills to set up this linking process. But it’s a cool idea and may reduce at least some error that occurs in manually tracking data. For example, by creating a unique QR code/barcode (linked to a participant ID) that is added to every paper survey, similar to what is used for tracking inventory, you can scan that code when data collection is complete to “check in” a survey to your participant database. Or by integrating your survey platform and participant database, you could add new participants to your database as they complete their consents and surveys online.
QR code/barcode resources:
📑 Creating Barcodes in Excel
📑 Scanning Barcodes into an Excel Spreadsheet
📑 Creating Barcodes in Word
📑 Scanning QR codes into Google Sheets
📑 Creating Barcodes in Access
📑 Scanning Barcodes into Access
There are two issues you may want to consider when developing these systems:
At the most basic level you will have one table per entity (student, teacher, etc.), as we discussed. However, you can imagine these tables/sheets can become very wide if you are collecting data on a lot of measures or many waves of data collection and very long if you are collecting data on a lot of participants over several cohorts. If this is the case, you may want to consider building a more complex schema (such as different tables for each time period or for each cohort) that again link together through primary and foreign keys. You can read more about creating a schema here.
Another issue that may occur is, if team members other than the project coordinator will be tracking data collection, you may want to limit their access to participant identifying information. If this is true, similar to above, you may consider building a more complex schema where you create separate tables for participant demographic/identifying information and tables with only participant ID and data collection tracking, that again can be linked through primary and foreign keys.
Other overall preferences for your participant database include:
Although data storage is our next topic, I want to mention right now, that this participant database clearly has identifiable and protected information. Therefore it should be stored securely and apart from all other study data. This is the only file that directly links your participants’ true identity to your confidential study data and it should never be stored in the same folder as your study data. Additionally, it should have limited access. Only those who need access, such as the project coordinator, data manager, and any other staff assisting with tracking, should have access to this file for security purposes.
Resources on securing a participant database:
Last, most of what I have covered here is relevant for studies where you plan to de-identify your data. If you are collecting anonymous data, you do not need a participant database per se because you are not collecting any identifying information on your participants. With that said, you may still wish to and probably should set up a simple system to track your data collection efforts.
Whether you are working with your own original collected data, or you are working with extant data provided to you from an entity such as a school district, or publicly available data, you need to consider short-term secure data storage. I say short-term because this is how you will store your data while you are actively collecting data for your project. Data retention/long-term storage, after your study is complete, will be discussed in a later training.
In general you will most likely be working with one of four types of data and will need to store your data according to the type of data:
For this section I am going to generally describe the level of security needed to store your data depending on the type of data. However, I am not a security expert and in an effort to not explain security incorrectly, I would like to refer readers to this document from J-PAL. It is one of the most comprehensive documents I have come across regarding data security.
Because your participant database contains identifiable information about your participants, this data must have the highest security. The specifics of this security will be set by your university or institution. However, the general rules for storing this type of data is:
Your typical study data files are in formats such as csv, tab-delimited, excel, text, word doc, or a statistical program file such as .sav, .dta, .R, or .sas. No matter the file type, your study data should all be de-identified/confidential, only including study IDs and no other identifiers. However, there are still some security precautions you should take.
This includes items such as external hard drives, flash drives, or CDs.
A newer type of data collection is occurring since the COVID-19 pandemic, and that is observations, interviews, and focus groups occurring via video conferencing. This data is especially sensitive as it may include names and faces as part of the recording. Again, you will want to refer to your specific institution guidelines, but generally:
Note: These same data security rules will also apply to data that is recorded in person. And furthermore, any data that is recorded on detachable media will need to follow the detachable media guidelines.
Again, I am not an IT professional, but some general security rules to keep in mind for all data and devices are:
Resources on data security:
📑 University of Guelph
📑 University of Pittsburgh
📑 University of Michigan
📑 Princeton University
📑 University of Nevada
📑 Pacific University Oregon
📑 Florida International University
📑 Karl Broman
📑 University of Missouri
📑 Foundational Practices of Research Data Management
📑 University of Pittsburgh, RDM
📑 University of Connecticut
📑 Brandeis University
📑 Reasons to consider SharePoint for data storage
📑 Reasons to consider Institutional Servers for data storage
No matter where your team stores your files, it is important to develop a logical directory structure. It makes it so much easier to find your files and it facilitates sharing within and outside of your team. You will want to build this structure into your style guide and implement it generally across all of your projects to create cohesion. The most important thing is to allow files to be findable, have a structure that allows you to control user access, and to keep folders shallow enough that you don’t reach a limit on the allowable length of a path name for any file.
At the highest, organizational level, you will want to have separate folders for each of your projects, as well as an overall Team folder that houses general documents related to your team functioning (meeting notes, hr documents, team expectations). You will want to develop a style guide that covers general rules across all folders and house that style guide in an easily accessible location (ex: Team Wiki and/or a README in each project folder)
Then, within each project folder you will want to build a hierarchy, something like this:
levelName 1 project-name 2 °--life-cycle-folder 3 °--time 4 °--content 5 °--participant 6 °--archive
Level 1: The name of the project
Level 2: General research life cycle folders (Ex: data, documentation, project management, intervention, tracking)
Level 3: For longitudinal studies I tend to like the third level to be time period or grouping sub-folders (Ex: cohort 1, or wave 1)
Level 4: Specific content folders (Ex: raw data, syntax, clean data)
Level 5: Participant specific folder (ex: student, teacher).
Level 6: All previous versions of files can go in an archive folder to reduce clutter
More details on setting up this structure can be found in training_3.