This module is about connecting all of the pieces of the puzzle and thinking about how we can move from simply knowing what good data management should look like, to planning how you will actually manage data in your lab, for your projects, and with your team.
For the most part, the modules in this website have discussed phases of data management and best practices as independent steps in the data management life cycle. Yet we know very well, that all of these phases are dependent and connected. Before we begin to choose which practices to implement, we need to be able to put practices in context of outcomes (what are the benefits of implementing different practices). We then move on to review the data management and research life cycle to better understand how each module fits into the larger process. Then, we talk about when to start choosing the practices you want to implement, and provide checklists as tools to use in the planning process. And last, we talk about putting it all together by creating a data management workflow that works for you.
While there are MANY reasons to implement good data management practices (ex: funder requirements, legal and ethical mandates, contributing to open science, etc.), I think we can boil the benefits of data management down to 3 basic outcomes. Good data management produces reproducible, reliable, and secure data for you and future users.
Let’s connect these outcomes to actions that we have covered in previous modules:
Reproducible
Reproducible is defined as being able to produce the same results using the same materials and procedure. This could be anything from reproducing a data collection effort to reproducing a clean data file.
Data Management practices that contribute to reproducibility include:
Reliable
Reliable data is accurate/true and complete data that you can trust. Unreliable data might include problems such as inaccurately entered data, incorrectly coded variables, or missing values, and can lead to inaccurate decision making.
Practices that contribute to reliable data include:
Secure
Data security involves storing and sharing data in a way that protects participant confidentiality as well as prevents loss of information.
Practices that contribute to data security include:
We touched on the research life cycle in the Project Management module. However, I think it’s important now that we have gone through best practices, to put them all into context of where they fit in the data management life cycle.
Below we have a research life cycle image that shows how data management and project coordination work in parallel and collaboratively throughout a study. I typically think of the project management/coordination path as consisting of the PI/Co-PI as well as the project coordinator, and any other staff in charge of implementing the project as well as any intervention. The data management path consists of anyone in charge of working with data or data products (such as documentation or data collection tools), and again could still include PIs/Co-PIs, project coordinators, data managers and any other staff working with data. Sometimes the project team and the data team are the same people (especially if the team is small). Either way, it is still helpful to see how these paths work simultaneously and collaboratively.
Moving from left to right:
Data management planning is the most important step you can implement in the data management life cycle. Data management planning is the catalyst for reaping all of the benefits mentioned above. Without planning, the chances of inconsistencies, lost data, and human errors increase greatly. Think about a project where data is collected inconsistently, files are saved haphazardly, data cleaning is not well documented, and data is stored and shared without rules for security. It sounds like the story line of a data management horror narrative.
We saw in our data management flow chart above, that data management planning is mentioned twice. First it is mentioned in the context of a data management plan (DMP), the 2-5 page document required by federal funders that we reviewed in Module 5. And while DMPs provide a hopeful guide for future practices, there is often a disconnect between the broad theory behind those plans and the actual complex implementation of those plans in practice (Borycz, 2021). This is when the second planning phase comes into play. Planning data management refers to making detailed decisions and creating actionable steps to implement your DMP. This data management planning happens at the same time that the project team is planning for project implementation (things like how to collect data, how to hire staff, planning supplies, how to recruit participants, how to communicate with sites, etc).
Planning checklists can be really useful in helping you remember the various data management decisions that need to be made before your project begins. Below are checklists broken out by each phase. While these checklists will not encompass everything that every project will need to consider, it is a jumping off point for starting these team discussions. When reviewing these checklists, take into consideration all the variations that are unique to your team and project such as:
Note that many of these checklists will occur alongside (or may overlap with) general project planning which should have their own set of checklists.
You can see other examples of helpful checklists here:
📑 Kristin
Briney Data Managment Plan Checklist
📑 Harvard
Longwood Research Data Management Series of Checklists
📑 Stanford
Medicine Lane Medical Library
📑 UK
Data Service
Another part of the planning phase is developing data management workflows.
A workflow, often illustrated with a flow diagram, is a series of repeatable tasks that help you move through the stages of the research life cycle in an “organized and efficient manner” (Concordia-Saint Paul). A workflow is personalized. It is where you start to choose which “best practices” work for your project and your team. One team may collect survey data on paper because their participants are young children, hand enter it into Excel because it is the tool their team is familiar with, and double enter 20% because they don’t have the capacity to enter more than that. Another team may collect paper data because they are collecting data in the field, hand enter the data into FileMaker because that is the only tool they have access to, and double enter 100% because they have the budget and capacity to do that.
Borghi and Van Gulick view a workflow as a series of steps that a research team chooses, out of a the many possibilities not chosen. Maybe you won’t always be able to implement the “best practices” but you can decide what is good enough for your team based on motivations, incentives, needs, resources, skill set, and rules and regulations.
Here is a very simplified example of the decision making process, based on the Borghi and Van Gulick flow chart. Of course in real life we are often choosing between many more than just 2 options!
Your checklists are guides for what decisions need to be made. As you walk through your checklists, you can begin to enter your decisions into a workflow diagram. The order of your steps should follow the general order of the data management life cycle (specifically the data collection cycle). You will want to have a workflow diagram for every piece of data that you collect. So for example, if you collect the following:
You will have 3 workflow diagrams for these 3 processes.
Your diagrams should include the who, what, where, and when of each task/step in the process.
Your diagram can be displayed in any format that works for you. Here are a few examples of workflow diagrams.
And while all of these diagrams are good jumping off points, I think an effective diagram really needs to call out (at least minimally) the who, what, where, and when of each task. I think a template like this one below works very well. Remember, this is a repeatable process. So while this diagram is linear (steps laid out in the order in which we expect them to happen), this process will be repeated every time we collect this same piece of data.
Here is how I might complete this diagram for a student survey.
Visualizing these decisions in diagram format has many benefits. First it allows your team to see how their roles and responsibilities fit into the larger research process. Showing how data management is integrated into the larger research workflow can help team members view data management as part of their daily routine, rather than “extra work”. And last, reviewing workflows as a team and allowing members to provide feedback can help create buy-in for data management processes.
While these workflow diagrams are excellent for high level views of what the process will be, we can easily see that we are unable to put fine details into this visual diagram.
So the last step of creating a workflow, is to put all steps into a protocol. In your protocol you will add all necessary details of the process. You can also attach your visual diagram as an addendum to the protocol for reference.
Here is an example of how I might translate the student survey workflow from above, into a detailed protocol. Notice that I mention that I have a separate protocol just for the data cleaning portion of this workflow (and this might be because the data cleaning workflow is the same workflow used across many different types of data).
**NOTE: All workflows should be written into protocols, yet all protocols are not created from workflows. Sometimes protocols are simply documentation of decisions that are made. Take for example, a protocol on how study IDs will be assigned, or a protocol for inclusion/exclusion criteria. These don’t require workflows necessarily, yet they still need to be documented in a protocol.
Similar to the questions you need to consider when reviewing your planning checklists. You also need to evaluate the following things when developing your workflow.
✔️ Does your flow preserve the integrity of your data? Is there any
point where you might lose or comprise data?
✔️ Is there any point in the flow where data is not being handled
securely? Someone gains access to identifiable information that should
not have access?
✔️ Is your flow in accordance with all of your compliance requirements
(IRB, FERPA, HIPAA, Institutional Data Policies, etc.)?
✔️ Is your flow feasible for your team (based on size, skill level,
motivation, etc.)?
✔️ Is your flow feasible for your budget and available resources?
✔️ Is your flow feasible for the amount and types of data you are
collecting?
✔️ Are there any bottlenecks in the workflow? Areas where resources or
training are needed? Any areas where tasks should be re-directed?
Workflow resources:
📑 Borycz
📑 Borghi
and Van Gulick
📑 Briney, Coates, and
Goben
📑 Data Flow
Toolkit
Data management is complicated and the concepts can feel nebulous at times. At a lot of what works great for one team, may not work at all for another. Or even what works great for one round of data collection, may not work great for the next round. Things change: staff, situations, data, tools, life events, etc. Everything that is suggested in this entire series is just that, suggestions. They are ways that may help you get closer to having a better data management process than you had in your previous project, or in the last year of your current project, or even in the last week. By now I think we’ve all learned that data management is important. How we get to well-managed data doesn’t have to be through the same means and it doesn’t have to be implementing everything mentioned in this series. Ultimately, if you care about data management, if you are taking time to plan and think through your processes, if you are documenting those processes, and you are able to get your team on board with those processes, I call that a win!