Data sharing has become one of the most discussed areas around data management. With the growing number of funders requesting a data sharing plan (i.e. NIH), more people want to know how and why they should share their study data. In a 2019 paper, Pasek and Mayer found that data curation and re-use was cited as the area most needing improvement by graduate students across two universities. And in a way data sharing is a great jumping off point for many people to think about data management. With an end goal in mind, researchers can start to think about what structures they need to put in place to curate data that will be acceptable for data sharing.
While data sharing may sometimes encompass things such as sharing correlation matrices, summary tables, or study results, that is not what this module is about. This module is about sharing raw, item and case level, primary data collected as part of a research study. It can also include extant data collected and added to data that you collect. Beyond that overview, what types of data you share, how you share, and where you share, may depend more on things like your funder, your budget, your project/participants, and your field. However, this module will provide many recommendations for what and how to share your data that will lead to the most benefits for you, your project, and your field.
As a reminder of what we covered in the Data Management Module, in 2016 the FAIR Principles were published in Scientific Data, outlining 4 guiding principles for scientific data management and stewardship. These principles should be referred to when choosing when, where and how to share your research data.
All data should be findable through a persistent identifier and have good data documentation, aka metadata. As we move towards automation in our work and life, the need for machine-readable metadata becomes more prevalent for automatic discovery of data.
You data is accessible if humans can access your data. This can mean your data is available in a repository or through a request system.
Use standardized vocabularies as well as formats. Both humans and machines should be able to read and interpret your data. Software licenses should not pose a barrier to usage. Data should be available in open formats that can be accessed by any software such as .csv, .txt, .dat, etc. Furthermore, thorough data documentation should accompany data.
Your metadata should provide information on the broad context of your project as well as your data collection to allow for accurate use of your data. You should also have clear licensing for data use.
While Data Sharing may often happen at the end of a project, planning for data sharing should happen at the beginning. For many funders, you’ll be required to write a brief overview of your data sharing plan in your data management plan (DMP) as part of you grant proposal, and will also be allowed to submit DMP associated costs in your grant application budget requests. DMPs are often a supplement/appendix to your grant application and restricted to anywhere from 1-2 pages (NIH, NIJ, and NSF) to a 5 page maximum (IES). For most funders these DMPs are not part of the scoring process, but they are reviewed by a panel or program officer. Some funders may provide feedback and/or ask for revisions if they believe your plan and/or your budget/associated costs are not adequate.
What to include in a DMP varies some across funding agencies. While you should check each funding agency’s site for their specific DMP requirements, this comparison table provides an overview of 11 common categories covered in a data management plan and whether four large funding agencies ask applicants to address these categories in their data management plans, as well as any additional guidance they provide for each category.
Categories to Include in a Data Management Plan and Guidance Provided by Funder
However, even if your funder does not require a data sharing plan, there are still many reasons to consider sharing your data, as we covered above. Planning what, how, and when to share before your project even begins is the best way to ensure you have everything in order by the time you need to share your data. You can always update your plan during or after your project completion. If data sharing is required by your funder, it may be helpful to keep in contact with your program officer regarding any potential changes throughout your project.
Data Management Plan resources:
Implementation Guide for Public Access to Research Data
📑 U.S. Department of Education Plan and Policy Development Guidance for Public Access
📑 IES Data Sharing FAQ
📑 IES Policy Regarding Public Access to Research
📑 DMP Tool Template IES
While your funder may have guidance as well as requirements for your data sharing plan, there are also generally accepted best practices that you should consider when you construct your plan. Following required guidelines and best practices will help you provide data that is useful and accessible to researchers.
However, the Institute of Education Sciences put it best when they said, keep the big picture in mind. They listed out 4 big ideas to consider when planning for data sharing:
Focus on sharing well-organized and well-documented data. Include all documentation necessary for someone with no familiarity with your project to pick up your data and make sense of it. Consider everything you can include so that future researchers aren’t reaching back out to you with ongoing questions. Also, organize your data with a well designed structure. Don’t share messy files that are inconsistent across the project. Share files that are standardized, uniform, and can be easily linked if necessary.
Commit to sharing some data or code to facilitate analysis. If possible share more data beyond those used to produce a study’s main findings. And if your data have restrictions and you are unable to share, is there anything you can share that is still in the vein of open science, such as code or aggregated data?
Don’t get stuck on one single way to share data. Each project is unique and has it’s own opportunities and constraints. Think outside the box about what means of data sharing works best for you, while also considering how to maximize impact.
Not most, but some projects will need to consider the possibility of tradeoffs. There may be times when sharing all of your data requires you to share with restricted access, while removing some variables and sharing only some of your data allows you to share the dataset openly. Researchers may need to consider what makes the most sense here, and in some circumstances you may be able to share data through of combination of methods.
Data Sharing is a great service to the larger community and that service it is built upon trust. It’s built upon you trusting that your own data is accurate and represents the true information that you collected. It is also built upon users putting their trust in you and your data; that the data you provide are accurate and free from errors, and that the findings you publish are based off of data that are free from errors and manipulation.
This means, when you find errors in the data that you have publicly shared, you have an obligation to do something about it.
If your data is deposited in a repository:
Consider making a comment in your project, notifying users of errors in the data. If the repository requires/allows users to make an account before accessing the data, they may have a system to email current users to let them know a new comment was added to a project they have downloaded.
If the errors in your data is fixable, take time to go back to the raw data and re-clean, making the appropriate edits.
Upload a new version of your data to the repository. Make comments about the revisions so that users know what changes have been made between the previous and new version of your data.
If your data is not in a repository:
If you have used your data in a publication, you will also need to consider retracting the paper from the journal. Contact your journal to make them aware of the errors you found. Consider the story of Dr. Kate Laskowski, who began finding errors in her data, after publishing with it, and decided she had no other choice but to retract her publication. You can find the retracted article here.
An article in Science Direct found that across a three year span, of all the articles retracted in the journal PubMed, data processing errors were the number two reason for journal retractions. Errors happen, we are all human, but it is best to take the time and care to manage your data during your project so that retractions are not necessary later on.