Data sharing has become one of the most discussed areas around data management. With the growing number of funders requesting a data sharing plan (i.e. NIH), more people want to know how and why they should share their study data. In a 2019 paper, Pasek and Mayer found that data curation and re-use was cited as the area most needing improvement by graduate students across two universities. And in a way data sharing is a great jumping off point for many people to think about data management. With an end goal in mind, researchers can start to think about what structures they need to put in place to curate data that will be acceptable for data sharing.
While data sharing may sometimes encompass things such as sharing correlation matrices, summary tables, or study results, that is not what this module is about. This module is about sharing raw, item and case level, primary data collected as part of a research study. It can also include extant data collected and added to data that you collect. Beyond that overview, what types of data you share, how you share, and where you share, may depend more on things like your funder, your budget, your project/participants, and your field. However, this module will provide many recommendations for what and how to share your data that will lead to the most benefits for you, your project, and your field.
As a reminder of what we covered in the Data Management Module, in 2016 the FAIR Principles were published in Scientific Data, outlining 4 guiding principles for scientific data management and stewardship. These principles should be referred to when choosing when, where and how to share your research data.
F: Findable
All data should be findable through a persistent identifier and have good data documentation, aka metadata. As we move towards automation in our work and life, the need for machine-readable metadata becomes more prevalent for automatic discovery of data.
A: Accessible
You data is accessible if humans can access your data. This can mean your data is available in a repository or through a request system.
I: Interoperable
Use standardized vocabularies as well as formats. Both humans and machines should be able to read and interpret your data. Software licenses should not pose a barrier to usage. Data should be available in open formats that can be accessed by any software such as .csv, .txt, .dat, etc. Furthermore, thorough data documentation should accompany data.
R: Reusable
Your metadata should provide information on the broad context of your project as well as your data collection to allow for accurate use of your data. You should also have clear licensing for data use.
Additional resources:
📑 Within
& Between podcast
📑 Practical
Solutions for Sharing Data and Materials from Psychological
Research
While Data Sharing may often happen at the end of a project, planning for data sharing should happen at the beginning. For many funders, you’ll be required to write a brief overview of your data sharing plan in your data management plan (DMP) as part of you grant proposal, and will also be allowed to submit DMP associated costs in your grant application budget requests. DMPs are often a supplement/appendix to your grant application and restricted to anywhere from 1-2 pages (NIH, NIJ, and NSF) to a 5 page maximum (IES). For most funders these DMPs are not part of the scoring process, but they are reviewed by a panel or program officer. Some funders may provide feedback and/or ask for revisions if they believe your plan and/or your budget/associated costs are not adequate.
What to include in a DMP varies some across funding agencies. While you should check each funding agency’s site for their specific DMP requirements, this comparison table provides an overview of 11 common categories covered in a data management plan and whether four large funding agencies ask applicants to address these categories in their data management plans, as well as any additional guidance they provide for each category.
Categories to Include in a Data Management Plan and Guidance Provided by Funder
Content Category | IES - Data Management Plan | NIH - Data Management and Sharing Plan | NIJ - Data Archiving Plan | NSF - Data Management Plan |
---|---|---|---|---|
Roles and Responsibilities | YES | YES | YES | YES |
What are staff roles in mgmt and long-term preservation of data? Who ensures accessibility, reliability, and quality of data? Is there a plan if a core project member leaves the project or institution? |
Staff responsible for data creation and management should be identified and their duties described. Identify any hardware/software required for managing data. |
|||
Type of Data to be Shared | YES | YES | YES | YES |
How is data captured (surveys, assessments, observations)? Will data be item-level and summary scores? Will you share raw datasets and clean Datasets? What are the expected # of files? Expected # of participants in files? |
Full item/case level dataset, not summary statistics or tables are expected. This includes primary data collected through the grant, as well as data created by linking to extant data sources. Share, at a minimum, the data underlying any peer-reviewed publication, this includes “final analytic measures as well as the source measures used to construct them”. However, researchers are encouraged to share any data that will “inform the field more broadly”. Providing access to final cleaned data is required. Including raw data and derived variables is optional. |
All primary data collected through the grant regardless of whether the data are used to support scholarly publications. Data should be of sufficient quality to validate and replicate research findings. |
All cleaned data collected under the grant award. Data should include all variables used to produce analysis, tables, and descriptive information provided in the final report (ex: computed, derived, and weight variables - if applicable). |
Share primary data collected during the course of the grant. |
Format of Data | YES | YES | YES | YES |
Electronic? Nonproprietary software format (ex: csv)? More than one format (.sav and .csv)? Any related tools needed to manipulate shared data? |
Must be an electronic format. Advised to provide data in multiple formats (one being a non-proprietary format). Identify any hardware/software required for sharing data. |
Datasets should be provided in formats consistent with those used in the community that the repository serves. Preferred non-proprietary formats. |
For Quantitative: SPSS preferred, Stata and SAS are acceptable. Embedded variable and value labels, and missing values assigned. For Qualitative: txt, rtf, Microsoft Word or PDF, Excel, databases |
|
Documentation | YES | YES | YES | YES |
What metadata will you create (data dictionaries, codebooks)? Consider project level, dataset level and variable level documentation. What format will it be in (xml, csv, pdf)? What other documentation do you plan to include when sharing data (code, data collection instruments, protocols)? |
Provide sufficient documentation to support responsible use by other researchers. “Information can be embedded directly into the file or can be included with it as a separate file (e.g., a ReadMe.txt file, a .pdf, etc.).” “Documentation should include a summary of the purpose of the data collection, methodology and procedures used to collect the data, timing of the data collection, as well as details of the data codes, definition of variables, variable field locations, and frequencies.” |
Examples include: Methodology and procedures used to collect the data, Data labels, Definitions of variables, Any other information necessary to reproduce and understand the data |
Documentation should include: IRB approved protocol, Final project report/journal article, IRB approved consent (if applicable), Data use agreements (if applicable, Description of file formats, data anomalies, frequently used terms/acronyms, instructions for merging files, documentation on decisions, guidance for using weights (if applicable), citations, contact information, codebooks if variable and value labels are not embedded in the data files, syntax files containing statistical programming code as well as data manipulation code. PDF format preferred: Word, RTF and ASCII accepted |
|
Standards | YES | YES | YES | YES |
Any data/documentation standards being used (ex: DDI, International standard)? | “There are emerging metadata standards in many fields, but currently there is not a set of standards in education research.” | “While many scientific fields have developed and adopted common data standards, others have not. In such cases, the Plan may indicate that no consensus data standards exist for the scientific data and metadata to be generated, preserved, and shared.” | “DMPs submitted to EHR should be appropriate to the data being generated and reflect the procedures, standards and best practices developed by the communities of practice in the area of research being proposed.” | |
Method of Data Sharing | YES | YES | YES | YES |
How will you share data (institutional repository, data archive, PI website)? Will data be restricted and is a data enclave required? Is a data agreement required? |
Highly encouraged to deposit data in publicly accessible repositories. However, other methods may be used including the PI taking responsibilitiy for data sharing or some combination of repository and PI data sharing. Datasets should be discoverable and citable (ex: metadata and DOI). |
Strongly encourage the use of established repositories to the extent possible, specifically, domain-specific repositories where possible. Data should be findable and identifiable (ex: via a persistent unique identifier or other standard indexing tools). While data kept by the researcher or institution, provided on request is not preferred, NIH recognizes and respects that many communities (ex: AI/AN communities) may want to manage, preserve and share their own data. |
In most cases, the NIJ requires grantees to deposit their data in the National Archive of Criminal Justice Data (NACJD), which is hosted by ICPSR. If this is not the appropriate repository for the datasets, the data archiving plan should include submission of study level information and link to data location. |
“If data or products are to be preserved by a third party, please refer to their preservation plans if available.” |
Circumstances preventing sharing | YES | YES | YES | YES |
Do you have any data covered by FERPA/HIPAA that doesn’t allow data sharing? Do you work with any partners that do not allow you to share data (ex: School districts, Tribal regulations)? Are you working with proprietary data? |
“Specify appropriate restrictions on access to and usage of the data to ensure protection of human subjects while not unduly restricting access to the data.” If a DMP states data cannot be shared, researchers must provide a compelling rationale. |
Will there be any restrictions on data collected from human subjects? “Any restrictions imposed by federal, Tribal, or state laws, regulations, or policies, or existing or anticipated agreements.” Any limitations should be communicated to both the data repository as well as communicated in the Data Management and Sharing Plan for review. |
“Unless otherwise specified in writing by the NIJ grant manager, as authorized by the appropriate NIJ authority, data submission is required for all research, development, and evaluation awards, and the requirement may not be unilaterally modified or waived.” | |
Privacy and rights of participants | YES | YES | YES | YES |
How will you maintain participant confidentiality during your project and when data is shared, prevent disclosure of PII? Did participants sign informed consent? Did the consent communicate how participant data are expected to be used and shared? |
Protect confidentiality in final data and follow rules in accordance with IRB and any state/federal laws and regulations. Proxy IDs may be used to protect direct disclosure of participants in data, but means of indirect exposure should be identified as well (ex: small numbers, sample characteristics) and remedies provided. Consent forms and IRB approvals should reference future data sharing so that participants, schools, and other participating organizations are made aware of conditions that will be put in place to protect privacy prior to any data collection. |
Describe informed consent as well as how you will protect privacy and confidentiality “consistent with applicable federal, Tribal, state, and local laws, regulations, and policies” | “All direct identifiers to be removed and all indirect identifiers to be recoded to prohibit re-identification.” | “Access and sharing of data and products should reflect appropriate protections for IRB, privacy, confidentiality, data security, and intellectual property.” |
Data Security | YES | NO | YES | YES |
How will you keep data secure on site during the project. Consider IRB requirements. |
Protect confidentiality in final data and follow rules in accordance with IRB and any state/federal laws and regulations. Proxy IDs may be used to protect direct disclosure of participants in data, but means of indirect exposure should be identified as well (ex: small numbers, sample characteristics) and remedies provided. Consent forms and IRB approvals should reference future data sharing so that participants, schools, and other participating organizations are made aware of conditions that will be put in place to protect privacy prior to any data collection. |
You do not have to speak to security as NIH acknowledges that data security fall within the purview of Institutional IRB (while the project is active) and the repository (once data is shared). | “All direct identifiers to be removed and all indirect identifiers to be recoded to prohibit re-identification.” | “Access and sharing of data and products should reflect appropriate protections for IRB, privacy, confidentiality, data security, and intellectual property.” |
Schedule for Data Sharing | YES | YES | YES | YES |
When will you share data and for how long? |
No later than when the main findings from the final study dataset are published in a peer-reviewed scholarly publication. However, researchers may share data earlier as well, if appropriate. Data should be available for 10 years unless a shorter period of time is required to comply with Federal or State laws. |
Data should be shared no later than the time of an associated publication or end of the grant period, whichever comes first. A single project can share data at different times (ex: share data underlying publication during the period of award but ALSO share data that have not yet led to a publication by the end of the award period). Data should be available as long as it is useful for the larger research community. |
“Grant recipients are strongly encouraged to submit data sets 90 days or earlier prior to the end of the award project period.” | “Access to data and products should be provided, and data and the products of research shared, as soon as is reasonably possible.” |
Pre-registration | YES | NO | NO | NO |
Where and when will you pre-register your study? |
Causal impact studies must be pre-registered in a recognized study registry, documenting their confirmatory research questions and planned analytic activities. Must be registered within the first year of the project. Example registry options: OSF, Registry of efficacy and effectiveness studies (REES). |
However, even if your funder does not require a data sharing plan, there are still many reasons to consider sharing your data, as we covered above. Planning what, how, and when to share before your project even begins is the best way to ensure you have everything in order by the time you need to share your data. You can always update your plan during or after your project completion. If data sharing is required by your funder, it may be helpful to keep in contact with your program officer regarding any potential changes throughout your project.
Data Management Plan resources:
IES:
📑 IES
Implementation Guide for Public Access to Research Data
📑 U.S.
Department of Education Plan and Policy Development Guidance for Public
Access
📑 IES Data
Sharing FAQ
📑 IES Policy
Regarding Public Access to Research
📑 DMP Tool
Template IES
NIH:
📑 NIH
Writing a Data Management and Sharing Plan
📑 Final
NIH Policy for Data Management and Sharing
📑 NIH DMSP Guidance for Data Support
Services Working Group
NIJ:
NSF:
📑 Data
Management for NSF EHR Directorate Proposals and Awards
📑 NSF
Dissemination and Sharing of Research Results
📑 SPARC
Data Sharing Requirement Comparison
While your funder may have guidance as well as requirements for your data sharing plan, there are also generally accepted best practices that you should consider when you construct your plan. Following required guidelines and best practices will help you provide data that is useful and accessible to researchers.
However, the Institute of Education Sciences put it best when they said, keep the big picture in mind. They listed out 4 big ideas to consider when planning for data sharing:
Focus on sharing well-organized and well-documented data. Include all documentation necessary for someone with no familiarity with your project to pick up your data and make sense of it. Consider everything you can include so that future researchers aren’t reaching back out to you with ongoing questions. Also, organize your data with a well designed structure. Don’t share messy files that are inconsistent across the project. Share files that are standardized, uniform, and can be easily linked if necessary.
Commit to sharing some data or code to facilitate analysis. If possible share more data beyond those used to produce a study’s main findings. And if your data have restrictions and you are unable to share, is there anything you can share that is still in the vein of open science, such as code or aggregated data?
Don’t get stuck on one single way to share data. Each project is unique and has it’s own opportunities and constraints. Think outside the box about what means of data sharing works best for you, while also considering how to maximize impact.
Not most, but some projects will need to consider the possibility of tradeoffs. There may be times when sharing all of your data requires you to share with restricted access, while removing some variables and sharing only some of your data allows you to share the dataset openly. Researchers may need to consider what makes the most sense here, and in some circumstances you may be able to share data through of combination of methods.
Data Sharing is a great service to the larger community and that service it is built upon trust. It’s built upon you trusting that your own data is accurate and represents the true information that you collected. It is also built upon users putting their trust in you and your data; that the data you provide are accurate and free from errors, and that the findings you publish are based off of data that are free from errors and manipulation.
This means, when you find errors in the data that you have publicly shared, you have an obligation to do something about it.
If your data is deposited in a repository:
Consider making a comment in your project, notifying users of errors in the data. If the repository requires/allows users to make an account before accessing the data, they may have a system to email current users to let them know a new comment was added to a project they have downloaded.
If the errors in your data is fixable, take time to go back to the raw data and re-clean, making the appropriate edits.
Upload a new version of your data to the repository. Make comments about the revisions so that users know what changes have been made between the previous and new version of your data.
If your data is not in a repository:
If you have used your data in a publication, you will also need to consider retracting the paper from the journal. Contact your journal to make them aware of the errors you found. Consider the story of Dr. Kate Laskowski, who began finding errors in her data, after publishing with it, and decided she had no other choice but to retract her publication. You can find the retracted article here.
An article in Science Direct found that across a three year span, of all the articles retracted in the journal PubMed, data processing errors were the number two reason for journal retractions. Errors happen, we are all human, but it is best to take the time and care to manage your data during your project so that retractions are not necessary later on.
To learn more about reasons for retractions, you can read the blog Retraction Watch or search their database.