Preparing your data for shareability
Effective data sharing requires it to be organized, well-documented, and appropriately preserved. Human subjects research requires the informed consent of study participants to share data. Make sure data sharing was mentioned in your IRB and informed consent forms. It is also crucial to de-identify data prior to sharing to reduce the risk of identifying individuals in datasets.
- Guidance on the HIPAA Privacy Rule in Research
Details the HIPAA Privacy Rule, which outlines the circumstances under which covered entities may use or disclose protected health information for research.
- Preparing raw clinical data for publication
Many peer reviewed journals now require authors to be prepared to share their raw, unprocessed data with other scientists or state the availability of raw data in published articles, but little information on how such data should be prepared for sharing has emerged. Iain Hrynaszkiewicz and colleagues propose a minimum standard for de-identifying datasets to ensure patient privacy when sharing clinical research data in this article from the BMJ.
- J-PAL Guide to De-Identifying Data
Researchers who plan to publish data on human subjects should take careful steps to protect the confidentiality of study participants through data de-identification—a process that reduces the risk of re-identifying individuals within a given dataset. This guide provides further details on the de-identification process, including various procedures for de-identifying a dataset, a list of common identifiers that need to be reviewed, and sample code that can be used to de-identify data intended for publication.
- Guide for De-identifying Qualitative Research
A guide from the Qualitative Data Repository that discusses different types of potential identifiers and how to deal with them when sharing research data.
Licensing Your Data
When you share your data, it’s important to include a license. A license tells others exactly how they can use your data and how to give you credit. Without a license, people may be unsure what they’re allowed to do, which can lead to confusion and discourage reuse.
Licensing data is different from licensing other open access materials. Because datasets are often combined, reused, and built from many sources, requiring detailed attribution can quickly become complicated and it can make your data harder to reuse.
To avoid these issues, many researchers choose a license that doesn’t require attribution, such as CC0 or the Open Data Commons Public Domain Dedication and License (PDDL). These licenses make it easier for others to reuse your data without legal uncertainty.
The resources linked below can help you understand what needs to be considered when licensing your data.
Where can data be shared?
Domain Specific Repositories
The NIH supports a large number of domain-specific data sharing repositories. These repositories are described in two lists: one for repositories that allow open submission and access and one for repositories that may restrict submission and access to specific researchers. If available, best practices and many policies dictate that data should be shared via domain-specific repositories.
Generalist Repositories
The repositories listed below accept datasets from all research disciplines and are appropriate when a domain-specific repository does not exist. They also accept deposits of other scholarly outputs, such as preprints and software.
- Zenodo: general-purpose open-access repository operated by CERN. It allows researchers to deposit research papers, datasets, research software, reports, and many other types of research outputs.
- Figshare: online open access repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos. It is free to upload content and free to access.
- Open Science Framework (OSF): open source software project that facilitates open collaboration in science research. It can be used for both research data management and research project management.
- Harvard Dataverse: free data repository open to all researchers from any discipline, both inside and outside of the Harvard community, where you can share, archive, cite, access, and explore research data.
- Mendeley Data: open repository for sharing research data and a search engine that indexes both domain-specific and cross-domain data repositories.
- Dryad: international open-access repository of research data. It is free to access, but submission may involve a Data Publishing Charge (DPC).
- Vivli: global clinical research data sharing platform from the Center for Global Clinical Research Data.