Open Data: What is it and why is it important?

tristankk · April 30, 2021, 8:57pm

Open data is the practice of making research data publicly accessible for others to use. Sharing the data gathered for a specific study allows other groups to confirm, critique, or extend that study. If the data associated with a study is public, it allows others to repeat the authors’ analysis, verifying their claimed results. The awareness that this confirmation is possible may also incentivize researchers to gather data thoughtfully and analyze it properly. “[Sharing data] improves the potential for aggregation of raw data for research synthesis (Cooper, Hedges, & Valentine, 2009), it presents opportunities for applications with the same data that may not have been pursued by the original authors, and it creates a new opportunity for citation credit and reputation building.” (Nosek et al., 2012)

More recently, the question of how best to share data so that it can be as useful as possible to as wide an audience as possible has resulted in the GO FAIR initiative, advocating for FAIR data (Wilkinson et al., 2016): data that is findable, accessible, interoperable, and reusable. As the GO FAIR initiative has gained traction, many data repositories have begun to enforce policies that encourage datasets to adhere to the FAIR principles. For this reason and on the FAIR principles’ own merits, its useful for researchers to keep them in mind, especially as they consider a format for their (meta)data and a public repository to use for sharing their data.

High-level procedure

Sharing data is often a good place to start when adopting Open Science principles, because sharing data is often possible without other changes to the study protocol. To share your data, the following steps need to be taken:

Ensure you’re allowed to share your data under any ethics and/or privacy agreements associated with that data.
Organize your data in an accessible way. Ideally, this includes a organizing your data with a standard, non-proprietary directory structure and/or file format that includes as much relevant metadata as possible.
Identify an appropriate public repository for your data. Some repositories enforce a specific structure and/or size limits for datasets, so be aware of the requirements of a repository you have in mind. See this topic for a more in-depth discussion.

Data structure/organization

Some fields have specific standards for how their datasets are organized. It may be a good idea to explore open datasets from publications or repositories associated with your field. Some existing standards for neuroimaging-adjacent modalities you can look into as a starting point:

Brain Imaging Data Structure (BIDS) - A general specification for organizing neuroimaging data. BIDS initially came from the (functional) MRI community, so MRI modalities tend to have the most mature specifications, but BIDS has extensions, some accepted and some in progress, for other neuroimaging modalities.
Neurodata without Borders: Neurophysiology (NWB:N) - A data standard for neurophysiology data, where a single file contains the structure and metadata for recorded data.