Choosing an Open Data Repository

tristankk · May 3, 2021, 3:31pm

There are already many public repositories that you can use to share an open dataset, and the number of repositories out there continues to grow. With that in mind, there are a number of criteria that can be used to identify the most appropriate repository for a given dataset:

How much storage space is available for a given dataset? 50GB is a common limit for dataset size, but, especially in neuroimaging, sometimes a dataset is larger. This will limit the number of repositories that are available.
How does the repository handle metadata? A repository that processes a lot of relevant metadata for all the datasets it hosts will more easily be able to expose a dataset to interested users.
Who is responsible for the repository? What guarantees do they place on the persistence of your data? Especially if you link to your Open Data in a publication, it is important the the data remains accessible from that link for as long as possible.
What subject matter does the repository concern itself with? Repositories range from totally general-purpose research repositories that will host any kind of data from any region, to specialized repositories that accept only datasets from a specific discipline and/or region.
- This choice is a trade-off: Researchers who know they’re interested in a given discipline will more easily find relevant datasets in a repository dedicated to that discipline, but potentially interested researchers outside the discipline will have a harder time finding them in a specialized repository.
Is the repository curated? Some repositories review datasets in process similar to paper reviews, in which they screen datasets for relevance to the repository, appropriate metadata, and/or data organization, among other factors.
- Curated repositories will generally contain high-quality datasets, but submitting to a curated repository may have an associated financial cost, and will take more time to accept a dataset.
Does the repository provide a persistent DOI for uploaded datasets? In line with FAIR principle F1, it is good practice for an open dataset to have a globally unique and persistent identifier like a DOI.

A sample of available data repositories follows, with notes on their advantages and disadvantages:

OSF Projects offer a number of research project management features, including storage for research data.
- OSF is general-purpose and enforces no structure for data, so it is often a good choice for sharing a complete dataset without needing to make many changes.
- For public projects, OSF limits data storage to 50 GB.
- OSF produces DOIs for projects and registrations (static snapshots) of projects.
Zenodo is a repository for storing datasets and code.
- Zenodo does not enforce any structure for uploads.
- Zenodo datasets are limited to 50 GB (without making a special request).
- Zenodo produces a DOI for a dataset that always points to the current version of the dataset and a separate DOI for every version of the dataset.
FRDR is “a scalable federated platform for digital research data management (RDM) and discovery” hosted by Portage, Compute Canada, and the Canadian Association of Research Libraries.
- FRDR is designed to handle medium to large sized datasets that are not easily handled by other platforms.
- Data deposited to FRDR is curated, so there is a review process before data can be accepted.
- FRDR does not appear to have any hard limits on dataset size, but each PI can have up to 1 TB of data associated with them.
- FRDR is general-purpose, but limited to Canadian researchers.
The CONP Portal is an open science infrastructure for sharing neuroscience data. It exposes Canadian neuroscience-related data from a variety of modalities with a unified interface (based on DataLad), but without an enforced data structure.
- The major requirement CONP portal has for a dataset is a structured metadata file, DATS.json, which exposes structured metadata about each dataset, making those datasets easier to discover.
- The CONP portal crawls OSF and Zenodo for datasets (and parses associated metadata), so a dataset on OSF or Zenodo can be added to the CONP portal simply by adding a specific tag, as documented here.
- If a dataset isn’t on OSF or Zenodo, it can be manually shared with the CONP portal using a DataLad installation on any machine that has access to the dataset. This capability allows the CONP Portal to federate data from decentralized sources.