Installing Datalad on Compute Canada

akhanf · May 3, 2021, 7:19pm

Datalad

Datalad is Git for large datasets. You can datalad install a dataset (e.g. HCP1200, an openneuro dataset etc), to get a dataset without downloading all the large files, then you can datalad get to actually download the files you need.

See http://datasets.datalad.org/ for datasets available, and http://www.datalad.org/ for general info.

Instructions for using Datalad on compute canada:

Install by loading git-annex, and creating a virtualenv for datalad:

module load git-annex python/3
virtualenv ~/venv_datalad
source ~/venv_datalad/bin/activate
pip install datalad

Note: this has only been tested with StdEnv/2020 (which enables a newer version of git-annex), if you are not running StdEnv/2020 and are having issues, do the following:

echo "module-version StdEnv/2020 default" >> $HOME/.modulerc

Give it a try by installing a datalad dataset:

cd $SCRATCH
datalad install ///labs/hasson/narratives

You should now see the narratives folder with a bids dataset, and small files should be downloaded and accessible (try opening up dataset_description.json)

Note: you should always use scratch for datalad datasets (and not /project) because of the large number of files created when installing a dataset.

Get all the data from sub-001 with:

cd narratives
datalad get -r -J 1 sub-001

Note: You must use the -J 1 option to force one download job at a time, or else it may get killed for too much memory consumption…

You can also download the HCP1200 Open Access data, see http://datasets.datalad.org/?dir=/hcp-openaccess for instructions (you need a connectomedb account, and need to use that to get S3 access credentials). Note that not all the HCP1200 data is available via the Open Access data, but it is worth checking first, since datalad is the easiest way to download HCP1200 data.

TODO: make sure datalad is installed on CBS server too