Brussels / 1 & 2 February 2020

schedule

DataLad

Perpetual decentralized management of digital objects for collaborative open science


Contemporary sciences are heavily data-driven, but today's data management technologies and sharing practices fall at least a decade behind software ecosystem counterparts. Merely providing file access is insufficient for a simple reason: data are not static. Data often (and should!) continue to evolve; file formats can change, bugs will be fixed, new data are added, and derived data needs to be integrated. While (distributed) version control systems are a de-facto standard for open source software development, a similar level of tooling and culture is not present in the open data community.

The lecture introduces DataLad, a software that aims to address this problem by providing a feature-rich API (command line and Python) for joint management of all digital objects of science: source code, data artifacts (as much as their derivatives), and essential utilities, such as container images of employed computational environments. A DataLad dataset represents a comprehensive and actionable unit that can be used privately, or be published on today's cyberinfrastructure (GitLab, GitHub, Figshare, S3, Google Drive, etc.) to facilitate large and small-scale collaborations.

In addition to essential version control tasks, DataLad aids data discovery by supporting a plurality of evolving metadata description standards. Moreover, Datalad is able to capture data provenance information in a way that enables programmatic re-execution of computations, and as such provides a key feature for the implementation of reproducible science. DataLad is extensible and customizable to fine tune its functionality to specific domains (e.g., field of science or organizational requirements).

DataLad is built on a few key principles:

  1. DataLad only knows about two things: Datasets and files. A DataLad dataset is a collection of files in folders. And a file is the smallest unit any dataset can contain. At its core, DataLad is a completely domain-agnostic, general-purpose tool to manage data.

  2. A dataset is a Git repository. A dataset is a Git repository. All features of the version control system Git also apply to everything managed by DataLad.

  3. A DataLad dataset can take care of managing and version controlling arbitrarily large data. To do this, it has an optional annex for (large) file content: Thanks to this annex, DataLad can track files that are TBs in size (something that Git could not do, and that allows you to restore previous versions of data, transform and work with it while capturing all provenance, or share it with whomever you want). At the same time, DataLad does all of the magic necessary to get this important feature to work quietly in the background. The annex is set-up automatically, and the tool git-annex manages it all underneath the hood.

  4. DataLad follows the social principle to minimize custom procedures and data structures. DataLad will not transform your files into something that only DataLad or a specialized tool can read. A PDF file (or any other type of file) stays a PDF file (or whatever other type of file it was) whether it is managed by DataLad or not. This guarantees that users will not loose data or data access if DataLad would vanish from their system, or even when DataLad would vanish from the face of Earth. Using DataLad thus does not require or generate data structures that can only be used or read with DataLad -- DataLad does not tie you down, it liberates you.

  5. Furthermore, DataLad is developed for complete decentralization. There is no required central server or service necessary to use DataLad. In this way, no central infrastructure needs to be maintained (or paid for) -- your own laptop is the perfect place to live for your DataLad project, as is your institutions webserver, or any other common computational infrastructure you might be using.

  6. Simultaneously, though, DataLad aims to maximize the (re-)use of existing 3rd-party data resources and infrastructure. Users can use existing central infrastructure should they want to. DataLad works with any infrastructure from GitHub to Dropbox, Figshare, or institutional repositories, enabling users to harvest all of the advantages of their preferred infrastructure without tying anyone down to central services.

Speakers

Photo of Michael Hanke Michael Hanke

Links