validate bids datasets #74

yarikoptic · 2021-08-09T16:19:49Z

prompted by @satra in slack.

We need to RF our interface/approach to support that, since ATM dandischema validation is purely against the schema, and that is against pydantic model. Properties of current approach

P1: dandischema validates purely metadata, i.e. without looking into any data file (hence can be used by dandi-api server, which has no direct access to data blob files)
- a significant portion of "bids dataset" validation requires looking into side car .json and IIRC even actual data files
P2: operates "independently" on Dandiset and Asset records:
- only Dandiset seems to have some indication of either that Dandiset is a BIDS dataset since assetsSummary.dataStandard is populated while summarizing dandiset (should be identifier: RRID:SCR_016124)
- so whenever it gets to validate an Asset, it would not have information on the corresponding *Dandiset, and that it is a "BIDS asset" (note: if we allow for assets to be reused across dandisets, then the same asset might be part of a BIDS dataset and non-BIDS dataset)

If we were to stay in pure "python land" and just try to use bids-validator python module, its "usefulness" would be quite limited - I think it can only test for filenames to correspond to bids (which is good for P1 since we cannot even access data).
And I don't think it is even using the WiP stock BIDS schema yet: https://github.com/bids-standard/bids-specification/tree/master/src/schema . But AFAIK we do not even have some stable Python API library to provide interfaces which would load/use that stock schema for validation (relevant original discussion etc: bids-standard/bids-specification#543 ).

pybids uses that bids_validator module solely for that purpose but I guess does more of internal checks while constructing the layout. But even if pybids provided some extra power for validation, it is quite a heavy dependency:

list of stuff `pip install pybids` would pull on top of dandischema

$> pip install pybids
Collecting pybids
  Using cached pybids-0.13.1-py3-none-any.whl (3.2 MB)
Collecting pandas>=0.23
  Using cached pandas-1.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
Collecting sqlalchemy<1.4.0.dev0
  Using cached SQLAlchemy-1.3.24-cp39-cp39-manylinux2010_x86_64.whl (1.3 MB)
Collecting numpy
  Using cached numpy-1.21.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.8 MB)
Collecting scipy
  Using cached scipy-1.7.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (28.5 MB)
Collecting click
  Using cached click-8.0.1-py3-none-any.whl (97 kB)
Collecting bids-validator
  Using cached bids_validator-1.8.0-py2.py3-none-any.whl (19 kB)
Collecting nibabel>=2.1
  Using cached nibabel-3.2.1-py3-none-any.whl (3.3 MB)
Collecting num2words
  Using cached num2words-0.5.10-py3-none-any.whl (101 kB)
Collecting patsy
  Using cached patsy-0.5.1-py2.py3-none-any.whl (231 kB)
Collecting packaging>=14.3
  Using cached packaging-21.0-py3-none-any.whl (40 kB)
Collecting pyparsing>=2.0.2
  Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
Collecting python-dateutil>=2.7.3
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting pytz>=2017.3
  Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB)
Requirement already satisfied: six>=1.5 in ./venvs/dev3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas>=0.23->pybids) (1.16.0)
Collecting docopt>=0.6.2
  Using cached docopt-0.6.2-py2.py3-none-any.whl
Installing collected packages: pyparsing, pytz, python-dateutil, packaging, numpy, docopt, sqlalchemy, scipy, patsy, pandas, num2words, nibabel, click, bids-validator, pybids

Even if we decided to just use "official" JS bids-validator, I don't think we would have much luck since AFAIK it needs content of at least .json side car files (well, I guess those could be fetched, but it might be expensive in sheer number of them for some datasets). So not sure if that option is also easy to realize for remote (on dandi-api server) execution, besides: develop FUSE mount support of dandisets based on our /assets listing, and feeding that to bids-validator.

Altogether -- I do not see an easy way for "ultimate bids-validation" support, but I feel that we can get quite far if we use official WiP schema, even if for starter just to ensure bids dataset version compliant files naming/presence. We might want to instigate/contribute toward bids-standard/bids-specification#543 (I will follow up there).

But also I think we might need to add some more explicit (not just an auto-summarized mentioning) indicator at dandiset level metadata description (metadata entry) that dataset is a BIDS dataset.

To address P2 I think we would need to "couple" a notion of a Dandiset and asset(s) for validation, and most likely need to validate BIDS dandiset as a whole for the purpose of BIDS validation and not per asset/file .

The text was updated successfully, but these errors were encountered:

yarikoptic · 2023-02-07T17:53:02Z

note: paths of bids datasets could be validated using bidsschematools now as done in dandi-cli. no need for heavy pybids. Overall path validation - #157

yarikoptic mentioned this issue Aug 9, 2021

[INFRA]: post-#475 - migrate schema access code (Python API) to bids-schema bids-standard/bids-specification#543

Open

yarikoptic mentioned this issue Feb 7, 2023

Add asset "layout" metadata field, validate asset paths #157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate bids datasets #74

validate bids datasets #74

yarikoptic commented Aug 9, 2021

yarikoptic commented Feb 7, 2023

validate bids datasets #74

validate bids datasets #74

Comments

yarikoptic commented Aug 9, 2021

yarikoptic commented Feb 7, 2023