Valid/Invalid datasets #18

bleonar5 · 2023-10-26T14:01:27Z

It seems like we have a mixture of valid and invalid datasets included here, some intentional and some not. Seeing this, and thinking about how to build out a validator-testing process, a couple questions come up.

Do we want all of these datasets to be valid?

There's certainly a benefit to having both valid and invalid datasets when it comes to testing, as it gives us a wide range of cases to cover, but in order for that process to work, we need to know which datasets we expect to be valid and which to be invalid. This is also important, naturally, for community members that would want to use these datasets as exemplars worth emulating.

Should we create two subdirectories here, one for valid datasets and one for invalid?

Or should we keep track of this information in a metadata file and then throw all the datasets in the same bucket? Or should we only include valid datasets on the public facing repo, and then save the invalid ones solely for testing purposes, to reduce confusion?

In the case where we learn that one of the datasets from this repo is unintentionally invalid, should we fix it?

Or should we just include it in the invalid dataset collection. Or, lastly, should we create a new, corrected version for the valid collection, and then keep the original around in the invalid collection?

mekline · 2023-10-26T14:12:17Z

Working on updates to the readme now! For now, the top priority for this repo is easy dataset contribution by a new user. People are primarily being encouraged to attempt by-hand conversion, and then upload the result (and optionally the original). Since conversion instructions are likely inexact, mistakes should be expected, and hopefully informative.

Going forward, we'll probably want a subdirectory for datasets that are actually used for testing (either by being valid or invalid in a known way), which will probably include moving and/or correcting some of the current datasets, but let's leave the current ones at the top level so people can see what they are supposed to be doing. Feel free to make and organize that subdirectory for known/tested datasets when and how it makes sense to do so!

bleonar5 assigned mekline and bleonar5 Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Valid/Invalid datasets #18

Valid/Invalid datasets #18

bleonar5 commented Oct 26, 2023

mekline commented Oct 26, 2023

Valid/Invalid datasets #18

Valid/Invalid datasets #18

Comments

bleonar5 commented Oct 26, 2023

Do we want all of these datasets to be valid?

Should we create two subdirectories here, one for valid datasets and one for invalid?

In the case where we learn that one of the datasets from this repo is unintentionally invalid, should we fix it?

mekline commented Oct 26, 2023