Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valid/Invalid datasets #18

Open
bleonar5 opened this issue Oct 26, 2023 · 1 comment
Open

Valid/Invalid datasets #18

bleonar5 opened this issue Oct 26, 2023 · 1 comment
Assignees

Comments

@bleonar5
Copy link
Contributor

It seems like we have a mixture of valid and invalid datasets included here, some intentional and some not. Seeing this, and thinking about how to build out a validator-testing process, a couple questions come up.

Do we want all of these datasets to be valid?

There's certainly a benefit to having both valid and invalid datasets when it comes to testing, as it gives us a wide range of cases to cover, but in order for that process to work, we need to know which datasets we expect to be valid and which to be invalid. This is also important, naturally, for community members that would want to use these datasets as exemplars worth emulating.

Should we create two subdirectories here, one for valid datasets and one for invalid?

Or should we keep track of this information in a metadata file and then throw all the datasets in the same bucket? Or should we only include valid datasets on the public facing repo, and then save the invalid ones solely for testing purposes, to reduce confusion?

In the case where we learn that one of the datasets from this repo is unintentionally invalid, should we fix it?

Or should we just include it in the invalid dataset collection. Or, lastly, should we create a new, corrected version for the valid collection, and then keep the original around in the invalid collection?

@mekline
Copy link
Contributor

mekline commented Oct 26, 2023

Working on updates to the readme now! For now, the top priority for this repo is easy dataset contribution by a new user. People are primarily being encouraged to attempt by-hand conversion, and then upload the result (and optionally the original). Since conversion instructions are likely inexact, mistakes should be expected, and hopefully informative.

Going forward, we'll probably want a subdirectory for datasets that are actually used for testing (either by being valid or invalid in a known way), which will probably include moving and/or correcting some of the current datasets, but let's leave the current ones at the top level so people can see what they are supposed to be doing. Feel free to make and organize that subdirectory for known/tested datasets when and how it makes sense to do so!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants