Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: how to describe distribution o a training dataset #631

Open
ljgarcia opened this issue Jan 30, 2023 · 0 comments
Open

Discussion: how to describe distribution o a training dataset #631

ljgarcia opened this issue Jan 30, 2023 · 0 comments

Comments

@ljgarcia
Copy link
Contributor

The ELIXIR Machine Learning Focus Group (including the task force on synthetic data) and NFDI4DataScience (and possible RDA FAIR4ML IG) are interested in using metadata to describe the distribution of a dataset for ML training purposes (including the DOME recommendations for Data).

During the BioHackathon the subject was discussed for DOME and Synthetic Data. The current suggestion is using variableMeasured in combination with PropertyValue for any distribution/subsets of interest of this Dataset. For example attributes/features, classes (if intended for classification training), data points under each class, biological sex of the samples. For instance

  • Data splits [{unitText: “Training”, referenceValue: {unitText: “Positive”, value: 40000}, measurementTechnique: “Splits”}, {unitText: “Validation”, referenceValue: {unitText: “Positive”, value: 5000}, measurementTechnique: “Splits”}]
    • Note: The reference value refers to the classes defined (if available)
  • Data classes [{unitText: “Positive”, value: 75000, measurementTechnique: “Classes”}, {unitText: “Negative”, value: 15000, measurementTechnique: “Classes”}]
    • Note: the full size/number of records would be needed to realize about, e.g., overlaps
  • Biological sex {unitText: “Biological sex”, propertyID:"http://purl.obolibrary.org/obo/PATO_0000047", value: "female"} or {unitText: “Biological sex”, propertyID:"http://purl.obolibrary.org/obo/PATO_0000047", referenceValue: {unitText: “Female”, value: 30000}}

Note: the measurementTechnique, unitText, value, propertyID could come from a controlled vocabulary, e.g., a DefinedTerm, which is no currently supported. A discussion about extending the coverage of DefinedTerm in ongoing

Please share your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant