-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automating metadata file maintenance (templated metadata with a metadata discovery and rendering service) #22
Comments
This would be really useful for Astropy, which also has quite a few repositories, and where manual curation would be difficult. We should discuss this! |
I would advocate a slightly different approach - rather than having a full template file, we could have a configuration YAML file that can include e.g. a list of fields to automatically update, and options such as how to add new contributors (e.g. adding missing contributors, always sorting authors in some way, etc.). So then there would be a single JSON file (codemeta.json) that authors edit but the YAML file would allow whitelisting of certain fields that can be auto-updated. |
I like your approach @astrofrog. The codemeta.json file in the repo would be fully rendered (updated on a per-release basis, say), while as you say the I suppose some next steps are to more clearly write down some user stories, and start defining a syntax for the configuration YAML file. |
We've created a repository to start developing this idea: https://github.com/codemeta-gen/metagen I think our first task there is to begin documenting the YAML configuration files as a means of designing the system. |
👋 @jonathansick & @astrofrog - I've probably mentioned this to you in the past but a couple of years back I started work on a RubyGem to do something similar to this https://github.com/arfon/metamatter I'd be happy to walk you through the library sometime if it's not obvious what it's doing. |
@jonathansick The specification you proposed for generating codemeta.json file from the intrinsic metadata and trying to minimize the maintenance of the codemeta files (which is also problematic when having only 10 files) are exactly what we should strive for when implementing the software citation workflow. @cboettig has developed a tool generating codemeta.json for R packages: CodeMetar. @arfon that's great ! it's the first time I'm seeing this tool. We should reference it in the tools page, I think it should be more visible. |
@arfon @moranegg Those both look great. My application is generating metadata in the Python ecosystem that my project (@lsst) lives in. I also want a metadata generator that's extremely pluggable. LSST has a lot of unconventional software packaging and metadata sources, so being able to write plugins that extract and transform metadata are a high priority for me. R and Ruby place a higher barrier to entry for that than Python, for us at least. For sure, though, I want to study what you're doing in metamatter and CodeMetar. Maybe I'll take you up on that walk-through at some point :) |
@jonathansick et al Really excellent thread here raising some important issues; in particular, developing a robust maintenance strategy for the The approach we have taken with
I suppose we're basically doing the 'hydrate a partially filled codemeta.json` as you so succinctly put it. Ideally this would be wired into the release process, which would handle the chicken-and-egg issues like updating the version in both the language docs and It is far from obvious that this is the ideal strategy, but just thought I'd share what we'd done so far. I like your idea of allowing the users to enter metadata directly as YAML instead of JSON, since it's more user-friendly and has a well-defined automated mapping into JSON. (Of course any valid JSON is also valid YAML). This still leaves open the issues of how to resolve conflicts between metadata fields declared here and inferred fields, and how to avoid stale metadata (particularly versions, dois) slipping in here. I think you're suggesting that the |
I agree this is a great thread with some really good discussion of an important issue. My take is: As I mention in https://danielskatzblog.wordpress.com/2017/09/25/software-heritage-and-repository-metadata-a-software-citation-solution/, I would like to get to a point where authors who want credit provide the needed metadata, just like they create READMEs and CONTRIBUTING and LICENSE. I was thinking that they would do this into a codemeta.json file, but a yaml version also seems fine - whatever we think is easier. One question is which fields are properties of the template (organization) and what are properties of the repo (software) itself. I'm not sure there will be a single answer to this that covers both a project like LSST and a small lab project. And I agree that since almost everything there (except the authors) can be generated automatically, if the authors have provided information, it should be used rather than what would be generated. The author metadata is the only thing that I think cannot be generated automatically. The authors is not the set of github contributors, and the set of github contributors is not the authors, but there is likely some overlap. If the authors are not specified by the authors, I really think the best thing to do is to name the project as the authors, rather than inaccurately guessing. To provide some examples for this briefly, one case is that an author is a person who had the idea for a piece of code, got the funding to develop it, and designed it on a whiteboard, but never committed anything to the repo. This person likely should be an author. On the other side, imagine an administrator who updates the license file, and thus is a committer, even without making any intellectual contribution to the software. This person likely should not be an author. |
@danielskatz Great points about the author issues, I agree entirely that the GitHub commit records are not a good indicator of this for exactly the reasons you outline. In the R community we are pretty used to specifying author information, even along with author roles (e.g. a "contributor" in an R package is distinct from an "author" and omitted from the author list generated by R's Other language/distribution config files have somewhat similar support for identifying authors in the language, (e.g. python |
I'm approaching this working group from a software producer's perspective. At LSST we have a few hundred repositories on GitHub (https://github.com/lsst, https://github.com/lsst-dm, https://github.com/lsst-sqre, https://github.com/lsst-sims are our major GitHub organizations), and have a large group of people contributing to these repos. As much as possible, we rely on automation to move towards a continuous delivery ideal to ensure our code releases are reliable.
The idea of putting something like a CITATION file #2 or
codemeta.json
#4 in our repositories is great, and I think we're going that route. We especially like codemeta / JSON-LD because it means we can add LSST-specific metadata for our own internal purposes. At the same time, deployingcodemeta.json
at scale across all of our repositories could cause some maintenance challenges.If LSST has 500 GitHub repositories, we'd have 500
codemeta.json
files. And like documentation, it's sometimes difficult to rely on software developers in each project to keep that metadata accurate and up-to-date. For example, every time there is a new contributor we'd need to add them tocodemeta.json
. We might add a new code dependency, so we'd have to ensure the dependency metadata is up to date. Or at worst, every new commit on GitHub is in some sense a new release/version of the software for provenance purposes; it's not tractable to have acodemeta.json
file committed to a repo reflect that sort of continuous versioning information.A solution I'm interested in is combining
codemeta.json
metadata committed to a repository with metadata that's intrinsic to the repository itself. Things you can discover from a software repository are:setup.py
file for example; see also How to managing versioning when citing software #16)setup.py
or node.js'spackage.json
)Here's a system I'm envisioning:
codemeta.json
object on-demand for a Git repository at any Git ref. The web service inspects the Git repository for metadata and merges that metadata with the existing, manually maintained template metadata file.codemeta.json
and commit that metadata into the Git repository/software distribution. Potentially themaster
branch could even carry thecodemeta.json
rendered from the latest release. This metadata rendering and committing happens automatically on the continuous integration server.In some ways, this is similar to how we're approaching software documentation. Combining code and its documentation in the same repository help make a software product more self-contained from a developer's perspective and makes it easier to maintain versioned documentation. In the same way
codemeta.json
embedded in a repository is useful for maintaining versioned metadata. But we also rely on automation in a continuous integration service to help us produce, render, and validate the documentation (for example, generating an API reference by inspecting the code base and merging API signatures with human-written documentation strings).I'm curious if others have thought about the maintenance of
codemeta.json
files at scale, and whether this approach is generally tractable?A significant challenge is that the web service needs to know how to introspect the software. At LSST we have some non-standard practices for building software, so we'd need to implement a web service that knows about the LSST build system, in addition to standard Python PyPI packing, for example.
A spin-off of this approach is a "linting" service that runs in continuous integration and identifies when metadata in
codemeta.json
is out of date. In this case, a developer would still maintaincodemeta.json
manually, but would be forced to resolve metadata discrepancies before merging a PR.The text was updated successfully, but these errors were encountered: