Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest - collect information from (pre)processing steps #1198

Open
fvankrieken opened this issue Oct 16, 2024 · 0 comments
Open

Ingest - collect information from (pre)processing steps #1198

fvankrieken opened this issue Oct 16, 2024 · 0 comments
Assignees

Comments

@fvankrieken
Copy link
Contributor

fvankrieken commented Oct 16, 2024

In ingest, various preprocessing steps are defined and run as part of dataset ingestion. The inputs to these are defined, but it would also be useful to see the effects of them logged somehow in the Config object that gets stored with every processed and archived dataset. A few use cases are

  • filter_rows: number of rows filtered out (or both number of rows before filter and number of rows after)
  • append_prev: the version that was actually appended to

Currently, each ingest processor takes a dataframe and some amount of kwargs and returns a dataframe. There are definitely a few ways this could be handled. It could be a general approach that records some statistics before and after each processing step. But it seems like it needs to be more specific to the steps - something like "append_prev" really should log the version appended to, and there's no generalized way to log that info outside of the steps themselves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

No branches or pull requests

2 participants