Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated QA checks #236

Open
stucka opened this issue Jan 12, 2024 · 5 comments
Open

Automated QA checks #236

stucka opened this issue Jan 12, 2024 · 5 comments

Comments

@stucka
Copy link
Contributor

stucka commented Jan 12, 2024

@Kirkman realized scrapers can begin failing and produce empty CSVs, but there's no process in place to flag those failures. See biglocalnews/warn-scraper#598

As I understand it, warn-transformer consolidate is going to pull everything together from historical data and new scrapes and then eliminate duplicates. This would be the only time in which we've got all the new scrapes, and we can see whether new scrape files are empty, or have fewer entries than the historical data that's available.

It might be easy enough to build in an error/alert here that doesn't stop the rest of the transform from working, but does send a message through the internal BLN ETL alerts channel -- likely by mimicking what's in the Github workflow, except I think that requires a logger.error or something similar to actually get triggered.

It's also possible there may be some cases in which counts should be reduced -- e.g., a state decides a notice in the system isn't actually a WARN notice but is a non-WARN layoff. I think we saw that in Maine early on, where a WARN notice disappeared. Once recorded by warn-transformer, those aren't coming back, so the count of the missing will grow.

There will also be cases where a state takes down a previous year's data, and the scraper will have less to work with.

So ... where to go? CSVs with only a header row are not ever going to be correct. That's a bare minimum for flagging, but building against that might make it harder to implement more in-depth QA work.

@stucka
Copy link
Contributor Author

stucka commented Jan 16, 2024

I think this can be easily detected in consolidate. Will logger.warning make it into Github Actions logs? Should it be integrated with the existing alers workflow?

source_list = module.Transformer(input_dir).transform()

stucka added a commit to stucka/warn-transformer that referenced this issue Jan 16, 2024
@stucka
Copy link
Contributor Author

stucka commented Jan 16, 2024

It ... cannot be easily detected in consolidate. Logging has been improved, though.

@chriszs
Copy link
Contributor

chriszs commented Mar 17, 2024

So one approach would to throw an error, failing the transformation for that state and creating a failed status that could be reported. Granted, that might obscure an otherwise successful run.

Okay, so how to do reporting on data quality without halting or logging ineffectually? I'm looking a little bit at Great Expectations, which seems to have gotten very enterprise-y and been rebranded "GX OSS" to clear the way for a parallel SaaS business, but which might still be the right general tool for this sort of thing.

The Great Expectations way of doing this seems way more complicated than the nice one line if statement test you've got there, but it does seem to have ways of building data docs with validation results and configurable alerting. I wonder if there are other things it could be used to test for.

As you say, testing for this one case may be a different thing than the general case of QA checks.

@chriszs
Copy link
Contributor

chriszs commented Mar 17, 2024

Did some exploration using Great Expectations in #252, creating a check that looks at each raw file and verifies the row count is three or greater.

After this, I tend to agree with the thrust of the Reddit post I found headlined "Great Expectations is annoyingly cumbersome" (the Dickens novel doesn't appear to be well-loved either). Ah well, I had such high hopes, but maybe SaaS ruins everything. On the other hand, maybe it's not so bad once you learn the concepts/get past the initial set up.

This doesn't do exactly what the current if check does, because it looks at each raw data file before transformation, though we could also look at the data after consolidation and build some sort of list of sources we expect to see in there.

@chriszs
Copy link
Contributor

chriszs commented Mar 18, 2024

One of the things this might help address: the case where a state's data or a scraper's output loses quality over time, or runs into issues on as-of-yet unseen documents. It'd be nice to have some row and/or column-level expectations set up that could flag that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants