-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated QA checks #236
Comments
I think this can be easily detected in consolidate. Will logger.warning make it into Github Actions logs? Should it be integrated with the existing alers workflow?
|
It ... cannot be easily detected in consolidate. Logging has been improved, though. |
So one approach would to throw an error, failing the transformation for that state and creating a failed status that could be reported. Granted, that might obscure an otherwise successful run. Okay, so how to do reporting on data quality without halting or logging ineffectually? I'm looking a little bit at Great Expectations, which seems to have gotten very enterprise-y and been rebranded "GX OSS" to clear the way for a parallel SaaS business, but which might still be the right general tool for this sort of thing. The Great Expectations way of doing this seems way more complicated than the nice one line if statement test you've got there, but it does seem to have ways of building data docs with validation results and configurable alerting. I wonder if there are other things it could be used to test for. As you say, testing for this one case may be a different thing than the general case of QA checks. |
Did some exploration using Great Expectations in #252, creating a check that looks at each raw file and verifies the row count is three or greater. After this, I tend to agree with the thrust of the Reddit post I found headlined "Great Expectations is annoyingly cumbersome" (the Dickens novel doesn't appear to be well-loved either). Ah well, I had such high hopes, but maybe SaaS ruins everything. On the other hand, maybe it's not so bad once you learn the concepts/get past the initial set up. This doesn't do exactly what the current if check does, because it looks at each raw data file before transformation, though we could also look at the data after consolidation and build some sort of list of sources we expect to see in there. |
One of the things this might help address: the case where a state's data or a scraper's output loses quality over time, or runs into issues on as-of-yet unseen documents. It'd be nice to have some row and/or column-level expectations set up that could flag that. |
@Kirkman realized scrapers can begin failing and produce empty CSVs, but there's no process in place to flag those failures. See biglocalnews/warn-scraper#598
As I understand it, warn-transformer consolidate is going to pull everything together from historical data and new scrapes and then eliminate duplicates. This would be the only time in which we've got all the new scrapes, and we can see whether new scrape files are empty, or have fewer entries than the historical data that's available.
It might be easy enough to build in an error/alert here that doesn't stop the rest of the transform from working, but does send a message through the internal BLN ETL alerts channel -- likely by mimicking what's in the Github workflow, except I think that requires a logger.error or something similar to actually get triggered.
It's also possible there may be some cases in which counts should be reduced -- e.g., a state decides a notice in the system isn't actually a WARN notice but is a non-WARN layoff. I think we saw that in Maine early on, where a WARN notice disappeared. Once recorded by warn-transformer, those aren't coming back, so the count of the missing will grow.
There will also be cases where a state takes down a previous year's data, and the scraper will have less to work with.
So ... where to go? CSVs with only a header row are not ever going to be correct. That's a bare minimum for flagging, but building against that might make it harder to implement more in-depth QA work.
The text was updated successfully, but these errors were encountered: