Automated QA checks #236

stucka · 2024-01-12T13:31:58Z

@Kirkman realized scrapers can begin failing and produce empty CSVs, but there's no process in place to flag those failures. See biglocalnews/warn-scraper#598

As I understand it, warn-transformer consolidate is going to pull everything together from historical data and new scrapes and then eliminate duplicates. This would be the only time in which we've got all the new scrapes, and we can see whether new scrape files are empty, or have fewer entries than the historical data that's available.

It might be easy enough to build in an error/alert here that doesn't stop the rest of the transform from working, but does send a message through the internal BLN ETL alerts channel -- likely by mimicking what's in the Github workflow, except I think that requires a logger.error or something similar to actually get triggered.

It's also possible there may be some cases in which counts should be reduced -- e.g., a state decides a notice in the system isn't actually a WARN notice but is a non-WARN layoff. I think we saw that in Maine early on, where a WARN notice disappeared. Once recorded by warn-transformer, those aren't coming back, so the count of the missing will grow.

There will also be cases where a state takes down a previous year's data, and the scraper will have less to work with.

So ... where to go? CSVs with only a header row are not ever going to be correct. That's a bare minimum for flagging, but building against that might make it harder to implement more in-depth QA work.

stucka · 2024-01-16T15:03:28Z

I think this can be easily detected in consolidate. Will logger.warning make it into Github Actions logs? Should it be integrated with the existing alers workflow?

warn-transformer/warn_transformer/consolidate.py

Line 41 in 82c55f9

source_list = module.Transformer(input_dir).transform()

stucka · 2024-01-16T21:19:22Z

It ... cannot be easily detected in consolidate. Logging has been improved, though.

chriszs · 2024-03-17T06:32:47Z

So one approach would to throw an error, failing the transformation for that state and creating a failed status that could be reported. Granted, that might obscure an otherwise successful run.

Okay, so how to do reporting on data quality without halting or logging ineffectually? I'm looking a little bit at Great Expectations, which seems to have gotten very enterprise-y and been rebranded "GX OSS" to clear the way for a parallel SaaS business, but which might still be the right general tool for this sort of thing.

The Great Expectations way of doing this seems way more complicated than the nice one line if statement test you've got there, but it does seem to have ways of building data docs with validation results and configurable alerting. I wonder if there are other things it could be used to test for.

As you say, testing for this one case may be a different thing than the general case of QA checks.

chriszs · 2024-03-17T09:41:32Z

Did some exploration using Great Expectations in #252, creating a check that looks at each raw file and verifies the row count is three or greater.

After this, I tend to agree with the thrust of the Reddit post I found headlined "Great Expectations is annoyingly cumbersome" (the Dickens novel doesn't appear to be well-loved either). Ah well, I had such high hopes, but maybe SaaS ruins everything. On the other hand, maybe it's not so bad once you learn the concepts/get past the initial set up.

This doesn't do exactly what the current if check does, because it looks at each raw data file before transformation, though we could also look at the data after consolidation and build some sort of list of sources we expect to see in there.

chriszs · 2024-03-18T09:55:22Z

One of the things this might help address: the case where a state's data or a scraper's output loses quality over time, or runs into issues on as-of-yet unseen documents. It'd be nice to have some row and/or column-level expectations set up that could flag that.

stucka added a commit to stucka/warn-transformer that referenced this issue Jan 16, 2024

Work toward biglocalnews#236. Sort of. Better logging.

138e0f7

chriszs mentioned this issue Mar 10, 2024

Build automated QA checks biglocalnews/warn-scraper#598

Open

chriszs mentioned this issue Mar 17, 2024

Meet Great Expectations (WIP) #252

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated QA checks #236

Automated QA checks #236

stucka commented Jan 12, 2024

stucka commented Jan 16, 2024

stucka commented Jan 16, 2024

chriszs commented Mar 17, 2024 •

edited

Loading

chriszs commented Mar 17, 2024 •

edited

Loading

chriszs commented Mar 18, 2024

Automated QA checks #236

Automated QA checks #236

Comments

stucka commented Jan 12, 2024

stucka commented Jan 16, 2024

stucka commented Jan 16, 2024

chriszs commented Mar 17, 2024 • edited Loading

chriszs commented Mar 17, 2024 • edited Loading

chriszs commented Mar 18, 2024

chriszs commented Mar 17, 2024 •

edited

Loading

chriszs commented Mar 17, 2024 •

edited

Loading