-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deal with multiple valid test results #149
Comments
Hi @TimoRoth The way I use pytest-mpl and I have seen projects which use baseline images is that they work hard to maintain a set of baseline images per constrained environment (versions of mpl etc) and then dynamically select the path for the baseline images based on the versions installed at test time. See this code in sunpy which selects the hashlib: https://github.com/sunpy/sunpy/blob/main/sunpy/tests/helpers.py#L72 which is then used in a wrapper decorator: https://github.com/sunpy/sunpy/blob/main/sunpy/tests/helpers.py#L103-L105. Or see Astropy which uses baseline images and not a hash library where the path to the baseline images is simarly dynamically generated based on matplotlib version: https://github.com/astropy/astropy/blob/main/astropy/tests/image_tests.py I have thought about ways we could make these kind of workflows easier for users, but making them general always feels like you need a way to specify the baseline images or hash library for an arbitrarily complex set of dependencies (sometimes C libs like freetype). If you have any suggestions based on how your fork works then I would love to hear them. |
The basic idea of the way it's implemented in our fork is that instead of a baseline image, we point at a baseline directory, and as long as one image in there matches with the desired tolerance, the test passes. You can see one of those baseline image collections here: https://github.com/OGGM/oggm-sample-data/tree/master/baseline_images/freetype_28/test_raster We used to do the same method you described(hence the freetype version still being part of the path), but there are so many moving parts that can affect the output (matplotlib version, version of libfreetype used by matplotlib, proj version, gdal version, and probably more dependencies that can cause slight variations in results) that it's become absolutely unmaintainable. We'd have to maintain several hundred sets of different baseline images, and would probably still be missing some. |
I can see how that approach would be useful, especially when you get to a very large dependency tree. I would be happy to accept a contribution with this feature, my only reservation is further complicating an already exceptionally flexible API, see the beginnings of a discussion in #154 about that. |
I'll need to rebase, or even rewrite, the patch. The introduction of the summary-Mechanism has complicated it quite a bit. |
I'm curious to know how much #150 might alleviate this issue. If it gets banked, then it certainly opens the door to hashing being a better alternative to the How might that work? Consider only having one baseline image per test, generated by a single "golden" environment of your choice. Extend the hash library such that it now supports a When running This approach certainly reduces the bloat and problem of managing multiple baseline image collections per environment. With respect to the baseline images, for a failed test the image difference contains the difference that tripped the failure but for "non-golden" environments also includes any acceptable difference (that you originally signed-off when creating the hash) against the "golden" environment image. This might be enough for you to ascertain the source of the image change and whether that is acceptable or not given the failure. Just a thought... dunno if it has legs or not 🤔 |
The issue I see with that is still that it's A LOT of permutations. For the vast majority of cases, the change isn't meaningful enough to cause a difference that causes the tolerance to be violated. Maintaining such a massive set of baselines would also be quite bothersome. |
@TimoRoth Understood. Is it possible for you to point me to a repo containing your baseline images? I'd love to take a peek at them, if you don't mind? No worries if that's not possible 👍 |
Already linked to them above, in my first reply: https://github.com/OGGM/oggm-sample-data/tree/master/baseline_images/freetype_28/test_raster That's probably the most clear demonstration of the random differences that can occur. The other tests show similar differences. Some also have numerical differences, that are negligible enough though. |
@TimoRoth I done some quick analysis on your 5 test images using 256-bit perceptual images hashes. If you take
This is telling me that images 1, 2, 3 and 4 are sufficiently similar images 👍 However, with a hamming distance of 22, image 5 is quite different. On inspection this is indeed true, as the plot is larger overall, and the scale on the colorbar is different. Does this help confirm how perceptual image hashing might help? |
It's primarily telling me that we should investigate using (perceptual?) hashes instead of a growing library of baseline images :D |
Okay, how about this as a compromise, given what I suggested in the above comment and what you'd like as part of your workflow (which I'm assuming is a pretty common pain point)... Entirely drop the environment name concept. It's not necssary. Also, instead of a There are two important further enhancements necessary, and then I think we might have struck gold. Firstly, when testing a result image hash for a test against the entries in the list, because perceptual image hashes are comparable (resulting in a hamming bit distance), we compare the result hash against the "most similar" accepted image hash. If the hamming distance is ≤ the hamming bit tolerance, then "bingo" the test passes, otherwise it fails. What's great about this approach is that we're minimising the hash list, but also testing a result automatically against the most appropriate and likely accepted "similar" baseline hash/image. Also, it avoids the need to support a matrix of hamming tolerances per test/environment. We simply support the global hamming tolerance or customised tolerance via the Secondly, for the baseline images, rather than have a flat directory of ".png" files, we use the name of the test to instead create a directory associated with the test, and within that directory we have "<hash-value.png>" files; one for each actual image associated with that hash. Having separate test sub-directories for the baseline directory avoids any hash value conflict issues, plus it'll makes the management of what images belong to which test intuitive. So... given all of the above, I genuinely think that this might be what we need here. It will certainly work for your test-raster example above, where one image ( Also, the other big win here is that image comparison summaries with result/diff/baseline images still works. If there's enough agreement, and I've not missed anything obvious, then I'd be happy to implement this in a feature branch if/once #150 lands... And if you'd be willing to collaborate and test it @TimoRoth with your workflow (and anyone else that cares), and feedback, then we can make an informed decision as to whether Thoughts? |
That sounds much more elaborate than anything I had thought of so far, but it'll definitely solve the issue in an elegant way that's hopefully useful for a lot more users. |
Depending on the version of matplotlib, the way it was installed, the way other dependencies were installed and a bunch of other random factors, the results of our image comparison tests can vary slightly.
Enough to fall way out of the tolerance window, but still being perfectly valid results.
So we have a set of multiple valid baseline images for most of our tests, but don't really have a good way to actually test if any of them matches.
Right now, we are using our own fork of pytest-mpl, which does support testing against multiple baselines: https://github.com/OGGM/pytest-mpl
But maintaining that has been a consistent pain, and it's getting harder the more upstream pytest-mpl evolves.
This must be a general issue not only we have. What is the proper way to deal with it?
The text was updated successfully, but these errors were encountered: