Make more tests robust and independent #877

glatterf42 · 2024-09-12T10:15:00Z

Closes #797.

This is a follow-up PR to #784 with the aim of bringing even more test stability into our CI suite.

For the moment (at least while it's free), this PR experiments with codecov's new flaky-test-detection feature.

How to review

Read the diff and note that the CI checks all pass.
Read the PR comments and ensure the reasoning makes sense.

PR checklist

Continuous integration checks all ✅
Update tests; coverage checks both ✅
~~[ ] Add, expand, or update documentation.~~ Just CI.
Update release notes.

codecov · 2024-09-12T10:20:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.5%. Comparing base (ef66e37) to head (f70e76a).
Report is 20 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##            main    #877     +/-   ##
=======================================
- Coverage   95.6%   95.5%   -0.1%     
=======================================
  Files         46      46             
  Lines       4344    4340      -4     
=======================================
- Hits        4154    4148      -6     
- Misses       190     192      +2

Files with missing lines	Coverage Δ
message_ix/cli.py	`94.2% <100.0%> (+1.5%)`	⬆️
message_ix/tests/conftest.py	`100.0% <ø> (ø)`
message_ix/tests/test_cli.py	`90.6% <100.0%> (-9.4%)`	⬇️
message_ix/tests/test_core.py	`100.0% <100.0%> (ø)`
...age_ix/tests/test_feature_bound_activity_shares.py	`100.0% <100.0%> (ø)`
message_ix/tests/test_feature_storage.py	`100.0% <100.0%> (ø)`
message_ix/tests/test_macro.py	`100.0% <100.0%> (ø)`
message_ix/tests/test_report.py	`100.0% <100.0%> (ø)`
message_ix/tests/test_util.py	`100.0% <100.0%> (ø)`
message_ix/tests/tools/test_add_year.py	`95.6% <100.0%> (ø)`
... and 1 more

glatterf42 · 2024-09-13T13:32:37Z

Most items from #797 seem addressed now. Mainly, I'm not yet sure why the tests in test_feature_bound_activity_shares are flaky. The only thing I can think of is that despite our previous efforts (identifying scenario names via request.node.name), some scenarios could still get mixed up so that bounds would be applied sort of randomly instead of as intended. So for now, I have doubled down on this approach with the help of pytest-xdist's recommended way of uniquely identifying each test run.
For now, I've only used it in test_feature_bound_activity_share, but if it works (and we don't intend to drop pytest-xdist), we could use this feature elsewhere, too, to replace request.node.name.
However, in general, I thought request.node.name was uniquely identifying scenarios already, so I'm not quite convinced this new approach works any better.

glatterf42 · 2024-09-17T08:07:13Z

I have to admit: while I did make changes to test_feature_bound_activity_shares, I was not sure they would address the flakiness seen in test_add_share_mode_lo or test_commodity_share_lo. However, I have re-run all tests five times now and have not observed a single flaky test, so maybe one of the changes I made did resolve that flakiness.

For future reference: I have tried replicating the numbers reported in #797 (comment): called make_dantzig locally, defined calc_share as it is in the test file and I find:

>>> 1.05 * calc_share(scen1)
0.56875
>>> calc_share(scen1)
0.5416666666666666

This seems odd since 0.56875 is not one of the reported values when flakiness occurs, seemingly suggesting either that the numbers are reported from a different scenario altogether or that the scenario in question had also been changed by another test (in the same module; since all of them use the module-scoped fixture test_mp). For example, maybe the test_add_share_mode_up frequently added a 0.95 multiplier to the scenario, resulting in the test_add_share_mode_lo scenario finding the "normal" 0.5416... value (due to its reverse-factor of 1 / 0.95 = 1.05).
In either of those cases, the changes I made (ensuring each clone gets its own unique scenario name) could well be the changes needed to eliminate the flakiness.

glatterf42 · 2024-09-23T10:22:22Z

Today, I'm seeing these flaky errors for the first time:

 __________________________________ test_dl[] ___________________________________
[gw1] linux -- Python 3.9.20 /opt/hostedtoolcache/Python/3.9.20/x64/bin/python

message_ix_cli = <bound method message_ix_cli.<locals>.Runner.invoke of <message_ix.tests.conftest.message_ix_cli.<locals>.Runner object at 0x7fc27797a2b0>>
opts = ''
tmp_path = PosixPath('/tmp/pytest-of-runner/pytest-0/popen-gw1/test_dl__0')

    @pytest.mark.parametrize(
        "opts",
        [
            "",
            "--branch=main",
            "--tag=v1.2.0",
            # Nonexistent tag
            pytest.param("--tag=v999", marks=pytest.mark.xfail(raises=AssertionError)),
        ],
    )
    def test_dl(message_ix_cli, opts, tmp_path):
        r = message_ix_cli("dl", opts, str(tmp_path))
    
>       assert r.exit_code == 0, (r.exception, r.output)
E       AssertionError: (URLError(TimeoutError(110, 'Connection timed out')), '')
E       assert 1 == 0
E        +  where 1 = <Result URLError(TimeoutError(110, 'Connection timed out'))>.exit_code

message_ix/tests/test_cli.py:66: AssertionError
____________________________ test_dl[--tag=v1.2.0] _____________________________
[gw0] linux -- Python 3.9.20 /opt/hostedtoolcache/Python/3.9.20/x64/bin/python

message_ix_cli = <bound method message_ix_cli.<locals>.Runner.invoke of <message_ix.tests.conftest.message_ix_cli.<locals>.Runner object at 0x7f3e3f6a8e20>>
opts = '--tag=v1.2.0'
tmp_path = PosixPath('/tmp/pytest-of-runner/pytest-0/popen-gw0/test_dl___tag_v1_2_0_0')

    @pytest.mark.parametrize(
        "opts",
        [
            "",
            "--branch=main",
            "--tag=v1.2.0",
            # Nonexistent tag
            pytest.param("--tag=v999", marks=pytest.mark.xfail(raises=AssertionError)),
        ],
    )
    def test_dl(message_ix_cli, opts, tmp_path):
        r = message_ix_cli("dl", opts, str(tmp_path))
    
>       assert r.exit_code == 0, (r.exception, r.output)
E       AssertionError: (URLError(TimeoutError(110, 'Connection timed out')), '')
E       assert 1 == 0
E        +  where 1 = <Result URLError(TimeoutError(110, 'Connection timed out'))>.exit_code

message_ix/tests/test_cli.py:66: AssertionError
 ____________________________ test_dl[--branch=main] ____________________________
[gw1] linux -- Python 3.9.20 /opt/hostedtoolcache/Python/3.9.20/x64/bin/python

message_ix_cli = <bound method message_ix_cli.<locals>.Runner.invoke of <message_ix.tests.conftest.message_ix_cli.<locals>.Runner object at 0x7fc27797a2b0>>
opts = '--branch=main'
tmp_path = PosixPath('/tmp/pytest-of-runner/pytest-0/popen-gw1/test_dl___branch_main_0')

    @pytest.mark.parametrize(
        "opts",
        [
            "",
            "--branch=main",
            "--tag=v1.2.0",
            # Nonexistent tag
            pytest.param("--tag=v999", marks=pytest.mark.xfail(raises=AssertionError)),
        ],
    )
    def test_dl(message_ix_cli, opts, tmp_path):
        r = message_ix_cli("dl", opts, str(tmp_path))
    
>       assert r.exit_code == 0, (r.exception, r.output)
E       AssertionError: (URLError(TimeoutError(110, 'Connection timed out')), 'Retrieving https://github.com/iiasa/message_ix/archive/main.zip
E         ')
E       assert 1 == 0
E        +  where 1 = <Result URLError(TimeoutError(110, 'Connection timed out'))>.exit_code

message_ix/tests/test_cli.py:66: AssertionError

Unfortunately, they don't provide too much information. If I'm not mistaken, the first two tests timed out retrieving the tags_info from GitHub, while the last test apparently timed out retrieving the data from GitHub. In both cases, I'm not sure what we can do to avoid the flakiness (other than re-running tests on fails): there's just one GitHub address we need to address and it's the same for all tests and we're retrieving all data to temporary directories (that should be unique for each test run), anyway. So this is mainly documenting for future reference.

Lastly, I noticed that we didn't test calling the CLI command dl with both --tag and --branch, so I added a test for that. Unfortunately, click handles the exception somewhat awkwardly, so that the test is for now XPassing: it is triggering the right click.UsageError, but this gets printed to stdout and raises a SystemExit, which doesn't make it through our message_ix_cli fixture: this still returns r, only now it has an .exit_code != 0. While this is not ideal, I didn't want to spend too much time on this right now.

khaeru · 2024-09-24T07:26:33Z

.github/workflows/pytest.yaml

@@ -98,7 +98,7 @@ jobs:
        pytest message_ix \
          -m "not nightly and not tutorial" \
          -rA --verbose --color=yes --durations=20 \
-          --cov-report=xml \
+          --cov --junitxml=junit.xml \


As I understand, the --cov-report=xml file (coverage.xml) includes the line coverage, whereas the --junitxml file includes the results of tests.

I guess the latter is needed for Codecov to detect flaky tests. But doesn't this change also turn off generation of the former? How does Codecov then receive info on line coverage?

Happy to drop this commit since flaky-test-detection is going to become a paid feature sooner or later, anyway. I mainly wanted to see if it could be useful for this PR, no need to keep these changes :)

khaeru

Thanks for the thorough tracking and many fixes to flaky tests! The use of isolation across the test suite now exhibits a more consistent style that can be followed by others making additions to the code.

I put some notes inline, and a few small requested changes. With those, this will be good to merge.

message_ix/util/__init__.py

khaeru · 2024-09-24T07:45:21Z

message_ix/util/__init__.py

@@ -46,28 +50,29 @@ def output(str):
    else:
        output = print

-    # Identify the source directory using importlib.resources


Just recording for posterity that:

importlib.resources was used in Make tests robust & independent #784 because it's in the Python standard library, thus appears to be recommended/preferred over pathlib.Path(__file__).

However, there are strong dissenting views from Python users.

It was decided to switch to using pathlib.Path(__file__). This reverses changes made in the earlier PR.

We don't know if this has any impact on flakiness of tests.

message_ix/cli.py

message_ix/tests/test_cli.py

glatterf42 added enh New features & functionality ci Continuous integration labels Sep 12, 2024

glatterf42 self-assigned this Sep 12, 2024

glatterf42 requested a review from khaeru September 17, 2024 08:07

khaeru reviewed Sep 24, 2024

View reviewed changes

khaeru approved these changes Sep 24, 2024

View reviewed changes

glatterf42 added 20 commits September 24, 2024 14:20

Ensure order of test parametrization values

ef76943

Always use pathlib to find source dir

2ee5258

Modify source dir in test-specific way

20810aa

Add optional source_dir parameter to copy_model

5d14698

Avoid changing source dir directly in tests

ab8811d

Use test-specific name in GDX file names

bf62531

Use request.node.name in all make_ helpers

64094a8

Exempt fixtures from request.node.name

33b2497

Use pytest-xdist recommendation to identify test runs

b65341e

Remove testrun_uid from test_b_a_u_all_modes

bec2fd7

Fix typo

8e721d8

Use request-specific scenario names in fixture

eca6b05

List all files to copy-model at once

693627f

Use always applicable error instead of too-specific one

c0da4b3

Stop click from turning errors into simple stdout

9b0594e

Test providing --tag and --branch for dl CLI

d78aac3

List all files to copy-model at once

ec34dd3

Make new function argument keyword-only

d155377

Avoid duplicating test code

f68c092

Add PR to release notes

f70e76a

glatterf42 force-pushed the enh/ci-test-stability branch from 3bece75 to f70e76a Compare September 24, 2024 12:20

glatterf42 merged commit f070ebe into main Sep 24, 2024
26 checks passed

glatterf42 deleted the enh/ci-test-stability branch September 24, 2024 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make more tests robust and independent #877

Make more tests robust and independent #877

glatterf42 commented Sep 12, 2024 •

edited

Loading

codecov bot commented Sep 12, 2024 •

edited

Loading

glatterf42 commented Sep 13, 2024

glatterf42 commented Sep 17, 2024

glatterf42 commented Sep 23, 2024

khaeru Sep 24, 2024

glatterf42 Sep 24, 2024

khaeru left a comment

khaeru Sep 24, 2024

Make more tests robust and independent #877

Make more tests robust and independent #877

Conversation

glatterf42 commented Sep 12, 2024 • edited Loading

How to review

PR checklist

codecov bot commented Sep 12, 2024 • edited Loading

Codecov Report

glatterf42 commented Sep 13, 2024

glatterf42 commented Sep 17, 2024

glatterf42 commented Sep 23, 2024

khaeru Sep 24, 2024

Choose a reason for hiding this comment

glatterf42 Sep 24, 2024

Choose a reason for hiding this comment

khaeru left a comment

Choose a reason for hiding this comment

khaeru Sep 24, 2024

Choose a reason for hiding this comment

glatterf42 commented Sep 12, 2024 •

edited

Loading

codecov bot commented Sep 12, 2024 •

edited

Loading