Repeated CI failures on Windows #238

MilesCranmer · 2022-12-10T17:02:12Z

Many of the Windows tests are now failing with various segmentation faults, which appear to be randomly triggered:

Nightly action: https://github.com/MilesCranmer/PySR/actions/workflows/CI_large_nightly.yml
PR action: Raise warning on statically-linked Python binaries #237

They seem to occur more frequently on older versions of Julia, and rarely on Julia 1.8.3. Regardless, a segfault anywhere is cause for concern and should be tracked down.

The errors include:

Early segmentation fault (Julia 1.6.7) at first run, segfault during noise test (Julia 1.6.7 and others), as well as segfaults during warm start test.

e.g., Windows:

 D:\a\_temp\221410f9-8bf7-4099-901d-eb9813d86c45.sh: line 1:  1098 Segmentation fault      python -m pysr.test main
Started!

also occurs on Ubuntu sometimes:

signal (11): Segmentation fault
in expression starting at none:0
unknown function (ip: 0x7fd6a19bc215)
unknown function (ip: 0x7fd6a19947ff)
macro expansion at /home/runner/.julia/packages/PyCall/ygXW2/src/exception.jl:95 [inlined]
convert at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:94
pyjlwrap_getattr at /home/runner/.julia/packages/PyCall/ygXW2/src/pytype.jl:378
unknown function (ip: 0x7fd68d30b1bd)
unknown function (ip: 0x7fd6a19babda)
unknown function (ip: 0x7fd6a198e9d4)
pyisinstance at /home/runner/.julia/packages/PyCall/ygXW2/src/PyCall.jl:170 [inlined]
pysequence_query at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:752
pytype_query at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:773
pytype_query at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:806 [inlined]
convert at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:831
julia_kwarg at /home/runner/.julia/packages/PyCall/ygXW2/src/callback.jl:19 [inlined]
#57 at ./none:0 [inlined]
iterate at ./generator.jl:47 [inlined]
collect_to! at ./array.jl:728
unknown function (ip: 0x7fd68d341d9a)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
collect_to! at ./array.jl:736
unknown function (ip: 0x7fd68d33e35a)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
collect_to! at ./array.jl:736
collect_to_with_first! at ./array.jl:706
unknown function (ip: 0x7fd68d33d775)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
collect at ./array.jl:687
unknown function (ip: 0x7fd68d33afb4)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
_pyjlwrap_call at /home/runner/.julia/packages/PyCall/ygXW2/src/callback.jl:31
unknown function (ip: 0x7fd68d3348d5)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
pyjlwrap_call at /home/runner/.julia/packages/PyCall/ygXW2/src/callback.jl:44
unknown function (ip: 0x7fd68d30aeee)
unknown function (ip: 0x7fd6a19980c7)
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:116 [inlined]
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:103 [inlined]
PyObject_Vectorcall at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:127 [inlined]
call_function at /home/runner/work/_temp/SourceCode/Python/ceval.c:5077 [inlined]
_PyEval_EvalFrameDefault at /home/runner/work/_temp/SourceCode/Python/ceval.c:3537
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a199a1e0)
unknown function (ip: 0x7fd6a19ed97b)
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a19ecdf6)
unknown function (ip: 0x7fd6a1998972)
unknown function (ip: 0x7fd6a199a1e0)
unknown function (ip: 0x7fd6a19ecb12)
unknown function (ip: 0x7fd6a1998972)
unknown function (ip: 0x7fd6a19ecdf6)
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a199a28d)
unknown function (ip: 0x7fd6a19ef9b1)
unknown function (ip: 0x7fd6a19ebbb7)
unknown function (ip: 0x7fd6a1997d4c)
unknown function (ip: 0x7fd6a1998f2b)
unknown function (ip: 0x7fd6a1a46421)
unknown function (ip: 0x7fd6a199802f)
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:116 [inlined]
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:103 [inlined]
PyObject_Vectorcall at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:127 [inlined]
call_function at /home/runner/work/_temp/SourceCode/Python/ceval.c:5077 [inlined]
_PyEval_EvalFrameDefault at /home/runner/work/_temp/SourceCode/Python/ceval.c:3520
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a199a28d)
unknown function (ip: 0x7fd6a19ef9b1)
unknown function (ip: 0x7fd6a19ebbb7)
unknown function (ip: 0x7fd6a1997d4c)
unknown function (ip: 0x7fd6a1998f2b)
unknown function (ip: 0x7fd6a1a46421)
unknown function (ip: 0x7fd6a199802f)
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:116 [inlined]
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:103 [inlined]
PyObject_Vectorcall at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:127 [inlined]
call_function at /home/runner/work/_temp/SourceCode/Python/ceval.c:5077 [inlined]
_PyEval_EvalFrameDefault at /home/runner/work/_temp/SourceCode/Python/ceval.c:3520
unknown function (ip: 0x7fd6a1998972)
unknown function (ip: 0x7fd6a19ecdf6)
unknown function (ip: 0x7fd6a1998972)
unknown function (ip: 0x7fd6a19ecb12)
unknown function (ip: 0x7fd6a19ebbb7)
_PyEval_EvalCodeWithName at /home/runner/work/_temp/SourceCode/Python/ceval.c:4361
unknown function (ip: 0x7fd6a19eb876)
PyEval_EvalCode at /home/runner/work/_temp/SourceCode/Python/ceval.c:828
unknown function (ip: 0x7fd6a1a6399f)
cfunction_vectorcall_FASTCALL at /home/runner/work/_temp/SourceCode/Objects/methodobject.c:430
unknown function (ip: 0x7fd6a19ecb12)
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a19ecb12)
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a1a7fdd6)
unknown function (ip: 0x7fd6a1a7faae)
Py_BytesMain at /home/runner/work/_temp/SourceCode/Modules/main.c:731
unknown function (ip: 0x7fd6a1642d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at python (unknown line)
Allocations: 185387713 (Pool: 185351460; Big: 36253); GC: 470
/home/runner/work/_temp/bdd49862-48fd-4e82-bed8-685329606248.sh: line 1:  2324 Segmentation fault      (core dumped) python -m pysr.test main

Git errors: (Julia 1.8.2)

PyCall is installed and built successfully.
     Cloning git-repo `[https://github.com/MilesCranmer/SymbolicRegression.jl`](https://github.com/MilesCranmer/SymbolicRegression.jl%60)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/runner/work/PySR/PySR/pysr/julia_helpers.py", line 87, in install
    _add_sr_to_julia_project(Main, io_arg)
  File "/Users/runner/work/PySR/PySR/pysr/julia_helpers.py", line 240, in _add_sr_to_julia_project
    Main.eval(f"Pkg.add([sr_spec, clustermanagers_spec], {io_arg})")
  File "/Users/runner/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/julia/core.py", line 627, in eval
    ans = self._call(src)
  File "/Users/runner/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/julia/core.py", line 555, in _call
    self.check_exception(src)
  File "/Users/runner/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/julia/core.py", line 609, in check_exception
    raise JuliaError(u'Exception \'{}\' occurred while calling julia code:\n{}'
julia.core.JuliaError: Exception 'failed to clone from https://github.com/MilesCranmer/SymbolicRegression.jl, error: GitError(Code:ERROR, Class:Net, SecureTransport error: connection closed via error)' occurred while calling julia code:
Pkg.add([sr_spec, clustermanagers_spec], io=stderr)

Access errors during scikit-learn tests (these ones don't even fail the CI, which is a bit worrisome)

e.g.,

Failed check_fit2d_predict1d with:
    Traceback (most recent call last):
      File "D:\a\PySR\PySR\pysr\test\test.py", line 671, in test_scikit_learn_compatibility
        check(model)
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\sklearn\utils\_testing.py", line 188, in wrapper
        return fn(*args, **kwargs)
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\sklearn\utils\estimator_checks.py", line 1300, in check_fit2d_predict1d
        estimator.fit(X, y)
      File "D:\a\PySR\PySR\pysr\sr.py", line 1792, in fit
        self._run(X, y, mutated_params, weights=weights, seed=seed)
      File "D:\a\PySR\PySR\pysr\sr.py", line 1493, in _run
        Main = init_julia(self.julia_project, julia_kwargs=julia_kwargs)
      File "D:\a\PySR\PySR\pysr\julia_helpers.py", line 180, in init_julia
        Julia(**julia_kwargs)
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\julia\core.py", line 519, in __init__
        self._call("const PyCall = Base.require({0})".format(PYCALL_PKGID))
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\julia\core.py", line 554, in _call
        ans = self.api.jl_eval_string(src.encode('utf-8'))
    OSError: exception: access violation reading 0x000001BC1C501000

Torch errors.

One other curious thing is that this error is raised on some Windows tests (https://github.com/MilesCranmer/PySR/actions/runs/3664894286/jobs/6195713513). But, this should not take place...

Run python -m pysr.test torch
D:\a\PySR\PySR\pysr\julia_helpers.py:139: UserWarning: `torch` was loaded before the Julia instance started. This may cause a segfault when running `PySRRegressor.fit`. To avoid this, please run `pysr.julia_helpers.init_julia()` *before* importing `torch`. For updates, see https://github.com/pytorch/pytorch/issues/78829
  warnings.warn(
D:\a\_temp\8727c9f4-d0f6-4345-84e6-e774762771ab.sh: line 1:   258 Segmentation fault      python -m pysr.test torch
Started!

The text was updated successfully, but these errors were encountered:

MilesCranmer · 2022-12-10T17:21:16Z

@mkitti for error 3 in particular, do you have an idea of where I should check PyJulia? It almost looks like Python garbage collected the pointer to the Julia runtime which is strange.

mkitti · 2022-12-10T18:47:58Z

What changed?

MilesCranmer · 2022-12-10T19:49:11Z

So I have seen a few of these on-and-off for a while, especially on Windows. However, the rate has gone up recently. Perhaps this is because I have added more unit-tests over time, and tested more complex functionality (e.g., LoopVectorization.jl) and thus there is cumulatively a higher chance of each error occurring. I am really not sure what causes error 1 and 3 though. Error 2 and 4 seem doable to debug but seem more related to CI than the code itself; so I am mostly worried about 1+3.

MilesCranmer · 2022-12-10T22:53:54Z

I wonder if it has to do with the _LIBJULIA variable in PyJulia being cleaned up by the python gc?
https://github.com/JuliaPy/pyjulia/blob/1e3de7bbd27312f9abd200761a0c04a03c40a23d/src/julia/libjulia.py#L90-L94

self.api is set to an evaluation of get_libjulia, which is defined here, which returns a global variable _LIBJULIA. However, that variable is actually not declared as global in that function, but just passed when the function is first defined. I wonder if that is the source of the issue?

i.e., maybe the fix is

   def get_libjulia():
+      global _LIBJULIA
       return _LIBJULIA

MilesCranmer · 2022-12-10T22:59:48Z

Edit: looks like the access error in particular was introduced between these two commits: https://github.com/MilesCranmer/PySR/compare/c97f60de90203bd5091c3f49e031f49b17a0c6fa..da0bef974b69dc9215a0986145c53f5f7f4462a9. Maybe it has to do with setting optimize=3 on Julia?

MilesCranmer · 2022-12-10T23:48:49Z

Nope; neither the optimize=2 nor the global change fixed it. Very confused...

It seems like the access errors first show up in test_scikit_learn_compatibility, which passes PySRRegressor to an internal test suite of scikit-learn: here. I wonder if a recent change to this test suite is what suddenly caused this breakage in the Windows tests.

MilesCranmer · 2022-12-11T17:27:43Z

I can't reproduce the errors on a local copy of Windows (in Parallels) - Python 3.10, Julia 1.8.3. I wonder if the GitHub action is just running out of memory or something...

mkitti · 2022-12-11T20:50:30Z

Running out of memory would definitely put pressure on the garbage collector

MilesCranmer · 2022-12-21T00:04:17Z

Indeed I think it is an overuse of memory from some sort of garbage not being properly collected from threads:

I was launching searches repeatedly from IPython, and at one point there was 10 GB allocated in the RAM. Even when I set model = None, none of the memory was cleared by the Python/Julia GCs, indicating it is somehow sticking around.

The short term solution is to split the CI into separate launches of Python, so that memory is forced to clear after multiple tests.

The long term solution is to debug exactly why memory is not being freed. Perhaps it has something to do with jobs being added to this list through the use of @async: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/367d155f26c5a7f0faf26bf529b95f097f1f7f22/src/SymbolicRegression.jl#L652, and then garbage not being collected when this function exits?

MilesCranmer · 2022-12-21T00:12:52Z

~~Debugging list:~~

~~- [ ] Does the memory leak appear in Julia, or just PyJulia?~~
~~- [ ] Is the memory leak due to parallelism?~~
~~- [ ] Does the memory leak occur when running in serial mode?~~
~~- [ ] Does the memory leak occur when running until completion, rather than early stopping?~~
~~- [ ] How does the memory leak scale with # populations, dataset size, etc.?~~
~~- [ ] Does the memory leak appear only on some operating systems?~~
~~- [ ] Is the memory leak due to running everything directly on Main in PyJulia, rather than in a scope?~~

Edit: seems like there isn't actually a memory leak; it's just the JIT cache.

MilesCranmer · 2022-12-21T19:17:46Z

Even just splitting it into 10 different subsets of tests seems to cause segfaults: https://github.com/MilesCranmer/PySR/actions/runs/3752052933.

MilesCranmer · 2023-02-18T05:20:13Z

Got some cloud compute to try to debug this. Looks like the test triggering the series of access violations is TestPipeline.test_high_dim_selection_early_stop in test.py. In particular, something in the second half of this test (the second model.fit) seems to trigger it:

PySR/pysr/test/test.py

Lines 300 to 317 in d045586

    
           def test_high_dim_selection_early_stop(self): 
        
               X = pd.DataFrame({f"k{i}": self.rstate.randn(10000) for i in range(10)}) 
        
               Xresampled = pd.DataFrame({f"k{i}": self.rstate.randn(100) for i in range(10)}) 
        
               y = X["k7"] ** 2 + np.cos(X["k9"]) * 3 
        
               model = PySRRegressor( 
        
                   unary_operators=["cos"], 
        
                   select_k_features=3, 
        
                   early_stop_condition=1e-4,  # Stop once most accurate equation is <1e-4 MSE 
        
                   maxsize=12, 
        
                   **self.default_test_kwargs, 
        
               ) 
        
               model.set_params(model_selection="accuracy") 
        
               model.fit(X, y, Xresampled=Xresampled) 
        
               self.assertLess(np.average((model.predict(X) - y) ** 2), 1e-4) 
        
               # Again, but with numpy arrays: 
        
               model.fit(X.values, y.values, Xresampled=Xresampled.values) 
        
               self.assertLess(np.average((model.predict(X.values) - y.values) ** 2), 1e-4)

Updates:

Turned off early_stop_condition, and the bug went away. So perhaps stopping early is triggering some sort of memory access bug (e.g., from threads which haven't completed yet?)
- It looks like threads could continue to modify the contents of returnPops even after it has been returned to Python. Perhaps that is the issue.
- This could be tested by seeing if the problem goes away when serial mode is used instead, or when the returnPops store an explicit copy of populations.

MilesCranmer · 2023-02-18T19:59:06Z

The poster in #266 confirmed that multi-processing got rid of their issue. So it seems like a data race issue. I wonder if this is because EquationSearch is exiting before some threads are finished, because there is no safe way to cancel threads, whereas for processes, I simply call rmprocs(procs): https://github.com/MilesCranmer/SymbolicRegression.jl/blob/51d205c518eb3e99cfd45ac6a2d3dbbbd1944f32/src/SymbolicRegression.jl#L915

One possible solution is to implement a task handler that will safely kill tasks, as described here: https://discourse.julialang.org/t/how-to-kill-thread/34236/8.

MilesCranmer · 2024-02-12T09:04:07Z

Presumably fixed by #535

MilesCranmer changed the title ~~Repeated CI failures due to segmentation fault on Windows~~ Repeated CI failures on Windows Dec 10, 2022

MilesCranmer mentioned this issue Dec 10, 2022

Raise warning on statically-linked Python binaries #237

Closed

mkitti mentioned this issue Dec 16, 2022

Segmentation Fault When Using PyJulia Inside of PyTorch Custom Autograd Function JuliaPy/pyjulia#518

Open

MilesCranmer mentioned this issue Dec 21, 2022

Repeated segfaults on Windows integration tests JuliaLang/julia#47957

Open

MilesCranmer mentioned this issue Feb 17, 2023

[BUG] OSError: exception: access violation reading #266

Closed

MilesCranmer mentioned this issue Feb 18, 2023

Bump backend version with data race fix #268

Merged

MilesCranmer self-assigned this Mar 25, 2023

MilesCranmer removed their assignment Apr 20, 2023

MilesCranmer closed this as completed Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated CI failures on Windows #238

Repeated CI failures on Windows #238

MilesCranmer commented Dec 10, 2022 •

edited

Loading

MilesCranmer commented Dec 10, 2022 •

edited

Loading

mkitti commented Dec 10, 2022

MilesCranmer commented Dec 10, 2022

MilesCranmer commented Dec 10, 2022

MilesCranmer commented Dec 10, 2022

MilesCranmer commented Dec 10, 2022

MilesCranmer commented Dec 11, 2022

mkitti commented Dec 11, 2022

MilesCranmer commented Dec 21, 2022

MilesCranmer commented Dec 21, 2022 •

edited

Loading

MilesCranmer commented Dec 21, 2022

MilesCranmer commented Feb 18, 2023 •

edited

Loading

MilesCranmer commented Feb 18, 2023 •

edited

Loading

MilesCranmer commented Feb 12, 2024

Repeated CI failures on Windows #238

Repeated CI failures on Windows #238

Comments

MilesCranmer commented Dec 10, 2022 • edited Loading

MilesCranmer commented Dec 10, 2022 • edited Loading

mkitti commented Dec 10, 2022

MilesCranmer commented Dec 10, 2022

MilesCranmer commented Dec 10, 2022

MilesCranmer commented Dec 10, 2022

MilesCranmer commented Dec 10, 2022

MilesCranmer commented Dec 11, 2022

mkitti commented Dec 11, 2022

MilesCranmer commented Dec 21, 2022

MilesCranmer commented Dec 21, 2022 • edited Loading

MilesCranmer commented Dec 21, 2022

MilesCranmer commented Feb 18, 2023 • edited Loading

MilesCranmer commented Feb 18, 2023 • edited Loading

MilesCranmer commented Feb 12, 2024

MilesCranmer commented Dec 10, 2022 •

edited

Loading

MilesCranmer commented Dec 10, 2022 •

edited

Loading

MilesCranmer commented Dec 21, 2022 •

edited

Loading

MilesCranmer commented Feb 18, 2023 •

edited

Loading

MilesCranmer commented Feb 18, 2023 •

edited

Loading