Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated CI failures on Windows #238

Closed
MilesCranmer opened this issue Dec 10, 2022 · 14 comments
Closed

Repeated CI failures on Windows #238

MilesCranmer opened this issue Dec 10, 2022 · 14 comments

Comments

@MilesCranmer
Copy link
Owner

MilesCranmer commented Dec 10, 2022

Many of the Windows tests are now failing with various segmentation faults, which appear to be randomly triggered:

They seem to occur more frequently on older versions of Julia, and rarely on Julia 1.8.3. Regardless, a segfault anywhere is cause for concern and should be tracked down.

The errors include:

  1. Early segmentation fault (Julia 1.6.7) at first run, segfault during noise test (Julia 1.6.7 and others), as well as segfaults during warm start test.

e.g., Windows:

 D:\a\_temp\221410f9-8bf7-4099-901d-eb9813d86c45.sh: line 1:  1098 Segmentation fault      python -m pysr.test main
Started!
also occurs on Ubuntu sometimes:
signal (11): Segmentation fault
in expression starting at none:0
unknown function (ip: 0x7fd6a19bc215)
unknown function (ip: 0x7fd6a19947ff)
macro expansion at /home/runner/.julia/packages/PyCall/ygXW2/src/exception.jl:95 [inlined]
convert at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:94
pyjlwrap_getattr at /home/runner/.julia/packages/PyCall/ygXW2/src/pytype.jl:378
unknown function (ip: 0x7fd68d30b1bd)
unknown function (ip: 0x7fd6a19babda)
unknown function (ip: 0x7fd6a198e9d4)
pyisinstance at /home/runner/.julia/packages/PyCall/ygXW2/src/PyCall.jl:170 [inlined]
pysequence_query at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:752
pytype_query at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:773
pytype_query at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:806 [inlined]
convert at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:831
julia_kwarg at /home/runner/.julia/packages/PyCall/ygXW2/src/callback.jl:19 [inlined]
#57 at ./none:0 [inlined]
iterate at ./generator.jl:47 [inlined]
collect_to! at ./array.jl:728
unknown function (ip: 0x7fd68d341d9a)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
collect_to! at ./array.jl:736
unknown function (ip: 0x7fd68d33e35a)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
collect_to! at ./array.jl:736
collect_to_with_first! at ./array.jl:706
unknown function (ip: 0x7fd68d33d775)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
collect at ./array.jl:687
unknown function (ip: 0x7fd68d33afb4)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
_pyjlwrap_call at /home/runner/.julia/packages/PyCall/ygXW2/src/callback.jl:31
unknown function (ip: 0x7fd68d3348d5)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
pyjlwrap_call at /home/runner/.julia/packages/PyCall/ygXW2/src/callback.jl:44
unknown function (ip: 0x7fd68d30aeee)
unknown function (ip: 0x7fd6a19980c7)
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:116 [inlined]
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:103 [inlined]
PyObject_Vectorcall at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:127 [inlined]
call_function at /home/runner/work/_temp/SourceCode/Python/ceval.c:5077 [inlined]
_PyEval_EvalFrameDefault at /home/runner/work/_temp/SourceCode/Python/ceval.c:3537
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a199a1e0)
unknown function (ip: 0x7fd6a19ed97b)
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a19ecdf6)
unknown function (ip: 0x7fd6a1998972)
unknown function (ip: 0x7fd6a199a1e0)
unknown function (ip: 0x7fd6a19ecb12)
unknown function (ip: 0x7fd6a1998972)
unknown function (ip: 0x7fd6a19ecdf6)
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a199a28d)
unknown function (ip: 0x7fd6a19ef9b1)
unknown function (ip: 0x7fd6a19ebbb7)
unknown function (ip: 0x7fd6a1997d4c)
unknown function (ip: 0x7fd6a1998f2b)
unknown function (ip: 0x7fd6a1a46421)
unknown function (ip: 0x7fd6a199802f)
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:116 [inlined]
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:103 [inlined]
PyObject_Vectorcall at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:127 [inlined]
call_function at /home/runner/work/_temp/SourceCode/Python/ceval.c:5077 [inlined]
_PyEval_EvalFrameDefault at /home/runner/work/_temp/SourceCode/Python/ceval.c:3520
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a199a28d)
unknown function (ip: 0x7fd6a19ef9b1)
unknown function (ip: 0x7fd6a19ebbb7)
unknown function (ip: 0x7fd6a1997d4c)
unknown function (ip: 0x7fd6a1998f2b)
unknown function (ip: 0x7fd6a1a46421)
unknown function (ip: 0x7fd6a199802f)
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:116 [inlined]
_PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:103 [inlined]
PyObject_Vectorcall at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:127 [inlined]
call_function at /home/runner/work/_temp/SourceCode/Python/ceval.c:5077 [inlined]
_PyEval_EvalFrameDefault at /home/runner/work/_temp/SourceCode/Python/ceval.c:3520
unknown function (ip: 0x7fd6a1998972)
unknown function (ip: 0x7fd6a19ecdf6)
unknown function (ip: 0x7fd6a1998972)
unknown function (ip: 0x7fd6a19ecb12)
unknown function (ip: 0x7fd6a19ebbb7)
_PyEval_EvalCodeWithName at /home/runner/work/_temp/SourceCode/Python/ceval.c:4361
unknown function (ip: 0x7fd6a19eb876)
PyEval_EvalCode at /home/runner/work/_temp/SourceCode/Python/ceval.c:828
unknown function (ip: 0x7fd6a1a6399f)
cfunction_vectorcall_FASTCALL at /home/runner/work/_temp/SourceCode/Objects/methodobject.c:430
unknown function (ip: 0x7fd6a19ecb12)
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a19ecb12)
unknown function (ip: 0x7fd6a19ebbb7)
_PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396
unknown function (ip: 0x7fd6a1a7fdd6)
unknown function (ip: 0x7fd6a1a7faae)
Py_BytesMain at /home/runner/work/_temp/SourceCode/Modules/main.c:731
unknown function (ip: 0x7fd6a1642d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at python (unknown line)
Allocations: 185387713 (Pool: 185351460; Big: 36253); GC: 470
/home/runner/work/_temp/bdd49862-48fd-4e82-bed8-685329606248.sh: line 1:  2324 Segmentation fault      (core dumped) python -m pysr.test main
  1. Git errors: (Julia 1.8.2)
PyCall is installed and built successfully.
     Cloning git-repo `[https://github.com/MilesCranmer/SymbolicRegression.jl`](https://github.com/MilesCranmer/SymbolicRegression.jl%60)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/runner/work/PySR/PySR/pysr/julia_helpers.py", line 87, in install
    _add_sr_to_julia_project(Main, io_arg)
  File "/Users/runner/work/PySR/PySR/pysr/julia_helpers.py", line 240, in _add_sr_to_julia_project
    Main.eval(f"Pkg.add([sr_spec, clustermanagers_spec], {io_arg})")
  File "/Users/runner/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/julia/core.py", line 627, in eval
    ans = self._call(src)
  File "/Users/runner/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/julia/core.py", line 555, in _call
    self.check_exception(src)
  File "/Users/runner/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/julia/core.py", line 609, in check_exception
    raise JuliaError(u'Exception \'{}\' occurred while calling julia code:\n{}'
julia.core.JuliaError: Exception 'failed to clone from https://github.com/MilesCranmer/SymbolicRegression.jl, error: GitError(Code:ERROR, Class:Net, SecureTransport error: connection closed via error)' occurred while calling julia code:
Pkg.add([sr_spec, clustermanagers_spec], io=stderr)
  1. Access errors during scikit-learn tests (these ones don't even fail the CI, which is a bit worrisome)

e.g.,

Failed check_fit2d_predict1d with:
    Traceback (most recent call last):
      File "D:\a\PySR\PySR\pysr\test\test.py", line 671, in test_scikit_learn_compatibility
        check(model)
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\sklearn\utils\_testing.py", line 188, in wrapper
        return fn(*args, **kwargs)
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\sklearn\utils\estimator_checks.py", line 1300, in check_fit2d_predict1d
        estimator.fit(X, y)
      File "D:\a\PySR\PySR\pysr\sr.py", line 1792, in fit
        self._run(X, y, mutated_params, weights=weights, seed=seed)
      File "D:\a\PySR\PySR\pysr\sr.py", line 1493, in _run
        Main = init_julia(self.julia_project, julia_kwargs=julia_kwargs)
      File "D:\a\PySR\PySR\pysr\julia_helpers.py", line 180, in init_julia
        Julia(**julia_kwargs)
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\julia\core.py", line 519, in __init__
        self._call("const PyCall = Base.require({0})".format(PYCALL_PKGID))
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\julia\core.py", line 554, in _call
        ans = self.api.jl_eval_string(src.encode('utf-8'))
    OSError: exception: access violation reading 0x000001BC1C501000
  1. Torch errors.

One other curious thing is that this error is raised on some Windows tests (https://github.com/MilesCranmer/PySR/actions/runs/3664894286/jobs/6195713513). But, this should not take place...

Run python -m pysr.test torch
D:\a\PySR\PySR\pysr\julia_helpers.py:139: UserWarning: `torch` was loaded before the Julia instance started. This may cause a segfault when running `PySRRegressor.fit`. To avoid this, please run `pysr.julia_helpers.init_julia()` *before* importing `torch`. For updates, see https://github.com/pytorch/pytorch/issues/78829
  warnings.warn(
D:\a\_temp\8727c9f4-d0f6-4345-84e6-e774762771ab.sh: line 1:   258 Segmentation fault      python -m pysr.test torch
Started!
@MilesCranmer MilesCranmer changed the title Repeated CI failures due to segmentation fault on Windows Repeated CI failures on Windows Dec 10, 2022
@MilesCranmer
Copy link
Owner Author

MilesCranmer commented Dec 10, 2022

@mkitti for error 3 in particular, do you have an idea of where I should check PyJulia? It almost looks like Python garbage collected the pointer to the Julia runtime which is strange.

@mkitti
Copy link
Contributor

mkitti commented Dec 10, 2022

What changed?

@MilesCranmer
Copy link
Owner Author

So I have seen a few of these on-and-off for a while, especially on Windows. However, the rate has gone up recently. Perhaps this is because I have added more unit-tests over time, and tested more complex functionality (e.g., LoopVectorization.jl) and thus there is cumulatively a higher chance of each error occurring. I am really not sure what causes error 1 and 3 though. Error 2 and 4 seem doable to debug but seem more related to CI than the code itself; so I am mostly worried about 1+3.

@MilesCranmer
Copy link
Owner Author

I wonder if it has to do with the _LIBJULIA variable in PyJulia being cleaned up by the python gc?
https://github.com/JuliaPy/pyjulia/blob/1e3de7bbd27312f9abd200761a0c04a03c40a23d/src/julia/libjulia.py#L90-L94

self.api is set to an evaluation of get_libjulia, which is defined here, which returns a global variable _LIBJULIA. However, that variable is actually not declared as global in that function, but just passed when the function is first defined. I wonder if that is the source of the issue?

i.e., maybe the fix is

   def get_libjulia():
+      global _LIBJULIA
       return _LIBJULIA

@MilesCranmer
Copy link
Owner Author

Edit: looks like the access error in particular was introduced between these two commits: https://github.com/MilesCranmer/PySR/compare/c97f60de90203bd5091c3f49e031f49b17a0c6fa..da0bef974b69dc9215a0986145c53f5f7f4462a9. Maybe it has to do with setting optimize=3 on Julia?

@MilesCranmer
Copy link
Owner Author

Nope; neither the optimize=2 nor the global change fixed it. Very confused...

It seems like the access errors first show up in test_scikit_learn_compatibility, which passes PySRRegressor to an internal test suite of scikit-learn: here. I wonder if a recent change to this test suite is what suddenly caused this breakage in the Windows tests.

@MilesCranmer
Copy link
Owner Author

I can't reproduce the errors on a local copy of Windows (in Parallels) - Python 3.10, Julia 1.8.3. I wonder if the GitHub action is just running out of memory or something...

@mkitti
Copy link
Contributor

mkitti commented Dec 11, 2022

Running out of memory would definitely put pressure on the garbage collector

@MilesCranmer
Copy link
Owner Author

Indeed I think it is an overuse of memory from some sort of garbage not being properly collected from threads:

Screenshot 2022-12-20 at 6 58 21 PM

I was launching searches repeatedly from IPython, and at one point there was 10 GB allocated in the RAM. Even when I set model = None, none of the memory was cleared by the Python/Julia GCs, indicating it is somehow sticking around.

The short term solution is to split the CI into separate launches of Python, so that memory is forced to clear after multiple tests.

The long term solution is to debug exactly why memory is not being freed. Perhaps it has something to do with jobs being added to this list through the use of @async: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/367d155f26c5a7f0faf26bf529b95f097f1f7f22/src/SymbolicRegression.jl#L652, and then garbage not being collected when this function exits?

@MilesCranmer
Copy link
Owner Author

MilesCranmer commented Dec 21, 2022

Debugging list:

- [ ] Does the memory leak appear in Julia, or just PyJulia?
- [ ] Is the memory leak due to parallelism?
- [ ] Does the memory leak occur when running in serial mode?
- [ ] Does the memory leak occur when running until completion, rather than early stopping?
- [ ] How does the memory leak scale with # populations, dataset size, etc.?
- [ ] Does the memory leak appear only on some operating systems?
- [ ] Is the memory leak due to running everything directly on Main in PyJulia, rather than in a scope?

Edit: seems like there isn't actually a memory leak; it's just the JIT cache.

@MilesCranmer
Copy link
Owner Author

Even just splitting it into 10 different subsets of tests seems to cause segfaults: https://github.com/MilesCranmer/PySR/actions/runs/3752052933.

@MilesCranmer
Copy link
Owner Author

MilesCranmer commented Feb 18, 2023

Got some cloud compute to try to debug this. Looks like the test triggering the series of access violations is TestPipeline.test_high_dim_selection_early_stop in test.py. In particular, something in the second half of this test (the second model.fit) seems to trigger it:

PySR/pysr/test/test.py

Lines 300 to 317 in d045586

def test_high_dim_selection_early_stop(self):
X = pd.DataFrame({f"k{i}": self.rstate.randn(10000) for i in range(10)})
Xresampled = pd.DataFrame({f"k{i}": self.rstate.randn(100) for i in range(10)})
y = X["k7"] ** 2 + np.cos(X["k9"]) * 3
model = PySRRegressor(
unary_operators=["cos"],
select_k_features=3,
early_stop_condition=1e-4, # Stop once most accurate equation is <1e-4 MSE
maxsize=12,
**self.default_test_kwargs,
)
model.set_params(model_selection="accuracy")
model.fit(X, y, Xresampled=Xresampled)
self.assertLess(np.average((model.predict(X) - y) ** 2), 1e-4)
# Again, but with numpy arrays:
model.fit(X.values, y.values, Xresampled=Xresampled.values)
self.assertLess(np.average((model.predict(X.values) - y.values) ** 2), 1e-4)


Updates:

  1. Turned off early_stop_condition, and the bug went away. So perhaps stopping early is triggering some sort of memory access bug (e.g., from threads which haven't completed yet?)
    • It looks like threads could continue to modify the contents of returnPops even after it has been returned to Python. Perhaps that is the issue.
    • This could be tested by seeing if the problem goes away when serial mode is used instead, or when the returnPops store an explicit copy of populations.

@MilesCranmer
Copy link
Owner Author

MilesCranmer commented Feb 18, 2023

The poster in #266 confirmed that multi-processing got rid of their issue. So it seems like a data race issue. I wonder if this is because EquationSearch is exiting before some threads are finished, because there is no safe way to cancel threads, whereas for processes, I simply call rmprocs(procs): https://github.com/MilesCranmer/SymbolicRegression.jl/blob/51d205c518eb3e99cfd45ac6a2d3dbbbd1944f32/src/SymbolicRegression.jl#L915

One possible solution is to implement a task handler that will safely kill tasks, as described here: https://discourse.julialang.org/t/how-to-kill-thread/34236/8.

@MilesCranmer MilesCranmer self-assigned this Mar 25, 2023
@MilesCranmer MilesCranmer removed their assignment Apr 20, 2023
@MilesCranmer
Copy link
Owner Author

Presumably fixed by #535

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants