Refactor test pool to be less optional #75

mmcgr · 2023-08-11T01:22:25Z

The 'optional' test pool was over complicated and offered little advantage over using it directly. Functionality is almost unchanged after this refactor, but with a simpler interface.

set and get methods for test engine names now go directly to the engine pool.
Increasing the size of the engine pool also provisions the engines.
Getting an engine from the pool now assumes it exists.
The resize function grew a bit much so has been split into helpers.

src/engines.jl

mcmcgrath13 · 2023-08-15T14:41:53Z

src/engines.jl


 Create an XS engine with default settings and the provided name.

 If the engine already exists, return immediately. If not, create the engine then
 return once the provisioning process is complete, or failed.
 """
-function create_default_engine(name::String)
-    size = "XS"
+function create_default_engine(name::String, size::String="XS")


in case there are other attributes we want to add later, can we make size a kwarg instead of a positional optional arg?

Done. The name isn't so great now, since there are no defaults. create_engine_and_wait, maybe?

src/engines.jl

src/testpool.jl

src/engines.jl

mcmcgrath13 · 2023-08-15T14:53:27Z

src/testpool.jl

 """
 function add_test_engine!(name::String)
    @lock TEST_SERVER_LOCK begin
        engines = TEST_ENGINE_POOL.engines
        engines[name] = 0
    end

+    # Provision the engine if it does not already exist.
+    TEST_ENGINE_POOL.creater(new_name)
+
    return nothing
 end

 function get_next_engine_name(id::Int64)
    return "julia-sdk-test-$(id)"


not for this PR, but this default name could probably use another pass

mcmcgrath13 · 2023-08-15T14:55:04Z

src/testpool.jl

+                end
+                # The engine exists, but is not provisioned despite our best attempts
+            catch
+                # The engine does not exist


is there a specific error we can test for here?

404 means the engine does not exist. There are plans for a 429 with a Retry-After, although that shouldn't be handled here. Then there are HTTP connection errors that are not about the engine.

Reporting the error, whatever it is, could be useful. 404 is a definitive error, but handling that would be covered by reporting the error. The response, if there is one, could also be reported.

but it sounds like we do expect those to be HTTPError and not some other error? Maybe check for that just to avoid a situation where we make a typo in the try block and swallow it

Ah, right. Only HTTPErrors make sense for us. If we do get a different error, something is seriously wrong and we should rethrow.

mcmcgrath13 · 2023-08-15T15:02:09Z

src/testpool.jl

    end

+    @lock TEST_SERVER_LOCK begin


If I'm understanding the concurrency model correctly (which is a big if to be fair), the main place where we could see data update races is when marking engines as available for testing or not. It looks like all of the pool maintenance isn't really threaded. If this is indeed the case, could we use the lock just on those availability updates or use an atomic there?

Pool maintenance can be performed while testing is in progress. This is mainly useful for replacing engines while tests are being run (which we've never actually done as they've never failed during testing). The single global lock for adding/removing/claiming engines is the simplest, most reliable approach to allow this.

If we're not modifying the list during testing (and we aren't in current use) then we could add atomics, and drop the add/remove lock entirely, marking those methods as non-threadsafe.

So, yes we could change it. Is there an advantage in reliability or readability either way?

Not a huge one, but it seems like there are some opportunities for helpers that could be simplified without the lock or would require smaller chunks of code locking. For instance, we delete engines in a similar try/catch + warn pattern in a bunch of places that could maybe be abstracted a bit

Modifying engines mid-testing didn't prove to be needed, so simplifying the threading model sounds reasonable to me. I'll do this in a followup PR so it's easier to see the changes.

I found potential issues so I went ahead and made the changes in this PR. More work is now done by helpers, and the locks are held for less of the code.

src/testpool.jl

mcmcgrath13 · 2023-08-17T22:38:49Z

src/testpool.jl

@@ -18,7 +18,7 @@ Base.@kwdef mutable struct TestEnginePool
    engines::Dict{String, Int64} = Dict()
    # This is used to enable unique, simple, naming of engines
    # Switching to randomly generated UUIDs would be needed if tests are run independently
-    next_id::Int64 = 0
+    next_id::Threads.Atomic{Int} = Threads.Atomic{Int}(0)
    # Number of tests per engine. Values > 1 invalidate test timing, and require careful
    # attention to engine sizing
    concurrency::Int64 = 1


should concurrency also be an atomic?

ah, concurrency is the general limit, not the number of tests actually running

mcmcgrath13 · 2023-08-17T22:39:52Z

src/testpool.jl

+function _get_new_id()
+    return Threads.atomic_add!(TEST_ENGINE_POOL.next_id, 1)
+end
+
 function get_free_test_engine_name()::String
    delay = 1
    # One lock guards name acquisition, forming a queue
    # The second lock guards modification of the engine pool
    @lock TEST_SERVER_ACQUISITION_LOCK while true


with next id being an atomic, is the TEST_SERVER_ACQUISITION_LOCK still needed?

next_id is for engine naming, so they don't interact. It's there to simplify starting up/naming/renaming multiple engines at once.

TEST_SERVER_ACQUISITION_LOCK is there to restrict engine claiming to one test a time - which forms a queue once all engines are claimed. The inner lock then gives 'privileged' access to anything else trying to modify the list. A standard test has to claim both locks, restricting access to the inner lock to one test thread at a time. Thus an attempt to replace an engine has to contend for the lock with only one thread that spends most of its time sleeping. Otherwise attempts to modify the list during testing could wait until all the tests had finished.

Disallowing modification of the list while tests are running would also work, of course.

mcmcgrath13

nice! much more readable now 🎉

src/engines.jl

mmcgr added 4 commits August 11, 2023 11:19

Refactor test pool to be less optional

49ce841

Be less lazy when provisioning engines

a00fcdc

Comment

271233f

Remove debug code

6de754b

mmcgr marked this pull request as ready for review August 11, 2023 08:42

mmcgr added 2 commits August 11, 2023 18:47

Rename

9bb99d0

Remove debug code

6b5e906

mmcgr requested a review from mcmcgrath13 August 11, 2023 08:50