conduit-lwt-unix: se accept_n on the server #387

mseri · 2021-04-22T11:11:26Z

In benchmarks for cohttp-lwt-unix' latency we see evidence of connections starvation.
Using accept_n seems to improve the situation. You can see here an example of the artificial benchmarks that I used to check this

Ping @avsm

Signed-off-by: Marcello Seri <[email protected]>

…n since run_handler and accept_n should not raise Signed-off-by: Marcello Seri <[email protected]>

Signed-off-by: Marcello Seri <[email protected]>

mseri · 2021-04-22T11:12:07Z

src/conduit-lwt-unix/conduit_lwt_server.mli

@@ -10,6 +10,7 @@ val process_accept :
  unit Lwt.t

 val init :


I should probably document this, I was first waiting to see if you are interested at all and what the review says about the implementation

Signed-off-by: Marcello Seri <[email protected]>

mseri · 2021-04-22T11:28:00Z

Adding the yield call greatly improves the latency, but apparently at a cost on lost connections.
In the picture below, the green line is accept_n+yield from e490592, the red one is the current conduit, the blue is accept_n where I delete all code related to set_max_active and the orange one is accept_n without the extra yielding (cdeda3c).

mseri · 2021-04-23T07:22:38Z

src/conduit-lwt-unix/conduit_lwt_server.ml

+        | `Accepted connections ->
+            Lwt_list.iter_p
+              (fun v ->
+                connected ();


There may also be a race to update connected here since I am calling iter_p, right?

I don't quite see the race you're referring to here -- where is the iter_p?

If I am not mistaken iter_p runs the iteration in parallel, and connected () increments the active connection count without any protection (let connected () = incr active). My fear was that multiple connections executed in parallel could try to increment the variable simultaneously (and I don't want to put a mutex there).

This reverts commit e490592.

mseri · 2021-05-17T17:54:46Z

I have updated the test, using the test server suggested in inhabitedtype/httpaf#200 and the wrk2 call used there.
The new results are in https://gist.github.com/mseri/d40130fef7c8673e55b9e337f411c56d and summarized below

In practice, it seems that using accept_n does improve the latency with respect both to the current implementation and the one with the extra yield call.

I would be curious to see if this makes any difference on linux (I am testing on a macbook pro)

…s on the increments Signed-off-by: Marcello Seri <[email protected]>

avsm

This looks like an improvement to me for sure. I'd like to run tests on Linux before merging it just to be quantify the win.

Signed-off-by: Marcello Seri <[email protected]>

mseri · 2021-05-21T13:09:57Z

If somebody can benchmark it on linux, can you also check what difference it does with and without cf7c62f ?

anuragsoni · 2021-05-30T23:07:13Z

@mseri If somebody can benchmark it on linux, can you also check what difference it does with and without cf7c62f ?

I gave this a run on linux today. I did a set of different tests. I first tried running the tests through cohttp + conduit and took the example server from your gist. I didn't notice a big difference between the current released conduit and the version with accept_n. I then tried to use conduit for my own WIP http 1.1 server implementation and tried the following combinations:

All servers run a service of the form req -> res Lwt.t

Server with upstream conduit where handler returns ASAP
Server with accept_n where handler returns ASAP
With these two servers there wasn't a big difference between accept_n and accept.
Server with upstream conduit where handler uses Lwt.pause before responding
Server with accept_n where handler uses Lwt.pause before responding
In these two servers I noticed improvement in the accept_n branch vs the current released conduit.

All of the data can be found at https://gist.github.com/anuragsoni/33ae4a750dd7e110cbe94ed69614b7d5
Do let me know if something looks missing there and i'd be happy to run more tests.

As an aside (and this is most likely off-topic for conduit and maybe more a discussion about how lwt scheduler works?), another potential data point was when i tried running a test that can simulate handlers that do a little bit of work. I added a sleep of 1 millisecond in the handler before it responds. With this change i noticed a big spike in latency and the numbers went up from milliseconds to seconds. https://gist.github.com/anuragsoni/33ae4a750dd7e110cbe94ed69614b7d5#file-h1_conduit_accept_n_1ms_delay-txt

I've been running some benchmarks against servers in other languages and i don't see a similar spike when i add a millisecond delay in a handler for a tokio based rust web server, and I don't see a similar spike when I run my h1 server on async instead of lwt. The rust and h1-async results can be seen at https://github.com/anuragsoni/h1/tree/3c16f5eccf4a17c1009814a5178537fb50e5aaad/wrk_bench
I'm still early in my experiments so I'm not a 100% sure about what causes these spikes for Lwt, but i'll spend some more time to see if I can get more information about what's happening!

mseri · 2021-05-31T09:19:24Z

Dear @anuragsoni thanks a lot for this!

The idea of adding the millisecond pause is brilliant.

There are a number of tickets on this latency issue, if would be great if we can figure out how to improve/fix it.

anuragsoni · 2021-06-01T15:11:35Z

@mseri There are a number of tickets on this latency issue, if would be great if we can figure out how to improve/fix it.

For sure! I remember seeing some tickets on cohttp and httpaf, are those the tickets you are referring to?

I'm wondering if some of the latency related investigation/work will be needed at a layer below that? I've only recently started using Lwt seriously, but the fact that I'm seeing the large latency spike only on lwt and not on async leads me to believe that it'd be worthwhile to see if we can come up with a small reproduction case that's independent of cohttp/conduit etc to see if the issue lies somewhere in Lwt. When i tested the 1ms delay on async with my h1 library I saw a similar spike in latency when i was using the higher level Reader/Writer modules that do their own buffering and scheduling for writes. I noticed pretty significant improvements when i switched to using async's file_descr/socket apis directly to orchestrate my own read/writes and do my own buffering. I was hoping to see similar improvements on my lwt backend and I do see improvements when the handler doesn't do any work, but with the 1ms delay going down to read/write calls with Lwt_unix.file_descr's still shows spikes in latency that are similar to when I was testing with the Lwt_io managed buffered channels.

Do note that everything I said might very well be wrong as I'm still exploring the lwt api and learning more about it :D

mseri · 2021-06-01T16:21:32Z

What you see is reasonable and in line with some of the comments in the tickets I am referring to. Give me a day and I will link them here

mseri · 2021-06-01T17:22:48Z

Here we go, the latency issue is discussed in

Latency benchmarking cohttp shows queueing cohttp#328
Memory leak under heavy load cohttp#545
"Hello world" HTTP server dies under load cohttp#503, possibly what happens in cohttp#412

as well as httpaf#200 and eliom#270 -- ping also @smorimoto that made some tests there --

It can be mitigated a bit manipulating

    Conduit_lwt_unix.set_max_active 100;
    Lwt_io.set_default_buffer_size 0x10000

and, as you observed, by using accept_n and yielding there.
There may also be a relation to Possible memory leak in client.read_response cohttp#688 – which, in fact, goes hand in hand with another issue in Lwt, closed for lack of manpower.

Signed-off-by: Marcello Seri <[email protected]>

mseri · 2021-06-30T06:46:14Z

There is a new benchmark showing how httpaf crushes cohttp on unilernels: robur-coop/unipi#4 (comment)

anuragsoni · 2021-10-07T15:45:59Z

There is a new benchmark showing how httpaf crushes cohttp on unilernels: roburio/unipi#4 (comment)

Some more benchmarks comparing various http server implementations at https://github.com/ocaml-multicore/retro-httpaf-bench/

mseri added 5 commits April 21, 2021 13:25

conduit_lwt_server: use accept_n

c5b3c46

Signed-off-by: Marcello Seri <[email protected]>

conduit-lwt-unix: catch possible cancellations

763ab5f

Signed-off-by: Marcello Seri <[email protected]>

conduit-lwt-unix: all exceptions should be related to the cancellatio…

fe2ebe7

…n since run_handler and accept_n should not raise Signed-off-by: Marcello Seri <[email protected]>

conduit-lwt-unix: loop only outside cancellations

86d584a

Signed-off-by: Marcello Seri <[email protected]>

conduit-lwt-unix.server: Allow to configure n

5793acf

Signed-off-by: Marcello Seri <[email protected]>

mseri commented Apr 22, 2021

View reviewed changes

mseri added 2 commits April 22, 2021 13:13

conduit-lwt-unix.server: minor cleanup

cdeda3c

Signed-off-by: Marcello Seri <[email protected]>

conduit-lwt-unix.server: add a call to yield

e490592

Signed-off-by: Marcello Seri <[email protected]>

mseri commented Apr 23, 2021

View reviewed changes

Revert "conduit-lwt-unix.server: add a call to yield"

cf7c62f

This reverts commit e490592.

conduit_lwt_unix: use iter_s over the accepts, this should avoid race…

a61511e

…s on the increments Signed-off-by: Marcello Seri <[email protected]>

avsm approved these changes May 21, 2021

View reviewed changes

conduit_lwt_server: make linter happy

d4ef50e

Signed-off-by: Marcello Seri <[email protected]>

conduit_lwt_server: re-introduce call to pause in the accept loop

10e5b85

Signed-off-by: Marcello Seri <[email protected]>

mseri mentioned this pull request Jun 23, 2021

cohttp.headers: use faster comparison mirage/ocaml-cohttp#778

Merged

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conduit-lwt-unix: se accept_n on the server #387

conduit-lwt-unix: se accept_n on the server #387

mseri commented Apr 22, 2021

mseri Apr 22, 2021

mseri commented Apr 22, 2021 •

edited

Loading

mseri Apr 23, 2021

avsm May 21, 2021

mseri May 21, 2021

mseri commented May 17, 2021 •

edited

Loading

avsm left a comment

mseri commented May 21, 2021 •

edited

Loading

anuragsoni commented May 30, 2021

mseri commented May 31, 2021

anuragsoni commented Jun 1, 2021

mseri commented Jun 1, 2021

mseri commented Jun 1, 2021 •

edited

Loading

mseri commented Jun 30, 2021

anuragsoni commented Oct 7, 2021

conduit-lwt-unix: se accept_n on the server #387

conduit-lwt-unix: se accept_n on the server #387

Conversation

mseri commented Apr 22, 2021

mseri Apr 22, 2021

Choose a reason for hiding this comment

mseri commented Apr 22, 2021 • edited Loading

mseri Apr 23, 2021

Choose a reason for hiding this comment

avsm May 21, 2021

Choose a reason for hiding this comment

mseri May 21, 2021

Choose a reason for hiding this comment

mseri commented May 17, 2021 • edited Loading

avsm left a comment

Choose a reason for hiding this comment

mseri commented May 21, 2021 • edited Loading

anuragsoni commented May 30, 2021

mseri commented May 31, 2021

anuragsoni commented Jun 1, 2021

mseri commented Jun 1, 2021

mseri commented Jun 1, 2021 • edited Loading

mseri commented Jun 30, 2021

anuragsoni commented Oct 7, 2021

mseri commented Apr 22, 2021 •

edited

Loading

mseri commented May 17, 2021 •

edited

Loading

mseri commented May 21, 2021 •

edited

Loading

mseri commented Jun 1, 2021 •

edited

Loading