Memory leak under heavy load #545

hcarty · 2017-04-12T19:59:22Z

Tested with OCaml 4.04.0+flambda, cohttp 0.22.0, conduit 0.15.0, Lwt w/libev, CentOS 7 64bit VM:

let server http_port =
  let callback _conn _req _body =
    Cohttp_lwt_unix.Server.respond ~status:`OK ~body:Cohttp_lwt_body.empty ()
  in
  Cohttp_lwt_unix.Server.create
    ~mode:(`TCP (`Port http_port))
    (Cohttp_lwt_unix.Server.make ~callback ())

let () = Lwt_main.run (server 8080)

Then hitting it with ab on the same system (may require ulimit adjustments):

ab -c 10000 -n 100000 http://127.0.0.1:8080/

In my tests, ab gets through ~99% of the requests just fine but the last few hang for a bit and the cohttp server process jumps from <20 megabytes of RAM used to >150 megabytes used. Repeating the ab invocation shows the same behavior - ~99% of requests complete them RAM usage jumps for the cohttp server process.

The cohttp RAM use never drops back down so there seems to be a resource leak somewhere.

The text was updated successfully, but these errors were encountered:

hcarty · 2017-04-12T20:59:34Z

If I add a line inside the server definition to limit the number of conduit active connections the leak isn't eliminated but it is reduced to a few megabytes leaked per ab call rather than >100 megabytes:

let server http_port =
  Conduit_lwt_server.set_max_active 1_000;
  ...

hcarty · 2017-04-12T21:16:47Z

I take back my comment that the leak isn't eliminated when Conduit_lwt_server.set_max_active is used - the memory usage grows with the maximum number of concurrent connections attempted but it caps out once it peaks for that level of concurrency.

* mirage/ocaml-cohttp#545

hannesm · 2018-07-29T12:25:46Z

I observe a memory leak with cohttp 1.0.2 using mirage-cohttp and conduit 1.0.3 (mirage-conduit 3.0.1) and OCaml 4.06.0. Furthermore, I observe failures such as:

Error handling ((headers   ((accept */*) (accept-encoding gzip,deflate) (host hannes.nqsb.io) (user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10"))) (meth GET) (resource /Posts/Jackline) (version HTTP_1_1) (encoding Unknown)): Out of memory

which seems to originate from cohttp-lwt/server.ml in response_stream which catches all exceptions from callback by printing it to stderr and respond_error ~body:"Internal Server Error" () -- not clear to me whether this catch should apply to Out_of_memory exception.

kandu · 2022-06-23T04:58:41Z

I doubt this memory leak occurs in lwt or it is indeed not memory leak but an implementation strategy.

The minimal zero-dependency http server below can also reproduce similar 'problem'.

after ab -c 10000 -n 100000 http://127.0.0.1:8000/, the minimal server occupies 600 MiB ~ 1.2 GiB memory. higher than my java(vertx, netty) and .net(asp.net core mvc) implementations. The periodical compact gc reduce the memory to 400 MiB ~ 600 MiB

system and env: debian 11 amd64, 8 core, 16g ram, ocaml 4.14.0, lwt 5.5.0

dune

(executable
  (public_name hello)
  (libraries lwt lwt.unix)
  (preprocess (pps lwt_ppx)))

hello.ml

open Lwt

let rec gc ()=
  Lwt_unix.sleep 5.;%lwt
  print_endline "full major compact";
  Gc.compact ();
  gc ()

let rec read_request ic=
  (* read and drop the http Request-Line, all the Request Headers  and the last CRLF *)
  let%lwt s= Lwt_io.read_line ic in
  let len= String.length s in
  if len > 0 then read_request ic
  else return ()


let handler _addr (ic, oc)=
  let msg= "hello" in
  (try%lwt read_request ic with _-> return ());%lwt
  Lwt_io.fprint oc "HTTP/1.0 200 OK\r\n";%lwt
  Lwt_io.fprintf oc "Content-Length:%d\r\n" (String.length msg);%lwt
  Lwt_io.fprint oc "Content-Type:text/html\r\n";%lwt
  Lwt_io.fprint oc "\r\n";%lwt
  Lwt_io.fprint oc msg;%lwt
  Lwt_io.flush oc;%lwt
  return ()

let main ()=
  async gc;
  let sockaddr= Lwt_unix.ADDR_INET (Unix.inet_addr_any, 8000) in
  let%lwt server= Lwt_io.establish_server_with_client_address
    ~no_close:false (* channels and socks are closed automatically after the handler, fd/channel leak is not possible *)
    ~backlog:4096 (* enlarge listen backlog to reduce the probability of failure connection *)
    sockaddr
    handler
  in
  let%lwt _= Lwt_io.read_line Lwt_io.stdin in
  Lwt_io.shutdown_server server


let () =
  Lwt_main.run @@ main ()

The number of requests doesn't affect the memory occupy, ab -c 10 -n 100000 is resource thrifty on my system, which implies it is very unlikely that there is memory leak in the lwt system. But the number of concurrency affects more.

My guess is that lwt holds promises and channel buffers in some data structure, and for performance reason, the data structure enlarges itself when on demand, but it doesn't shrink after the promises are resolved.

kandu · 2022-06-25T00:21:37Z

Occasionally, full compact gc can't recycle enough memory, after ab benchmark, more than 1GiB memory is occupied. This could be improved by some more intelligent strategy. We may better cc this issue to lwt developers?

kandu · 2022-07-01T03:02:13Z

I tried to rewrite the server with lwt_unix, that is, without lwt_io, the memory consumption decreased very well.

lwt_io uses lwt_bytes.t as its buffer, which is of type bigarray.Array1.t and depends on custom c stubs heavily.
Another try is to implement lwt_bytes with Stdlib.Bytes. Memory consumption decreases too.

lwt_bytes and its c stub looks like one of the memory leak holes.

After some testings, replacing of some of the components in lwt, there seems to be more memory leaking holes exist in the unix/io related modules.

rgrinberg · 2022-07-01T03:46:37Z

Lwt_bytes are fine, you're probably just encountering Bigarray related GC issues which are well documented.

The issue is indeed with Lwt_io, but it's not related to Lwt_bytes. Lwt_io has this queuing layer for "atomic" operation (see Lwt_io.primitive) that sometimes works very poorly in practice. Taking your handler as an example, each write will enqueue itself and wait its turn until the channel is "Idle". All of this queueing overhead is quite costly (especially if we consider cancellation) and just mercilessly stresses the GC. Especially given that most write operations should be extremely cheap blits to the internal buffer.

You can change your server to use Lwt_io.direct_access and see for yourself that it's all it takes to get decent performance. If you'd like, you can modifying cohttp-server-lwt-unix to use direct access. I'd be happy to review such a PR.

kandu · 2022-07-04T06:50:38Z

Thanks for the explanation.

Unfortunately, the performance after changing the server to use Lwt_io.direct_access is not as good as expected.
I'm considering switching to janestreet async though its performance is not comparable with lwt, but its memory consumption is low and task schedule is fair :)

kandu · 2022-08-29T09:37:32Z

Hi, @rgrinberg

The issue is indeed with Lwt_io, but it's not related to Lwt_bytes. Lwt_io has this queuing layer for "atomic" operation (see Lwt_io.primitive) that sometimes works very poorly in practice. Taking your handler as an example, each write will enqueue itself and wait its turn until the channel is "Idle". All of this queueing overhead is quite costly (especially if we consider cancellation) and just mercilessly stresses the GC. Especially given that most write operations should be extremely cheap blits to the internal buffer.

Indeed, this strategy stacks up all the buffer together. When thousands of connections are emitted, it will cause really high peak memory usage. And the default memory allocator, malloc of glibc on linux, doesn't release free memory back to the OS.

I make a binding function, <malloc.h> malloc_trim in lwt_unix, call it after every major gc cycle(Gc.create_alarm) to force the glibc memory allocator to release back free memory. Then the long-term memory usage is constant -- about 20MiB

So

I doubt this memory leak occurs in lwt or it is indeed not memory leak but an implementation strategy.

This is not so much a cohttp bug as an implementation flaw of lwt_io.
I think this issue could also be forwarded to lwt developers. At least, we can replace the bug tag with enhancement tag.

gasche · 2022-11-01T13:57:11Z

I learned of the present issue from Caml Weekly News reporting on @kandu's comment on Discuss.

It seems to point to a fundamental memory-usage issue with Lwt_io -- consuming large amounts of otherwise-free memory is surprising but ok, but having dependent users crash with Out_of_memory is not great. I don't understand from reading the discussion whether the issue is related to Bigarray usage in Lwt_bytes or not.

Has this issue actually been reported to the Lwt folks? If yes, can you point to the corresponding issue there?

If the issue is in fact related to Bigarray usage, notice that the GC ways to deal with outside-heap memory usage has improved in the past years (a few years ago but... after Lwt was initially implemented), and potential "well-known issues" may be solvable more or less easily today. (Possibly that would involve discussing with upstream ocaml-runtime folks, but it makes sense to go through Lwt people first.)

rgrinberg · 2022-11-01T14:29:12Z

Has this issue actually been reported to the Lwt folks? If yes, can you point to the corresponding issue there?

Nope, it was not. The issue was successfully worked around in the new cohttp client and servers though.

hansole · 2022-11-13T18:08:33Z

I have had similar problems with Ocsigen, reported in:
ocsigen/ocsigen-start#658
ocsigen/eliom#569
Ocsigen seems to be using cohttp version 2.5.6

I tried read through the changelog of cohttp but I was not able to find this fix.
Which version of cohttp has the workaround?

gasche · 2022-11-13T19:44:41Z

I created an upstream issue for Lwt at ocsigen/lwt#972 . (I wish people more knowledgeable about the cohttp issue had done it themselves, because I couldn't give much useful information.)

mseri · 2022-11-14T12:10:28Z

@hansole the upcoming 6.0.0. The first alpha release is on opam-repository and will likely be merged soon.

rgrinberg · 2022-11-14T15:04:49Z

Note that to address the problem on the server side, you will need to switch to cohttp-server-lwt-unix.

ciarancourtney added a commit to ciarancourtney/FrameworkBenchmarks that referenced this issue Aug 16, 2017

Fix high load crash, set_max_active = 1000

21bef5b

* mirage/ocaml-cohttp#545

mseri added the Bug label Oct 20, 2018

hcarty mentioned this issue Dec 20, 2018

Always close channels after handling an event mirage/ocaml-conduit#283

Merged

jptmoore mentioned this issue Dec 21, 2018

Memory leak jptmoore/nibbledb#2

Closed

mseri mentioned this issue Jun 1, 2021

conduit-lwt-unix: se accept_n on the server mirage/ocaml-conduit#387

Closed

gasche mentioned this issue Nov 13, 2022

memory consumption blowups in downstream projects using Lwt ocsigen/lwt#972

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak under heavy load #545

Memory leak under heavy load #545

hcarty commented Apr 12, 2017

hcarty commented Apr 12, 2017

hcarty commented Apr 12, 2017

hannesm commented Jul 29, 2018

kandu commented Jun 23, 2022 •

edited

Loading

kandu commented Jun 25, 2022

kandu commented Jul 1, 2022

rgrinberg commented Jul 1, 2022

kandu commented Jul 4, 2022 •

edited

Loading

kandu commented Aug 29, 2022 •

edited

Loading

gasche commented Nov 1, 2022

rgrinberg commented Nov 1, 2022

hansole commented Nov 13, 2022

gasche commented Nov 13, 2022

mseri commented Nov 14, 2022

rgrinberg commented Nov 14, 2022

Memory leak under heavy load #545

Memory leak under heavy load #545

Comments

hcarty commented Apr 12, 2017

hcarty commented Apr 12, 2017

hcarty commented Apr 12, 2017

hannesm commented Jul 29, 2018

kandu commented Jun 23, 2022 • edited Loading

kandu commented Jun 25, 2022

kandu commented Jul 1, 2022

rgrinberg commented Jul 1, 2022

kandu commented Jul 4, 2022 • edited Loading

kandu commented Aug 29, 2022 • edited Loading

gasche commented Nov 1, 2022

rgrinberg commented Nov 1, 2022

hansole commented Nov 13, 2022

gasche commented Nov 13, 2022

mseri commented Nov 14, 2022

rgrinberg commented Nov 14, 2022

kandu commented Jun 23, 2022 •

edited

Loading

kandu commented Jul 4, 2022 •

edited

Loading

kandu commented Aug 29, 2022 •

edited

Loading