Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak under heavy load #545

Open
hcarty opened this issue Apr 12, 2017 · 15 comments
Open

Memory leak under heavy load #545

hcarty opened this issue Apr 12, 2017 · 15 comments
Labels

Comments

@hcarty
Copy link
Contributor

hcarty commented Apr 12, 2017

Tested with OCaml 4.04.0+flambda, cohttp 0.22.0, conduit 0.15.0, Lwt w/libev, CentOS 7 64bit VM:

let server http_port =
  let callback _conn _req _body =
    Cohttp_lwt_unix.Server.respond ~status:`OK ~body:Cohttp_lwt_body.empty ()
  in
  Cohttp_lwt_unix.Server.create
    ~mode:(`TCP (`Port http_port))
    (Cohttp_lwt_unix.Server.make ~callback ())

let () = Lwt_main.run (server 8080)

Then hitting it with ab on the same system (may require ulimit adjustments):

ab -c 10000 -n 100000 http://127.0.0.1:8080/

In my tests, ab gets through ~99% of the requests just fine but the last few hang for a bit and the cohttp server process jumps from <20 megabytes of RAM used to >150 megabytes used. Repeating the ab invocation shows the same behavior - ~99% of requests complete them RAM usage jumps for the cohttp server process.

The cohttp RAM use never drops back down so there seems to be a resource leak somewhere.

@hcarty
Copy link
Contributor Author

hcarty commented Apr 12, 2017

If I add a line inside the server definition to limit the number of conduit active connections the leak isn't eliminated but it is reduced to a few megabytes leaked per ab call rather than >100 megabytes:

let server http_port =
  Conduit_lwt_server.set_max_active 1_000;
  ...

@hcarty
Copy link
Contributor Author

hcarty commented Apr 12, 2017

I take back my comment that the leak isn't eliminated when Conduit_lwt_server.set_max_active is used - the memory usage grows with the maximum number of concurrent connections attempted but it caps out once it peaks for that level of concurrency.

ciarancourtney added a commit to ciarancourtney/FrameworkBenchmarks that referenced this issue Aug 16, 2017
@hannesm
Copy link
Member

hannesm commented Jul 29, 2018

I observe a memory leak with cohttp 1.0.2 using mirage-cohttp and conduit 1.0.3 (mirage-conduit 3.0.1) and OCaml 4.06.0. Furthermore, I observe failures such as:

Error handling ((headers   ((accept */*) (accept-encoding gzip,deflate) (host hannes.nqsb.io) (user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10"))) (meth GET) (resource /Posts/Jackline) (version HTTP_1_1) (encoding Unknown)): Out of memory

which seems to originate from cohttp-lwt/server.ml in response_stream which catches all exceptions from callback by printing it to stderr and respond_error ~body:"Internal Server Error" () -- not clear to me whether this catch should apply to Out_of_memory exception.

@kandu
Copy link

kandu commented Jun 23, 2022

I doubt this memory leak occurs in lwt or it is indeed not memory leak but an implementation strategy.

The minimal zero-dependency http server below can also reproduce similar 'problem'.

after ab -c 10000 -n 100000 http://127.0.0.1:8000/, the minimal server occupies 600 MiB ~ 1.2 GiB memory. higher than my java(vertx, netty) and .net(asp.net core mvc) implementations. The periodical compact gc reduce the memory to 400 MiB ~ 600 MiB

system and env: debian 11 amd64, 8 core, 16g ram, ocaml 4.14.0, lwt 5.5.0

dune

(executable
  (public_name hello)
  (libraries lwt lwt.unix)
  (preprocess (pps lwt_ppx)))

hello.ml

open Lwt

let rec gc ()=
  Lwt_unix.sleep 5.;%lwt
  print_endline "full major compact";
  Gc.compact ();
  gc ()

let rec read_request ic=
  (* read and drop the http Request-Line, all the Request Headers  and the last CRLF *)
  let%lwt s= Lwt_io.read_line ic in
  let len= String.length s in
  if len > 0 then read_request ic
  else return ()


let handler _addr (ic, oc)=
  let msg= "hello" in
  (try%lwt read_request ic with _-> return ());%lwt
  Lwt_io.fprint oc "HTTP/1.0 200 OK\r\n";%lwt
  Lwt_io.fprintf oc "Content-Length:%d\r\n" (String.length msg);%lwt
  Lwt_io.fprint oc "Content-Type:text/html\r\n";%lwt
  Lwt_io.fprint oc "\r\n";%lwt
  Lwt_io.fprint oc msg;%lwt
  Lwt_io.flush oc;%lwt
  return ()

let main ()=
  async gc;
  let sockaddr= Lwt_unix.ADDR_INET (Unix.inet_addr_any, 8000) in
  let%lwt server= Lwt_io.establish_server_with_client_address
    ~no_close:false (* channels and socks are closed automatically after the handler, fd/channel leak is not possible *)
    ~backlog:4096 (* enlarge listen backlog to reduce the probability of failure connection *)
    sockaddr
    handler
  in
  let%lwt _= Lwt_io.read_line Lwt_io.stdin in
  Lwt_io.shutdown_server server


let () =
  Lwt_main.run @@ main ()

The number of requests doesn't affect the memory occupy, ab -c 10 -n 100000 is resource thrifty on my system, which implies it is very unlikely that there is memory leak in the lwt system. But the number of concurrency affects more.

My guess is that lwt holds promises and channel buffers in some data structure, and for performance reason, the data structure enlarges itself when on demand, but it doesn't shrink after the promises are resolved.

@kandu
Copy link

kandu commented Jun 25, 2022

Occasionally, full compact gc can't recycle enough memory, after ab benchmark, more than 1GiB memory is occupied. This could be improved by some more intelligent strategy. We may better cc this issue to lwt developers?

@kandu
Copy link

kandu commented Jul 1, 2022

I tried to rewrite the server with lwt_unix, that is, without lwt_io, the memory consumption decreased very well.

lwt_io uses lwt_bytes.t as its buffer, which is of type bigarray.Array1.t and depends on custom c stubs heavily.
Another try is to implement lwt_bytes with Stdlib.Bytes. Memory consumption decreases too.

lwt_bytes and its c stub looks like one of the memory leak holes.

After some testings, replacing of some of the components in lwt, there seems to be more memory leaking holes exist in the unix/io related modules.

@rgrinberg
Copy link
Member

Lwt_bytes are fine, you're probably just encountering Bigarray related GC issues which are well documented.

The issue is indeed with Lwt_io, but it's not related to Lwt_bytes. Lwt_io has this queuing layer for "atomic" operation (see Lwt_io.primitive) that sometimes works very poorly in practice. Taking your handler as an example, each write will enqueue itself and wait its turn until the channel is "Idle". All of this queueing overhead is quite costly (especially if we consider cancellation) and just mercilessly stresses the GC. Especially given that most write operations should be extremely cheap blits to the internal buffer.

You can change your server to use Lwt_io.direct_access and see for yourself that it's all it takes to get decent performance. If you'd like, you can modifying cohttp-server-lwt-unix to use direct access. I'd be happy to review such a PR.

@kandu
Copy link

kandu commented Jul 4, 2022

Thanks for the explanation.

Unfortunately, the performance after changing the server to use Lwt_io.direct_access is not as good as expected.
I'm considering switching to janestreet async though its performance is not comparable with lwt, but its memory consumption is low and task schedule is fair :)

@kandu
Copy link

kandu commented Aug 29, 2022

Hi, @rgrinberg

The issue is indeed with Lwt_io, but it's not related to Lwt_bytes. Lwt_io has this queuing layer for "atomic" operation (see Lwt_io.primitive) that sometimes works very poorly in practice. Taking your handler as an example, each write will enqueue itself and wait its turn until the channel is "Idle". All of this queueing overhead is quite costly (especially if we consider cancellation) and just mercilessly stresses the GC. Especially given that most write operations should be extremely cheap blits to the internal buffer.

Indeed, this strategy stacks up all the buffer together. When thousands of connections are emitted, it will cause really high peak memory usage. And the default memory allocator, malloc of glibc on linux, doesn't release free memory back to the OS.

I make a binding function, <malloc.h> malloc_trim in lwt_unix, call it after every major gc cycle(Gc.create_alarm) to force the glibc memory allocator to release back free memory. Then the long-term memory usage is constant -- about 20MiB

So

I doubt this memory leak occurs in lwt or it is indeed not memory leak but an implementation strategy.

This is not so much a cohttp bug as an implementation flaw of lwt_io.
I think this issue could also be forwarded to lwt developers. At least, we can replace the bug tag with enhancement tag.

@gasche
Copy link
Contributor

gasche commented Nov 1, 2022

I learned of the present issue from Caml Weekly News reporting on @kandu's comment on Discuss.

It seems to point to a fundamental memory-usage issue with Lwt_io -- consuming large amounts of otherwise-free memory is surprising but ok, but having dependent users crash with Out_of_memory is not great. I don't understand from reading the discussion whether the issue is related to Bigarray usage in Lwt_bytes or not.

Has this issue actually been reported to the Lwt folks? If yes, can you point to the corresponding issue there?

If the issue is in fact related to Bigarray usage, notice that the GC ways to deal with outside-heap memory usage has improved in the past years (a few years ago but... after Lwt was initially implemented), and potential "well-known issues" may be solvable more or less easily today. (Possibly that would involve discussing with upstream ocaml-runtime folks, but it makes sense to go through Lwt people first.)

@rgrinberg
Copy link
Member

Has this issue actually been reported to the Lwt folks? If yes, can you point to the corresponding issue there?

Nope, it was not. The issue was successfully worked around in the new cohttp client and servers though.

@hansole
Copy link

hansole commented Nov 13, 2022

I have had similar problems with Ocsigen, reported in:
ocsigen/ocsigen-start#658
ocsigen/eliom#569
Ocsigen seems to be using cohttp version 2.5.6

I tried read through the changelog of cohttp but I was not able to find this fix.
Which version of cohttp has the workaround?

@gasche
Copy link
Contributor

gasche commented Nov 13, 2022

I created an upstream issue for Lwt at ocsigen/lwt#972 . (I wish people more knowledgeable about the cohttp issue had done it themselves, because I couldn't give much useful information.)

@mseri
Copy link
Collaborator

mseri commented Nov 14, 2022

@hansole the upcoming 6.0.0. The first alpha release is on opam-repository and will likely be merged soon.

@rgrinberg
Copy link
Member

Note that to address the problem on the server side, you will need to switch to cohttp-server-lwt-unix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants