Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infinite loop on windows graceful restart with proxy websocket #277

Open
nono303 opened this issue Apr 2, 2024 · 11 comments
Open

infinite loop on windows graceful restart with proxy websocket #277

nono303 opened this issue Apr 2, 2024 · 11 comments

Comments

@nono303
Copy link

nono303 commented Apr 2, 2024

Hi @icing
I have a bug with mod_h2 and "proxyfied" websockets when restarting the service (graceful I suppose) on Windows
This problem reminds me of https://bz.apache.org/bugzilla/show_bug.cgi?id=65180 but is not fully identical.

httpd 2.4.58 self compiled vs17 x64 mod_h2 2.0.26

ProtocolsHonorOrder On
Protocols h2 h2c http/1.1 acme-tls/1
H2Upgrade On
H2Direct On
H2Push On
H2EarlyHints On
H2ModernTLSOnly On
H2WebSockets Off # as unsupported on Windows
H2WindowSize 1460000000
H2StreamMaxMemSize 1460000000
H2MinWorkers 8
H2MaxWorkers 100
H2MaxSessionStreams 100
H2MaxDataFrameLen 0

  1. I have 3 vhosts configured (grafana, gitlab and mattermost) like this (with different ports...)
  ProxyPass / http://127.0.0.1:7000/ upgrade=websocket nocanon retry=0 keepalive=On
  ProxyPassReverse / http://127.0.0.1:7000/

Note: I did the test described below by deactivating nocanon retry=0 keepalive=On properties one by one but that doesn't change anything.

  1. I start httpd and launch my browser tabs which requires a websocket connection
    • ✅ httpd fork is ~100% idle
    • F12 101 Switching Protocols for all and I see the exchanges in the console
   wss://gitlab.mydomain.com/-/cable
   wss://mattermost.mydomain.com/api/v4/websocket?connection_id=&sequence_number=0
   wss://grafana.mydomain.com/api/live/ws	
  1. I restart httpd (httpd -k restart) while still having my 3 tabs open (websocket connected)
    • ⚠️ the newly created httpd fork is a ~25% sys and stay as it
  2. I attach debugger and find an infinite loop on the 3 threads on this stack:
   ntdll!NtDeviceIoControlFile+0x14
   MSWSOCK!MSAFD_WSPPoll+0x30e
   MSWSOCK!WSPIoctl+0xed11()
   WS2_32!WSAIoctl+0x1c7()
   WS2_32!WSAPoll+0x22f
   libapr_1!apr_winapi_WSAPoll+0x1b(pollfd * fdArray = 0x1778f464738) [C:\sdk\src\apr\include\arch\win32\apr_arch_misc.h @ 477]
   libapr_1!impl_pollset_poll+0xc5(apr_pollset_t * pollset = 0x1778f4646c0, __int64 timeout = 0, int * num = 0x134bfffb78 : 0, apr_pollfd_t * * descriptors = 0x134bfffb60, apr_pollfd_t * * descriptors = 0x134bfffb60) [C:\sdk\src\apr\poll\unix\poll.c @ 263]
   mod_http2!mplx_pollset_poll+0x136(h2_mplx * m = 0x1778f44c7d8, __int64 timeout = 0, void * on_ctx = 0x1778f44c160, void (__cdecl*)(void *,h2_stream *) on_stream_input = 0x44894c20244c894c : 0x44894c20244c894c, void (__cdecl*)(void *,h2_stream *) on_stream_output =    0x44894c20244c894c : 0x44894c20244c894c) [C:\sdk\src\mod_h2\mod_http2\h2_mplx.c @ 1210]
   mod_http2!h2_mplx_c1_poll+0x40(h2_mplx * m = 0x1778f44c7d8) [C:\sdk\src\mod_h2\mod_http2\h2_mplx.c @ 637]
   mod_http2!h2_session_process+0xce3(h2_session * session = 0x1778f44c160, int async = 610044232) [C:\sdk\src\mod_h2\mod_http2\h2_session.c @ 1911]
   mod_http2!h2_c1_run+0xaf(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\mod_h2\mod_http2\h2_c1.c @ 132]
   mod_http2!h2_c1_hook_process_connection+0x4e3(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\mod_h2\mod_http2\h2_c1.c @ 279]
   libhttpd!ap_run_process_connection+0x35(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\httpd\server\connection.c @ 42]
   libhttpd!ap_process_connection+0x57(void * csd = 0x1778f5002c0) [C:\sdk\src\httpd\server\connection.c @ 217]
   libhttpd!worker_main+0x3a8(void * thread_num_val = 0xc5) [C:\sdk\src\httpd\server\mpm\winnt\child.c @ 846]
   KERNEL32!BaseThreadInitThunk+0x1d()
   ntdll!RtlUserThreadStart+0x28

...and to be clear about it...

diff --git "a/mod_http2/h2_mplx.c" "b/mod_http2/h2_mplx.c"
index 9f9ce80..1823d5c 100644
--- "a/mod_http2/h2_mplx.c"
+++ "b/mod_http2/h2_mplx.c"
@@ -1207,6 +1207,7 @@ static apr_status_t mplx_pollset_poll(h2_mplx *m, apr_interval_time_t timeout,
 
             H2_MPLX_LEAVE(m);
             rv = apr_pollset_poll(m->pollset, timeout >= 0? timeout : -1, &nresults, &results);
+			ap_log_cerror(APLOG_MARK, APLOG_ERR, rv, m->c1, APLOGNO(10310) H2_MPLX_MSG(m, "###### "));
             H2_MPLX_ENTER_ALWAYS(m);
             if (APR_STATUS_IS_EINTR(rv) && m->shutdown) {
                 if (!m->aborted) {

indeed, log show an infinite loop after graceful restart..

So

  1. did I configure something wrong? (If you need other information about my configuration, I will give it to you)
  2. if we need to investigate further, tell me what to do
  3. ⚠️ by starting (httpd -k start) with only changing to H2Upgrade Off, I directly fall into the infinite loop
@icing
Copy link
Owner

icing commented Apr 2, 2024

I am a bit confused. Does the browser make WebSocket connections over HTTP/2? It should not, if the server does not support it. Is this another request that produces the infinite loop? Strange.

Can you produce a log with LogLevel http2:trace2 on such a situation?

@nono303
Copy link
Author

nono303 commented Apr 2, 2024

I am a bit confused

same for me cause my big picture understanding is that in this situation, h2_switch.c would have to decline and forward to mod_proxy_wstunnel....

Does the browser make WebSocket connections over HTTP/2

no, over HTTP/1.1
2024-04-02 17:52:06.551 proxy-server "GET /grafana/api/live/ws HTTP/1.1" 101 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0" sI:2676 sO:->0:-:- sT:3657 tI:- tO:24711096 ka:0-

Can you produce a log with LogLevel http2:trace2 on such a situation?

just have this relevant:
2024-04-02 17:52:06.551648 grafana.mydomain.com 127.0.0.1:50694 ZgwppqjK1zgVxTggMqo8iAAAAMM http2:debug h2_switch.c(92) pid:13140 tid:4048 ka:0 [AH03085: upgrade without HTTP2-Settings declined]
full trace after a graceful restart: http2-trace2.txt

@nono303
Copy link
Author

nono303 commented Apr 5, 2024

Hi @icing,
Continuing digging on it…

To be clear:

  • Disabling load_module h2 or changing Protocols h2 http/1.1 > Protocols http/1.1 h2
    • ✅ All works fine in http/1.1
  • Remove upgrade=websocket in ProxyPass
    • ✅ All works fine in http2 except websocket that are not upgraded

But as you said, I’m not sure that the infinite loop is originated by a gracefully restarted websocket cnx!
As server-status scoreboard doesn’t show tid in mpm_winnt mode (only pid of main process) and as I don’t see issue in scoreboard nor broken cnx in browser network call stack, I suspect to have some old connection (from previous httpd forked process before graceful restart) to be stuck

  • either on client side brower <> httpd
  • or in backend side (no cnx exception in logs) httpd <> [grafana|mattermost|gitlab]

As it’s difficult to work on Windows with tracing on a “kill & up process fork” and as I’m not fluent with APR, would it be possible to give a tmp patch (better than mine... easy ;) to trace which request are stuck in the infinite loop?
The idea is to quickly have maximum of information (tid, request uri rtime, etc.) when it's stuck in do while greater than n time (like 1000 or whatever) and exit.

@icing
Copy link
Owner

icing commented Apr 9, 2024

If mod_http2 is indeed polling in a loop, as your stacktrace indicates, something weird is going on. The trace is a call from h2_session.c:1911 where the session is in state H2_SESSION_ST_BUSY. When the poll returns, it calls h2_c1_read(session). That call should either get data from the client or move the session to another state.

I think we really need a log from such a case with LogLevel http2:debug to see what state changes the HTTP/2 session does.

@nono303
Copy link
Author

nono303 commented May 26, 2024

Hi @icing,I'm back on it
Yesterday I put the http2:debug logs and reactivated the configuration Protocols h2 http/1.1 acme-tls/1
Surprise! No infinite loop has been triggered in the last 12 hours (and the usage /throughput is globally the same)

Originally, just had to wait few second after a graceful to see 1 to 4 thread in infinite loop stae regarding cpu usage

Changes since last month:

  • I don't think that the debug logs inhibit the issue^^
  • Updated the drivers of my network card (Intel 82599 10gbps: 4.1.251.0 > 4.1.254.0
    • The only notable change is that the Encapsulated Packet Task Offload property is set to 0... think unrelated
  • nghttp2 1.60 > 1.62.1 🤔

I'll keep you posted...

@icing
Copy link
Owner

icing commented May 26, 2024

Thanks for the update. This is a mysterious one indeed.

@nono303
Copy link
Author

nono303 commented May 28, 2024

Hi @icing, I think I've caught the loop
Use case:

  1. start httpd
  2. launch browser tabs for mattermost, grafana, webmail & mainsite (4 different vhost proxing different backend expect mainsite)
[AH03079: h2_session(10292-0,INIT,0): started on mattermost:443]
[AH03079: h2_session(10292-1,INIT,0): started on webmail:443]
[AH03079: h2_session(10292-2,INIT,0): started on grafana:443]
[AH03079: h2_session(10292-3,INIT,0): started on mainsite:443]
[AH03079: h2_session(10292-5,INIT,0): started on mattermost:443]
[AH03079: h2_session(10292-6,INIT,0): started on mattermost:443]
[AH03079: h2_session(10292-7,INIT,0): started on grafana:443]
[AH03079: h2_session(10292-8,INIT,0): started on mainsite:443]
  • all are well exchanging in h2
  • mattermost & mrafana have a "101" upgrade websocket http1./1 active

mattermost & mrafana proxy config

RequestHeader set X-Forwarded-Proto "https"
RequestHeader set X-Forwarded-Ssl "on"
ProxyPass / http://backend/ upgrade=websocket retry=0 keepalive=On timeout=10
ProxyPassReverse / http://backend /
  1. few seconds later, a worker (Windows thread) is 100% cpu for ~1 minute
  2. waiting it to stop looping
  3. kill browser
  4. wait 5 minutes
  • Server-status : http/1.1 websocket workers still in "W" Sending Reply
  • Netstat : cnx httpd <> mattermost still in close wait
[httpd.exe] TCP    httpd.ip:59709     mattermost.ip:8065    CLOSE_WAIT
[httpd.exe] TCP    httpd.ip:59765     mattermost.ip:8065    CLOSE_WAIT
[httpd.exe] TCP    httpd.ip:59795     mattermost.ip:8065    CLOSE_WAIT
[httpd.exe] TCP    httpd.ip:59798     mattermost.ip:8065    CLOSE_WAIT
[httpd.exe] TCP    httpd.ip:59799     mattermost.ip:8065    CLOSE_WAIT
[httpd.exe] TCP    httpd.ip:59802     mattermost.ip:8065    CLOSE_WAIT

Here is the full 2 minutes httpd-error.log
Hard to read for me but... We see the 4 websocket connection declined [AH03085: upgrade without HTTP2-Settings declined]
So hypothesis that I don't control... Is there any chance that a pointer of these declined connections is still referenced in the h2_session and produce a "busy" zombie
I say this after reading the lines below which correspond almost to the moment when the infinite loop ended

2024-05-28 08:02:22.967261 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
2024-05-28 08:02:23.429743 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
2024-05-28 08:02:23.451635 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
2024-05-28 08:02:23.527443 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]

That's the context. If you need more traces, information... just tell me

@nono303
Copy link
Author

nono303 commented Oct 19, 2024

Hi @icing,

I sometime continue diging on this issue and as my dbg is currently “hot” with another bug (php), Here are the new some fresh traces and clues.

just to reminde

  • The 100% cpu on one or mode thread only occurs when I activate Protocols h2 http/1.1 acme-tls/1 at server level for all my vhosts and having some ProxyPass www upgrade=websocket on some vhosts
  • I can only reproduce the issue on a graceful restart with some active connection
  • The 100% cpu on one or more thread occur within few second after the graceful restart

Affected thread only hang some h2 (non websocket) connection de be closed (at least I've never seen)

Ex.

00000000`97FFF6F0 00000000`1C7ECFE0  -> AH03045: h2_session(%d-%lu,%s,%d): process, closing conn
00000000`97FFECF0 00000000`97FFED00  -> : stream 279, GET /plugins/com.mattermost.apps/api/v1/bindings?channel_id=xxx&user_agent=webapp

I always have the same call stack :


[mod_http2.so!mplx_pollset_poll(h2_mplx * m, __int64 timeout, void(*)(void *, h2_stream *) on_ctx, void(*)(void *, h2_stream *)) Line 1234
	at C:\sdk\src\mod_h2\mod_http2\h2_mplx.c(1234)
[Inline Frame] mod_http2.so!h2_mplx_c1_poll(h2_mplx * m, __int64) Line 661
	at C:\sdk\src\mod_h2\mod_http2\h2_mplx.c(661)
mod_http2.so!h2_session_process(h2_session * session, int async, int * pkeepalive) Line 1925
	at C:\sdk\src\mod_h2\mod_http2\h2_session.c(1925)
mod_http2.so!h2_c1_run(conn_rec * c) Line 135
	at C:\sdk\src\mod_h2\mod_http2\h2_c1.c(135)
[Inline Frame] libhttpd.dll!ap_run_process_connection(conn_rec *) Line 42
	at C:\sdk\src\httpd\server\connection.c(42)
[Inline Frame] libhttpd.dll!ap_process_connection(conn_rec * csd, void *) Line 217
	at C:\sdk\src\httpd\server\connection.c(217)](url)

refreshing the stack, line only change in impl_pollset_poll() method

tid 34272: libapr-1.dll!impl_pollset_poll(apr_pollset_t * pollset, __int64 timeout, int * num, const apr_pollfd_t * * descriptors) Line 288
	at C:\sdk\src\apr\poll\unix\poll.c(288)
tid 5152: libapr-1.dll!impl_pollset_poll(apr_pollset_t * pollset, __int64 timeout, int * num, const apr_pollfd_t * * descriptors) Line 263
	at C:\sdk\src\apr\poll\unix\poll.c(263)

This confirm that the condition } while (APR_STATUS_IS_EINTR(rv)); l.1246 is always true in mplx_pollset_poll and imply the infinite 100% cpu loop (rv value is always 4)
Beside the fact that APR poll implementation is in an unix folder, I just googled and find that “wsa_poll cpu” had a lot of results!

like https://lists.apache.org/thread/k5sjldo71gw8w9cwov29tol99ybd894z referring to
https://learn.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsapoll

that's where my understanding ends 😊

@nono303
Copy link
Author

nono303 commented Oct 19, 2024

Note:
I found many reference to https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/ with some patch / alternative to WSAPoll()

In 2023, a user made me aware that the Microsoft documentation now says:
Note As of Windows 10 version 2004, when a TCP socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is indicated.
Maybe it is time to do new tests.

@icing
Copy link
Owner

icing commented Oct 21, 2024

if the Windows poll always returns EINTR, this would lead to a busy loop. I think you are correct in your read on this. You could try the following patch that will treat EINTR as an error during shutdowns.

Index: modules/http2/h2_mplx.c
===================================================================
--- modules/http2/h2_mplx.c	(Revision 1921209)
+++ modules/http2/h2_mplx.c	(Arbeitskopie)
@@ -1232,9 +1232,15 @@
             rv = apr_pollset_poll(m->pollset, timeout >= 0? timeout : -1, &nresults, &results);
             H2_MPLX_ENTER_ALWAYS(m);
             if (APR_STATUS_IS_EINTR(rv) && m->shutdown) {
+#ifdef WIN32
+            /* See https://github.com/icing/mod_h2/issues/277
+             * WSAPoll seems to always deliver EINTR during server graceful
+             * restarts. Treat this is an error condition */
+#else
                 if (!m->aborted) {
                     rv = APR_SUCCESS;
                 }
+#endif
                 goto cleanup;
             }
         } while (APR_STATUS_IS_EINTR(rv));

@nono303
Copy link
Author

nono303 commented Oct 21, 2024

Hi @icing, Thx for the feedback!
I’ve just applied the patch but that didn’t correspond to the infinite loop use case because m-> shutdown is false
I added a log in the loop and here is the context which come across every time

http2:error h2_mplx.c(1235) pid:15140 tid:3800 ka:0 [AH10310: h2_mplx(15140-0): loop rv=4 eintr=1 shutdown=0 aborted=0 pcount=0]

diff --git a/mod_http2/h2_mplx.c b/mod_http2/h2_mplx.c
index 71a7431..fc9b81e 100644
--- a/mod_http2/h2_mplx.c
+++ b/mod_http2/h2_mplx.c
@@ -1232,10 +1232,17 @@ static apr_status_t mplx_pollset_poll(h2_mplx *m, apr_interval_time_t timeout,
             H2_MPLX_LEAVE(m);
             rv = apr_pollset_poll(m->pollset, timeout >= 0? timeout : -1, &nresults, &results);
             H2_MPLX_ENTER_ALWAYS(m);
+			ap_log_cerror(APLOG_MARK, APLOG_ERR, rv, m->c1, APLOGNO(10310) H2_MPLX_MSG(m, "loop rv=%d eintr=%d shutdown=%d aborted=%d pcount=%d"),(int)rv,APR_STATUS_IS_EINTR(rv),m->shutdown,m->aborted,m->processing_count);
             if (APR_STATUS_IS_EINTR(rv) && m->shutdown) {
+#ifdef WIN32

+            /* See https://github.com/icing/mod_h2/issues/277

+             * WSAPoll seems to always deliver EINTR during server graceful

+             * restarts. Treat this is an error condition */

+#else

                 if (!m->aborted) {
                     rv = APR_SUCCESS;
                 }
+#endif

                 goto cleanup;
             }
         } while (APR_STATUS_IS_EINTR(rv));

if other information is needed in this line for debugging, just tell me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants