infinite loop on windows graceful restart with proxy websocket #277

nono303 · 2024-04-02T14:24:00Z

Hi @icing
I have a bug with mod_h2 and "proxyfied" websockets when restarting the service (graceful I suppose) on Windows
This problem reminds me of https://bz.apache.org/bugzilla/show_bug.cgi?id=65180 but is not fully identical.

httpd 2.4.58 self compiled vs17 x64 mod_h2 2.0.26

ProtocolsHonorOrder On
Protocols h2 h2c http/1.1 acme-tls/1
H2Upgrade On
H2Direct On
H2Push On
H2EarlyHints On
H2ModernTLSOnly On
H2WebSockets Off # as unsupported on Windows
H2WindowSize 1460000000
H2StreamMaxMemSize 1460000000
H2MinWorkers 8
H2MaxWorkers 100
H2MaxSessionStreams 100
H2MaxDataFrameLen 0

I have 3 vhosts configured (grafana, gitlab and mattermost) like this (with different ports...)

  ProxyPass / http://127.0.0.1:7000/ upgrade=websocket nocanon retry=0 keepalive=On
  ProxyPassReverse / http://127.0.0.1:7000/

Note: I did the test described below by deactivating nocanon retry=0 keepalive=On properties one by one but that doesn't change anything.

I start httpd and launch my browser tabs which requires a websocket connection
- ✅ httpd fork is ~100% idle
- F12 101 Switching Protocols for all and I see the exchanges in the console

   wss://gitlab.mydomain.com/-/cable
   wss://mattermost.mydomain.com/api/v4/websocket?connection_id=&sequence_number=0
   wss://grafana.mydomain.com/api/live/ws

I restart httpd (httpd -k restart) while still having my 3 tabs open (websocket connected)
- ⚠️ the newly created httpd fork is a ~25% sys and stay as it
I attach debugger and find an infinite loop on the 3 threads on this stack:

   ntdll!NtDeviceIoControlFile+0x14
   MSWSOCK!MSAFD_WSPPoll+0x30e
   MSWSOCK!WSPIoctl+0xed11()
   WS2_32!WSAIoctl+0x1c7()
   WS2_32!WSAPoll+0x22f
   libapr_1!apr_winapi_WSAPoll+0x1b(pollfd * fdArray = 0x1778f464738) [C:\sdk\src\apr\include\arch\win32\apr_arch_misc.h @ 477]
   libapr_1!impl_pollset_poll+0xc5(apr_pollset_t * pollset = 0x1778f4646c0, __int64 timeout = 0, int * num = 0x134bfffb78 : 0, apr_pollfd_t * * descriptors = 0x134bfffb60, apr_pollfd_t * * descriptors = 0x134bfffb60) [C:\sdk\src\apr\poll\unix\poll.c @ 263]
   mod_http2!mplx_pollset_poll+0x136(h2_mplx * m = 0x1778f44c7d8, __int64 timeout = 0, void * on_ctx = 0x1778f44c160, void (__cdecl*)(void *,h2_stream *) on_stream_input = 0x44894c20244c894c : 0x44894c20244c894c, void (__cdecl*)(void *,h2_stream *) on_stream_output =    0x44894c20244c894c : 0x44894c20244c894c) [C:\sdk\src\mod_h2\mod_http2\h2_mplx.c @ 1210]
   mod_http2!h2_mplx_c1_poll+0x40(h2_mplx * m = 0x1778f44c7d8) [C:\sdk\src\mod_h2\mod_http2\h2_mplx.c @ 637]
   mod_http2!h2_session_process+0xce3(h2_session * session = 0x1778f44c160, int async = 610044232) [C:\sdk\src\mod_h2\mod_http2\h2_session.c @ 1911]
   mod_http2!h2_c1_run+0xaf(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\mod_h2\mod_http2\h2_c1.c @ 132]
   mod_http2!h2_c1_hook_process_connection+0x4e3(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\mod_h2\mod_http2\h2_c1.c @ 279]
   libhttpd!ap_run_process_connection+0x35(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\httpd\server\connection.c @ 42]
   libhttpd!ap_process_connection+0x57(void * csd = 0x1778f5002c0) [C:\sdk\src\httpd\server\connection.c @ 217]
   libhttpd!worker_main+0x3a8(void * thread_num_val = 0xc5) [C:\sdk\src\httpd\server\mpm\winnt\child.c @ 846]
   KERNEL32!BaseThreadInitThunk+0x1d()
   ntdll!RtlUserThreadStart+0x28

...and to be clear about it...

diff --git "a/mod_http2/h2_mplx.c" "b/mod_http2/h2_mplx.c"
index 9f9ce80..1823d5c 100644
--- "a/mod_http2/h2_mplx.c"
+++ "b/mod_http2/h2_mplx.c"
@@ -1207,6 +1207,7 @@ static apr_status_t mplx_pollset_poll(h2_mplx *m, apr_interval_time_t timeout,
 
             H2_MPLX_LEAVE(m);
             rv = apr_pollset_poll(m->pollset, timeout >= 0? timeout : -1, &nresults, &results);
+			ap_log_cerror(APLOG_MARK, APLOG_ERR, rv, m->c1, APLOGNO(10310) H2_MPLX_MSG(m, "###### "));
             H2_MPLX_ENTER_ALWAYS(m);
             if (APR_STATUS_IS_EINTR(rv) && m->shutdown) {
                 if (!m->aborted) {

indeed, log show an infinite loop after graceful restart..

So

did I configure something wrong? (If you need other information about my configuration, I will give it to you)
if we need to investigate further, tell me what to do
⚠️ by starting (httpd -k start) with only changing to H2Upgrade Off, I directly fall into the infinite loop

The text was updated successfully, but these errors were encountered:

icing · 2024-04-02T14:55:16Z

I am a bit confused. Does the browser make WebSocket connections over HTTP/2? It should not, if the server does not support it. Is this another request that produces the infinite loop? Strange.

Can you produce a log with LogLevel http2:trace2 on such a situation?

nono303 · 2024-04-02T16:19:25Z

I am a bit confused

same for me cause my big picture understanding is that in this situation, h2_switch.c would have to decline and forward to mod_proxy_wstunnel....

Does the browser make WebSocket connections over HTTP/2

no, over HTTP/1.1
2024-04-02 17:52:06.551 proxy-server "GET /grafana/api/live/ws HTTP/1.1" 101 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0" sI:2676 sO:->0:-:- sT:3657 tI:- tO:24711096 ka:0-

Can you produce a log with LogLevel http2:trace2 on such a situation?

just have this relevant:
2024-04-02 17:52:06.551648 grafana.mydomain.com 127.0.0.1:50694 ZgwppqjK1zgVxTggMqo8iAAAAMM http2:debug h2_switch.c(92) pid:13140 tid:4048 ka:0 [AH03085: upgrade without HTTP2-Settings declined]
full trace after a graceful restart: http2-trace2.txt

nono303 · 2024-04-05T12:37:10Z

Hi @icing,
Continuing digging on it…

To be clear:

Disabling load_module h2 or changing Protocols h2 http/1.1 > Protocols http/1.1 h2
- ✅ All works fine in http/1.1
Remove upgrade=websocket in ProxyPass
- ✅ All works fine in http2 except websocket that are not upgraded

But as you said, I’m not sure that the infinite loop is originated by a gracefully restarted websocket cnx!
As server-status scoreboard doesn’t show tid in mpm_winnt mode (only pid of main process) and as I don’t see issue in scoreboard nor broken cnx in browser network call stack, I suspect to have some old connection (from previous httpd forked process before graceful restart) to be stuck

either on client side brower <> httpd
or in backend side (no cnx exception in logs) httpd <> [grafana|mattermost|gitlab]

As it’s difficult to work on Windows with tracing on a “kill & up process fork” and as I’m not fluent with APR, would it be possible to give a tmp patch (better than mine... easy ;) to trace which request are stuck in the infinite loop?
The idea is to quickly have maximum of information (tid, request uri rtime, etc.) when it's stuck in do while greater than n time (like 1000 or whatever) and exit.

icing · 2024-04-09T10:58:37Z

If mod_http2 is indeed polling in a loop, as your stacktrace indicates, something weird is going on. The trace is a call from h2_session.c:1911 where the session is in state H2_SESSION_ST_BUSY. When the poll returns, it calls h2_c1_read(session). That call should either get data from the client or move the session to another state.

I think we really need a log from such a case with LogLevel http2:debug to see what state changes the HTTP/2 session does.

nono303 · 2024-05-26T09:23:57Z

Hi @icing,I'm back on it
Yesterday I put the http2:debug logs and reactivated the configuration Protocols h2 http/1.1 acme-tls/1
Surprise! No infinite loop has been triggered in the last 12 hours (and the usage /throughput is globally the same)

Originally, just had to wait few second after a graceful to see 1 to 4 thread in infinite loop stae regarding cpu usage

Changes since last month:

I don't think that the debug logs inhibit the issue^^
Updated the drivers of my network card (Intel 82599 10gbps: 4.1.251.0 > 4.1.254.0
- The only notable change is that the Encapsulated Packet Task Offload property is set to 0... think unrelated
nghttp2 1.60 > 1.62.1 🤔

I'll keep you posted...

icing · 2024-05-26T11:12:46Z

Thanks for the update. This is a mysterious one indeed.

nono303 · 2024-05-28T07:25:47Z

Hi @icing, I think I've caught the loop
Use case:

start httpd
launch browser tabs for mattermost, grafana, webmail & mainsite (4 different vhost proxing different backend expect mainsite)

[AH03079: h2_session(10292-0,INIT,0): started on mattermost:443]
[AH03079: h2_session(10292-1,INIT,0): started on webmail:443]
[AH03079: h2_session(10292-2,INIT,0): started on grafana:443]
[AH03079: h2_session(10292-3,INIT,0): started on mainsite:443]
[AH03079: h2_session(10292-5,INIT,0): started on mattermost:443]
[AH03079: h2_session(10292-6,INIT,0): started on mattermost:443]
[AH03079: h2_session(10292-7,INIT,0): started on grafana:443]
[AH03079: h2_session(10292-8,INIT,0): started on mainsite:443]

all are well exchanging in h2
mattermost & mrafana have a "101" upgrade websocket http1./1 active

mattermost & mrafana proxy config

RequestHeader set X-Forwarded-Proto "https"
RequestHeader set X-Forwarded-Ssl "on"
ProxyPass / http://backend/ upgrade=websocket retry=0 keepalive=On timeout=10
ProxyPassReverse / http://backend /

few seconds later, a worker (Windows thread) is 100% cpu for ~1 minute
waiting it to stop looping
kill browser
wait 5 minutes

Server-status : http/1.1 websocket workers still in "W" Sending Reply
Netstat : cnx httpd <> mattermost still in close wait

[httpd.exe] TCP    httpd.ip:59709     mattermost.ip:8065    CLOSE_WAIT
[httpd.exe] TCP    httpd.ip:59765     mattermost.ip:8065    CLOSE_WAIT
[httpd.exe] TCP    httpd.ip:59795     mattermost.ip:8065    CLOSE_WAIT
[httpd.exe] TCP    httpd.ip:59798     mattermost.ip:8065    CLOSE_WAIT
[httpd.exe] TCP    httpd.ip:59799     mattermost.ip:8065    CLOSE_WAIT
[httpd.exe] TCP    httpd.ip:59802     mattermost.ip:8065    CLOSE_WAIT

Here is the full 2 minutes httpd-error.log
Hard to read for me but... We see the 4 websocket connection declined [AH03085: upgrade without HTTP2-Settings declined]
So hypothesis that I don't control... Is there any chance that a pointer of these declined connections is still referenced in the h2_session and produce a "busy" zombie
I say this after reading the lines below which correspond almost to the moment when the infinite loop ended

2024-05-28 08:02:22.967261 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
2024-05-28 08:02:23.429743 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
2024-05-28 08:02:23.451635 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
2024-05-28 08:02:23.527443 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]

That's the context. If you need more traces, information... just tell me

nono303 · 2024-10-19T14:20:04Z

Hi @icing,

I sometime continue diging on this issue and as my dbg is currently “hot” with another bug (php), Here are the new some fresh traces and clues.

just to reminde

The 100% cpu on one or mode thread only occurs when I activate Protocols h2 http/1.1 acme-tls/1 at server level for all my vhosts and having some ProxyPass www upgrade=websocket on some vhosts
I can only reproduce the issue on a graceful restart with some active connection
The 100% cpu on one or more thread occur within few second after the graceful restart

⏩ Affected thread only hang some h2 (non websocket) connection de be closed (at least I've never seen)

Ex.

00000000`97FFF6F0 00000000`1C7ECFE0  -> AH03045: h2_session(%d-%lu,%s,%d): process, closing conn
00000000`97FFECF0 00000000`97FFED00  -> : stream 279, GET /plugins/com.mattermost.apps/api/v1/bindings?channel_id=xxx&user_agent=webapp

I always have the same call stack :


[mod_http2.so!mplx_pollset_poll(h2_mplx * m, __int64 timeout, void(*)(void *, h2_stream *) on_ctx, void(*)(void *, h2_stream *)) Line 1234
	at C:\sdk\src\mod_h2\mod_http2\h2_mplx.c(1234)
[Inline Frame] mod_http2.so!h2_mplx_c1_poll(h2_mplx * m, __int64) Line 661
	at C:\sdk\src\mod_h2\mod_http2\h2_mplx.c(661)
mod_http2.so!h2_session_process(h2_session * session, int async, int * pkeepalive) Line 1925
	at C:\sdk\src\mod_h2\mod_http2\h2_session.c(1925)
mod_http2.so!h2_c1_run(conn_rec * c) Line 135
	at C:\sdk\src\mod_h2\mod_http2\h2_c1.c(135)
[Inline Frame] libhttpd.dll!ap_run_process_connection(conn_rec *) Line 42
	at C:\sdk\src\httpd\server\connection.c(42)
[Inline Frame] libhttpd.dll!ap_process_connection(conn_rec * csd, void *) Line 217
	at C:\sdk\src\httpd\server\connection.c(217)](url)

refreshing the stack, line only change in impl_pollset_poll() method

tid 34272: libapr-1.dll!impl_pollset_poll(apr_pollset_t * pollset, __int64 timeout, int * num, const apr_pollfd_t * * descriptors) Line 288
	at C:\sdk\src\apr\poll\unix\poll.c(288)
tid 5152: libapr-1.dll!impl_pollset_poll(apr_pollset_t * pollset, __int64 timeout, int * num, const apr_pollfd_t * * descriptors) Line 263
	at C:\sdk\src\apr\poll\unix\poll.c(263)

This confirm that the condition } while (APR_STATUS_IS_EINTR(rv)); l.1246 is always true in mplx_pollset_poll and imply the infinite 100% cpu loop (rv value is always 4)
Beside the fact that APR poll implementation is in an unix folder, I just googled and find that “wsa_poll cpu” had a lot of results!

like https://lists.apache.org/thread/k5sjldo71gw8w9cwov29tol99ybd894z referring to
https://learn.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsapoll

that's where my understanding ends 😊

nono303 · 2024-10-19T15:25:26Z

Note:
I found many reference to https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/ with some patch / alternative to WSAPoll()

In 2023, a user made me aware that the Microsoft documentation now says:
Note As of Windows 10 version 2004, when a TCP socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is indicated.
Maybe it is time to do new tests.

icing · 2024-10-21T08:23:37Z

if the Windows poll always returns EINTR, this would lead to a busy loop. I think you are correct in your read on this. You could try the following patch that will treat EINTR as an error during shutdowns.

Index: modules/http2/h2_mplx.c
===================================================================
--- modules/http2/h2_mplx.c	(Revision 1921209)
+++ modules/http2/h2_mplx.c	(Arbeitskopie)
@@ -1232,9 +1232,15 @@
             rv = apr_pollset_poll(m->pollset, timeout >= 0? timeout : -1, &nresults, &results);
             H2_MPLX_ENTER_ALWAYS(m);
             if (APR_STATUS_IS_EINTR(rv) && m->shutdown) {
+#ifdef WIN32
+            /* See https://github.com/icing/mod_h2/issues/277
+             * WSAPoll seems to always deliver EINTR during server graceful
+             * restarts. Treat this is an error condition */
+#else
                 if (!m->aborted) {
                     rv = APR_SUCCESS;
                 }
+#endif
                 goto cleanup;
             }
         } while (APR_STATUS_IS_EINTR(rv));

nono303 · 2024-10-21T12:39:02Z

Hi @icing, Thx for the feedback!
I’ve just applied the patch but that didn’t correspond to the infinite loop use case because m-> shutdown is false
I added a log in the loop and here is the context which come across every time

http2:error h2_mplx.c(1235) pid:15140 tid:3800 ka:0 [AH10310: h2_mplx(15140-0): loop rv=4 eintr=1 shutdown=0 aborted=0 pcount=0]

diff --git a/mod_http2/h2_mplx.c b/mod_http2/h2_mplx.c
index 71a7431..fc9b81e 100644
--- a/mod_http2/h2_mplx.c
+++ b/mod_http2/h2_mplx.c
@@ -1232,10 +1232,17 @@ static apr_status_t mplx_pollset_poll(h2_mplx *m, apr_interval_time_t timeout,
             H2_MPLX_LEAVE(m);
             rv = apr_pollset_poll(m->pollset, timeout >= 0? timeout : -1, &nresults, &results);
             H2_MPLX_ENTER_ALWAYS(m);
+			ap_log_cerror(APLOG_MARK, APLOG_ERR, rv, m->c1, APLOGNO(10310) H2_MPLX_MSG(m, "loop rv=%d eintr=%d shutdown=%d aborted=%d pcount=%d"),(int)rv,APR_STATUS_IS_EINTR(rv),m->shutdown,m->aborted,m->processing_count);
             if (APR_STATUS_IS_EINTR(rv) && m->shutdown) {
+#ifdef WIN32

+            /* See https://github.com/icing/mod_h2/issues/277

+             * WSAPoll seems to always deliver EINTR during server graceful

+             * restarts. Treat this is an error condition */

+#else

                 if (!m->aborted) {
                     rv = APR_SUCCESS;
                 }
+#endif

                 goto cleanup;
             }
         } while (APR_STATUS_IS_EINTR(rv));

if other information is needed in this line for debugging, just tell me

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infinite loop on windows graceful restart with proxy websocket #277

infinite loop on windows graceful restart with proxy websocket #277

nono303 commented Apr 2, 2024

icing commented Apr 2, 2024

nono303 commented Apr 2, 2024

nono303 commented Apr 5, 2024

icing commented Apr 9, 2024

nono303 commented May 26, 2024 •

edited

Loading

icing commented May 26, 2024

nono303 commented May 28, 2024

nono303 commented Oct 19, 2024 •

edited

Loading

nono303 commented Oct 19, 2024

icing commented Oct 21, 2024

nono303 commented Oct 21, 2024

infinite loop on windows graceful restart with proxy websocket #277

infinite loop on windows graceful restart with proxy websocket #277

Comments

nono303 commented Apr 2, 2024

indeed, log show an infinite loop after graceful restart..

icing commented Apr 2, 2024

nono303 commented Apr 2, 2024

nono303 commented Apr 5, 2024

icing commented Apr 9, 2024

nono303 commented May 26, 2024 • edited Loading

icing commented May 26, 2024

nono303 commented May 28, 2024

nono303 commented Oct 19, 2024 • edited Loading

nono303 commented Oct 19, 2024

icing commented Oct 21, 2024

nono303 commented Oct 21, 2024

nono303 commented May 26, 2024 •

edited

Loading

nono303 commented Oct 19, 2024 •

edited

Loading