issue: 3795997 Allow split segment with unacked q #145

iftahl · 2024-05-05T09:29:08Z

When the send window is not big enough for the required TCP segment to send, we may split the segment so it will fit into the window. Before this change - We didn't split the segment in the case we have unacked segments. The motivation was that we anticipate to get ACK on the inflight segments, which will trigger the next send operation.
This flow counts on RTT for receiving ACKs, which may be delayed depending on the remote side. When RTT is long - we would block sending although the TCP send window allows it.

The change is to split TCP segments although we have unacked data, in case the send window is big enough (mss).

Change type

What kind of change does this PR introduce?

Check list

Code follows the style de facto guidelines of this project
Comments have been inserted in hard to understand places
Documentation has been updated (if necessary)
Test has been added (if possible)

Signed-off-by: Alexander Grissik <[email protected]>

At most a single element of this vector is always used. Once rfs constructor is complete there must be exactly one attach_flow_data element in case of ring_simple. For ring_tap this element remains null. Signed-off-by: Alexander Grissik <[email protected]>

Signed-off-by: Alexander Grissik <[email protected]>

Signed-off-by: Alex Briskin <[email protected]>

Set ETIMEDOUT errno and return -1 from recv in case a socket was timed out, instead of 0 return value and 0 errno. For instance, in case of TCP keep alive timeout. Signed-off-by: Alexander Grissik <[email protected]>

The idea is to scan all rpm/deb packages for personal emails we should not be releasing packages with such emails the scan is done on both the metadat info and the changelog of a specific package Issue: HPCINFRA-919 Signed-off-by: Daniel Pressler <[email protected]>

XLIO Socket API must guarantee that the XLIO_SOCKET_EVENT_TERMINATED is not followed by any other events. Therefore, all the TX completion events must be completed by that moment. Do a polling iteration before calling socket destructor to increase the chance that all the relevant WQEs are completed. This mechanism needs to be improved in the future. Signed-off-by: Dmytro Podgornyi <[email protected]>

xlio_init_ex() changes some default parameters. However, a global object can trigger safe_mce_sys() constructor at the start. Therefore, we need to re-read the environment variables again to guarantee that the changed parameters take place. Signed-off-by: Dmytro Podgornyi <[email protected]>

Avoid using connect() with sock fd interface, because fd_collection doesn't keep xlio_socket_t objects. Signed-off-by: Dmytro Podgornyi <[email protected]>

xlio_socket_t objects aren't connected to the fd_collection anymore. Therefore, all the methods must be called from the sockinfo_tcp objects directly. Also, xlio_socket_fd() is not relevant anymore and can be removed. Signed-off-by: Dmytro Podgornyi <[email protected]>

Iterate over std::list of TCP sockets while erasing socket during iteration. Overcomed by increasing iterator before erase. Signed-off-by: Iftah Levi <[email protected]>

rdma-core limits number of UARs per context to 16 by default. After creating 16 QPs, XLIO receives duplicates of blueflame registers for each subsequent QP. As results, blueflame doorbell method can write WQEs concurrently without serialization and this leads to a data corruption. BlueFlame can make impact on throughput, since copy to the blueflame register is expensive. It can improve latency in some low latency scenarios, however, XLIO targets high traffic/PPS rates. Removing blueflame method also slightly improves performance in some scenarios. BlueFlame can be returned back in the future to improve low-latency scenarios, however, it will need some rework to avoid the data corruption. Signed-off-by: Dmytro Podgornyi <[email protected]>

The inline WQE branch is not likely in most throughput scenarios. Signed-off-by: Dmytro Podgornyi <[email protected]>

Avoid calling register_socket_timer_event when a socket is already registered (TIME-WAIT). Although there is no functionality issue with that, it produces too high rate of posting events for internal-thread. This leads to lock contantion inside internal-thread and degraded performance of HTTP CPS. Signed-off-by: Alexander Grissik <[email protected]>

Signed-off-by: Gal Noam <[email protected]>

UTLS uses tcp_tx_express() for non blocking sockets. However, this TX method doesn't support XLIO_RX_POLL_ON_TX_TCP. Additional RX polling improves scenarios such as WEB servers. Insert RX polling into UTLS TX path to resolve performance degradation. Signed-off-by: Dmytro Podgornyi <[email protected]>

In heavy CPS scenarios a socket may go to TIME-WAIT state and be reused before first TCP timer registration is performed by internal-thread. 1. Setting timer_registered=true while posting the event prevents the second attemp to try and post the event again. 2. Adding sanity check in add_new_timer that verifies that the socket is not already in the timer map. Signed-off-by: Alexander Grissik <[email protected]>

Added new env parameter - XLIO_MAX_TSO_SIZE. It allows the user to control maximum size of TSO, instead of taking the maximum cap by HW. The default size is 256KB (maximum by current HW). Values higher than HW capabilities won't be taken into account. Signed-off-by: Iftah Levi <[email protected]>

Signed-off-by: Gal Noam <[email protected]>

When sock_stats was static its destructor was called before xlio_exit that destroys the internal-thread which destroys sockets. We should avoid having global objects with untrivial constructors/destructors, since there is no control of their execution order. Signed-off-by: Alexander Grissik <[email protected]>

When TCP socket is destroyed it frees the preallocated buffers after dst_entry is deleted. This returns the buffers to the global pool directly and breaks m_tx_num_bufs,m_zc_num_bufs ring counters. 1. Move the preallocated buffers cleanup before dst_entry destruction. 2. Add ring stats for m_tx_num_bufs and m_zc_num_bufs. Signed-off-by: Alexander Grissik <[email protected]>

1. Removing hardcoded check that switches AIM to latency mode. In case of low packet rate the calculation will result in 0 count anyway. In case packet rate is higher than the desired interrupt rate we do want to utilize the AIM correctly. 2. Changing default AIM values to more reasonable. 3. Removing default values for Nginx and use AIM by default. This improves CPU utilization in low congested cases significantly.: Signed-off-by: Alexander Grissik <[email protected]>

These parameters are deprecated and will be removed in the future. Use XLIO_MEMORY_LIMIT instead. Signed-off-by: Dmytro Podgornyi <[email protected]>

MCE_MAX_CQ_POLL_BATCH usage requires it to be small enough. However, this is a logical upper limit and we want be able to raise it if necessary. Remove unused cq_mgr_tx::clean_cq() which uses MCE_MAX_CQ_POLL_BATCH for an array on stack. Adjust condition for RX buffers compensation to remove MCE_MAX_CQ_POLL_BATCH. However, this changes the logic and now, we forcibly compensate only the last RX buffer in RQ. Signed-off-by: Dmytro Podgornyi <[email protected]>

MCE_MAX_CQ_POLL_BATCH is a logical upper limit for CQ polling batch size. There is no hard limitation for it, so raise it to maximum CQ size. This value can even exceed CQ size, because HW continue receiving packets during polling. Be default, this change doesn't have effect unless a higher value for XLIO_CQ_POLL_BATCH_MAX is set explicitly. This can be helpful in a scenario when a high traffic rate stops for a long time and number of packets in an RQ exceeds the batch size. Signed-off-by: Dmytro Podgornyi <[email protected]>

Signed-off-by: Gal Noam <[email protected]>

When the send window is not big enough for the required TCP segment to send, we may split the segment so it will fit into the window. Before this change - We didn't split the segment in the case we have unacked segments. The motivation was that we anticipate to get ACK on the inflight segments, which will trigger the next send operation. This flow counts on RTT for receiving ACKs, which may be delayed depending on the remote side. When RTT is long - we would block sending although the TCP send window allows it. The change is to split TCP segments although we have unacked data, in case the send window is big enough (mss). Signed-off-by: Iftah Levi <[email protected]>

iftahl · 2024-05-09T09:13:36Z

bot:retest

iftahl · 2024-05-12T05:33:10Z

@AlexanderGrissik can we merge it?

galnoam · 2024-06-05T14:06:50Z

@iftahl, please add statistics, @AlexanderGrissik asked.

AlexanderGrissik and others added 30 commits January 14, 2024 16:02

issue: 3514044 Introducing cq_mgr_regrq and cq_mgr_strq

1976934

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Renaming cq_mgr_mlx5 to cq_mgr_regrq

1495ac4

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Renaming cq_mgr_mlx5_strq to cq_mgr_strq

fb6022f

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Moving cq_mgr_regrq tx methods to cq_mgr

f21880d

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Moving cq_mgr_regrq events to cq_mgr

ea4a4e6

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Moving cq_mgr_regrq add_qp_tx to cq_mgr

8f5999c

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Moving cq_mgr_regrq RX common to cq_mgr

16f67fc

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Moving Tx from cq_mgr to cq_mgr_tx

52d0cf5

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Rename cq_mgr to cq_mgr_rx

7eb4156

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Remove qp_rec struct

553cc35

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Squash qp_mgr_eth to qp_mgr

797c40c

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Remove DEFINED_DPCP from qp_mgr and styling fixes

5fb7681

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Squash qp_mgr_eth_mlx5 to qp_mgr

3e1a3bf

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Squash qp_mgr_eth_mlx5_dpcp to qp_mgr

6dfaffc

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Split qp_mgr to hw_queue_tx and hw_queue_rx

c890d0d

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Squash rfs_rule_dpcp to rfs_rule

4bd1df4

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Removing hqrx from attach_flow_data_t

d5bb9e4

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Removing ibv steering flows

70b1bb3

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Adding flow tag check through dpcp::adapter

bc3e728

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Require dpcp for configure and CI

ea7dfd0

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Rebasing changes on top 3.20.5 with coverity fixes

e2acc51

Signed-off-by: Alexander Grissik <[email protected]>

issue 3514044 Fixing package test with mandatory dpcp

0e91471

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Updating min dpcp version to 1.1.43

d702e3c

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Replacing .inl file with .h

67185ae

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Removing option_strq

3ce76f7

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3514044 Removing unnecessary checks

6cda7bf

Signed-off-by: Alexander Grissik <[email protected]>

issue: 3745279 Fix artifact generation in CI

ecb0c8a

Signed-off-by: Alex Briskin <[email protected]>

issue: 3664594 Return ETIMEDOUT err for timed out socket

bf0b744

Set ETIMEDOUT errno and return -1 from recv in case a socket was timed out, instead of 0 return value and 0 errno. For instance, in case of TCP keep alive timeout. Signed-off-by: Alexander Grissik <[email protected]>

pasis and others added 21 commits April 1, 2024 00:09

issue: 3788369 Avoid POSIX connect() in xlio_socket_connect()

4771bdd

Avoid using connect() with sock fd interface, because fd_collection doesn't keep xlio_socket_t objects. Signed-off-by: Dmytro Podgornyi <[email protected]>

issue: 3829626 Fix seg fault in TCP timers

86fc67a

Iterate over std::list of TCP sockets while erasing socket during iteration. Overcomed by increasing iterator before erase. Signed-off-by: Iftah Levi <[email protected]>

issue: 3818038 Remove likely() from the inline WQE branch

6f485a1

The inline WQE branch is not likely in most throughput scenarios. Signed-off-by: Dmytro Podgornyi <[email protected]>

version: 3.30.4

8e64060

Signed-off-by: Gal Noam <[email protected]>

version: 3.30.5

1e18c6a

Signed-off-by: Gal Noam <[email protected]>

issue: 3832212 Print a deprecation warning for XLIO_TX/RX_BUFS

ef71ee6

These parameters are deprecated and will be removed in the future. Use XLIO_MEMORY_LIMIT instead. Signed-off-by: Dmytro Podgornyi <[email protected]>

version: 3.31.0

fc48742

Signed-off-by: Gal Noam <[email protected]>

iftahl requested a review from AlexanderGrissik May 5, 2024 09:29

AlexanderGrissik approved these changes May 29, 2024

View reviewed changes

galnoam requested a review from AlexanderGrissik June 5, 2024 14:06

galnoam added the to_merge label Jun 5, 2024

galnoam force-pushed the vNext branch from 7b51feb to 23c527e Compare June 13, 2024 08:28

AlexanderGrissik force-pushed the vNext branch from cfce2e8 to 51c2340 Compare August 18, 2024 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue: 3795997 Allow split segment with unacked q #145

issue: 3795997 Allow split segment with unacked q #145

iftahl commented May 5, 2024 •

edited

Loading

iftahl commented May 9, 2024

iftahl commented May 12, 2024

galnoam commented Jun 5, 2024

issue: 3795997 Allow split segment with unacked q #145

Are you sure you want to change the base?

issue: 3795997 Allow split segment with unacked q #145

Conversation

iftahl commented May 5, 2024 • edited Loading

Change type

Check list

iftahl commented May 9, 2024

iftahl commented May 12, 2024

galnoam commented Jun 5, 2024

iftahl commented May 5, 2024 •

edited

Loading