Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue: 3795997 Allow split segment with unacked q #145

Open
wants to merge 169 commits into
base: vNext
Choose a base branch
from

Conversation

iftahl
Copy link
Collaborator

@iftahl iftahl commented May 5, 2024

When the send window is not big enough for the required TCP segment to send, we may split the segment so it will fit into the window. Before this change - We didn't split the segment in the case we have unacked segments. The motivation was that we anticipate to get ACK on the inflight segments, which will trigger the next send operation.
This flow counts on RTT for receiving ACKs, which may be delayed depending on the remote side. When RTT is long - we would block sending although the TCP send window allows it.

The change is to split TCP segments although we have unacked data, in case the send window is big enough (mss).

Change type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Check list

  • Code follows the style de facto guidelines of this project
  • Comments have been inserted in hard to understand places
  • Documentation has been updated (if necessary)
  • Test has been added (if possible)

AlexanderGrissik and others added 30 commits January 14, 2024 16:02
Signed-off-by: Alexander Grissik <[email protected]>
At most a single element of this vector is always used.
Once rfs constructor is complete there must be exactly one attach_flow_data element in case of ring_simple.
For ring_tap this element remains null.

Signed-off-by: Alexander Grissik <[email protected]>
Signed-off-by: Alexander Grissik <[email protected]>
Set ETIMEDOUT errno and return -1 from recv in case a socket was timed out, instead of 0 return value and 0 errno.
For instance, in case of TCP keep alive timeout.

Signed-off-by: Alexander Grissik <[email protected]>
The idea is to scan all rpm/deb packages for personal emails
we should not be releasing packages with such emails
the scan is done on both the metadat info and the changelog
of a specific package

Issue: HPCINFRA-919
Signed-off-by: Daniel Pressler <[email protected]>
pasis and others added 21 commits April 1, 2024 00:09
XLIO Socket API must guarantee that the XLIO_SOCKET_EVENT_TERMINATED is
not followed by any other events. Therefore, all the TX completion
events must be completed by that moment.

Do a polling iteration before calling socket destructor to increase the
chance that all the relevant WQEs are completed. This mechanism needs to
be improved in the future.

Signed-off-by: Dmytro Podgornyi <[email protected]>
xlio_init_ex() changes some default parameters. However, a global object
can trigger safe_mce_sys() constructor at the start. Therefore, we need
to re-read the environment variables again to guarantee that the changed
parameters take place.

Signed-off-by: Dmytro Podgornyi <[email protected]>
Avoid using connect() with sock fd interface, because fd_collection
doesn't keep xlio_socket_t objects.

Signed-off-by: Dmytro Podgornyi <[email protected]>
xlio_socket_t objects aren't connected to the fd_collection anymore.
Therefore, all the methods must be called from the sockinfo_tcp objects
directly.

Also, xlio_socket_fd() is not relevant anymore and can be removed.

Signed-off-by: Dmytro Podgornyi <[email protected]>
Iterate over std::list of TCP sockets while
erasing socket during iteration.
Overcomed by increasing iterator before erase.

Signed-off-by: Iftah Levi <[email protected]>
rdma-core limits number of UARs per context to 16 by default. After
creating 16 QPs, XLIO receives duplicates of blueflame registers for
each subsequent QP. As results, blueflame doorbell method can write WQEs
concurrently without serialization and this leads to a data corruption.

BlueFlame can make impact on throughput, since copy to the blueflame
register is expensive. It can improve latency in some low latency
scenarios, however, XLIO targets high traffic/PPS rates.
Removing blueflame method also slightly improves performance in some
scenarios.

BlueFlame can be returned back in the future to improve low-latency
scenarios, however, it will need some rework to avoid the data
corruption.

Signed-off-by: Dmytro Podgornyi <[email protected]>
The inline WQE branch is not likely in most throughput scenarios.

Signed-off-by: Dmytro Podgornyi <[email protected]>
Avoid calling register_socket_timer_event when a socket is already registered (TIME-WAIT).
Although there is no functionality issue with that, it produces too high rate of posting events for internal-thread.
This leads to lock contantion inside internal-thread and degraded performance of HTTP CPS.

Signed-off-by: Alexander Grissik <[email protected]>
Signed-off-by: Gal Noam <[email protected]>
UTLS uses tcp_tx_express() for non blocking sockets. However, this TX
method doesn't support XLIO_RX_POLL_ON_TX_TCP. Additional RX polling
improves scenarios such as WEB servers.

Insert RX polling into UTLS TX path to resolve performance degradation.

Signed-off-by: Dmytro Podgornyi <[email protected]>
In heavy CPS scenarios a socket may go to TIME-WAIT state and be reused before first TCP timer registration is performed by internal-thread.
1. Setting timer_registered=true while posting the event prevents the second attemp to try and post the event again.
2. Adding sanity check in add_new_timer that verifies that the socket is not already in the timer map.

Signed-off-by: Alexander Grissik <[email protected]>
Added new env parameter - XLIO_MAX_TSO_SIZE.
It allows the user to control maximum size of TSO,
instead of taking the maximum cap by HW.
The default size is 256KB (maximum by current HW).
Values higher than HW capabilities won't be taken into account.

Signed-off-by: Iftah Levi <[email protected]>
Signed-off-by: Gal Noam <[email protected]>
When sock_stats was static its destructor was called before xlio_exit that destroys the internal-thread which destroys sockets.
We should avoid having global objects with untrivial constructors/destructors, since there is no control of their execution order.

Signed-off-by: Alexander Grissik <[email protected]>
When TCP socket is destroyed it frees the preallocated buffers after dst_entry is deleted.
This returns the buffers to the global pool directly and breaks m_tx_num_bufs,m_zc_num_bufs ring counters.

1. Move the preallocated buffers cleanup before dst_entry destruction.
2. Add ring stats for m_tx_num_bufs and m_zc_num_bufs.

Signed-off-by: Alexander Grissik <[email protected]>
1. Removing hardcoded check that switches AIM to latency mode.
In case of low packet rate the calculation will result in 0 count anyway.
In case packet rate is higher than the desired interrupt rate we do want to utilize the AIM correctly.
2. Changing default AIM values to more reasonable.
3. Removing default values for Nginx and use AIM by default.
This improves CPU utilization in low congested cases significantly.:

Signed-off-by: Alexander Grissik <[email protected]>
These parameters are deprecated and will be removed in the future. Use
XLIO_MEMORY_LIMIT instead.

Signed-off-by: Dmytro Podgornyi <[email protected]>
MCE_MAX_CQ_POLL_BATCH usage requires it to be small enough. However,
this is a logical upper limit and we want be able to raise it if
necessary.

Remove unused cq_mgr_tx::clean_cq() which uses MCE_MAX_CQ_POLL_BATCH
for an array on stack.

Adjust condition for RX buffers compensation to remove
MCE_MAX_CQ_POLL_BATCH. However, this changes the logic and now, we
forcibly compensate only the last RX buffer in RQ.

Signed-off-by: Dmytro Podgornyi <[email protected]>
MCE_MAX_CQ_POLL_BATCH is a logical upper limit for CQ polling batch
size. There is no hard limitation for it, so raise it to maximum
CQ size.

This value can even exceed CQ size, because HW continue receiving
packets during polling.

Be default, this change doesn't have effect unless a higher value
for XLIO_CQ_POLL_BATCH_MAX is set explicitly. This can be helpful
in a scenario when a high traffic rate stops for a long time and
number of packets in an RQ exceeds the batch size.

Signed-off-by: Dmytro Podgornyi <[email protected]>
Signed-off-by: Gal Noam <[email protected]>
When the send window is not big enough for the required TCP segment to send,
we may split the segment so it will fit into the window.
Before this change - We didn't split the segment in the case we have unacked segments.
The motivation was that we anticipate to get ACK on the inflight segments,
which will trigger the next send operation.
This flow counts on RTT for receiving ACKs, which may be delayed depending on the remote side.
When RTT is long - we would block sending although the TCP send window allows it.

The change is to split TCP segments although we have unacked data, in
case the send window is big enough (mss).

Signed-off-by: Iftah Levi <[email protected]>
@iftahl
Copy link
Collaborator Author

iftahl commented May 9, 2024

bot:retest

@iftahl
Copy link
Collaborator Author

iftahl commented May 12, 2024

@AlexanderGrissik can we merge it?

@galnoam
Copy link
Collaborator

galnoam commented Jun 5, 2024

@iftahl, please add statistics, @AlexanderGrissik asked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants