-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
buffer release when BT_CONN_DISCONNECT_COMPLETE #64114
buffer release when BT_CONN_DISCONNECT_COMPLETE #64114
Conversation
I have created a PR, but frankly - most of the changes come from clang-format. Was the file formatted properly in the main branch? |
You should not format the whole file, only your changes and in general keep with the style of the file. clang-format supports formatting the diff only. The way it is now makes it very hard to review. |
OK, I will roll back and format the diff only. But could you explain to me why the file is not formatted as a whole according to the clang-format definition? What sense is in diff formatting only, when the rest of the file is a 'jungle'? |
The clang-format file is not enforced in CI, checkpatch is used instead. Different subsystems may have slightly different styles. Different versions of clang-format support different options. The column width is recommended 80 chars but max 100, which tools dont handle at all. Formatting the whole file will create a large diff making backporting of changes more difficult. |
This reverts commit cf3e379.
OK, I have reverted the file and added only the fix with proper formatting. |
The recommendation for 80 characters is no longer in effect in the Linux kernel (where we got this from), so the current "recommendation" is, AFAIK, actually 100. |
subsys/bluetooth/host/conn.c
Outdated
bool disconnecting = conn->state == BT_CONN_DISCONNECTED || | ||
conn->state == BT_CONN_DISCONNECTING || | ||
conn->state == BT_CONN_DISCONNECT_COMPLETE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it would make more sense to check for conn->state != BT_CONN_CONNECTED
.
Arguably we only want to process TX when in the connected state, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was also my initial idea, but frankly, the whole state machine in conn.c looks a little overcomplicated and didn't have time to analyze it deeply. On the other hand, you are correct, inside the send_buf() function there is a check to validate the state is CONNECTED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any scenario where we are not in the BT_CONN_CONNECTED
state and do not want to flush our TX queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look, but probably this weekend, and I can redo tests next week to see if '!= CONNECTED' is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Thalley
I did some additional analysis and testing and this is what I found:
- BT_CONN_CLEANUP flag is set by bt_conn_set_state() only when transition from BT_CONN_DISCONNECT_COMPLETE -> BT_CONN_DISCONNECTED. See this:
/* Notify disconnection and queue a dummy buffer to wake
* up and stop the tx thread for states where it was
* running.
*/
switch (old_state) {
case BT_CONN_DISCONNECT_COMPLETE:
tx_notify(conn);
/* Cancel Connection Update if it is pending */
if ((conn->type == BT_CONN_TYPE_LE) &&
(k_work_delayable_busy_get(&conn->deferred_work) &
(K_WORK_QUEUED | K_WORK_DELAYED))) {
k_work_cancel_delayable(&conn->deferred_work);
}
atomic_set_bit(conn->flags, BT_CONN_CLEANUP);
k_poll_signal_raise(&conn_change, 0);
/* The last ref will be dropped during cleanup */
break;
so in bt_conn_process_tx(), the call to conn_cleanup() will be executed only when the state is BT_CONN_DISCONNECTED and BT_CONN_CLEANUP set. This means my change here is useless.
-
What fixes memory buffer release is the check for errors from send_buf(). If the connection state is different than BT_CONN_CONNECTED (tested in do_send_frag()) this function returns '-ENOTCONN'. And this is what should be handled.
-
What is still not clear to me is how other error codes should be treated. Like '-ENOBUFS' or '-ENOMEM'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-ENOBUFS
is only in the case of not enough "credits" for sending data to the controller. In that case, we still haven't poppedbuf
from the tx queue, so no leaks.-ENOMEM
is in the case where we don't have a buffer for allocating a fragment for sending parts ofbuf
. In that case we also haven't poppedbuf
from TX queue, so no leaks again- Any other error codes from the controller are converted to
-EIO
to avoid a scenario when the above cases are not true.
@@ -926,7 +926,7 @@ void bt_conn_process_tx(struct bt_conn *conn) | |||
err = send_buf(conn, buf); | |||
net_buf_unref(buf); | |||
|
|||
if (err == -EIO) { | |||
if (err == -EIO || err == -ENOTCONN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this already being done in
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/bluetooth/host/conn.c#L643-L656
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's the intention at least. But this PR proves that the flow is obfuscated enough that no one here is really sure of what's happening anymore.
I had a branch that massively simplifies all of this, but couldn't merge it as it exposed other issues. I'll see if we can split out some parts of it.
@watsug I took the liberty of editing your PR message to link the issue you're fixing, in order to give more context to the reviewers .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jori-nordic What should we do with this PR?
It may sound like it's fixing an issue, but it also sounds like that the fix here already exist and has just been moved, so to me it doesn't seem to actually change anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have another look at the explanation (next week) and report back.
Sometimes, fixes like these are impossible to test for in bsim upstream, as the ZLL is not compatible with BT_RECV_WORKQ_SYS
and BT_RECV_WORKQ_BT
. These two options can lead to a very different scheduling pattern, highlighting issues like this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of singling out "problematic" error codes, we could only check for the cases where we want to keep the buffer in the queue, E.g. -EAGAIN.
TODO for myself: document what error codes should trigger a destroy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@watsug Any chance to continue work on this PR? |
I implemented what I suggested in #68748 Would one of you have the time to double-check that PR fixes the bug? |
This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time. |
Connection cleanup is performed not only when BT_CONN_DISCONNECTED, but also when BT_CONN_DISCONNECTING and BT_CONN_DISCONNECT_COMPLETE. In addition send_buf() function can return -ENOTCONN error code - the handling was added. The fix has been tested on our side and it fixes the issue.
Fixes #63217