Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer release when BT_CONN_DISCONNECT_COMPLETE #64114

Conversation

watsug
Copy link

@watsug watsug commented Oct 19, 2023

Connection cleanup is performed not only when BT_CONN_DISCONNECTED, but also when BT_CONN_DISCONNECTING and BT_CONN_DISCONNECT_COMPLETE. In addition send_buf() function can return -ENOTCONN error code - the handling was added. The fix has been tested on our side and it fixes the issue.

Fixes #63217

@watsug
Copy link
Author

watsug commented Oct 19, 2023

I have created a PR, but frankly - most of the changes come from clang-format. Was the file formatted properly in the main branch?

@hermabe
Copy link
Member

hermabe commented Oct 19, 2023

I have created a PR, but frankly - most of the changes come from clang-format. Was the file formatted properly in the main branch?

You should not format the whole file, only your changes and in general keep with the style of the file. clang-format supports formatting the diff only. The way it is now makes it very hard to review.

@watsug
Copy link
Author

watsug commented Oct 19, 2023

OK, I will roll back and format the diff only. But could you explain to me why the file is not formatted as a whole according to the clang-format definition? What sense is in diff formatting only, when the rest of the file is a 'jungle'?

@hermabe
Copy link
Member

hermabe commented Oct 19, 2023

OK, I will roll back and format the diff only. But could you explain to me why the file is not formatted as a whole according to the clang-format definition? What sense is in diff formatting only, when the rest of the file is a 'jungle'?

The clang-format file is not enforced in CI, checkpatch is used instead. Different subsystems may have slightly different styles. Different versions of clang-format support different options. The column width is recommended 80 chars but max 100, which tools dont handle at all. Formatting the whole file will create a large diff making backporting of changes more difficult.

@watsug
Copy link
Author

watsug commented Oct 19, 2023

OK, I have reverted the file and added only the fix with proper formatting.

@Thalley
Copy link
Collaborator

Thalley commented Oct 19, 2023

The column width is recommended 80 chars but max 100

The recommendation for 80 characters is no longer in effect in the Linux kernel (where we got this from), so the current "recommendation" is, AFAIK, actually 100.

Comment on lines 906 to 908
bool disconnecting = conn->state == BT_CONN_DISCONNECTED ||
conn->state == BT_CONN_DISCONNECTING ||
conn->state == BT_CONN_DISCONNECT_COMPLETE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would make more sense to check for conn->state != BT_CONN_CONNECTED.

Arguably we only want to process TX when in the connected state, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also my initial idea, but frankly, the whole state machine in conn.c looks a little overcomplicated and didn't have time to analyze it deeply. On the other hand, you are correct, inside the send_buf() function there is a check to validate the state is CONNECTED.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any scenario where we are not in the BT_CONN_CONNECTED state and do not want to flush our TX queue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look, but probably this weekend, and I can redo tests next week to see if '!= CONNECTED' is OK.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Thalley

I did some additional analysis and testing and this is what I found:

  1. BT_CONN_CLEANUP flag is set by bt_conn_set_state() only when transition from BT_CONN_DISCONNECT_COMPLETE -> BT_CONN_DISCONNECTED. See this:
		/* Notify disconnection and queue a dummy buffer to wake
		 * up and stop the tx thread for states where it was
		 * running.
		 */
		switch (old_state) {
		case BT_CONN_DISCONNECT_COMPLETE:
			tx_notify(conn);

			/* Cancel Connection Update if it is pending */
			if ((conn->type == BT_CONN_TYPE_LE) &&
			    (k_work_delayable_busy_get(&conn->deferred_work) &
			     (K_WORK_QUEUED | K_WORK_DELAYED))) {
				k_work_cancel_delayable(&conn->deferred_work);
			}

			atomic_set_bit(conn->flags, BT_CONN_CLEANUP);
			k_poll_signal_raise(&conn_change, 0);
			/* The last ref will be dropped during cleanup */
			break;

so in bt_conn_process_tx(), the call to conn_cleanup() will be executed only when the state is BT_CONN_DISCONNECTED and BT_CONN_CLEANUP set. This means my change here is useless.

  1. What fixes memory buffer release is the check for errors from send_buf(). If the connection state is different than BT_CONN_CONNECTED (tested in do_send_frag()) this function returns '-ENOTCONN'. And this is what should be handled.

  2. What is still not clear to me is how other error codes should be treated. Like '-ENOBUFS' or '-ENOMEM'.

Copy link
Collaborator

@jori-nordic jori-nordic Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • -ENOBUFS is only in the case of not enough "credits" for sending data to the controller. In that case, we still haven't popped buf from the tx queue, so no leaks.
  • -ENOMEM is in the case where we don't have a buffer for allocating a fragment for sending parts of buf. In that case we also haven't popped buf from TX queue, so no leaks again
  • Any other error codes from the controller are converted to -EIO to avoid a scenario when the above cases are not true.

@@ -926,7 +926,7 @@ void bt_conn_process_tx(struct bt_conn *conn)
err = send_buf(conn, buf);
net_buf_unref(buf);

if (err == -EIO) {
if (err == -EIO || err == -ENOTCONN) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the intention at least. But this PR proves that the flow is obfuscated enough that no one here is really sure of what's happening anymore.
I had a branch that massively simplifies all of this, but couldn't merge it as it exposed other issues. I'll see if we can split out some parts of it.

@watsug I took the liberty of editing your PR message to link the issue you're fixing, in order to give more context to the reviewers .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jori-nordic What should we do with this PR?

It may sound like it's fixing an issue, but it also sounds like that the fix here already exist and has just been moved, so to me it doesn't seem to actually change anything.

Copy link
Collaborator

@jori-nordic jori-nordic Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have another look at the explanation (next week) and report back.

Sometimes, fixes like these are impossible to test for in bsim upstream, as the ZLL is not compatible with BT_RECV_WORKQ_SYS and BT_RECV_WORKQ_BT. These two options can lead to a very different scheduling pattern, highlighting issues like this one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of singling out "problematic" error codes, we could only check for the cases where we want to keep the buffer in the queue, E.g. -EAGAIN.
TODO for myself: document what error codes should trigger a destroy.

Copy link
Collaborator

@jori-nordic jori-nordic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you instead move the connection test from here to here?

edit: That way, buf is never popped from the TX queue, and conn_cleanup will end up freeing it and its metadata later.

edit2: and also cleanup the commit history

@kruithofa kruithofa removed their request for review January 11, 2024 10:14
@r2r0
Copy link
Member

r2r0 commented Jan 16, 2024

@watsug Any chance to continue work on this PR?

@jori-nordic
Copy link
Collaborator

@r2r0 @watsug

I implemented what I suggested in #68748

Would one of you have the time to double-check that PR fixes the bug?

Copy link

github-actions bot commented Apr 9, 2024

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Apr 9, 2024
@github-actions github-actions bot closed this Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
7 participants