Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usbd: device_next: cdc: wait IN EP ready before enqueue data to EP #79104

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zeonchew
Copy link
Contributor

Under the current structure of usbd_cdc_acm, there is a possibility that TX data is enqueued to UDC driver before the previous transaction is completed when host is sending data to device faster than it receives data from device, causing dropped data.

Example of such flow is that:

  1. USB_OUT transaction completed. Data passed to sample's application code. Applicaiton code echo the data back into tx_fifo.
  2. USB_IN transaction is enqueued to UDC driver.
  3. Before USB_IN transaction is completed, another USB_OUT transaction is completed. Application code echo the data back into tx_fifo again.
  4. Since the echo raised tx_fifo.altered flag, tx_fifo_work is queued again.
  5. When cdc_acm_tx_fifo_handler() tries to enqueue packet to UDC driver, it is rejected because the USB_IN endpoint is still busy, causing the packet to be dropped.

This PR adds mechanism to wait for IN endpoint to complete existing transaction before allowing subsequent transaction to be enqueued to UDC drive.

@zephyrbot zephyrbot added the area: USB Universal Serial Bus label Sep 27, 2024
@zeonchew zeonchew force-pushed the usbd_acm_out_ep_busy_fix branch 2 times, most recently from 1fc6702 to 6ad3ad4 Compare September 27, 2024 13:25
@@ -540,6 +544,11 @@ static void cdc_acm_tx_fifo_handler(struct k_work *work)
return;
}

if (atomic_test_and_set_bit(&data->state, CDC_ACM_TX_EP_BUSY)) {
cdc_acm_work_submit(&data->tx_fifo_work);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? Will it simply keep retriggering the TX fifo work until the request is completed? This sounds way too much CPU intensive to me that could starve system workqueue and other lower priority threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tmon-nordic, you are right. Indeed keep requeuing tx_fifo_work would cause the issue you mentioned, I have totally missed that. I have changed the handling to raised the tx_fifo altered flag instead. Please help and see whether it make sense for you.

@tmon-nordic
Copy link
Contributor

The commit message should not be a concatenation of squashed commit messages like it is now.

@zeonchew
Copy link
Contributor Author

zeonchew commented Oct 1, 2024

The commit message should not be a concatenation of squashed commit messages like it is now.

Hi @tmon-nordic, thanks again. Sorry that I am unaware for the rules and workflow. Have removed the "review history" from the commit message.

May I check with you whether there is documentation regarding correct workflow for addressing review?

From what I understand from what is happening so far is that, for addressing review, we wanted the amendment to be squashed with the original commit, without needing record of the review history.
Do I understand this correctly?

Thanks again @tmon-nordic.

@tmon-nordic
Copy link
Contributor

May I check with you whether there is documentation regarding correct workflow for addressing review?

It can be derived from https://docs.zephyrproject.org/latest/contribute/contributor_expectations.html but only indirectly. The PR in Zephyr is essentially equivalent to what a patch series in Linux Kernel is.

From what I understand from what is happening so far is that, for addressing review, we wanted the amendment to be squashed with the original commit, without needing record of the review history. Do I understand this correctly?

The review history is not really useful in the "final patch set" (each subsequent force push should be possible to consider as "final patch set", there shouldn't really be "WIP" changes unless you mark the PR as draft).

There are so many commits that if all review history was in the main history it would be orders of magnitude harder to navigate it. If someone is interested in review history then the GitHub is the place to look for it (Linux Kernel IMHO has it better with the lore.kernel.org archive, but that's related to tooling and not necessarily the development process).

@zeonchew
Copy link
Contributor Author

zeonchew commented Oct 1, 2024

@tmon-nordic, undestood. Thanks for the sharing again 😄

Copy link
Collaborator

@jfischer-no jfischer-no left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the current structure of usbd_cdc_acm, there is a possibility that
TX data is enqueued to UDC driver before the previous transaction is
completed, hence causing the dropped data.

Where is data dropped? Also please take a look at #75439

@zeonchew
Copy link
Contributor Author

zeonchew commented Oct 1, 2024

Hi @jfischer-no,

Where is data dropped?

from my current understanding, because that cdc_acm_irq_cb_handler() is invoked-able by a number of events, there is a possibility that when the handler is invoked before TX is completed, and tx_fifo.altered is raised. in this case usbd_cdc_acm would still enqueue the data from tx_fifo to UDC driver at cdc_acm_tx_fifo_handler() by calling usbd_ep_enqueue(). Since the EP status is still busy at UDC driver side, depending on the enqueue mechanism at UDC, it would be either one of the following scenario:

  1. UDC driver push the request into a queue and returns success, but when the UDC thread tries to execute the TX transaction, it fails and discarded by UDC driver.
  2. UDC driver return failure for the calls, and the failure case handling in cdc_acm_tx_fifo_handler() discards the buffer:
	ret = usbd_ep_enqueue(c_data, buf);
	if (ret) {
		LOG_ERR("Failed to enqueue");
		net_buf_unref(buf);
	}

Also please take a look at #75439

From my understanding of the code change in #75439, it seems that the k_work_delayable is very useful to delay the tx_fifo_work when the EP is busy. It would be good if we could use delayed tx_fifo_work to retrigger the work instead of setting data->tx_fifo.altered = true as what is done in this PR. What do you think?

By the way, it seems that the case where "IN endpoint is busy" is not included in #75439 yet. Please correct me if I understand that wrongly.

@tmon-nordic
Copy link
Contributor

tmon-nordic commented Oct 1, 2024

From my understanding of the code change in #75439, it seems that the k_work_delayable is very useful to delay the tx_fifo_work when the EP is busy. It would be good if we could use delayed tx_fifo_work to retrigger the work instead of setting data->tx_fifo.altered = true as what is done in this PR. What do you think?

I think blindly retriggering with some arbitrary delay is bad design. It should really be a flag that executes once the underlying condition has changed (which is the case here with data->tx_fifo.altered).

The two issues seem independent to me. But it would be best @zeonchew could try to trigger the issue this PR is aiming to fix with #75439 changes applied (and without this PR).

@zeonchew
Copy link
Contributor Author

zeonchew commented Oct 2, 2024

From my understanding of the code change in #75439, it seems that the k_work_delayable is very useful to delay the tx_fifo_work when the EP is busy. It would be good if we could use delayed tx_fifo_work to retrigger the work instead of setting data->tx_fifo.altered = true as what is done in this PR. What do you think?

I think blindly retriggering with some arbitrary delay is bad design. It should really be a flag that executes once the underlying condition has changed (which is the case here with data->tx_fifo.altered).

The two issues seem independent to me. But it would be best @zeonchew could try to trigger the issue this PR is aiming to fix with #75439 changes applied (and without this PR).

@tmon-nordic, sure. Let me have a try on it soonest possible and get back here to update the status. 😉

Under the current structure of usbd_cdc_acm, there is a possibility that
TX data is enqueued to UDC driver before the previous transaction is
completed, hence causing the dropped data. This commit adds a mechanism
for usbd_cdc_acm driver to wait for IN endpoint to complete existing
transaction before subsequent transaction is enqueued.

Signed-off-by: Chew Zeh Yang <[email protected]>
@@ -114,6 +115,8 @@ struct cdc_acm_uart_data {
struct k_work rx_fifo_work;
atomic_t state;
struct k_sem notif_sem;
/* USBD CDC ACM TX transaction deferred due to BULK IN EP busy */
bool tx_deferred;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why new variable instead of just using data->tx_fifo.altered like it was earlier?

Copy link
Contributor Author

@zeonchew zeonchew Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tmon-nordic,

I was just about to add my report on what I have done here.

Since that #75439 has been merged into zephyr main, I have rebased my works and retested with the changes of #75439 applied. However, I am still able to reproduce this issue.

I have also run more stress test on my fixed ported over to the latest main, and found out that raising data->tx_fifo.altered wouldn't help in retrigger the TX transaction, as tx_fifo.altered is cleared in cdc_acm_irq_cb_handler().

data->tx_fifo.altered = false;
data->rx_fifo.altered = false;

The previous testing wasn't rigorous enough, that the subsequent transaction was actually triggered by new data coming in from RX instead (and hence the stale data was flushed out together).

This being said, I have added another handling here to make sure that remaining data in tx_fifo, which the BULK IN endpoint is unable to send out in current transaction, gets sent out immediately after current TX is completed as well.

/*
* If ring buffer contains more data than what the endpoint could
* send in one go, raise deferred flag for remaining data to be
* sent out once current transaction is completed.
*/
if (len == buf->size) {
if (!ring_buf_is_empty(data->tx_fifo.rb)) {
data->tx_deferred = true;
}
}

Please help to review whether this make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand cdc_acm_irq_cb_handler(). Isn't the data->tx_fifo.altered = false; going to make the if (data->tx_fifo.altered) effectively a dead code? Shouldn't this be fixed instead of adding new boolean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand cdc_acm_irq_cb_handler(). Isn't the data->tx_fifo.altered = false; going to make the if (data->tx_fifo.altered) effectively a dead code? Shouldn't this be fixed instead of adding new boolean?

I took me a while to understand this as well, from my understanding, in this handler, it will first clear all the altered flag, and then callback to the interrupt handler at application side: data->cb(...).

if (atomic_test_bit(&data->state, CDC_ACM_IRQ_RX_ENABLED) ||
atomic_test_bit(&data->state, CDC_ACM_IRQ_TX_ENABLED)) {
data->cb(usbd_class_get_private(c_data), data->cb_data);
}

If there is more data to be transmitted by the application, the tx_fifo.altered flag will be raised by application before the callback returns. Then here we go, continue to send data out again.

I see that this part is checked in by @jfischer-no. @jfischer-no, can you please help to confirm whether this understanding is correct?
Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a problem when there is more data in the tx ringbuffer? The sequence I think about is:

  • cdc_acm_irq_cb_handler() sets data->tx_fifo.altered = false;
  • data->cb(usbd_class_get_private(c_data), data->cb_data); is called, application writes to ring buffer more bytes than a CDC ACM TX buffer can fit
  • TX fifo work is triggered because data->tx_fifo.altered was set to true alongside ring buffer write
  • cdc_acm_tx_fifo_handler() calls len = ring_buf_get(data->tx_fifo.rb, buf->data, buf->size); and submits the transfer. The ring buffer still has pending TX data.
  • host reads all the data, usbd_cdc_acm_request() submits cdc_acm_work_submit(&data->irq_cb_work);
  • cdc_acm_irq_cb_handler() sets data->tx_fifo.altered = false;
  • data->cb(usbd_class_get_private(c_data), data->cb_data); is called, application doesn't have more data to write
  • TX fifo work is not triggered despite there being data in the ring buffer.

Is there anything preventing the above scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tmon-nordic,

I added this for your case you mentioned in the latest commit just now ☺️

/*
* If ring buffer contains more data than what the endpoint could
* send in one go, raise deferred flag for remaining data to be
* sent out once current transaction is completed.
*/
if (len == buf->size) {
if (!ring_buf_is_empty(data->tx_fifo.rb)) {
data->tx_deferred = true;
}
}

Copy link
Contributor

@tmon-nordic tmon-nordic Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why to have two booleans for roughly the same? Wouldn't it be better to change the data->tx_fifo.altered logic completely? If we go with separate variable, I think bool tx_deferred; should be moved inside tx_fifo structure.

But is it really necessary to have tx_deferred? I think that the CDC_ACM_TX_EP_BUSY in combination with ring_buf_is_empty(data->tx_fifo.rb)) should be enough to determine whether the TX work has to be raised or not.

My idea would be to replace if (data->tx_fifo.altered) with if (!ring_buf_is_empty(data->tx_fifo.rb)) && !atomic_test_and_set_bit(&data->state, CDC_ACM_TX_EP_BUSY)). Then the CDC_ACM_TX_EP_BUSY would be cleared in usbd_cdc_acm_request() (and inside the TX handler if it fails to enqueue the buffer) when the IN request completes. This would essentially eliminate the need for data->tx_fifo.altered altogether and potentially make the code simpler to follow and more robust. Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open Source development works on technical merit. I would say that there is no need to consult the original author when proposing a change that solves issues or otherwise enhances the code base.

While there may be times where you want to ask what was intended, there is also the possibility that the original author simply is no longer around.

Copy link
Contributor Author

@zeonchew zeonchew Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, was going to replace my comment with another one just now but somehow I lost all of them.
(The comment on saying I am new to Open-Source and ZephyrOS, asking on how this community works, and whether it is okay to make such a change without consulting original author is correct way to do things.)

I am kind of agree that in this case, checking for ring_buf_is_empty() at cdc_acm_irq_cb_handler() plus checking for CDC_ACM_TX_EP_BUSY flag at cdc_acm_tx_fifo_handler() would be optimal as checking the ring buffer status directly gives us the most accurate status for actions to be taken.

It seems that the original intend of altered flag is to provide a fast access to the status whether application layer has accessed the ring buffer. However, it seems to me that the bit might not be good enough for the scenario in tx_fifo here anymore.

@jfischer-no , what do you think? Is @tmon-nordic suggestion here to check whether to trigger tx_fifo_work based on whether ring_buf_is_empty() sounds good to you as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: USB Universal Serial Bus
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants