usbd: device_next: cdc: wait IN EP ready before enqueue data to EP #79104

zeonchew · 2024-09-27T10:05:41Z

Under the current structure of usbd_cdc_acm, there is a possibility that TX data is enqueued to UDC driver before the previous transaction is completed when host is sending data to device faster than it receives data from device, causing dropped data.

Example of such flow is that:

USB_OUT transaction completed. Data passed to sample's application code. Applicaiton code echo the data back into tx_fifo.
USB_IN transaction is enqueued to UDC driver.
Before USB_IN transaction is completed, another USB_OUT transaction is completed. Application code echo the data back into tx_fifo again.
Since the echo raised tx_fifo.altered flag, tx_fifo_work is queued again.
When cdc_acm_tx_fifo_handler() tries to enqueue packet to UDC driver, it is rejected because the USB_IN endpoint is still busy, causing the packet to be dropped.

This PR adds mechanism to wait for IN endpoint to complete existing transaction before allowing subsequent transaction to be enqueued to UDC drive.

tmon-nordic · 2024-09-30T09:27:26Z

subsys/usb/device_next/class/usbd_cdc_acm.c

@@ -540,6 +544,11 @@ static void cdc_acm_tx_fifo_handler(struct k_work *work)
 		return;
 	}

+	if (atomic_test_and_set_bit(&data->state, CDC_ACM_TX_EP_BUSY)) {
+		cdc_acm_work_submit(&data->tx_fifo_work);


How does this work? Will it simply keep retriggering the TX fifo work until the request is completed? This sounds way too much CPU intensive to me that could starve system workqueue and other lower priority threads.

Hi @tmon-nordic, you are right. Indeed keep requeuing tx_fifo_work would cause the issue you mentioned, I have totally missed that. I have changed the handling to raised the tx_fifo altered flag instead. Please help and see whether it make sense for you.

tmon-nordic · 2024-10-01T07:20:08Z

The commit message should not be a concatenation of squashed commit messages like it is now.

zeonchew · 2024-10-01T07:45:40Z

The commit message should not be a concatenation of squashed commit messages like it is now.

Hi @tmon-nordic, thanks again. Sorry that I am unaware for the rules and workflow. Have removed the "review history" from the commit message.

May I check with you whether there is documentation regarding correct workflow for addressing review?

From what I understand from what is happening so far is that, for addressing review, we wanted the amendment to be squashed with the original commit, without needing record of the review history.
Do I understand this correctly?

Thanks again @tmon-nordic.

tmon-nordic · 2024-10-01T08:10:30Z

May I check with you whether there is documentation regarding correct workflow for addressing review?

It can be derived from https://docs.zephyrproject.org/latest/contribute/contributor_expectations.html but only indirectly. The PR in Zephyr is essentially equivalent to what a patch series in Linux Kernel is.

From what I understand from what is happening so far is that, for addressing review, we wanted the amendment to be squashed with the original commit, without needing record of the review history. Do I understand this correctly?

The review history is not really useful in the "final patch set" (each subsequent force push should be possible to consider as "final patch set", there shouldn't really be "WIP" changes unless you mark the PR as draft).

There are so many commits that if all review history was in the main history it would be orders of magnitude harder to navigate it. If someone is interested in review history then the GitHub is the place to look for it (Linux Kernel IMHO has it better with the lore.kernel.org archive, but that's related to tooling and not necessarily the development process).

zeonchew · 2024-10-01T08:13:37Z

@tmon-nordic, undestood. Thanks for the sharing again 😄

jfischer-no

Under the current structure of usbd_cdc_acm, there is a possibility that
TX data is enqueued to UDC driver before the previous transaction is
completed, hence causing the dropped data.

Where is data dropped? Also please take a look at #75439

zeonchew · 2024-10-01T08:58:46Z

Hi @jfischer-no,

Where is data dropped?

from my current understanding, because that cdc_acm_irq_cb_handler() is invoked-able by a number of events, there is a possibility that when the handler is invoked before TX is completed, and tx_fifo.altered is raised. in this case usbd_cdc_acm would still enqueue the data from tx_fifo to UDC driver at cdc_acm_tx_fifo_handler() by calling usbd_ep_enqueue(). Since the EP status is still busy at UDC driver side, depending on the enqueue mechanism at UDC, it would be either one of the following scenario:

UDC driver push the request into a queue and returns success, but when the UDC thread tries to execute the TX transaction, it fails and discarded by UDC driver.
UDC driver return failure for the calls, and the failure case handling in cdc_acm_tx_fifo_handler() discards the buffer:

	ret = usbd_ep_enqueue(c_data, buf);
	if (ret) {
		LOG_ERR("Failed to enqueue");
		net_buf_unref(buf);
	}

Also please take a look at #75439

From my understanding of the code change in #75439, it seems that the k_work_delayable is very useful to delay the tx_fifo_work when the EP is busy. It would be good if we could use delayed tx_fifo_work to retrigger the work instead of setting data->tx_fifo.altered = true as what is done in this PR. What do you think?

By the way, it seems that the case where "IN endpoint is busy" is not included in #75439 yet. Please correct me if I understand that wrongly.

tmon-nordic · 2024-10-01T11:13:46Z

From my understanding of the code change in #75439, it seems that the k_work_delayable is very useful to delay the tx_fifo_work when the EP is busy. It would be good if we could use delayed tx_fifo_work to retrigger the work instead of setting data->tx_fifo.altered = true as what is done in this PR. What do you think?

I think blindly retriggering with some arbitrary delay is bad design. It should really be a flag that executes once the underlying condition has changed (which is the case here with data->tx_fifo.altered).

The two issues seem independent to me. But it would be best @zeonchew could try to trigger the issue this PR is aiming to fix with #75439 changes applied (and without this PR).

zeonchew · 2024-10-02T08:40:31Z

From my understanding of the code change in #75439, it seems that the k_work_delayable is very useful to delay the tx_fifo_work when the EP is busy. It would be good if we could use delayed tx_fifo_work to retrigger the work instead of setting data->tx_fifo.altered = true as what is done in this PR. What do you think?

I think blindly retriggering with some arbitrary delay is bad design. It should really be a flag that executes once the underlying condition has changed (which is the case here with data->tx_fifo.altered).

The two issues seem independent to me. But it would be best @zeonchew could try to trigger the issue this PR is aiming to fix with #75439 changes applied (and without this PR).

@tmon-nordic, sure. Let me have a try on it soonest possible and get back here to update the status. 😉

Under the current structure of usbd_cdc_acm, there is a possibility that TX data is enqueued to UDC driver before the previous transaction is completed, hence causing the dropped data. This commit adds a mechanism for usbd_cdc_acm driver to wait for IN endpoint to complete existing transaction before subsequent transaction is enqueued. Signed-off-by: Chew Zeh Yang <[email protected]>

tmon-nordic · 2024-10-03T08:53:16Z

subsys/usb/device_next/class/usbd_cdc_acm.c

@@ -114,6 +115,8 @@ struct cdc_acm_uart_data {
 	struct k_work rx_fifo_work;
 	atomic_t state;
 	struct k_sem notif_sem;
+	/* USBD CDC ACM TX transaction deferred due to BULK IN EP busy */
+	bool tx_deferred;


Why new variable instead of just using data->tx_fifo.altered like it was earlier?

Hi @tmon-nordic,

I was just about to add my report on what I have done here.

Since that #75439 has been merged into zephyr main, I have rebased my works and retested with the changes of #75439 applied. However, I am still able to reproduce this issue.

I have also run more stress test on my fixed ported over to the latest main, and found out that raising data->tx_fifo.altered wouldn't help in retrigger the TX transaction, as tx_fifo.altered is cleared in cdc_acm_irq_cb_handler().

zephyr/subsys/usb/device_next/class/usbd_cdc_acm.c

Lines 836 to 837 in a7aac8d

data->tx_fifo.altered = false;

data->rx_fifo.altered = false;

The previous testing wasn't rigorous enough, that the subsequent transaction was actually triggered by new data coming in from RX instead (and hence the stale data was flushed out together).

This being said, I have added another handling here to make sure that remaining data in tx_fifo, which the BULK IN endpoint is unable to send out in current transaction, gets sent out immediately after current TX is completed as well.

zephyr/subsys/usb/device_next/class/usbd_cdc_acm.c

Lines 571 to 580 in a7aac8d

/*

* If ring buffer contains more data than what the endpoint could

* send in one go, raise deferred flag for remaining data to be

* sent out once current transaction is completed.

*/

if (len == buf->size) {

if (!ring_buf_is_empty(data->tx_fifo.rb)) {

data->tx_deferred = true;

}

}

Please help to review whether this make sense.

I don't understand cdc_acm_irq_cb_handler(). Isn't the data->tx_fifo.altered = false; going to make the if (data->tx_fifo.altered) effectively a dead code? Shouldn't this be fixed instead of adding new boolean?

I don't understand cdc_acm_irq_cb_handler(). Isn't the data->tx_fifo.altered = false; going to make the if (data->tx_fifo.altered) effectively a dead code? Shouldn't this be fixed instead of adding new boolean?

I took me a while to understand this as well, from my understanding, in this handler, it will first clear all the altered flag, and then callback to the interrupt handler at application side: data->cb(...).

zephyr/subsys/usb/device_next/class/usbd_cdc_acm.c

Lines 841 to 844 in a7aac8d

if (atomic_test_bit(&data->state, CDC_ACM_IRQ_RX_ENABLED) ||

atomic_test_bit(&data->state, CDC_ACM_IRQ_TX_ENABLED)) {

data->cb(usbd_class_get_private(c_data), data->cb_data);

}

If there is more data to be transmitted by the application, the tx_fifo.altered flag will be raised by application before the callback returns. Then here we go, continue to send data out again.

I see that this part is checked in by @jfischer-no. @jfischer-no, can you please help to confirm whether this understanding is correct?
Thanks.

Isn't there a problem when there is more data in the tx ringbuffer? The sequence I think about is:

cdc_acm_irq_cb_handler() sets data->tx_fifo.altered = false;

data->cb(usbd_class_get_private(c_data), data->cb_data); is called, application writes to ring buffer more bytes than a CDC ACM TX buffer can fit

TX fifo work is triggered because data->tx_fifo.altered was set to true alongside ring buffer write

cdc_acm_tx_fifo_handler() calls len = ring_buf_get(data->tx_fifo.rb, buf->data, buf->size); and submits the transfer. The ring buffer still has pending TX data.

host reads all the data, usbd_cdc_acm_request() submits cdc_acm_work_submit(&data->irq_cb_work);

cdc_acm_irq_cb_handler() sets data->tx_fifo.altered = false;

data->cb(usbd_class_get_private(c_data), data->cb_data); is called, application doesn't have more data to write

TX fifo work is not triggered despite there being data in the ring buffer.

Is there anything preventing the above scenario?

Hi @tmon-nordic,

I added this for your case you mentioned in the latest commit just now ☺️

zephyr/subsys/usb/device_next/class/usbd_cdc_acm.c

Lines 571 to 580 in a7aac8d

/*

* If ring buffer contains more data than what the endpoint could

* send in one go, raise deferred flag for remaining data to be

* sent out once current transaction is completed.

*/

if (len == buf->size) {

if (!ring_buf_is_empty(data->tx_fifo.rb)) {

data->tx_deferred = true;

}

}

Why to have two booleans for roughly the same? Wouldn't it be better to change the data->tx_fifo.altered logic completely? If we go with separate variable, I think bool tx_deferred; should be moved inside tx_fifo structure.

But is it really necessary to have tx_deferred? I think that the CDC_ACM_TX_EP_BUSY in combination with ring_buf_is_empty(data->tx_fifo.rb)) should be enough to determine whether the TX work has to be raised or not.

My idea would be to replace if (data->tx_fifo.altered) with if (!ring_buf_is_empty(data->tx_fifo.rb)) && !atomic_test_and_set_bit(&data->state, CDC_ACM_TX_EP_BUSY)). Then the CDC_ACM_TX_EP_BUSY would be cleared in usbd_cdc_acm_request() (and inside the TX handler if it fails to enqueue the buffer) when the IN request completes. This would essentially eliminate the need for data->tx_fifo.altered altogether and potentially make the code simpler to follow and more robust. Am I missing something?

Open Source development works on technical merit. I would say that there is no need to consult the original author when proposing a change that solves issues or otherwise enhances the code base.

While there may be times where you want to ask what was intended, there is also the possibility that the original author simply is no longer around.

sorry, was going to replace my comment with another one just now but somehow I lost all of them.
(The comment on saying I am new to Open-Source and ZephyrOS, asking on how this community works, and whether it is okay to make such a change without consulting original author is correct way to do things.)

I am kind of agree that in this case, checking for ring_buf_is_empty() at cdc_acm_irq_cb_handler() plus checking for CDC_ACM_TX_EP_BUSY flag at cdc_acm_tx_fifo_handler() would be optimal as checking the ring buffer status directly gives us the most accurate status for actions to be taken.

It seems that the original intend of altered flag is to provide a fast access to the status whether application layer has accessed the ring buffer. However, it seems to me that the bit might not be good enough for the scenario in tx_fifo here anymore.

@jfischer-no , what do you think? Is @tmon-nordic suggestion here to check whether to trigger tx_fifo_work based on whether ring_buf_is_empty() sounds good to you as well?

zephyrbot added the area: USB Universal Serial Bus label Sep 27, 2024

zephyrbot requested review from jfischer-no and tmon-nordic September 27, 2024 10:06

zephyrbot assigned jfischer-no Sep 27, 2024

zeonchew force-pushed the usbd_acm_out_ep_busy_fix branch 2 times, most recently from 1fc6702 to 6ad3ad4 Compare September 27, 2024 13:25

tmon-nordic requested changes Sep 30, 2024

View reviewed changes

zeonchew force-pushed the usbd_acm_out_ep_busy_fix branch from 18778a2 to 27486c5 Compare September 30, 2024 15:22

zeonchew requested a review from tmon-nordic September 30, 2024 15:22

zeonchew force-pushed the usbd_acm_out_ep_busy_fix branch from 27486c5 to bea015b Compare October 1, 2024 07:37

jfischer-no requested changes Oct 1, 2024

View reviewed changes

zeonchew requested a review from jfischer-no October 1, 2024 09:04

zeonchew force-pushed the usbd_acm_out_ep_busy_fix branch from bea015b to 87f93a9 Compare October 3, 2024 08:25

zeonchew force-pushed the usbd_acm_out_ep_busy_fix branch from 87f93a9 to a7aac8d Compare October 3, 2024 08:40

tmon-nordic reviewed Oct 3, 2024

View reviewed changes

zeonchew requested a review from tmon-nordic October 3, 2024 14:25

tmon-nordic mentioned this pull request Oct 4, 2024

usbd: device_next: cdc: add option to enable short packet terminate #79193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usbd: device_next: cdc: wait IN EP ready before enqueue data to EP #79104

usbd: device_next: cdc: wait IN EP ready before enqueue data to EP #79104

zeonchew commented Sep 27, 2024

tmon-nordic Sep 30, 2024

zeonchew Sep 30, 2024

tmon-nordic commented Oct 1, 2024

zeonchew commented Oct 1, 2024

tmon-nordic commented Oct 1, 2024

zeonchew commented Oct 1, 2024

jfischer-no left a comment

zeonchew commented Oct 1, 2024

tmon-nordic commented Oct 1, 2024 •

edited

Loading

zeonchew commented Oct 2, 2024

tmon-nordic Oct 3, 2024

zeonchew Oct 3, 2024 •

edited

Loading

tmon-nordic Oct 3, 2024

zeonchew Oct 3, 2024

tmon-nordic Oct 3, 2024

zeonchew Oct 3, 2024

tmon-nordic Oct 3, 2024 •

edited

Loading

tmon-nordic Oct 3, 2024

zeonchew Oct 3, 2024 •

edited

Loading

	data->tx_fifo.altered = false;
	data->rx_fifo.altered = false;

	/*
	* If ring buffer contains more data than what the endpoint could
	* send in one go, raise deferred flag for remaining data to be
	* sent out once current transaction is completed.
	*/
	if (len == buf->size) {
	if (!ring_buf_is_empty(data->tx_fifo.rb)) {
	data->tx_deferred = true;
	}
	}

	if (atomic_test_bit(&data->state, CDC_ACM_IRQ_RX_ENABLED) \|\|
	atomic_test_bit(&data->state, CDC_ACM_IRQ_TX_ENABLED)) {
	data->cb(usbd_class_get_private(c_data), data->cb_data);
	}

usbd: device_next: cdc: wait IN EP ready before enqueue data to EP #79104

Are you sure you want to change the base?

usbd: device_next: cdc: wait IN EP ready before enqueue data to EP #79104

Conversation

zeonchew commented Sep 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmon-nordic commented Oct 1, 2024

zeonchew commented Oct 1, 2024

tmon-nordic commented Oct 1, 2024

zeonchew commented Oct 1, 2024

jfischer-no left a comment

Choose a reason for hiding this comment

zeonchew commented Oct 1, 2024

tmon-nordic commented Oct 1, 2024 • edited Loading

zeonchew commented Oct 2, 2024

Choose a reason for hiding this comment

zeonchew Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmon-nordic Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeonchew Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

tmon-nordic commented Oct 1, 2024 •

edited

Loading

zeonchew Oct 3, 2024 •

edited

Loading

tmon-nordic Oct 3, 2024 •

edited

Loading

zeonchew Oct 3, 2024 •

edited

Loading