Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samples: bluetooth: central_iso: fix synchronization of data submission #72622

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

fgrandel
Copy link
Contributor

@fgrandel fgrandel commented May 11, 2024

This PR synchronizes preparation and scheduling of isochronous data to the Bluetooth clock.

It uses TX synchronization info to set and maintain a well-defined phase between the application timer and the bluetooth clock including ongoing drift compensation (see https://github.com/nrfconnect/sdk-nrf/blob/main/samples/bluetooth/iso_time_sync/src/iso_tx.c and the "LE Read ISO TX Sync" HCI command in the Bluetooth Core Spec 5.4, Vol 4 Part E, 7.8.96).

The PR introduces a synchronization approach that works on both, central and peripheral, and could therefore be ported in later PRs to showcase sending synchronized data from multiple peripherals to one central.

Fixes #72452

@fgrandel fgrandel added platform: nRF Nordic nRFx area: Bluetooth ISO Bluetooth LE Isochronous Channels labels May 11, 2024
@fgrandel fgrandel requested review from Thalley and cvinayak May 11, 2024 15:36
@fgrandel fgrandel self-assigned this May 11, 2024
@fgrandel fgrandel force-pushed the fix/72452-ble-iso-sample-sync branch from fb95936 to 916630e Compare May 11, 2024 15:52
@fgrandel fgrandel force-pushed the fix/72452-ble-iso-sample-sync branch 2 times, most recently from d269363 to 84989c5 Compare May 11, 2024 16:35
@fgrandel fgrandel marked this pull request as ready for review May 11, 2024 16:36
@zephyrbot zephyrbot added the area: Samples Samples label May 11, 2024
@fgrandel fgrandel force-pushed the fix/72452-ble-iso-sample-sync branch from 84989c5 to 723196f Compare May 11, 2024 16:43
@fgrandel fgrandel marked this pull request as draft May 12, 2024 08:08
@fgrandel
Copy link
Contributor Author

Update: This approach is too brittle. I tested it under different circumstances and it turns out that my initial assumption was wrong that the connected event would be a stable anchor for synchronization. I'll have to implement the timstamp-based approach right away. (Therefore converted back to Draft.)

@fgrandel fgrandel force-pushed the fix/72452-ble-iso-sample-sync branch from 723196f to 12f1ef9 Compare May 12, 2024 15:43
@fgrandel
Copy link
Contributor Author

fgrandel commented May 12, 2024

Done.

Known issue (cf Discord), IMO to be fixed in a separate PR:

Just wondering why the peripheral's ACK does not trigger the sent-callback immediately. In the example almost a full ISO interval (~10 ms) elapses before the callback is called. This leaves very little time to prepare and send the next SDU. IMHO it would be nicer if one had time to prepare the next SDU triggered by the sent-callback. Is this hard to fix? I saw you already have a TODO for early ACK in the code.

If you don't like the surrounding changes (refactoring, reversal of payload length, etc.) just let me know and I'll remove this unrelated stuff.

@fgrandel fgrandel marked this pull request as ready for review May 12, 2024 15:49
struct bt_iso_tx_info iso_tx_info = {0};
int err;

err = bt_iso_chan_get_tx_sync(chan, &iso_tx_info);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fgrandel , thanks for spending time on improving this sample!
I would recommend to not use timestamps when providing data in this sample.
The increase in sequence number gives the controller enough information about when it is to be scheduled on air. See https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/nrfxlib/softdevice_controller/doc/isochronous_channels.html#providing-data for more information on how the Nordic SoftDevice Controller processing incoming HCI ISO Data.

When using timestamps it is possible to provide the data to the controller right before it is going to be sent on air.
Unfortunately, this timing information cannot be obtained in a controller-agnostic way:
The timestamp returned from the HCI LE Read ISO TX Sync command is ambiguous, the returned value depends on how the controller implementation interprets the spec. Also, the command documentation refers to CIG/BIG reference anchor points which may not align with the SDU generation if ISO_Interval != SDU interval (up to the controller). These things make it hard to rely on this return value.
See the open spec errata https://bluetooth.atlassian.net/browse/ES-23138 for more details.

Another note: The timing of the iso_sent is also implementation dependent. It is raised when the controller raises the HCI Number of Completed Packets event. This event may be raised when the controller is ready to receive more data. This may be the case when the payload is scheduled for transmission. That is, even before it is sent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR; I see why the sent-callback + HCI LE Read ISO TX Sync + timestamp approach is lacking and depends on controller implementation details. But IMO it is still the only approach that works at all to reliably synchronize ISO channels for the moment being.

Arguments following:

Thanks for the link. Unfortunately I don't have permission to access it - guess that's only for SIG members.

The SDU vs. ISO interval argument is convincing - however as long as all our options are to some degree depending on the controller implementation, I'd still prefer the approach recommended by the core spec. By the time we'll implement more controllers, I guess the spec will have been updated and the errata removed in a somehow backwards compatible way, so we can move on from there.

In Zephyr we can and should decouple from implementation details by wrapping the controller specific parts behind more usable high-level APIs in the meantime. This would also improve developer experience by protecting them from the hacky and error-prone synchronization details. Plus it guarantees a migration path.

The dependence of iso_sent on the controller implementation is also a good argument. But we still need to provide developers with a working solution until the spec does so. That's AFAICS, why Nordic recommends the HCI LE Read ISO TX Sync + timestamp solution for the moment being and uses the send callback as a trigger. Also see the recommended approach in the link you provided:

This is the preferred way of providing data to the SoftDevice Controller and guarantees the highest degree of control.

Re sent-callback latency: Introducing a delay of one ISO interval w/o technical need is not a good way of interpreting the slack in the spec IMO. We should at least provide best-effort latency as far as technically possible - even more as it seems to be the only workable solution for deterministic synchronization for the time being. So the controller really should deliver some sufficient degree of latency guarantees beyond what is required by the spec for this specific callback/HCI command until the spec provides a workable solution itself.

The seq-number and "time-of-arrival" approaches don't work unless you admit at least two ISO intervals of extra latency to avoid systematic package loss (as demonstrated by the bug I'm fixing here). This I find inacceptable because it counters the concept of ISO channels with unframed PDUs and deterministic latency. Also see the latency promises of the CIG config API in Zephyr which will not be met. You need to add at least two sequence numbers all the time to ensure that:

  1. the initially (but non-deterministically) skipped event gets accounted for.
  2. source-to-CIG anchor point offset doesn't play a role even if you happen to synchronize too closely to the next anchor point so that the controller will pick up your SDUs too late.

In addition the "time-of-arrival" approach is not consistent with the spec AFAICS as flushing cannot be controlled deterministically by the application. It's the point of the ISO protocol that you can synchronize your source with the sink with deterministic latency and time-to-live of data packets.

On the peripheral both approaches don't work at all because neither the time-of-arrival nor the seq number approach are able to account for drift of the peripheral's clock. This cannot work by design if you want well-synchronized data across devices. This also explains why the (local) timer approach does systematically not work.

In our case for example we do decentralized measurements on peripherals in sync with the central's distributed clock across several devices and do explicitly want those measurements be sent out in the same CIG event with deterministic latency so we don't need timestamps and get cheap syntonization of all peripherals to the master clock w/o having to implement our own syntonization approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the peripheral both approaches don't work at all because neither the time-of-arrival nor the seq number approach are able to account for drift of the peripheral's clock. This cannot work by design if you want well-synchronized data across devices. This also explains why the (local) timer approach does systematically not work.

Fun fact: An ISO peripheral from BT Core spec 5.4 is not aware of the SDU interval and can actually never transmit correctly (reliably) 8)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment for the use of bt_iso_chan_get_tx_sync: When we use it like this here, we can use the current response to compare with the previous response. IIRC (and please correct me @rugeGerritsen) the return values of bt_iso_chan_get_tx_sync are only when an SDUs have been sent (or scheduled for transmission). In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.

e.g.

  1. App send SDU_1 with TS_1 and PSN_1
  2. App gets sent callback and gets TS_1 and PSN_1 from bt_iso_chan_get_tx_sync
  3. App sends SDU_2 wit TS_1 and PSN_2 (or TS_2 and PSN_1)
  4. Controller reject this as invalid (which not all controllers may do), but still triggers a Number of Completed Packets event
  5. App gets sent callback and gets TS_1 and PSN_1 from bt_iso_chan_get_tx_sync (since it was not transmitted or scheduled for transmission)
  6. App is now aware that the previous TX went wrong and could attempt to fix by increasing PSN or TS accordingly

Copy link
Contributor

@cvinayak cvinayak May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.

Only when the Controller has received a SDU from Host, the returned Sequence Number and Timestamp correspond to the last transmitted or scheduled SDU are calculated.

  1. If bt_iso_chan_get_tx_sync is called multiple times without a SDU enqueued, it will return same Sequence Number and Timestamp corresponding to the instant the previous bt_iso_chan_send was called.
  2. Every bt_iso_chan_get_tx_sync after bt_iso_chan_send call will return the Sequence Number and Timestamp of the last transmitted or scheduled SDU. Hence, for the consecutive call to bt_iso_chan_send, the next Sequence Number (returned value plus one) and next Timestamp (returned value plus one SDU interval) be provided to bt_iso_chan_send_ts.
  3. If bt_iso_chan_get_tx_sync followed by bt_iso_chan_send_ts is called in the sent-callback, then the returned Sequence Number and Timestamp could be stale depending on the SDU intervals elapsed from the last call of bt_iso_chan_send_ts to when the current sent-callback executes. If sufficient SDU buffers into the future has been already provided to the Controller, then bt_iso_chan_get_tx_sync will continue to return future Sequence Number and Timestamp ensuring new SDU enqueued to the Controller are not stale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to implement a drift compensation algorithm now that:

  • does no longer rely on the sent callback
  • does not rely on the exact point in time that controllers store the TX sync information (even if the sync timestamp lies in the future)
  • should continue to work if SDU interval != ISO interval
  • uses an algorithm that is able to calculate the SDU interval even on a peripheral
  • works with BT > 1 and FT > 1
  • tolerates a certain amount of package loss
  • compensates for initial non-standard delay of the second SDU (split controller ISOAL bug?)
  • uses seq_num only when sending the sdu (no more timestamp dependency)

For obvious reasons I can not compensate for TX timestamp specification ambiguities but as I have to rely on hardware-specific timing anyway as long as the bluetooth clock is not exposed by the API in some hardware-independent way, this is not a problem in practice as the split controller defines the timestamp in the same way as the Nordic SD controller AFAICS. Let's hope that by the time more hardware is being supported by the sample, this ambiguity will have been fixed.

An ISO peripheral from BT Core spec 5.4 is not aware of the SDU interval

Hrm, I think I found a way to derive the SDU interval from channel info that is available both, on the central and the peripheral:

Transport_Latency = CIG_Sync_Delay + FT x ISO_Interval - SDU_Interval, see Bluetooth Core, Version 5.4, Vol 6, Part G, Section 3.2.2.

Therefore:
SDU_Interval = CIG_Sync_Delay + FT x ISO_Interval - Transport_Latency

These parameters are available from the HCI LE CIS Established event.

@cvinayak
Copy link
Contributor

cvinayak commented May 13, 2024

Done.

Known issue (cf Discord), IMO to be fixed in a separate PR:

Just wondering why the peripheral's ACK does not trigger the sent-callback immediately. In the example almost a full ISO interval (~10 ms) elapses before the callback is called. This leaves very little time to prepare and send the next SDU. IMHO it would be nicer if one had time to prepare the next SDU triggered by the sent-callback. Is this hard to fix? I saw you already have a TODO for early ACK in the code.

If you don't like the surrounding changes (refactoring, reversal of payload length, etc.) just let me know and I'll remove this unrelated stuff.

Zephyr Controller follows a similar design (implementation choice) to ACL Sending Data for Sending ISO data too, refer to Bluetooth Core Specification Version 5.4, Vol 6, Part D, Section 6.1 Sending Data. In this design, the Number of Completed Packets is generated when corresponding buffer space is freed when acknowledged and/or flushed (in the case of CIS in the Zephyr Controller). But in the implementation of ISO ack/flush, the Zephyr Controller implementation reuses the code (control path) that is determining stale PDUs hence the Number of Completed Packets is delayed by one ISO Interval. An optimization is good to have, but not necessary, as the Specification states (Bluetooth Core Specification Version 5.4, Vol 4, Part E, Section 7.7.19 Number Of Completed Packets event):
While the Controller has HCI Data packets or HCI ISO Data packets in its buffer, it shall keep sending the HCI_Number_Of_Completed_Packets event to the Host at least periodically, until it finally reports that all the pending packets have been completed. The rate with which this event is sent is manufacturer specific.

Hence, an application should not rely on the rate at which the sent callback is called. Application will need to ensure there is enough buffers available/configured in the Controller to be able to provide the ISO data with corresponding sequence numbers for the future SDU intervals.

Zephyr Controller also needs that flush timeout is considered when determining the buffer number, as the implementation does not generate Number of Complete Packets until they are flushed under bad transmissions and NACK conditions. i.e for FT=2, Zephyr Controller needs 5 HCI ISO Data buffers.

Copy link
Collaborator

@Thalley Thalley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As others have pointed out, proper synchronization of ISO data over HCI as a generic host (i.e. not knowing what controller we use) is pretty hard (actually impossible, but can be done "good enough").

The use and requirements of both sequence numbers and timestamps are pretty vague / implementation specific in the core spec as per 5.4 unfortunately. Hopefully this will be improved in the future.

Some comments that need to be addressed

*/
static void iso_timer_timeout(struct k_work *work)
static void iso_send_sdu(uint32_t seq_num, int32_t next_sdu_ts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static void iso_send_sdu(uint32_t seq_num, int32_t next_sdu_ts)
static void iso_send_sdu(uint32_t seq_num, uint32_t next_sdu_ts, bool is_first_sdu)

I think this will make more sense. There's not really any reason to encode the seq_num and "is_first" into the same variable, and since -1 was only called one place this seems like a better solution to me.

Furthermore, the timestamp is a uint32_t, so for this to have worked the variable should otherwise have been a int64_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is now obsolete, so should be resolved. I returned to a timer-based approach as recommended.

static uint8_t buf_data[CONFIG_BT_ISO_TX_MTU];
static bool data_initialized;
static size_t len_to_send = ARRAY_SIZE(buf_data);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static size_t len_to_send = ARRAY_SIZE(buf_data);
static const size_t len_to_send = ARRAY_SIZE(buf_data);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len_to_send is now updated again, so this should be obsolete, please check if I'm not being confused here. ;-)

samples/bluetooth/central_iso/src/main.c Outdated Show resolved Hide resolved
Comment on lines 156 to 325
.connected = iso_connected,
.sent = iso_sent,
.disconnected = iso_disconnected,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.connected = iso_connected,
.sent = iso_sent,
.disconnected = iso_disconnected,
.connected = iso_connected,
.disconnected = iso_disconnected,
.sent = iso_sent,

or

Suggested change
.connected = iso_connected,
.sent = iso_sent,
.disconnected = iso_disconnected,
.connected = iso_connected,
.disconnected = iso_disconnected,
.sent = iso_sent,

Personal opinion, but it kind of stood out with sent in the middle of the connected and disconnected callbacks, and also not using the same indentation scheme :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sent callback is no longer needed, so I returned to the previous code. This should therefore be obsolete.

@fgrandel
Copy link
Contributor Author

An optimization is good to have, but not necessary, as the Specification states (Bluetooth Core Specification Version 5.4, Vol 4, Part E, Section 7.7.19 Number Of Completed Packets event):

Agreed. But there are good arguments why it should be improved regardless of what the spec allows (see my response to @rugeGerritsen above). This is required to cover up for weaknesses in the spec in other places to achieve a workable solution at all. Guess you agree that it is also not against the spec to improve latency on the sent-callback. Gaining one ISO interval should provide us with enough leeway so that the sent-callback latency is deterministic enough under all practical circumstances to synchronize ISO channels even if HCI works over reasonable remote links. Can you agree with that statement?

samples/bluetooth/central_iso/src/main.c Outdated Show resolved Hide resolved
struct bt_iso_tx_info iso_tx_info = {0};
int err;

err = bt_iso_chan_get_tx_sync(chan, &iso_tx_info);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment for the use of bt_iso_chan_get_tx_sync: When we use it like this here, we can use the current response to compare with the previous response. IIRC (and please correct me @rugeGerritsen) the return values of bt_iso_chan_get_tx_sync are only when an SDUs have been sent (or scheduled for transmission). In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.

e.g.

  1. App send SDU_1 with TS_1 and PSN_1
  2. App gets sent callback and gets TS_1 and PSN_1 from bt_iso_chan_get_tx_sync
  3. App sends SDU_2 wit TS_1 and PSN_2 (or TS_2 and PSN_1)
  4. Controller reject this as invalid (which not all controllers may do), but still triggers a Number of Completed Packets event
  5. App gets sent callback and gets TS_1 and PSN_1 from bt_iso_chan_get_tx_sync (since it was not transmitted or scheduled for transmission)
  6. App is now aware that the previous TX went wrong and could attempt to fix by increasing PSN or TS accordingly

samples/bluetooth/central_iso/src/main.c Outdated Show resolved Hide resolved
struct bt_iso_tx_info iso_tx_info = {0};
int err;

err = bt_iso_chan_get_tx_sync(chan, &iso_tx_info);
Copy link
Contributor

@cvinayak cvinayak May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.

Only when the Controller has received a SDU from Host, the returned Sequence Number and Timestamp correspond to the last transmitted or scheduled SDU are calculated.

  1. If bt_iso_chan_get_tx_sync is called multiple times without a SDU enqueued, it will return same Sequence Number and Timestamp corresponding to the instant the previous bt_iso_chan_send was called.
  2. Every bt_iso_chan_get_tx_sync after bt_iso_chan_send call will return the Sequence Number and Timestamp of the last transmitted or scheduled SDU. Hence, for the consecutive call to bt_iso_chan_send, the next Sequence Number (returned value plus one) and next Timestamp (returned value plus one SDU interval) be provided to bt_iso_chan_send_ts.
  3. If bt_iso_chan_get_tx_sync followed by bt_iso_chan_send_ts is called in the sent-callback, then the returned Sequence Number and Timestamp could be stale depending on the SDU intervals elapsed from the last call of bt_iso_chan_send_ts to when the current sent-callback executes. If sufficient SDU buffers into the future has been already provided to the Controller, then bt_iso_chan_get_tx_sync will continue to return future Sequence Number and Timestamp ensuring new SDU enqueued to the Controller are not stale.

}

iso_send_sdu(iso_tx_info.seq_num + 1, iso_tx_info.ts + INTERVAL_US);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
iso_send_sdu(iso_tx_info.seq_num + 1, iso_tx_info.ts + INTERVAL_US);
iso_send_sdu(iso_tx_info.seq_num + 1U, iso_tx_info.ts + INTERVAL_US);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach has been reverted, so should be resolved.

Comment on lines 57 to 113
if (unlikely(is_first_sdu)) {
ret = bt_iso_chan_send(&iso_chan, buf, seq_num);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If is_first_sdu, call bt_iso_chan_send once then followed by call to bt_iso_chan_get_tx_sync and bt_iso_chan_send_ts consecutively ((iso_tx.rtn + 1U) * 2U) times, so as to ensure the Controller has sufficient SDUs buffered before sent-callback under the case of retransmissions.

Give this a try under noisy conditions or when devices are far away. Let me know if the Sequence Number and Timestamp get back in sync when the radio condition get better, when devices move back close.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, I now implemented an approach that should tolerate package loss. If it doesn't 100% I hope that my new approach has enough merit that it can be merged anyway. Should I experience problems with synchronization under package loss in practical usage in the future I'll open up an other PR if you don't mind.

@zephyrbot zephyrbot requested a review from thoh-ot May 30, 2024 15:59
@fgrandel fgrandel force-pushed the fix/72452-ble-iso-sample-sync branch 4 times, most recently from 74a1bce to eb8f0b6 Compare May 31, 2024 06:24
fgrandel and others added 10 commits May 31, 2024 08:29
Previously the BT_CTLR_CONN_ISO_SDU_LEN_MAX value defaulted to the
"magic" number 251 which actually is the default MTU. We make this
dependency explicit (and thereby also transparently configurable) by
setting the default to BT_ISO_TX_MTU explicitly.

This also reduces the probability of inadvertent configuration errors or
inconsistencies.

Signed-off-by: Florian Grandel <[email protected]>
Fixes a minor typo in the Zephyr BT controller configuration overlay.

Signed-off-by: Florian Grandel <[email protected]>
Introduces reversed xmas-tree format to local variables before making
changes to them for better readability of subsequent commits.

Signed-off-by: Florian Grandel <[email protected]>
The change replaces magic numbers in the ISO CIG configuration by
preprocessor definitions.

Signed-off-by: Florian Grandel <[email protected]>
Replaces static constant variables by preprocessor definitions for
improved readability and (minor) RAM usage improvement.

Signed-off-by: Florian Grandel <[email protected]>
Remove CONFIG_BT_CTRL_SCAN_DATA_LEN_MAX: The sample does not employ
extended scanning.

Remove CONFIG_BT_CTLR_CONN_ISO_SDU_LEN_MAX: This setting defaults to
BT_ISO_TX_MTU which is set in the controller-independent prj.conf file
and should default to that value.

Signed-off-by: Florian Grandel <[email protected]>
The sample used deferred work to schedule ISO SDUs. The deferred work
API does not guarantee deterministic timing, though.

This commit switches to kernel timers which allow for deterministic
periodic or one-off timing as long as timers are re-scheduled in the
timer callback which runs directly in ISR context.

Signed-off-by: Florian Grandel <[email protected]>
This change introduces a drift compensation algorithm that keeps SDU
preparation synchronized with the bluetooth clock in a role, HCI
transport and controller agnostic way based on standard HCI commands.

Signed-off-by: Florian Grandel <[email protected]>
This change increases the max SDU size to twice the available max PDU
size which results in a burst number (BN) larger than one.

The sample thereby demonstrates that the drift compensation algorithm
can deal with BN > 1.

Signed-off-by: Florian Grandel <[email protected]>
Setting the SDU latency parameter to a value larger than the SDU
interval results in a flush timeout (FT) larger than one if at the same
time the ISO policy is switched from "low latency" to "reliability".

This demonstrates that the synchronization algorithm can deal with
FT > 1.

Signed-off-by: Florian Grandel <[email protected]>
@fgrandel fgrandel force-pushed the fix/72452-ble-iso-sample-sync branch from eb8f0b6 to a2f74de Compare May 31, 2024 06:29
*
* @brief Bluetooth clock hardware abstraction for nRF5x
*
* @details Currently, there is no way to access the Bluetooth clock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you have seen earlier, the timestamps in the HCI packets are defined by the controller.
For the Zephyr open source link layer, the timestamp maps directly to RTC0 ticks. For the SoftDevice Controller the timestamps also takes wrapping into account. Therefore, the approach taken here will only work until the RTC wraps (8 minutes). Also, the approach taken here will only work when run on a single-core 52 series device. That is, it will not work on the nRF53 app core.

Looking at this it looks like the only feasible way forward:

  • Remove reading out the controller clock
  • Never issue SDUs with timestamps.
  • Not use timers to trigger submission of SDUs, just push them to controller as quickly as possible.

The drawback will be that the total end-to-end latency becomes slightly longer.
If we want to optimize for reduced latency, controller-specific assumptions are needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @rugeGerritsen here

As the current BT Core spec works w.r.t. ISO over HCI, we can't really make any (good) assumptions, and the best we really can do is just to always enqueue at least 2 SDUs and TX as fast as possible.

There are no guarantees we can make without knowing the specific controller.

I've recently refactored TX for a similar sample: #73406

Effectively moving a timer-based approach to just sending as fast and much as possible, based on the available buffers.

Keep in mind that the buffer handling will also change as part of #72090

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, see my comment re next steps.

Copy link
Collaborator

@Thalley Thalley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of good changes in this PR, but not sure if some of the platform specific ones are good to have.

This is, after all, a sample and not a full application.

Comment on lines +2 to +3
# Max 2 PDUs per SDU (BN = 2)
CONFIG_BT_ISO_TX_MTU=494
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this.

CONFIG_BT_ISO_TX_MTU effectively reflects the maximum size of the SDUs. The SDUs can be any value below this, and thus increasing this size does not necessarily affect the number of PDUs per SDU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample now counts up to 494 bytes, so two PDUs will effectively be required to transfer that SDU (as the MAX PDU size payload is limited in the sample to exactly half of this value). This is required to prove that timing will work even if SDU interval != ISO interval.

Comment on lines +23 to +24
/* Currently only nRF5 hardware is supported by this sample. */
#include "bt_clock_hal_nrf5.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in upstream. I don't think we should make too many assumptions of how samples may be used by other projects, and samples should not (without proper guards at least) assume or require any specific hardware.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, see my comment re next steps.

NET_BUF_POOL_FIXED_DEFINE(tx_pool, 1, BT_ISO_SDU_BUF_SIZE(CONFIG_BT_ISO_TX_MTU),

static struct bt_iso_tx_info last_iso_tx_sync_info;
static inline bool is_synchronized(void)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static inline bool is_synchronized(void)
static bool is_synchronized(void)

inlines are rarely an improvement, and should generally be left to the compiler

@@ -945,7 +945,7 @@ config BT_CTLR_CONN_ISO_SDU_LEN_MAX
int "Maximum Connected Isochronous Channel SDU Length"
depends on BT_CTLR_CONN_ISO
range 1 4095
default 251
default BT_ISO_TX_MTU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cvinayak is BT_CTLR_CONN_ISO_SDU_LEN_MAX including the size of the bt_hci_iso_sdu_hdr/bt_hci_iso_sdu_ts_hdr headers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is SDU length (no HCI layer)

@fgrandel
Copy link
Contributor Author

fgrandel commented Jun 2, 2024

@Thalley @rugeGerritsen First of all: Thx again for your extensive and fast review. Highly appreciated!!

Before I try to fix remaining test failures let's see how and whether I can progress in a way that fits both, your needs and mine.

I totally agree with your feedback re hardware-specific code in the sample (and more or less expected it). OTOH sending data as fast as possible is really a boring solution that wouldn't motivate me to further spend my free time (as I obviously scratch my own itch here and the existing HW specific solution is totally fine for what we need).

Plus sending data as fast as possible and relying on (variable) backpressure and non-deterministic latency from the stack defeats the very idea of isochronous data sources.

Therefore I'd like to propose something different:

I've been thinking about a generic clock syntonization (aka drift compensation) framework for a long time already. See the linked resource for a definition of clock syntonization (as opposed to synchronization).

I already verified most elements of this approach within my (currently stalled) 802.15.4 TSCH protocol project. And I still believe that a general clock source (counter management/syntonization/synchronization) framework would really be a great and long overdue addition to Zephyr. And I might be able to make it useful in this case.

Basically I propose to buffer local timestamps until I get corresponding TX sync info via HCI. Once I get that info I can syntonize the local clock to the remote clock via a configurable syntonization strategy within the clocksource framework. Initially I'd use Zephyr's existing clock syntonization algorithm internally. But as the syntonization approach will be pluggable, something like a PID controller could be added any time and re-used across the system (as is required for more precise clock syntonization and quicker convergence, e.g. in PTP).

Assuming arbitrary latency between data preparation and data sending but limited jitter (which I'd expect to be a reasonable assumption, even in the HCI over UART case) such an approach would make the solution completely hardware / HCI agnostic and still optimize latency and (more importantly) provide stable isochronous timing of measurements across an arbitrary number of peripherals which is the very idea of isochronous channels.

The syntonization framework implementation would be kept locally to the network subsystem for the time being and would leverage the existing k_time (ns precision timestamp) framework already introduced there for network timestamps. The sample would only use the API and could therefore be simplified. Would that be a compromise acceptable upstream?

@Thalley
Copy link
Collaborator

Thalley commented Jun 3, 2024

I totally agree with your feedback re hardware-specific code in the sample (and more or less expected it). OTOH sending data as fast as possible is really a boring solution that wouldn't motivate me to further spend my free time (as I obviously scratch my own itch here and the existing HW specific solution is totally fine for what we need).

I think board specific behavior is something that fits in an application, but not necessarily a sample. If in a somewhat generic way that does not make the sample depend on it and does not make the code hard to follow, I would be OK with having board (or platform) dependent code here (so it's not an automatic rejection).

Plus sending data as fast as possible and relying on (variable) backpressure and non-deterministic latency from the stack defeats the very idea of isochronous data sources.

Therefore I'd like to propose something different:

I've been thinking about a generic clock syntonization (aka drift compensation) framework for a long time already. See the linked resource for a definition of clock syntonization (as opposed to synchronization).

I already verified most elements of this approach within my (currently stalled) 802.15.4 TSCH protocol project. And I still believe that a general clock source (counter management/syntonization/synchronization) framework would really be a great and long overdue addition to Zephyr. And I might be able to make it useful in this case.

Basically I propose to buffer local timestamps until I get corresponding TX sync info via HCI. Once I get that info I can syntonize the local clock to the remote clock via a configurable syntonization strategy within the clocksource framework. Initially I'd use Zephyr's existing clock syntonization algorithm internally. But as the syntonization approach will be pluggable, something like a PID controller could be added any time and re-used across the system (as is required for more precise clock syntonization and quicker convergence, e.g. in PTP).

Assuming arbitrary latency between data preparation and data sending but limited jitter (which I'd expect to be a reasonable assumption, even in the HCI over UART case) such an approach would make the solution completely hardware / HCI agnostic and still optimize latency and (more importantly) provide stable isochronous timing of measurements across an arbitrary number of peripherals which is the very idea of isochronous channels.

The syntonization framework implementation would be kept locally to the network subsystem for the time being and would leverage the existing k_time (ns precision timestamp) framework already introduced there for network timestamps. The sample would only use the API and could therefore be simplified. Would that be a compromise acceptable upstream?

That sounds like a very interesting project, and could be a nice way of doing this.
However, the fact of the matter is that HCi does not provide any guarantees whatsoever with any timing.

The timestamps generated by the controller is basically "whatever it wants":
image

In my eyes, the main issue is that you cannot do anything "good" (e.g. proper use of timestamps) as a generic host. ISO over HCI is honestly just pretty terribly designed for generic hosts such as Zephyr. If* (and that's a big if) we want the sample here to assume anything about a controller, it should be based on the Zephyr controller.

If you want to go the road of improving and optimizing TX, then we should add this sample to a babblesim tests in CI, and run tests there that can verify that such optimizations work, and will also ensure that they keep working by adding some pass criteria

@fgrandel
Copy link
Contributor Author

@Thalley @rugeGerritsen I have to apologize again for being inactive for such a long time. I've not given up on this ticket and will work on it again as soon as I've time.

Copy link

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Aug 17, 2024
@github-actions github-actions bot closed this Aug 31, 2024
@fgrandel fgrandel reopened this Sep 1, 2024
@fgrandel fgrandel removed the Stale label Sep 1, 2024
@kruithofa kruithofa removed their request for review September 3, 2024 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

samples: Bluetooth: central_iso: Desynchronized submission of data
5 participants