Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth Controller assertion on sys_reboot() with active connections (lll_preempt_calc: Actual EVENT_OVERHEAD_START_US) #52833

Closed
caco3 opened this issue Dec 6, 2022 · 14 comments
Assignees
Labels
area: Bluetooth Controller area: Bluetooth bug The issue is a bug, or the PR is fixing a bug Regression Something, which was working, does not anymore

Comments

@caco3
Copy link
Contributor

caco3 commented Dec 6, 2022

Describe the bug
We have a BLE application based on a nrf52.
With Zephyr 3.2.0 it runs stable, we didn't see any crashes for a long time.
Now we tried the latest commit da23050 and we sporadically see crashes when we try to restart the sensor (either using sys_reboot(SYS_REBOOT_COLD) or NVIC_SystemReset()).

[00:00:06.817,962] \x1b[1;31m<err> peripherals: restart: \n\n*** RESTART ***\n
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:473\r\n
\tlll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 2044\r\n
[00:00:06.844,299] \x1b[1;31m<err> os: hard_fault: ***** HARD FAULT *****
[00:00:06.851,074] \x1b[1;31m<err> os: hard_fault:   Fault escalation (see below)
[00:00:06.858,551] \x1b[1;31m<err> os: hard_fault: ARCH_EXCEPT with reason 3\n

[00:00:06.865,722] \x1b[1;31m<err> os: esf_dump: r0/a1:  0x00000003  r1/a2:  0x20016698  r2/a3:  0x20016698
[00:00:06.875,579] \x1b[1;31m<err> os: esf_dump: r3/a4:  0x200169c8 r12/ip:  0x00000000 r14/lr:  0x00002c13
[00:00:06.885,406] \x1b[1;31m<err> os: esf_dump:  xpsr:  0x61000028
[00:00:06.891,693] \x1b[1;31m<err> os: esf_dump: Faulting instruction address (r15/pc): 0x00002c1e
[00:00:06.900,726] \x1b[1;31m<err> os: z_fatal_error: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:06.909,912] \x1b[1;31m<err> os: z_fatal_error: Fault during interrupt handling\n

[00:00:06.917,877] \x1b[1;31m<err> os: z_fatal_error: Current thread: 0x20016698 (unknown)
[00:00:06.926,208] \x1b[1;31m<err> os: k_sys_fatal_error_handler: Halting system

Relevant code:

oid restart(void) {   
    err = bt_conn_disconnect(bt_connection, BT_HCI_ERR_REMOTE_USER_TERM_CONN);

    log_panic(); // Disable log buffer, all messages get flushed immediately

    LOG_ERR("\n\n*** RESTART ***\n");

    sys_reboot(SYS_REBOOT_COLD);
    NVIC_SystemReset(); // Alternative call in case above does not work
    k_msleep(100);
}

Prj.conf:

## Optimization
CONFIG_SPEED_OPTIMIZATIONS=y
CONFIG_BT_CTLR_OPTIMIZE_FOR_SPEED=y
CONFIG_COMPILER_OPT="-Os"

## Assert Check
CONFIG_ASSERT=y

## Bluetooth
CONFIG_BT=y
CONFIG_BT_CTLR_LLCP_LOCAL_PROC_CTX_BUF_NUM=6
CONFIG_BT_AUTO_PHY_UPDATE=y
CONFIG_BT_AUTO_DATA_LEN_UPDATE=y

## BLE
CONFIG_HEAP_MEM_POOL_SIZE=2048
CONFIG_BT_BUF_ACL_RX_SIZE=254
CONFIG_BT_BUF_EVT_RX_SIZE=255
CONFIG_BT_BUF_CMD_TX_SIZE=255

CONFIG_BT_GATT_CLIENT=y
CONFIG_BT_ATT_PREPARE_COUNT=2
CONFIG_BT_CONN_TX_MAX=255
CONFIG_BT_L2CAP_TX_BUF_COUNT=10
CONFIG_BT_L2CAP_TX_MTU=247
CONFIG_BT_BUF_ACL_RX_SIZE=251
CONFIG_BT_L2CAP_DYNAMIC_CHANNEL=y
CONFIG_BT_CTLR_PHY_2M=y
CONFIG_BT_BUF_ACL_TX_COUNT=10
CONFIG_BT_BUF_ACL_TX_SIZE=251

CONFIG_BT_USER_PHY_UPDATE=y
CONFIG_BT_USER_DATA_LEN_UPDATE=y

CONFIG_BT_CTLR_DATA_LENGTH_MAX=251
CONFIG_BT_CTLR_XTAL_THRESHOLD=500000

CONFIG_BT_PHY_UPDATE=y
CONFIG_BT_DATA_LEN_UPDATE=y

CONFIG_ZERO_LATENCY_IRQS=y
CONFIG_BT_CTLR_ZLI=y
CONFIG_BT_CTLR_ULL_HIGH_PRIO=2
CONFIG_BT_CTLR_ADVANCED_FEATURES=y

# Mcumgr SMP Service
CONFIG_MCUMGR_SMP_BT=y
CONFIG_MCUMGR_SMP_BT_AUTHEN=n

# MCUMgr Functionality
CONFIG_MCUMGR=y
CONFIG_MCUMGR_CMD_IMG_MGMT=y
CONFIG_MCUMGR_CMD_OS_MGMT=y

To Reproduce

Expected behavior
no assert

Impact
showstopper

Logs and console output

Environment (please complete the following information):

  • Target: nrf52840
  • OS: Linux (Ubuntu)
  • Toolchain: Zephyr SDK 0.15.0
  • Commit SHA or Version used: v3.2.99 (da23050)

Additional context

@caco3 caco3 added the bug The issue is a bug, or the PR is fixing a bug label Dec 6, 2022
@caco3
Copy link
Contributor Author

caco3 commented Dec 6, 2022

@cvinayak FYI

@caco3 caco3 changed the title HARD FAULT on restarting sensor (lll.c:473) HARD FAULT on restarting sensor (lll_preempt_calc: Actual EVENT_OVERHEAD_START_US) Dec 6, 2022
@nordicjm nordicjm added the Regression Something, which was working, does not anymore label Dec 7, 2022
@cvinayak
Copy link
Contributor

cvinayak commented Dec 7, 2022

The encountered assertion check was added recently to catch skipped BLE radio events that could happen due to any externally introduced latencies.

Does this only happen on restart? Please check if waiting for disconnected() helps. Or alternately, you can use bt_disable() interface before restart.

@caco3
Copy link
Contributor Author

caco3 commented Dec 8, 2022

Thanks for the quick response!

We only saw it on the restart() and one other code call (Power Off function) where we also call bt_conn_disconnect() beforehand. The Power Off function just uses an I2C call to an external chip to tell it to switch off the power.
I added now bt_disable() after both calls to bt_conn_disconnect() and did not see any other Hard Faults.

It looks to me that the issue is not related to the sys_reboot() and NVIC_SystemReset() calls but due to the skipped BLE radio events after the bt_conn_disconnect(). If you agree please feel free to update the issue title.

For my point of view this issue can be closed.

@cvinayak cvinayak changed the title HARD FAULT on restarting sensor (lll_preempt_calc: Actual EVENT_OVERHEAD_START_US) Bluetooth Controller assertion on sys_reboot() with active connections (lll_preempt_calc: Actual EVENT_OVERHEAD_START_US) Dec 8, 2022
@cvinayak
Copy link
Contributor

cvinayak commented Dec 8, 2022

Thank you for verifying the solution.

@cvinayak cvinayak closed this as completed Dec 8, 2022
@cvinayak
Copy link
Contributor

As you use CONFIG_BT_CTLR_ZLI=y, this may be interesting: #55114

@caco3
Copy link
Contributor Author

caco3 commented Jun 12, 2023

@cvinayak Thanks for the notification.
I retested now with Zephyr 3.4-RC2 where I expect #55114 to be contained.

How ever I am still able to provoke this error sporadically:

ASSERTION FAIL [0] @ ZEPHYR_BASE/subsys/bluetooth/controller/ll_sw/nordic/lll/lll.c:473
        lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[00:27:40.987,335] <err> os: ***** HARD FAULT *****
[00:27:40.987,335] <err> os:   Fault escalation (see below)
[00:27:40.987,365] <err> os: ARCH_EXCEPT with reason 3

[00:27:40.987,396] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000000  r2/a3:  0x00000000
[00:27:40.987,426] <err> os: r3/a4:  0x00000000 r12/ip:  0x20023a34 r14/lr:  0x0000361d
[00:27:40.987,457] <err> os:  xpsr:  0x41000011
[00:27:40.987,487] <err> os: Faulting instruction address (r15/pc): 0x00003628
[00:27:40.987,518] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:27:40.987,548] <err> os: Fault during interrupt handling

[00:27:40.987,609] <err> os: Current thread: 0x200168f8 (unknown)
[00:27:41.119,689] <err> os: Halting system

The way I am able provoke it sporadically (after 30 minutes or more) is to have my measurement system active (meaning I am sending notifications at a high sampling rate of 333 notifications/second) and adittionally running GATT read/writes as fast as possible.

@cvinayak
Copy link
Contributor

The assertion you see could very well be due to any latencies introduced from any user/application ISRs in ZLI priority.

Is this assertion be reproducible without any user ISRs in ZLI (or highest priority level)?

@caco3
Copy link
Contributor Author

caco3 commented Jun 12, 2023

hmm...

To my understanding, I have no ISRs using ZLI.

proj.conf:

CONFIG_ZERO_LATENCY_IRQS=y
CONFIG_BT_CTLR_ZLI=y
CONFIG_BT_CTLR_ULL_HIGH_PRIO=2
CONFIG_BT_CTLR_ADVANCED_FEATURES=y

And I only have to following relevant ISR:

/* Priority of Timer 2 (ADC) interrupt
 * It must be below the BLE Radio IRQs (0) and above any other IRQ. */
#define ADC_DMA_TIMER_IRQ_PRIORITY 1

#define SPIM_TRANSFER_COUNT_TIMER_INSTANCE  2   // Increment on every SPIM3 transfer,
#define SPIM_TRANSFER_COUNT_TIMER_IRQ  NRFX_CONCAT_3(TIMER, SPIM_TRANSFER_COUNT_TIMER_INSTANCE, _IRQn)
#define SPIM_TRANSFER_COUNT_TIMER_ISR  NRFX_CONCAT_3(TIMER, SPIM_TRANSFER_COUNT_TIMER_INSTANCE, _IRQHandler)

IRQ_DIRECT_CONNECT(SPIM_TRANSFER_COUNT_TIMER_IRQ, ADC_DMA_TIMER_IRQ_PRIORITY, SPIM_TRANSFER_COUNT_TIMER_ISR, 0);

ISR_DIRECT_DECLARE(SPIM_TRANSFER_COUNT_TIMER_ISR) {
   // ...
}

According to https://docs.zephyrproject.org/latest/kernel/services/interrupts.html#zero-latency-interrupts I would have to use IRQ_ZERO_LATENCY inside a IRQ_CONNECT/IRQ_DIRECT_CONNECT call to use ZLI for them, but I don't.

In the ISR itself I use several k_work_submit() to schedule BLE taskse, and a k_sem_give() to wake up a thread, but I don't call any BLE functionallity directly.

@cvinayak
Copy link
Contributor

I hope you have the default peripheral interrupt level value greater than 2? i.e. use lower priorities for other SoC peripherals used in your application by having the below

#define NRF_DEFAULT_IRQ_PRIORITY 5

inside your application folder in dts/arm/nordic/override.dtsi file.

In spite these, you still get the assertion, then the value of EVENT_OVERHEAD_START_US has to be increased from 275 to 306 or higher as reported by the assertion when it happens.

@caco3
Copy link
Contributor Author

caco3 commented Jun 13, 2023

Yes, I have set #define NRF_DEFAULT_IRQ_PRIORITY 5 as needed.

So should I simply increase the define in https://github.com/zephyrproject-rtos/zephyr/blob/v3.4.0-rc3/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_vendor.h#L39?

What is the disadventage on setting it to high?

SInc eit is very difficult to provoke this issue, it also wil lbe hard to check that the set value is sufficient.

I also saw your suggestion in 52491 (comment) to simply disable the ASSERT. Would this be a safer approach?

@cvinayak
Copy link
Contributor

What is the disadventage on setting it to high?

Higher value reduces radio utilization. I.e. each radio event has this possible CPU use overhead margin between next radio use.

SInc eit is very difficult to provoke this issue, it also wil lbe hard to check that the set value is sufficient.

I also saw your suggestion in 52491 (comment) to simply disable the ASSERT. Would this be a safer approach?

Since you mention that it is difficult to reproduce, yes, it is safe to disable the assertion check.

@caco3
Copy link
Contributor Author

caco3 commented Jun 15, 2023

Since you mention that it is difficult to reproduce, yes, it is safe to disable the assertion check.

Thanks for the confirmation.

I also now have a test run where I see that changing it to a LOG message does not harm our applications functionality (No BLE disconnects or reduced throughput, no crashes).

May I suggest to provide a Kconfig to disable this assert or to adjust the value?

For me I always see the same pattern:


[00:54:17.256,286] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:07:18.482,543] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:07:19.997,589] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:07:25.157,745] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:38.122,833] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:40.537,902] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:42.187,957] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:42.892,974] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:43.102,996] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:44.198,028] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:44.963,043] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:45.923,095] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:46.703,094] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:46.973,114] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:47.198,120] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:48.458,160] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:13:49.028,167] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:20:11.728,576] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:20:12.778,594] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:20:13.738,616] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:20:13.813,629] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:20:13.918,640] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:20:15.088,684] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:20:15.613,677] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:20:23.668,945] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:26:42.634,216] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:26:43.234,222] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:26:43.474,243] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:26:43.729,248] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:26:44.149,261] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:26:45.034,301] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:26:45.409,301] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305
[01:26:46.654,327] <err> lll: lll_preempt_calc: Actual EVENT_OVERHEAD_START_US = 305

@cvinayak
Copy link
Contributor

May I suggest to provide a Kconfig to disable this assert or to adjust the value?

I have some work in progress in a local repo, I will see what I can do:

  1. I want to move the assertion to individual state/role c source file instead of the current lll.c
  2. Have a Kconfig to disable assertion and permit radio event skipping

But this will take some time before I have a clean PR to upstream.

@cvinayak
Copy link
Contributor

@caco3 hope you can check for any regressions by this PR: #59801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth Controller area: Bluetooth bug The issue is a bug, or the PR is fixing a bug Regression Something, which was working, does not anymore
Projects
None yet
Development

No branches or pull requests

4 participants