Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] multiple-pause-resume.sh sof-test fails to FW_GEN_MSG failed with err 3018 #8792

Closed
kv2019i opened this issue Jan 25, 2024 · 6 comments
Assignees
Labels
bug Something isn't working as expected Display Audio Issues with display audio (via external HDMI or DP display) P1 Blocker bugs or important features TGL Applies to Tiger Lake

Comments

@kv2019i
Copy link
Collaborator

kv2019i commented Jan 25, 2024

Describe the bug
The multiple-pause-resume.sh sof-test started to fail with latest Zephyr on Intel cAVS2.5 based platforms (TGL, ADL). First seen on
#8764

To Reproduce
Build SOF with recent Zephyr, run sof-test.

Reproduction Rate
50+%

Expected behavior
Chain-dma host dma stop fails

Impact
SOF CI PR tests are failing with high rate

Environment

  1. Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
    See [DNM] Zephyr smp rework #8764
  2. Name of the topology file
  3. Name of the platform(s) on which the bug is observed.
    • Platform: tgl, adl, rpl

Screenshots or console output
See https://sof-ci.01.org/sofpr/PR8764/build2195/devicetest/index.html?model=TGLU_UP_HDA-ipc4&testcase=multiple-pause-resume-50

[  825.139933] <inf> ipc: ipc_cmd: rx	: 0xe050000|0x0
[  825.140968] <inf> chain_dma: chain_link_stop: comp:128 0x80 chain_link_stop(): dma_stop() link chan_index = 0
[  825.142083] <inf> ll_schedule: zephyr_ll_task_done: task complete 0xbe0bf810 0x204b0U
[  825.142095] <inf> ll_schedule: zephyr_ll_task_done: num_tasks 3 total_num_tasks 3
[  825.142133] <err> ipc: ipc_cmd: ipc4: FW_GEN_MSG failed with err 3018
[  825.142873] <inf> ipc: ipc_cmd: rx	: 0xe040000|0x0
[  825.142901] <inf> dma: dma_put: dma_put(), dma = 0x9e09f650, sref = 1
[  825.142915] <inf> dma: dma_put: dma_put(), dma = 0x9e09f6f0, sref = 1
@kv2019i kv2019i added bug Something isn't working as expected P1 Blocker bugs or important features TGL Applies to Tiger Lake Display Audio Issues with display audio (via external HDMI or DP display) labels Jan 25, 2024
@kv2019i kv2019i self-assigned this Jan 25, 2024
@kv2019i
Copy link
Collaborator Author

kv2019i commented Jan 25, 2024

zephyrproject-rtos/zephyr#68101 didn't work. 5000us is a REALLY long time to wait already and we still had fails, so something else goes wrong. On other platforms using chain-dma, this is not seeing, making this quite curious. Seems to be specific to cAVS2.5.

kv2019i added a commit to kv2019i/linux that referenced this issue Jan 26, 2024
The host DMA (controlled by BE ops) must be stopped before
sending PAUSE/STOP IPC (sent from FE ops) to chain DMA.
Unless this is done, the DMA stop flow is not following programming
sequence and DMA engine may get stuck in busy state.

Link: thesofproject/sof#8792
Signed-off-by: Kai Vehmanen <[email protected]>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Jan 26, 2024

I now suspect the DMA stop problem the Zephyr update uncovers, is actually a problem in our host DMA stop sequence. With current kernel and SOF, the order of stop seems wrong:

  • CHAIN_DMA IPC is sent -> this clears DGCS_GEN and DGCS_FIFORDY bits to stop the DMA (in Zephyr driver)
  • BE trigger is run in kernel and this clears the RUN bit on host side

As far as I can tell, this is against the recommended programming flow. With lcoal stress tests thesofproject/linux#4798 seems to fix the issue (GBUSY no longer stuck), but more stress testing is needed.

FYI @jsarha @ujfalusi @plbossart @ranj063 @RanderWang

kv2019i added a commit to kv2019i/sof that referenced this issue Jan 26, 2024
Starting with Zephyr commit e021ccfc745221c6 ("drivers: dma:
intel-adsp-hda: add delay to stop host dma"), the pause-resume
sof-test cases started failing on Intel cAVS2.5 platforms.

Add a delay loop around DMA stop code in chain DMA to workaround
the issue while a proper fix is under investigation. This allows
to resume integration of newer Zephyr versions to SOF and ensure
we detect any new regressions in time.

Link: thesofproject#8792
Signed-off-by: Kai Vehmanen <[email protected]>
kv2019i added a commit to kv2019i/sof that referenced this issue Jan 26, 2024
Starting with Zephyr commit e021ccfc745221c6 ("drivers: dma:
intel-adsp-hda: add delay to stop host dma"), the pause-resume
sof-test cases started failing on Intel cAVS2.5 platforms.

Add a delay loop around DMA stop code in chain DMA to workaround
the issue while a proper fix is under investigation. This allows
to resume integration of newer Zephyr versions to SOF and ensure
we detect any new regressions in time.

Link: thesofproject#8792
Signed-off-by: Kai Vehmanen <[email protected]>
kv2019i added a commit to kv2019i/sof that referenced this issue Jan 29, 2024
Starting with Zephyr commit e021ccfc745221c6 ("drivers: dma:
intel-adsp-hda: add delay to stop host dma"), the pause-resume
sof-test cases started failing on Intel cAVS2.5 platforms.

Add a temporary workaround to ignore this error on the affected
Intel platforms. This allows to resume integration of newer Zephyr
versions to SOF.

Link: thesofproject#8792
Signed-off-by: Kai Vehmanen <[email protected]>
lyakh pushed a commit to lyakh/sof that referenced this issue Jan 30, 2024
Starting with Zephyr commit e021ccfc745221c6 ("drivers: dma:
intel-adsp-hda: add delay to stop host dma"), the pause-resume
sof-test cases started failing on Intel cAVS2.5 platforms.

Add a delay loop around DMA stop code in chain DMA to workaround
the issue while a proper fix is under investigation. This allows
to resume integration of newer Zephyr versions to SOF and ensure
we detect any new regressions in time.

Link: thesofproject#8792
Signed-off-by: Kai Vehmanen <[email protected]>
lyakh pushed a commit to lyakh/sof that referenced this issue Jan 30, 2024
Starting with Zephyr commit e021ccfc745221c6 ("drivers: dma:
intel-adsp-hda: add delay to stop host dma"), the pause-resume
sof-test cases started failing on Intel cAVS2.5 platforms.

Add a delay loop around DMA stop code in chain DMA to workaround
the issue while a proper fix is under investigation. This allows
to resume integration of newer Zephyr versions to SOF and ensure
we detect any new regressions in time.

Link: thesofproject#8792
Signed-off-by: Kai Vehmanen <[email protected]>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Jan 30, 2024

Tried a combination of zephyrproject-rtos/zephyr#68304 and thesofproject/linux#4798 but still fails

lyakh pushed a commit to lyakh/sof that referenced this issue Jan 31, 2024
Starting with Zephyr commit e021ccfc745221c6 ("drivers: dma:
intel-adsp-hda: add delay to stop host dma"), the pause-resume
sof-test cases started failing on Intel cAVS2.5 platforms.

Add a delay loop around DMA stop code in chain DMA to workaround
the issue while a proper fix is under investigation. This allows
to resume integration of newer Zephyr versions to SOF and ensure
we detect any new regressions in time.

Link: thesofproject#8792
Signed-off-by: Kai Vehmanen <[email protected]>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Jan 31, 2024

This kernel PR thesofproject/linux#4801 seem to help. One test plan (#37589) has passed, err 3018 not seen at all.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Feb 1, 2024

A test PR with a Zephyr-side fix for the issue -> #8826

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 6, 2024

Issue closed with zephyrproject-rtos/zephyr#68415 merged via #8903 today.

@kv2019i kv2019i closed this as completed Mar 6, 2024
dztang pushed a commit to dztang/zephyr that referenced this issue Apr 23, 2024
After commit e021ccf ("drivers: dma: intel-adsp-hda: add delay to
stop host dma"), SOF project tests for "chain DMA" feature started
failed with high failure rate on Intel cAVS2.5 ADSP platforms.

Debugging shows the the 1000us timeout is not enough to clear the GBUSY
bit on these platforms and the chain DMA tests.

Link: thesofproject/sof#8792
Signed-off-by: Kai Vehmanen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected Display Audio Issues with display audio (via external HDMI or DP display) P1 Blocker bugs or important features TGL Applies to Tiger Lake
Projects
None yet
Development

No branches or pull requests

1 participant