Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fw panic support for IPC4 #4369

Merged
merged 9 commits into from
Sep 7, 2023

Conversation

RanderWang
Copy link

@RanderWang RanderWang commented May 23, 2023

It gets fw panic information from memory window and reuse the debug framework to dump information to kernel log.

The FW support is: zephyrproject-rtos/zephyr#57329

[43991.605559] sof-audio-pci-intel-tgl 0000:00:1f.3: Create widget copier.host.1.1 instance 0 - pipe 1 - core 0
[43991.605563] sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx : 0x40000004|0x15: MOD_INIT_INSTANCE [data size: 84]
[43991.607399] sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx : 0x1b0a0000|0x0: GLB_NOTIFICATION|EXCEPTION_CAUGHT
[43991.607413] sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ IPC dump start ]------------
[43991.607437] sof-audio-pci-intel-tgl 0000:00:1f.3: hda irq intsts 0x80000000 intlctl 0x40000000 rirb 00
[43991.607444] sof-audio-pci-intel-tgl 0000:00:1f.3: dsp irq ppsts 0x80000000 adspis 0x00000001
[43991.607472] sof-audio-pci-intel-tgl 0000:00:1f.3: Host IPC initiator: 0xc0000004|0x15|0x0, target: 0x9b0a0000|0x0|0x0, ctl: 0x3
[43991.607487] sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ IPC dump end ]------------
[43991.607492] sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump start ]------------
[43991.607496] sof-audio-pci-intel-tgl 0000:00:1f.3: FW Panic
[43991.607501] sof-audio-pci-intel-tgl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[43991.607512] sof-audio-pci-intel-tgl 0000:00:1f.3: 0x00000005: module: ROM, state: FW_ENTERED, running
[43991.607610] sof-audio-pci-intel-tgl 0000:00:1f.3: Core exception record version 0x2, soc: 3
[43991.607670] sof-audio-pci-intel-tgl 0000:00:1f.3: FW is built with XCC toolchain
[43991.607687] sof-audio-pci-intel-tgl 0000:00:1f.3: error: DSP Firmware Oops
[43991.607698] sof-audio-pci-intel-tgl 0000:00:1f.3: error: Exception Cause: IntegerDivideByZeroCause, QUOS, QUOU, REMS, or REMU divisor operand is zero
[43991.607708] sof-audio-pci-intel-tgl 0000:00:1f.3: EXCCAUSE 0x00000006 EXCVADDR 0x00000000 PS 0x00060220 SAR 0x0000000f
[43991.607717] sof-audio-pci-intel-tgl 0000:00:1f.3: EPC1 0xbe0391da EPC2 0x00000000 EPC3 0x00000000 EPC4 0x00000000
[43991.607724] sof-audio-pci-intel-tgl 0000:00:1f.3: EPC5 0x00000000 EPC6 0x00000000 EPC7 0x00000000 DEPC 0x00000000
[43991.607730] sof-audio-pci-intel-tgl 0000:00:1f.3: EPS2 0x00000000 EPS3 0x00000000 EPS4 0x00000000 EPS5 0x00000000
[43991.607735] sof-audio-pci-intel-tgl 0000:00:1f.3: EPS6 0x00000000 EPS7 0x00000000 INTENABL 0x00000000 INTERRU 0x00000000
[43991.607741] sof-audio-pci-intel-tgl 0000:00:1f.3: stack dump from 0x00000000
[43991.607746] sof-audio-pci-intel-tgl 0000:00:1f.3: dump ar register number 16
[43991.607750] sof-audio-pci-intel-tgl 0000:00:1f.3: [00]: 0xbe026aa6 0xbe0945c0 0x9e0a0b40 0xbe094610
[43991.607757] sof-audio-pci-intel-tgl 0000:00:1f.3: [04]: 0xbe006000 0xbe0a0bc0 0x9e0a0b00 0x9e084ec8
[43991.607762] sof-audio-pci-intel-tgl 0000:00:1f.3: [08]: 0xbe0391c3 0x000001 0x040000 0x000000
[43991.607769] sof-audio-pci-intel-tgl 0000:00:1f.3: [12]: 0xbe0391c3 0x000001 0x040000 0x000000
[43991.607801] sof-audio-pci-intel-tgl 0000:00:1f.3: extended rom status: 0x5 0x0 0x0 0x0 0x0 0x0 0x0 0x1
[43991.607806] sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump end ]------------

Copy link
Collaborator

@ranj063 ranj063 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @RanderWang except for a minor nit

sound/soc/sof/intel/telemetry.c Show resolved Hide resolved
ranj063
ranj063 previously approved these changes May 24, 2023
Copy link
Collaborator

@ranj063 ranj063 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @RanderWang except for a minor nit

@RanderWang
Copy link
Author

reform the variable definition style in function sof_ipc4_intel_dump_telemetry_state

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool @RanderWang ! A few comments inline.

sound/soc/sof/xtensa/core.c Outdated Show resolved Hide resolved
sound/soc/sof/intel/telemetry.h Show resolved Hide resolved
sound/soc/sof/intel/telemetry.h Show resolved Hide resolved
sound/soc/sof/intel/telemetry.h Outdated Show resolved Hide resolved
@RanderWang
Copy link
Author

@kv2019i update this PR for exception debugfs node support. Thanks!

kv2019i
kv2019i previously approved these changes Jun 1, 2023
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, looks good to go!

sound/soc/sof/intel/telemetry.h Show resolved Hide resolved
@kv2019i
Copy link
Collaborator

kv2019i commented Jun 7, 2023

@plbossart @bardliao @ujfalusi @ranj063 Could this be expedited? The FW part is now merged and it seems as a side-effect , the coredump info is not duplicated to mtrace anymore. If the review needs more time, let me know and we'll need to look at reverting the FW change until driver is ready. FYI @RanderWang

Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the directions. I thought there was a plan to use zephyr for coredumps, not it's back with the intel telemetry with weird definitions of target code.

Not sure where this is going @RanderWang @kv2019i

COREDUMP_TGT_X86_64,
COREDUMP_TGT_ARM_CORTEX_M,
COREDUMP_TGT_RISC_V,
COREDUMP_TGT_XTENSA,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What on earth is all this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart This is interface of the coredump structure defined in Zephyr and adopted by SOF FW in current mainline.

We can drop this and only expose this via "exception" debugfs, but based on past experience, enabling kernel to parse the essential bits of the coredump directly to kernel dmesg is valuable for debug. Collecting the exception dumps via debugfs nodes is going to be harder to do in practise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this intel specific with non-intel CPUs listed?

Something does not add up.

Either this file should be moved as sound/soc/sof/ipc4-telemetry.h, or it's intel-specific and needs to be simplified.

Pick one :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart Thanks for your suggestion. Now I divide it into two parts : one in sof core layer for common part in zephyr and the other part is for intel specific. Other vendors can reuse the common part.

ujfalusi
ujfalusi previously approved these changes Aug 25, 2023
plbossart
plbossart previously approved these changes Aug 28, 2023
@plbossart plbossart requested a review from kv2019i August 28, 2023 22:57
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RanderWang Uh, sorry, one late discovery in the first patch. Can you check? I think we need to fix this before merge. Otherwise re-reviewed the series and looks good.

sound/soc/sof/xtensa/core.c Show resolved Hide resolved
@RanderWang
Copy link
Author

@kv2019i updated to check max AR count. Thanks!

@@ -15,6 +15,9 @@
* Architecture specific debug
*/

/* AR0 ~ AR15 */
#define XTENSA_CORE_MAX_AR_REGS_COUNT 16

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, sorry @RanderWang , your original code was correct.

I now realize xoops->plat_hdr is not directly read from the memory window, but rather parsed in kernel. On line 84 of intel/telemetry.c (you add in this patch), you already limit the AR count to 16. So this number can be trusted after all, it cannot ever be larger than 16 no matter what is in the telemetry contents.

So you can revert to the original version after all (which already had the reviews and was ready to go, apologies!).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NP , reverted

kv2019i
kv2019i previously approved these changes Aug 30, 2023
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the last update is longer strictly necessary (my fault, there's a check in telemetry.c), the additional check doesn't hurt either and is valid check to be done in core/xtensa.c. So giving my +1 for this in any case.

Copy link
Collaborator

@ujfalusi ujfalusi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the extra check made things better, for me it is quite the opposite.


dev_printk(level, sdev->dev, "AR registers:\n");
/* the number of ar registers is a multiple of 4 */
for (i = 0; i < xoops->plat_hdr.numaregs && i < XTENSA_CORE_MAX_AR_REGS_COUNT; i += 4) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, if the xoops->plat_hdr.numaregs is invalid then likely the whole dump is corrupted..
Printing a known corrupted dump have questionable benefit.

and the line can be shortened if it is really desired to print even if the numregs is out of range:

	u32 num_regs;
	...
	num_regs = min(xoops->plat_hdr.numaregs, XTENSA_CORE_MAX_AR_REGS_COUNT);
	for (i = 0; i < num_regs; i += 4) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi thanks, also suggested by Kai. Now revert it

On Xtensa platform ar0 is for caller address and ar1 is for stack
address. The ar register dump can be used to rebuild call stack with
FW elf file by debug tools.

Signed-off-by: Rander Wang <[email protected]>
The macro definitions of debug slot can be used by gdb, telemetry
and mtrace log, so move these definitions to header.h from mtrace.
Then these macro definitions can be shared

Signed-off-by: Rander Wang <[email protected]>
Currently IPC4 supports GDB slot, telemetry slot and
debug slot. This helper function will be used to get
the slot offset in debug windows for further processing.

Signed-off-by: Rander Wang <[email protected]>
Core dump includes hardware platform information, cpu registers and
exception call stack. FW saves core dump to telemetry slot in shared
memory window for host in the event of FW exception. This patch creates
exception node in debugfs for user to dump telemetry data.

Signed-off-by: Rander Wang <[email protected]>
The exception node is created when FW is ready and clear to
zero when FW post boot.

Signed-off-by: Rander Wang <[email protected]>
Telemetry data is decoded based on intel xtensa design and printed in
kernel log by sof debug framework.

Signed-off-by: Rander Wang <[email protected]>
Dump dsp stack with sof_ipc4_intel_dump_telemetry_state since dsp stack
information is included by telemetry data. This also supports lnl since
the mtl code is reused.

Signed-off-by: Rander Wang <[email protected]>
Get the FW panic information from telemetry data in memory window and
dump it to kernel log. The old platforms before CAVS 2.5+ don't support
it since there is no support in FW for them.

Signed-off-by: Rander Wang <[email protected]>
Driver will receive exception IPC message and process it by
snd_sof_dsp_panic.

Signed-off-by: Rander Wang <[email protected]>
@RanderWang
Copy link
Author

RanderWang commented Aug 31, 2023

@plbossart @ujfalusi @kv2019i @lyakh did null change after getting your approve.

@plbossart plbossart requested a review from lyakh September 6, 2023 13:15
@ujfalusi ujfalusi merged commit 65fdc4e into thesofproject:topic/sof-dev Sep 7, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants