Increase performance of SWD and JTAG bitbanging #1688

ALTracer · 2023-11-25T20:13:15Z

Detailed description

This is a refactoring of core bitbanging routines.
The existing problem was IMO suboptimal efficiency of swdptap, jtagtap functions.
This PR solves this problem by extracting and improving the busy-looping delay; dropping double STM32F4 GPIO_BSRR writes; and rebalancing some of the no_delay routines to adjust for changes.

Tested on blackpill-f411ce platform connected to a stm32f103cb DUT. Firmware does not crash on exceptions from spurious ACKs. The 24MHz fx2la does not let me verify setup-and-hold timing compliancy, but the target silicon replies okay. I could retest with other targets accessible to me, possibly with low HCLKs. I was focusing ot stm32f4 platforms, but lm4f launchpad-icdi may be affected, too.
LINERESET waveform improved from 2.988 MHz to ~8.000 MHz, that is swd_seq_out_no_delay().
BMDA -r -f 8M was 22.5 and 41 KiB/s in JTAG and SWD mode, becomes 28 and 44 KiB/s respectively (but that bottleneck is different).

The PR aims to:

increase raw SWD performance of capable platforms (stm32f4), and by extension JTAG performance;
improve granularity of selecting max frequency (thanks to a quicker busy-delay loop); and fix said frequency calculations;
rebalance edge-case timings where needed;
optionally reduce AP WAIT polling spam (this makes logs and wire capture snapshots cleaner, but may hurt performance).

Important mentions:
GPIO_BSRR hack was introduced in f1ea5ed ; the reasoning behind it is not clear to me, but we can either set some default divider to target lower (2 MHz) frequency, and BMDA already asks for some 3 or 4 MHz. I wired everything on a breadboard with flying wires and a couple ground connections (no ribbon cables).
swdptap.c has optimize(3) since 4698a26 (but not jtagtap.c). -O3 increases flash footprint and may alter the timings.

Your checklist for this pull request

I've read the Code of Conduct
I've read the guidelines for contributing to this repository
It builds for hardware native (make PROBE_HOST=native)
It builds as BMDA (make PROBE_HOST=hosted)
I've tested it to the best of my ability
My commit messages provide a useful short description of what the commits do

Closing issues

No known issues w.r.t. SWJ speed.

dragonmux

This is looking provisionally very good and we see a lot of promise here - we'll have to do a pile of testing to make sure this doesn't regress the setup-and-hold timings, but mostly the notes and comments in this review are about form not function. The whole volatile vs register thing in particular raises some interesting questions to explore.

src/platforms/common/jtagtap.c

dragonmux · 2023-11-25T22:56:06Z

src/platforms/common/jtagtap.c

 static bool jtagtap_next_no_delay()
 {
 	gpio_set(TCK_PORT, TCK_PIN);
+	__asm__("nop" ::: "memory");


Please include a comment on what this is doing - we don't want to assume that the reader has seen the other uses of this and knows what's going on. Is there any particular reason for the inserting of a no-op rather than only a reordering barrier?

Annotated for the readers. This routine is too fast in comparison to others, because it has nothing to do besides toggle the jTCK pin, and it certainly does not move data into (or out of) BMP. I may even want to add more nops or something similar to tone down this one.

dragonmux · 2023-11-25T23:04:43Z

src/platforms/common/jtagtap.c

-static bool jtagtap_next(bool tms, bool tdi);
-static void jtagtap_cycle(bool tms, bool tdi, size_t clock_cycles);
+static inline void platform_delay_busy(const uint32_t loops) __attribute__((always_inline));
+static void jtagtap_tms_seq(uint32_t tms_states, size_t ticks) __attribute__((optimize(3)));


We'll have to do some captures and testing with this attribute to make sure it's not shooting the timings in the foot - as you note in the PR description. We do note that this particular function signature still has the old ticks nomenclature in it, so please switch that to clock_cycles.

750 kHz of tdi_tdo_seq on native was unremarkable, I hope performance improves -- please note, during capture/analysis, whether a given hardware relies on ancient FLITF Prefetch (2x64-bit lines) or runs from "0-wait-state" shadow SRAM.
Updated jtagtap_tms_seq() declaration and definition.

dragonmux · 2023-11-25T23:11:42Z

src/platforms/common/stm32/timing_stm32.c

+/* Busy-looping delay for GPIO bitbanging operations. SUBS+BNE.N take 4 cycles. */
+void platform_delay_busy(const uint32_t loops)
+{
+	/* Avoid using `volatile` variables which incur stack accesses */


The cppreference page on C's volatile does not mention anything about forced stack usage and we've never seen it imply any such - are we sure that is what the compiler's doing here? Either way, we would strongly prefer you to use a for loop not a do-while here. This is a for loop in disguise.

The disappearance of the loop when using continue; without volatile can be entirely explained not so much by DCE (it's the result, not the cause) but by there being no observable side effects thus allowing the compiler to assume that nothing of value is done by the loop. This is the reason for using volatile in the jtagtap code. Perhaps we can use volatile register to force the compiler to consider the operation to have observable side-effects but also keep the loop counter register-bound?

Indeed volatile register was the first thing I tried, only to be yelled at by the compiler to swap qualifiers... and identical codegen. I specifically wanted to avoid extraneous ldr/str of loop counters, so I monitored Puncover and objdump -CSd. Maybe Compiler Explorer can shed some light on why this codegens this way.

Oh how interesting - ok, that'll be an interesting session in CE then as yes, that might well provide some insight into how and why the compiler's choosing to do what it does.

Probably qualifying a variable as volatile forces the compiler to assume that it should be observable elsewhere, and hence needs to be pinned to SRAM (if only on the stack). This directly contradicts with the intent of register-allocated counters (which debuggers don't display due to "optimized away" but you can always dump the gp_regs). Something "dead store elimination".
And the for-loop form is longer and has an extra jump, I didn't subjectively like it; objectively it has to fit into Flash prefetch buffer which is a few (4) 64-bit lines on F103 (?) to not incur wait-states of 72MHz MCU against 24MHz flash on random jumps.

From our perspective re the for loop part, the for loop form should code gen to the same or very near the same as the do-while (perhaps needs the condition expressed as --counter > 0 and --counter in the update step removed to do it) while being a much clearer expression of your intent. Think this is very much going to be "take it to Compiler Explorer and tinker for an hour" to find the right thing to do and the right way to express this.

Your suggestion of --counter > 0 gets the codegen I want and the C idiomatic form you look to preserve, thanks.
Updated the three places accordingly.
m3/m4/m7 still get the SUBS+BNE.N, m0 (f072) gets subs; cmp; bne.n so its divisor timings will be off.

src/platforms/common/swdptap.c

dragonmux · 2023-11-25T23:15:19Z

src/platforms/common/swdptap.c

@@ -108,8 +121,9 @@ static uint32_t swdptap_seq_in_no_delay(const size_t clock_cycles)
 	for (size_t cycle = 0; cycle < clock_cycles; ++cycle) {
 		gpio_clear(SWCLK_PORT, SWCLK_PIN);
 		value |= gpio_get(SWDIO_IN_PORT, SWDIO_IN_PIN) ? 1U << cycle : 0U;
+		__asm__("" ::: "memory");


Please insert a note here about this being a reordering barrier - same as in other places the trick is used.

Added a comment. Both swdptap_seq_in_no_delay() and swdptap_seq_out_no_delay() need careful consideration in code influencing bitbanged timings, including effect of gpio_set()/clear() or gpio_set_val() and gpio_get().
I would like to see this entire routine fit into ART cache of STM32F4, or at least not jump around thanks to all the gpio_* bits inlined.

src/platforms/common/swdptap.c

koendv · 2023-11-28T10:02:35Z

With #1688 and #1690 applied, on an stm32f411 platform, rtt has a measured speed of 114.292 chars/s if target is stm32f102, 121.297 chars/s if target is stm32f411.
The documentation UsingRTT.md mentions a speed of 50.000 chars/s, perhaps update this number if PR is merged.

ALTracer · 2024-01-01T17:14:20Z

With #1688 and #1690 applied, on an stm32f411 platform, rtt has a measured speed of 114.292 chars/s if target is stm32f102, 121.297 chars/s if target is stm32f411. The documentation UsingRTT.md mentions a speed of 50.000 chars/s, perhaps update this number if PR is merged.

Thanks; the documented speed will certainly need updating, more so because of #1536.
I repeated the tests myself on blackpill-f411ce platform (96 MHz) and a gd32f103cb DUT clocked to 72 MHz; then 96, and 64, and even 48 MHz by providing a modified SystemClockConfig(). At default poll interval settings of 256 8 10 it is ~97000 chars/s, at 50 8 10 it is as you say, ~120000 chars/s, but altering them to monitor rtt poll 25 5 10 yields ~167000-17000 chars/s. This means the bottleneck is not in DUT downclocked to 48MHz, but rather in the BMD/libopencm3 code. orbtop could provide more detailed data.
BMDA, for the record, manages only ~27000 chars/s using BMPremote with this platform (and non-1-byte read libc calls).

* Copy the snippet verbatim from 10 identical places * Ask GCC to please inline it back (so codegen does not change) compromising 116 bytes of flash for less jumping around

* Copy the snippet verbatim from 10 similar places, preserving the +1 in argument (turnaround/parity) for automatic 0 clock delay instead of checking against UINT32_MAX * Mark as static for GCC to inline it back (so codegen does not change) compromising 96 bytes of flash for less jumping around

* Use predecrement and check against zero, as before * Stop touching the stack with `volatile` variables, a register is sufficient * Use normal SysTick platform_delay() in jtagtap_reset(), which is both longer duration and does not depend on platform HCLK.

* Avoid a zero loops delay, which underflows in `platform_delay_busy()`, hanging the adapter

* For STM32F4, dial in 12 & 4 as measured by LA and MSO. This became faster thanks to dropping double writes and volatile on-stack counters. * For STM32F0, dial in slightly bigger 16 & 6 because armv6-m codegen is different. * For STM32F1, dial in 18 & 4: less than 22 thanks to no volatile on-stack counters, and similar busy delay impact (armv7-m). * For everything else, keep the old numbers in hope someone measures it later.

* Reference: jtagtap_tms_seq_no_delay(). Also add comments. * Use appropriate type for TMS state output.

* This allows dropping the 0-loops check from delay_busy

* Also update signature of jtagtap_tms_seq() (s/ticks/clock_cycles/)

* SWD line reset needs 50 clocks with high TMS/SWDIO, this was achieved by looping over jtagtap_next() which incurs function call delays; * Use jtagtap_cycle() instead which seems appropriate here, and discards TDO anyways; * The adiv5_swd.c:swd_line_reset_sequence() counterpart seems less fitting to duplicate in jtagtap. * Annotate any new nops in hot no_delay path.

...by redirecting to faster tdi_seq * Do not use nullability attributes yet

tlyu · 2024-01-14T23:15:42Z

Thanks for working on this! I tried it out a bit, and it seems to sometimes set target_clk_divider to zero, causing underflow and very long delays. This would be on platforms that set BITBANG_CALIBRATED_FREQS. On the other platforms, there are added checks that prevent that.

At least the clock speeds seem more consistent across different routines.

There are lots of remaining oddities about the timing that I see on the logic analyzer, but I'm still investigating those.

tlyu · 2024-01-15T03:17:14Z

The main improvement is making the parity read/write functions have a consistent pulse width compared to the non-parity ones, at least in the case that does delays.

For the non-delay case, there seems to be increased asymmetry in the clock waveform, possibly due to replacing the NOP instructions with barriers. Some of these barriers are redundant, because the GPIO accesses are already volatile.

Also, I only have a 24MHz logic analyzer, so my ability to analyze differences in the no-delay case are going to be confounded by inadequate sample rate.

dragonmux added Enhancement General project improvement BMP Firmware Black Magic Probe Firmware (not PC hosted software) labels Nov 25, 2023

dragonmux added this to the v2.0 release milestone Nov 25, 2023

ALTracer force-pushed the fix/tap-timing branch from f59f76f to 8d73f51 Compare November 25, 2023 21:56

dragonmux requested changes Nov 25, 2023

View reviewed changes

dragonmux changed the title ~~Increase performance of SWJ-DP bitbanging~~ Increase performance of SWD and JTAG bitbanging Nov 25, 2023

ALTracer mentioned this pull request Nov 27, 2023

Fix: Speed up bus turnaround for blackpill-f4 like other platforms #1690

Merged

6 tasks

ALTracer mentioned this pull request Dec 17, 2023

Fix: extract method for SWD parity calculation #1700

Merged

6 tasks

dragonmux mentioned this pull request Dec 27, 2023

Inconsistent clock signal during SWDIO turnaround #1706

Closed

ALTracer force-pushed the fix/tap-timing branch 4 times, most recently from ecb692b to 47a39ca Compare December 31, 2023 05:10

tlyu mentioned this pull request Jan 4, 2024

common/swdptap: make SWD timing more consistent #1714

Merged

6 tasks

ALTracer added 12 commits January 5, 2024 00:53

jtagtap: Extract platform_delay_busy()

04b0080

* Copy the snippet verbatim from 10 identical places * Ask GCC to please inline it back (so codegen does not change) compromising 116 bytes of flash for less jumping around

timing_stm32: Copy the platform_delay_busy() here

e70d536

stm32/gpio: Drop double writes on set/reset since F4Discovery days

39673a0

timing_stm32: Do not default clk_divider to 0; fix frequency_get()

1b4bbd9

* Avoid a zero loops delay, which underflows in `platform_delay_busy()`, hanging the adapter

swdptap: Tweak _no_delay() timings for a more even duty cycle

12c40bc

* Reference: jtagtap_tms_seq_no_delay(). Also add comments. * Use appropriate type for TMS state output.

swdptap: Handle no_delay parity/turnaround as well

4bff6d1

* This allows dropping the 0-loops check from delay_busy

jtagtap: optimize _no_delay() paths at -O3 like swdptap

1a39f13

* Also update signature of jtagtap_tms_seq() (s/ticks/clock_cycles/)

jtagtap: Protect tdi_tdo_seq from writing into NULL data_out

58a2425

...by redirecting to faster tdi_seq * Do not use nullability attributes yet

ALTracer force-pushed the fix/tap-timing branch from 47a39ca to 58a2425 Compare January 4, 2024 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase performance of SWD and JTAG bitbanging #1688

Increase performance of SWD and JTAG bitbanging #1688

ALTracer commented Nov 25, 2023

dragonmux left a comment

dragonmux Nov 25, 2023

ALTracer Dec 31, 2023

dragonmux Nov 25, 2023

ALTracer Dec 2, 2023

dragonmux Nov 25, 2023

ALTracer Nov 25, 2023

dragonmux Nov 25, 2023

ALTracer Nov 25, 2023

dragonmux Nov 25, 2023

ALTracer Dec 2, 2023

dragonmux Nov 25, 2023

ALTracer Dec 31, 2023

koendv commented Nov 28, 2023 •

edited

Loading

ALTracer commented Jan 1, 2024

tlyu commented Jan 14, 2024

tlyu commented Jan 15, 2024

Increase performance of SWD and JTAG bitbanging #1688

Are you sure you want to change the base?

Increase performance of SWD and JTAG bitbanging #1688

Conversation

ALTracer commented Nov 25, 2023

Detailed description

Your checklist for this pull request

Closing issues

dragonmux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koendv commented Nov 28, 2023 • edited Loading

ALTracer commented Jan 1, 2024

tlyu commented Jan 14, 2024

tlyu commented Jan 15, 2024

koendv commented Nov 28, 2023 •

edited

Loading