Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase performance of SWD and JTAG bitbanging #1688

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

ALTracer
Copy link
Contributor

Detailed description

  • This is a refactoring of core bitbanging routines.
  • The existing problem was IMO suboptimal efficiency of swdptap, jtagtap functions.
  • This PR solves this problem by extracting and improving the busy-looping delay; dropping double STM32F4 GPIO_BSRR writes; and rebalancing some of the no_delay routines to adjust for changes.

Tested on blackpill-f411ce platform connected to a stm32f103cb DUT. Firmware does not crash on exceptions from spurious ACKs. The 24MHz fx2la does not let me verify setup-and-hold timing compliancy, but the target silicon replies okay. I could retest with other targets accessible to me, possibly with low HCLKs. I was focusing ot stm32f4 platforms, but lm4f launchpad-icdi may be affected, too.
LINERESET waveform improved from 2.988 MHz to ~8.000 MHz, that is swd_seq_out_no_delay().
BMDA -r -f 8M was 22.5 and 41 KiB/s in JTAG and SWD mode, becomes 28 and 44 KiB/s respectively (but that bottleneck is different).

The PR aims to:

  • increase raw SWD performance of capable platforms (stm32f4), and by extension JTAG performance;
  • improve granularity of selecting max frequency (thanks to a quicker busy-delay loop); and fix said frequency calculations;
  • rebalance edge-case timings where needed;
  • optionally reduce AP WAIT polling spam (this makes logs and wire capture snapshots cleaner, but may hurt performance).

Important mentions:
GPIO_BSRR hack was introduced in f1ea5ed ; the reasoning behind it is not clear to me, but we can either set some default divider to target lower (2 MHz) frequency, and BMDA already asks for some 3 or 4 MHz. I wired everything on a breadboard with flying wires and a couple ground connections (no ribbon cables).
swdptap.c has optimize(3) since 4698a26 (but not jtagtap.c). -O3 increases flash footprint and may alter the timings.

Your checklist for this pull request

  • I've read the Code of Conduct
  • I've read the guidelines for contributing to this repository
  • It builds for hardware native (make PROBE_HOST=native)
  • It builds as BMDA (make PROBE_HOST=hosted)
  • I've tested it to the best of my ability
  • My commit messages provide a useful short description of what the commits do

Closing issues

No known issues w.r.t. SWJ speed.

@dragonmux dragonmux added Enhancement General project improvement BMP Firmware Black Magic Probe Firmware (not PC hosted software) labels Nov 25, 2023
@dragonmux dragonmux added this to the v2.0 release milestone Nov 25, 2023
Copy link
Member

@dragonmux dragonmux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking provisionally very good and we see a lot of promise here - we'll have to do a pile of testing to make sure this doesn't regress the setup-and-hold timings, but mostly the notes and comments in this review are about form not function. The whole volatile vs register thing in particular raises some interesting questions to explore.

src/platforms/common/jtagtap.c Outdated Show resolved Hide resolved
src/platforms/common/jtagtap.c Outdated Show resolved Hide resolved
static bool jtagtap_next_no_delay()
{
gpio_set(TCK_PORT, TCK_PIN);
__asm__("nop" ::: "memory");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a comment on what this is doing - we don't want to assume that the reader has seen the other uses of this and knows what's going on. Is there any particular reason for the inserting of a no-op rather than only a reordering barrier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotated for the readers. This routine is too fast in comparison to others, because it has nothing to do besides toggle the jTCK pin, and it certainly does not move data into (or out of) BMP. I may even want to add more nops or something similar to tone down this one.

static bool jtagtap_next(bool tms, bool tdi);
static void jtagtap_cycle(bool tms, bool tdi, size_t clock_cycles);
static inline void platform_delay_busy(const uint32_t loops) __attribute__((always_inline));
static void jtagtap_tms_seq(uint32_t tms_states, size_t ticks) __attribute__((optimize(3)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to do some captures and testing with this attribute to make sure it's not shooting the timings in the foot - as you note in the PR description. We do note that this particular function signature still has the old ticks nomenclature in it, so please switch that to clock_cycles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

750 kHz of tdi_tdo_seq on native was unremarkable, I hope performance improves -- please note, during capture/analysis, whether a given hardware relies on ancient FLITF Prefetch (2x64-bit lines) or runs from "0-wait-state" shadow SRAM.
Updated jtagtap_tms_seq() declaration and definition.

/* Busy-looping delay for GPIO bitbanging operations. SUBS+BNE.N take 4 cycles. */
void platform_delay_busy(const uint32_t loops)
{
/* Avoid using `volatile` variables which incur stack accesses */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cppreference page on C's volatile does not mention anything about forced stack usage and we've never seen it imply any such - are we sure that is what the compiler's doing here? Either way, we would strongly prefer you to use a for loop not a do-while here. This is a for loop in disguise.

The disappearance of the loop when using continue; without volatile can be entirely explained not so much by DCE (it's the result, not the cause) but by there being no observable side effects thus allowing the compiler to assume that nothing of value is done by the loop. This is the reason for using volatile in the jtagtap code. Perhaps we can use volatile register to force the compiler to consider the operation to have observable side-effects but also keep the loop counter register-bound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed volatile register was the first thing I tried, only to be yelled at by the compiler to swap qualifiers... and identical codegen. I specifically wanted to avoid extraneous ldr/str of loop counters, so I monitored Puncover and objdump -CSd. Maybe Compiler Explorer can shed some light on why this codegens this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh how interesting - ok, that'll be an interesting session in CE then as yes, that might well provide some insight into how and why the compiler's choosing to do what it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably qualifying a variable as volatile forces the compiler to assume that it should be observable elsewhere, and hence needs to be pinned to SRAM (if only on the stack). This directly contradicts with the intent of register-allocated counters (which debuggers don't display due to "optimized away" but you can always dump the gp_regs). Something "dead store elimination".
And the for-loop form is longer and has an extra jump, I didn't subjectively like it; objectively it has to fit into Flash prefetch buffer which is a few (4) 64-bit lines on F103 (?) to not incur wait-states of 72MHz MCU against 24MHz flash on random jumps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our perspective re the for loop part, the for loop form should code gen to the same or very near the same as the do-while (perhaps needs the condition expressed as --counter > 0 and --counter in the update step removed to do it) while being a much clearer expression of your intent. Think this is very much going to be "take it to Compiler Explorer and tinker for an hour" to find the right thing to do and the right way to express this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion of --counter > 0 gets the codegen I want and the C idiomatic form you look to preserve, thanks.
Updated the three places accordingly.
m3/m4/m7 still get the SUBS+BNE.N, m0 (f072) gets subs; cmp; bne.n so its divisor timings will be off.

src/platforms/common/swdptap.c Outdated Show resolved Hide resolved
@@ -108,8 +121,9 @@ static uint32_t swdptap_seq_in_no_delay(const size_t clock_cycles)
for (size_t cycle = 0; cycle < clock_cycles; ++cycle) {
gpio_clear(SWCLK_PORT, SWCLK_PIN);
value |= gpio_get(SWDIO_IN_PORT, SWDIO_IN_PIN) ? 1U << cycle : 0U;
__asm__("" ::: "memory");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please insert a note here about this being a reordering barrier - same as in other places the trick is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. Both swdptap_seq_in_no_delay() and swdptap_seq_out_no_delay() need careful consideration in code influencing bitbanged timings, including effect of gpio_set()/clear() or gpio_set_val() and gpio_get().
I would like to see this entire routine fit into ART cache of STM32F4, or at least not jump around thanks to all the gpio_* bits inlined.

src/platforms/common/swdptap.c Outdated Show resolved Hide resolved
@dragonmux dragonmux changed the title Increase performance of SWJ-DP bitbanging Increase performance of SWD and JTAG bitbanging Nov 25, 2023
@koendv
Copy link
Contributor

koendv commented Nov 28, 2023

With #1688 and #1690 applied, on an stm32f411 platform, rtt has a measured speed of 114.292 chars/s if target is stm32f102, 121.297 chars/s if target is stm32f411.
The documentation UsingRTT.md mentions a speed of 50.000 chars/s, perhaps update this number if PR is merged.

@ALTracer
Copy link
Contributor Author

ALTracer commented Jan 1, 2024

With #1688 and #1690 applied, on an stm32f411 platform, rtt has a measured speed of 114.292 chars/s if target is stm32f102, 121.297 chars/s if target is stm32f411. The documentation UsingRTT.md mentions a speed of 50.000 chars/s, perhaps update this number if PR is merged.

Thanks; the documented speed will certainly need updating, more so because of #1536.
I repeated the tests myself on blackpill-f411ce platform (96 MHz) and a gd32f103cb DUT clocked to 72 MHz; then 96, and 64, and even 48 MHz by providing a modified SystemClockConfig(). At default poll interval settings of 256 8 10 it is ~97000 chars/s, at 50 8 10 it is as you say, ~120000 chars/s, but altering them to monitor rtt poll 25 5 10 yields ~167000-17000 chars/s. This means the bottleneck is not in DUT downclocked to 48MHz, but rather in the BMD/libopencm3 code. orbtop could provide more detailed data.
BMDA, for the record, manages only ~27000 chars/s using BMPremote with this platform (and non-1-byte read libc calls).

* Copy the snippet verbatim from 10 identical places
* Ask GCC to please inline it back (so codegen does not change)
  compromising 116 bytes of flash for less jumping around
* Copy the snippet verbatim from 10 similar places,
  preserving the +1 in argument (turnaround/parity)
  for automatic 0 clock delay instead of checking against UINT32_MAX
* Mark as static for GCC to inline it back (so codegen does not change)
  compromising 96 bytes of flash for less jumping around
* Use predecrement and check against zero, as before
* Stop touching the stack with `volatile` variables, a register is sufficient
* Use normal SysTick platform_delay() in jtagtap_reset(),
  which is both longer duration and does not depend on platform HCLK.
* Avoid a zero loops delay, which underflows in `platform_delay_busy()`, hanging the adapter
* For STM32F4, dial in 12 & 4 as measured by LA and MSO.
  This became faster thanks to dropping double writes and volatile on-stack counters.
* For STM32F0, dial in slightly bigger 16 & 6 because armv6-m codegen is different.
* For STM32F1, dial in 18 & 4: less than 22 thanks to no volatile on-stack counters,
  and similar busy delay impact (armv7-m).
* For everything else, keep the old numbers in hope someone measures it later.
* Reference: jtagtap_tms_seq_no_delay(). Also add comments.
* Use appropriate type for TMS state output.
* This allows dropping the 0-loops check from delay_busy
* Also update signature of jtagtap_tms_seq() (s/ticks/clock_cycles/)
* SWD line reset needs 50 clocks with high TMS/SWDIO,
  this was achieved by looping over jtagtap_next() which incurs function call delays;
* Use jtagtap_cycle() instead which seems appropriate here, and discards TDO anyways;
* The adiv5_swd.c:swd_line_reset_sequence() counterpart seems less fitting to duplicate in jtagtap.

* Annotate any new nops in hot no_delay path.
...by redirecting to faster tdi_seq
* Do not use nullability attributes yet
@tlyu
Copy link
Contributor

tlyu commented Jan 14, 2024

Thanks for working on this! I tried it out a bit, and it seems to sometimes set target_clk_divider to zero, causing underflow and very long delays. This would be on platforms that set BITBANG_CALIBRATED_FREQS. On the other platforms, there are added checks that prevent that.

At least the clock speeds seem more consistent across different routines.

There are lots of remaining oddities about the timing that I see on the logic analyzer, but I'm still investigating those.

@tlyu
Copy link
Contributor

tlyu commented Jan 15, 2024

The main improvement is making the parity read/write functions have a consistent pulse width compared to the non-parity ones, at least in the case that does delays.

For the non-delay case, there seems to be increased asymmetry in the clock waveform, possibly due to replacing the NOP instructions with barriers. Some of these barriers are redundant, because the GPIO accesses are already volatile.

Also, I only have a 24MHz logic analyzer, so my ability to analyze differences in the no-delay case are going to be confounded by inadequate sample rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BMP Firmware Black Magic Probe Firmware (not PC hosted software) Enhancement General project improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants