neorv32 coremark scores vs zero-riscy #631

biosbob · 2023-06-12T16:04:42Z

biosbob
Jun 12, 2023
Collaborator

why is the neorv32 benchmark of 0.95 Coremark/MHz so low compared with the 2.44 Coremark/MHz reported for zero-riscy???

zero-riscy (now called ibex and will eventually become CV32E20 under the openhwgroup) was created in ETH zurich as part of the PULP project.... an early presentation can be found here....

possible reasons might be:

von-neumann versus harvard architecture???
differences in pipeline depth???
cache performance???

just want to increase my understand of neorv32's current design -- so perhaps we can make it better 😉

Answered by biosbob

Jun 20, 2023

i think that explains the overlap.... by the way, this is even more significant when you are using the C extension; one 32-bit read on the ibus could in fact yield two uncompressed instructions -- each of which could potentially use the dbus....

View full answer

stnolting · 2023-06-17T16:27:56Z

stnolting
Jun 17, 2023
Maintainer

Hey bob!
Sorry for the delay... busy week 🙈

possible reasons might be: ...

You are absolutely right there, but let me summarize the main differences the impact overall performance.

Pipeline

Basically, NEORV32 is a multi-cycle architecture. Hence, the execution of a single instruction requires several cycles. The fastest instructions (most ALU operations) need two cycles to complete, memory accesses (see below) require 5 cycles and jumps about 6 cycles.

Bus System

The CPU provides two independent interfaces for fetching instructions and accessing data making it a real Harvard architecture. However, the entire processor uses a single bus for instruction fetch and data access making it a von Neumann architecture. Obviously, this single bus is the system's bottleneck.

Bus Protocol

The bus interface protocol of the CPU ensures that each and every memory transaction is completed and acknowledged before it moves on with execution

Why?

There are a lot more design details that (might) impact overall performance. However, I choose those design aspects on purpose to make the design as small as possible. Ibex / Zero-riscy might be more performant but 1.) it requires more hardware resources and 2.) it cannot operate at higher clock frequencies because of the quite long critical path(s).

5 replies

stnolting Jun 19, 2023
Maintainer

as part of my exploration of #619, one could certainly interface a separate (32K) SRAM to the ibus and dbus exposed by the CPU complex.... putting aside some added complexity to handle dbus read/write to the ibus SRAM, wouldn't this help with the single-bus bottleneck???

Having separate buses would definitively increase performance. For a FPGA target, one could just use on large dual-ported BRAM (one port for data, one port for instructions) plus a little gateway in both CPU buses to access peripherals (data) and a boot ROM (instructions).

could you walk me through a "simple read/write" operation which would show the ibus and dbus active at the same time....

Do you mean the protocol of the CPU interfaces? A simple transaction can be found in the data sheet:

If the data and instruction ports to access the unified processor bus at the same time, the instruction fetch request keeps pending (no ACK) until the data access request is completed.

are we really saying that neorv32 does NOT have a multi-stage pipeline???? or is this feature independent of that decision???

Well, it depends on your point of view 😉

For a single instruction the CPU is a plain multi-cycle architecture. For the default setup, it requires one cycle for the instruction fetch request, another cycle to get the memory response (= the actual instruction word), another cycle for dispatching the instruction into execution and final cycle for the actual execution.

For a stream of instruction the CPU is a 2-stages pipelined architecture. The first stage is the front-end that is responsible for the instruction fetch (taking 2 cycles per fetch; see above). The second stage is the back-end, which does the actual execution. Both stages operate independently of each other - so they can overlap; e.g. you can fetch further instructions and store them to the instruction prefetch buffer while the back-end is still crunching a 32-bit division.

hypothetically, what is the impact of NOT acknowledging a memory transaction -- are we trading some speedup for a lack of robustness???

Correct. Having a full ACK for each and very single memory transaction ensures a defined state of the system at every time (i.e. no out-of-order behavior at all making predictability (and safety) much more convenient).

related to the whole topic of coremark, i'm currently implementing an EM-based version of the benchmark -- essentially translating the existing C code into EM while leveraging any number of language features to reduce code-size and hopefully to improve performance.... if nothing else, the entire EM coremark program will only require 8K of IMEM and 2K of DMEM -- enabling us to execute completely out of tightly-coupled SRAM without relying on FLASH/CACHE....

That sounds great! Keep me informed when you have some results available 😉 Btw, it would be quite interesting to benchmark your version with the CPU hardware performance monitors to see where all the execution cycles are spent.

one part of the ULP suite is measuring how many coremark iterations can i run per-mJ of energy!!!!

This would be really great as we do not have any kind of energy efficiency evaluation so far. 👍

without a "real chip" based on neorv32, however, we can't perform a side-by-side comparision of any of these ULP MCUs -- at least not by measuring current draw over time....

But we could benchmark energy consumption for a FPGA setup. Especially the low-power Lattice FPGA would be interesting here - I think the results would be quite close to an ASIC implementation using an older technology node (maybe 180nm or 90nm).

biosbob Jun 20, 2023
Collaborator Author

so just to clarify something....

If the data and instruction ports to access the unified processor bus at the same time, the instruction fetch request keeps pending (no ACK) until the data access request is completed.

can this happen with just the CPU???? that is, can both the ibus and dbus of the CPU have requests at the same time????

and if so, does this mean the CPU is fetching the "next" instruction while read/writing memory for the "current" instruction???

stnolting-ims Jun 20, 2023
Collaborator

can this happen with just the CPU???? that is, can both the ibus and dbus of the CPU have requests at the same time????

Yes, that will happen 😉 The ibus is driven by the CPU front-end (= instruction fetch). The dbus is driven by the CPU back-end (= instruction execution). As front-end and back-end are basically two independent pipeline stages (see above) they will operate in parallel making also bus requests in parallel.

and if so, does this mean the CPU is fetching the "next" instruction while read/writing memory for the "current" instruction???

That's right. The front-end and the back-end are decoupled by a simple FIFO - the instruction prefetch buffer. The front-end will always try to keep this buffer full at any time. Hence, the front-end will fetch instructions way beyond the current program counter address.

biosbob Jun 20, 2023
Collaborator Author

i think that explains the overlap.... by the way, this is even more significant when you are using the C extension; one 32-bit read on the ibus could in fact yield two uncompressed instructions -- each of which could potentially use the dbus....

Answer selected by biosbob

biosbob · 2023-06-26T21:43:19Z

biosbob
Jun 26, 2023
Collaborator Author

i now have an EM-based implementation of coremark, which i've run on my icebreaker board.... my results are in line with the 0.95 coremarks/MHz reported in the datasheet, though i'm curious what board was used here.... with a clock-speed of 100MHz, was that a simulator????

for what it's worth, the C version in sw/example/coremark consumes 20280 bytes of code space; the corresponding EM version consumes only 9240 bytes of code.... note that i'm using LLVM, whereas sw/examples/coremark is using GCC.... building the EM version with the most aggresive size optimization yields just 3672 bytes of code -- with execution time degraded by about 30%....

@stnolting -- i know you like the icebreaker board.... are you able to run coremark on this board????

6 replies

biosbob Jun 27, 2023
Collaborator Author

my score was 0.91 coremarks/MHz versus 0.95 reported in the datasheet.... here's the configuration i'm using with icebreaker:

        CLOCK_FREQUENCY              => 18000000,           -- clock frequency of clk_i in Hz
        INT_BOOTLOADER_EN            => true,               -- boot configuration: true = boot explicit bootloader; false = boot from int/ext (I)MEM
        -- RISC-V CPU Extensions --
        CPU_EXTENSION_RISCV_C        => true,               -- implement compressed extension?
        CPU_EXTENSION_RISCV_M        => true,               -- implement mul/div extension?
        CPU_EXTENSION_RISCV_Zicntr   => true,               -- implement base counters?
        CPU_EXTENSION_RISCV_Zifencei => true,               -- implement instruction stream sync.?
        -- Tuning Options --
        FAST_MUL_EN                  => true,               -- use DSPs for M extension's multiplier
        FAST_SHIFT_EN                => true,               -- use barrel shifter for shift operations
        CPU_IPB_ENTRIES              => 2,                  -- entries in instruction prefetch buffer, has to be a power of 2, min 1
        -- Internal Instruction memory --
        MEM_INT_IMEM_EN              => true,               -- implement processor-internal instruction memory
        MEM_INT_IMEM_SIZE            => 32*1024,             -- size of processor-internal instruction memory in bytes
        -- Internal Data memory --
        MEM_INT_DMEM_EN              => true,               -- implement processor-internal data memory
        MEM_INT_DMEM_SIZE            => 32*1024,             -- size of processor-internal data memory in bytes

what other extensions did you use in the benchmark running on the cyclone board????

stnolting Jun 27, 2023
Maintainer

FAST_MUL_EN => true

This really works? If I enable that option (using Lattice Radiant) I am running out of DSP slices for the M extension. 🤔

But basically, yes, this seems to be the most performant configuration. I am using CPU_IPB_ENTRIES => 2, and -O3 for the software compilation. Beyond that, I have a compiler with limited B support so I have also enabled CPU_EXTENSION_RISCV_B.

biosbob Jun 28, 2023
Collaborator Author

i tried enabling CPU_EXTENSION_RISCV_B but it exceeded the capacity of my IceBreaker board.... i just filed feature request #640, which suggests enabling Zb* sub-extensions individually.... i'm curious as to the impact on the coremark score if you were to DISABLE the B extension; that would give a more accurate comparision to the generics i listed above....

another difference is that i'm using LLVM as my compiler, while you were using GCC.... when i'm (usually!!!!) focused on code-size, LLVM is noticeably better than GCC; but when using -O3 optimization, GCC beats LLVM (at least on ARM)....

i'll give GCC a try and report back what i've found.... in the meanwhile, could you re-run your benchmarks without B....

biosbob Jun 28, 2023
Collaborator Author

GCC results (compared with LLVM) were quite impressive in execution time.... using -O3 on both compilers with my EM-based coremark yields the following:

compiler	text+data+bss	coremarks/MHz

GCC		20280+12+2152	 0.95	baseline C code

GCC		13740+276+1448   1.21
LLVM		 9240+284+1448   0.91

and from what i've heard, the baseline benchmark uses the B extension -- which exceeded by FPGA capacity....

very, very interesting results!!!!! and imagine how could it will get once we have the compact address space and separate IMEM + DMEM accesses....

stnolting Jun 30, 2023
Maintainer

GCC 13740+276+1448 1.21

That's great! Finally we're above 1!! 😅

and from what i've heard, the baseline benchmark uses the B extension -- which exceeded by FPGA capacity....

Well, it is optional. Of course you can turn that off. As compiler support for those extension is still quite limited in my GCC version that performance gain is quite small when all supported extensions are enabled (just ~2%).

very, very interesting results!!!!! and imagine how could it will get once we have the compact address space and separate IMEM + DMEM accesses....

I made some experiments with a Harvard-style memory system. The results are... surprising... and different than I would have expected. I will post them here.

stnolting · 2023-06-30T06:16:40Z

stnolting
Jun 30, 2023
Maintainer

I have tested a version of the core using a Harvard-style memory (the CPU can access IMEM and DMEM in parallel).

This is the basic setup:

CPU_I    CPU_D
  |        |
+------------+
|  CROSSBAR  |
+------------+
 |          |
IMEM      DMEM
          BOOT
       PERIPHERALS

Here is the crossbar I am using: neorv32_crossbar.txt - it is just some combinatoric logic based on the project's bus muxes.

I am using (a slightly modified version of) CoreMark as test workload. It is compiled using the following flags: make USER_FLAG+=-DRUN_COREMARK MARCH=rv32imc EFFORT=-O3 clean_all exe

This is the relevant part of the CPU configuration:

    -- RISC-V CPU Extensions --
    CPU_EXTENSION_RISCV_B        => false,
    CPU_EXTENSION_RISCV_C        => true,
    CPU_EXTENSION_RISCV_M        => true,
    CPU_EXTENSION_RISCV_U        => true,
    CPU_EXTENSION_RISCV_Zfinx    => false,
    CPU_EXTENSION_RISCV_Zicntr   => true,
    CPU_EXTENSION_RISCV_Zicond   => false,
    CPU_EXTENSION_RISCV_Zihpm    => true,
    CPU_EXTENSION_RISCV_Zifencei => true,
    CPU_EXTENSION_RISCV_Zxcfu    => false,
    -- Extension Options --
    FAST_MUL_EN                  => true,
    FAST_SHIFT_EN                => true,
    CPU_IPB_ENTRIES              => 4,

And here are the results.

	Von-Neumann (default)	Harvard (crossbar)	Notes
CoreMark score	95	95	same
Total ticks	2169619 k	2147276 k	"same"
LUTs	5,489	5,633	crossbar is quite big...
f_max	127.45 MHz	123.76 MHz	... and entirely combinatoric

Active clock cycles:	2147276135	2169619367	total clock cycles
Retired instructions:	1088907104	1123926444	what happened here??
Retired compr. instructions:	342475868	342475868
Instr.-fetch wait cycles:	0	365050191	instruction fetch never needs to wait with the crossbar
Instr.-issue wait cycles:	254001852	253542381	mainly caused by branches
Multi-cycle ALU wait cycles:	96659050	96659050
Load operations:	108277848	108277848	number of load instructions
Store operations:	28390960	28390960	number of store instructions
Load/store wait cycles:	0	22802703	data access never needs to wait with the crossbar
Unconditional jumps:	16654334	16654334
Conditional branches (all):	115064467	115064467
Conditional branches (taken):	57388389	57388389
Entered traps:	0	0
Illegal operations:	0	0

Evaluation

This was quite unexpected I admit 😅 The current version that squeezes all memory operations into a single bus performs quite well! But why is that the case?! Well, there are several points that play together here:

I analyzed the instruction mix of some CoreMark sections. The most "memory intensive" sections consists of 30% load/Store operations. So in ten instructions we can find three memory instructions.
The bus switch, which merges instruction and data buses into a single SoC bus, is prioritizing the data requests. If the CPU executes a load/store then it will get served without wait states in most cases (see Load/store wait cycles). Yes, this blocks the instruction fetch, but the CPU has to wait for the data access anyway. Furthermore, there is no additional overhead for the instruction fetch because of the instruction prefetch buffer.

Lessons Learned

The unified bus is not the bottleneck when talking about performance. The real "bottleneck" is the strict and global in-order execution of the CPU itself (wait for final commit of each and every instruction before moving on). Yes, this costs performance. But this also reduces resources and makes the CPU so much more deterministic.

Btw, I have no idea why the number of retired instructions is different here... Maybe a bug in the counter logic? Something else? Any ideas?

Thoughts? Ideas? Comments? 😉

2 replies

biosbob Jun 30, 2023
Collaborator Author

this is interesting.... one question i've always was whether the underlying FGPA has independent SRAM blocks that can be assigned to IMEM and DMEM.... if that is NOT the case, wouldn't memory throughput (and hence the coremark) remain unchanged....

putting aside LUTs, can i truly have independent IMEM+DMEM blocks on my IceBreaker that can accessed simultaneously????

stnolting Jun 30, 2023
Maintainer

one question i've always was whether the underlying FGPA has independent SRAM blocks that can be assigned to IMEM and DMEM

The FPGA has individual SRAM blocks that can be connected to form wider, larger or entirely independent blocks. And this is what is done by the processor setup: BOOTROM, IMEM and DMEM are mapped to individual SRAM block the could be accessed in parallel.

putting aside LUTs, can i truly have independent IMEM+DMEM blocks on my IceBreaker that can accessed simultaneously????

This is exactly what I have tested using the crossbar. IMEM and DMEM were accessed in parallel. Since loads/stores are quite rare in comparison to instruction fetches there is no real performance gain as 1.) load/store operations are served first anyway and 2.) the CPU waits until the load/store operation is completed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neorv32 coremark scores vs zero-riscy #631

{{title}}

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

neorv32 coremark scores vs zero-riscy #631

biosbob Jun 12, 2023 Collaborator

Replies: 3 comments · 13 replies

stnolting Jun 17, 2023 Maintainer

Pipeline

Bus System

Bus Protocol

Why?

stnolting Jun 19, 2023 Maintainer

biosbob Jun 20, 2023 Collaborator Author

stnolting-ims Jun 20, 2023 Collaborator

biosbob Jun 20, 2023 Collaborator Author

biosbob Jun 26, 2023 Collaborator Author

biosbob Jun 27, 2023 Collaborator Author

stnolting Jun 27, 2023 Maintainer

biosbob Jun 28, 2023 Collaborator Author

biosbob Jun 28, 2023 Collaborator Author

stnolting Jun 30, 2023 Maintainer

stnolting Jun 30, 2023 Maintainer

Evaluation

Lessons Learned

Thoughts? Ideas? Comments? 😉

biosbob Jun 30, 2023 Collaborator Author

stnolting Jun 30, 2023 Maintainer

biosbob
Jun 12, 2023
Collaborator

Replies: 3 comments 13 replies

stnolting
Jun 17, 2023
Maintainer

stnolting Jun 19, 2023
Maintainer

biosbob Jun 20, 2023
Collaborator Author

stnolting-ims Jun 20, 2023
Collaborator

biosbob Jun 20, 2023
Collaborator Author

biosbob
Jun 26, 2023
Collaborator Author

biosbob Jun 27, 2023
Collaborator Author

stnolting Jun 27, 2023
Maintainer

biosbob Jun 28, 2023
Collaborator Author

biosbob Jun 28, 2023
Collaborator Author

stnolting Jun 30, 2023
Maintainer

stnolting
Jun 30, 2023
Maintainer

biosbob Jun 30, 2023
Collaborator Author

stnolting Jun 30, 2023
Maintainer