cycles per instruction benchmark -- neorv32 vs ibex (zero_riscy) #596

biosbob · 2023-04-23T17:59:01Z

biosbob
Apr 23, 2023
Collaborator

i have a (portable) CFFT that operates on Q15 real/imag values, which i've used to benchmark cycles-per-instruction on neorv32 versus ibex (zero_riscy).... for the latter, i'm using the x-heep project....

the CFFT itself is written in (portable) Em and has been used on a variety of CPUs (RISC-V, ARM M0+, etc).... the code itself is quite compact (<300 bytes) and executes a 128-point transform in ~1-3 ms depending upon the CPU architecture and its clock-speed....

in my neorv32 vs ibex benchmark, the total number of instructions (captured via minstret) in both executable images is 22564.... the architecture of both CPUs is rv32imc; and in the case of neorv32, i've configured FAST_MULT_EN and FAST_SHIFT_EN as true.... both images are executing out of SRAM (IMEM+DMEM); needless to say, the images are quite small....

here are the measured cycle counts (captured via mcycle):

ibex: -------- 35097 cycles, 1.55 instructions/cycle
neorv32: ---- 74345 cycles, 2.90 instructions/cycle

one factor that might explain the ~2x difference is the FPGA itself: ibex is running on a PYNQ-Z2 board (which i believe has multiple SRAM banks); neorv32 is running on an iCEBreaker board (and using a single SRAM bank as IMEM and DMEM)....

question: does the neorv32 CPU (like ibex) have a "harvard architecture" in which IMEM and DMEM could be accessed simultaneously??? and if so, what suggestions would you have for an FPGA board (other than PYNQ!!!!) i could use for this and other benchmarks....

FWIW -- i'm using >90% of the LUTs on my iCEBreaker board, and have LOTS of other stuff i still want to add into my SoC....

stnolting · 2023-04-23T18:10:39Z

stnolting
Apr 23, 2023
Maintainer

Hey @biosbob!
This is really interesting! Thanks for providing those benchmark results!

one factor that might explain the ~2x difference is the FPGA itself: ibex is running on a PYNQ-Z2 board (which i believe has multiple SRAM banks); neorv32 is running on an iCEBreaker board (and using a single SRAM bank as IMEM and DMEM)....

I think the bus architecture is the relevant factor here. I have no idea about your ibex setup but maybe it uses a Harvard architecture with separated instruction and data memories and buses. So there are no wait state if instruction fetch and data port access memory at the same time.

NEORV32 uses a von-Neumann approach. Instruction fetch and data access use a single bus so there is a congestion when both ports are accessing memories at the same time. The caches can help to reduce thos congestion.

Another interesting factor would be the FPGA utilization of the two setups. Furthermore, the maximum clock frequency of both cores would be interesting. Of course, these can only be compared when both setups were synthesized for the same technology/platform.

FWIW -- i'm using >90% of the LUTs on my iCEBreaker board, and have LOTS of other stuff i still want to add into my SoC....

I feel you 😉 The Lattice ice FPGAs are one the tiniest FPGAs available out there. I like like them and they are capable enough for small RISC-V SoCs. But if you want to implement more complex SoCs you'll might need to switch to a different FPGA / vendor.

2 replies

biosbob Apr 23, 2023
Collaborator Author

NEORV32 uses a von-Neumann approach. Instruction fetch and data access use a single bus so there is a congestion when both ports are accessing memories at the same time. The caches can help to reduce thos congestion.

that certainly explains it!!!! and for the record, i'm not using any I/D cache for reasons i'll explain below....

Another interesting factor would be the FPGA utilization of the two setups. Furthermore, the maximum clock frequency of both cores would be interesting. Of course, these can only be compared when both setups were synthesized for the same technology/platform.

this gets to the heart of my interest -- which is maximizing performance while minimizing silicon.... because EM programs are so small, they generally fit within (<32K) SRAM.... once code has been loaded from QSPI flash, the latter can actually be powered-down; no need for a cache, especially if the CPU has a harvard architecture....

no doubt, there is probably some extra HW required to implement a harvard versus von-neumann architecture.... but there are also some HW resources required to implement I/D caches, correct???? and perhaps it's possible to implement a harvard architecture using fewer resources than the caches require????

if so, we definitely have something to talk about.... 😉

stnolting Apr 24, 2023
Maintainer

this gets to the heart of my interest -- which is maximizing performance while minimizing silicon.... because EM programs are so small, they generally fit within (<32K) SRAM.... once code has been loaded from QSPI flash, the latter can actually be powered-down; no need for a cache, especially if the CPU has a harvard architecture....

Could you provide some more details? Is Em an interpreted language (like some bytecode)?

no doubt, there is probably some extra HW required to implement a harvard versus von-neumann architecture.... but there are also some HW resources required to implement I/D caches, correct???? and perhaps it's possible to implement a harvard architecture using fewer resources than the caches require????

Most CPU/RISC architectures - including the NEORV32 - implement a Harvard architecture on the CPU side. To the CPU has two distinctive interfaces: one for fetching instructions and another one for performing data load/store operations. The simplest setup here would be to connect individual memories to those individual interfaces.

For the von-Neumann approach you'll need some kind of crossbar or bus multiplexer so both CPU interfaces can access the same memories via a shared bus system (this is how it is done in the NEORV32).

Caches require a lot of additional resources! Beside the actual cache memory, you need to implement the address checking (hit/miss) logic and also some bus access logic so the cache can fetch new data when encountering a miss.

if so, we definitely have something to talk about.... 😉

😎👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cycles per instruction benchmark -- neorv32 vs ibex (zero_riscy) #596

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

cycles per instruction benchmark -- neorv32 vs ibex (zero_riscy) #596

biosbob Apr 23, 2023 Collaborator

Replies: 1 comment · 2 replies

stnolting Apr 23, 2023 Maintainer

biosbob Apr 23, 2023 Collaborator Author

stnolting Apr 24, 2023 Maintainer

biosbob
Apr 23, 2023
Collaborator

Replies: 1 comment 2 replies

stnolting
Apr 23, 2023
Maintainer

biosbob Apr 23, 2023
Collaborator Author

stnolting Apr 24, 2023
Maintainer