-
why is the neorv32 benchmark of 0.95 Coremark/MHz so low compared with the 2.44 Coremark/MHz reported for zero-riscy??? zero-riscy (now called ibex and will eventually become CV32E20 under the openhwgroup) was created in ETH zurich as part of the PULP project.... an early presentation can be found here.... possible reasons might be:
just want to increase my understand of neorv32's current design -- so perhaps we can make it better 😉 |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 13 replies
-
Hey bob!
You are absolutely right there, but let me summarize the main differences the impact overall performance. PipelineBasically, NEORV32 is a multi-cycle architecture. Hence, the execution of a single instruction requires several cycles. The fastest instructions (most ALU operations) need two cycles to complete, memory accesses (see below) require 5 cycles and jumps about 6 cycles. Bus SystemThe CPU provides two independent interfaces for fetching instructions and accessing data making it a real Harvard architecture. However, the entire processor uses a single bus for instruction fetch and data access making it a von Neumann architecture. Obviously, this single bus is the system's bottleneck. Bus ProtocolThe bus interface protocol of the CPU ensures that each and every memory transaction is completed and acknowledged before it moves on with execution Why?There are a lot more design details that (might) impact overall performance. However, I choose those design aspects on purpose to make the design as small as possible. Ibex / Zero-riscy might be more performant but 1.) it requires more hardware resources and 2.) it cannot operate at higher clock frequencies because of the quite long critical path(s). |
Beta Was this translation helpful? Give feedback.
-
i now have an EM-based implementation of coremark, which i've run on my icebreaker board.... my results are in line with the 0.95 coremarks/MHz reported in the datasheet, though i'm curious what board was used here.... with a clock-speed of 100MHz, was that a simulator???? for what it's worth, the C version in @stnolting -- i know you like the icebreaker board.... are you able to run coremark on this board???? |
Beta Was this translation helpful? Give feedback.
-
I have tested a version of the core using a Harvard-style memory (the CPU can access IMEM and DMEM in parallel). This is the basic setup:
Here is the crossbar I am using: neorv32_crossbar.txt - it is just some combinatoric logic based on the project's bus muxes. I am using (a slightly modified version of) CoreMark as test workload. It is compiled using the following flags: This is the relevant part of the CPU configuration: -- RISC-V CPU Extensions --
CPU_EXTENSION_RISCV_B => false,
CPU_EXTENSION_RISCV_C => true,
CPU_EXTENSION_RISCV_M => true,
CPU_EXTENSION_RISCV_U => true,
CPU_EXTENSION_RISCV_Zfinx => false,
CPU_EXTENSION_RISCV_Zicntr => true,
CPU_EXTENSION_RISCV_Zicond => false,
CPU_EXTENSION_RISCV_Zihpm => true,
CPU_EXTENSION_RISCV_Zifencei => true,
CPU_EXTENSION_RISCV_Zxcfu => false,
-- Extension Options --
FAST_MUL_EN => true,
FAST_SHIFT_EN => true,
CPU_IPB_ENTRIES => 4, And here are the results.
EvaluationThis was quite unexpected I admit 😅 The current version that squeezes all memory operations into a single bus performs quite well! But why is that the case?! Well, there are several points that play together here:
Lessons LearnedThe unified bus is not the bottleneck when talking about performance. The real "bottleneck" is the strict and global in-order execution of the CPU itself (wait for final commit of each and every instruction before moving on). Yes, this costs performance. But this also reduces resources and makes the CPU so much more deterministic. Btw, I have no idea why the number of retired instructions is different here... Maybe a bug in the counter logic? Something else? Any ideas? Thoughts? Ideas? Comments? 😉 |
Beta Was this translation helpful? Give feedback.
i think that explains the overlap.... by the way, this is even more significant when you are using the
C
extension; one 32-bit read on theibus
could in fact yield two uncompressed instructions -- each of which could potentially use thedbus
....