CFU minimum clock cycles #1045

H1alus · 2024-10-01T14:45:56Z

H1alus
Oct 1, 2024

I was trying to implement inside the neorv32 a simple 32 bit RCA in order to learn how to use the cfu for further implementation with custom accelerators, however, even by not having the RCA in a pipeline I see that the CFU operation is stated to take 7 clock cycles on Vivado's behavioural simulation. Here's the files that I created from modifying the example files.
Is the cfu bound to have 7 clock cycles as a minimum or am I doing something wrong?

Thank you in advance

files.zip

stnolting · 2024-10-01T18:30:06Z

stnolting
Oct 1, 2024
Maintainer

The execution time of a CFU instruction depends on the architecture of the CFU itself. In your example the CFU hardwires valid_o to one. Hence, execution is considered as completed within the next cycle. This makes two cycles. Together with the CPU's "dispatch" (get next instruction) and "execute" (execute the current instruction) this makes a total of 4 execution cycles for the CFU operation.

Here is a cut-out from the Vivado simulation simulating your design: execution starts at the yellow marker and end at the blue marker.

0 replies

H1alus · 2024-10-01T18:44:48Z

H1alus
Oct 1, 2024
Author

Thank you so much for your kindness and your detailed explanation as expected from your wonderful work on this project.

So can I consider both the sum from the built in ALU of neorv32 and the one from the RCA of the CFU to take the same number of overall clock cycles? (I ask you this stupid question since my next work will be the implementation of an approximate computing adder, hopefully )
Why does the CSR_MCYCLE count 1 cycle for the sum with alu and 7 for the custom RCA?

The C code that bugs me is this:

neorv32_cpu_csr_write(CSR_MCYCLE, 0); // start timing
  uint32_t sum_sw = sum_sw_func(100, 100);
  time_sum_sw = neorv32_cpu_csr_read(CSR_MCYCLE); // stop timing


  // sum using RCA CFU
  neorv32_uart0_printf("sum using RCA CFU\n");

  neorv32_cpu_csr_write(CSR_MCYCLE, 0); // start timing
  uint32_t sum_hw = custom_rca_sum(100, 100);
  time_sum_hw = neorv32_cpu_csr_read(CSR_MCYCLE); // stop timing

Which results in this output comparing the two times:

Execution timing:
SUM SW = 1 cycles
SUM HW = 7 cycles
Average speedup: ~0x

Thank you so much again

1 reply

stnolting Oct 1, 2024
Maintainer

So can I consider both the sum from the built in ALU of neorv32 and the one from the RCA of the CFU to take the same number of overall clock cycles?

Actually, no. Maybe this section in the data sheet needs some rework... 🙈

Normal ALU operations (excluding shifts) complete within 2 clock cycles ("dispatch" and "execute"). CFU instructions also require the "dispatch" and "execute" cycles. Additionally, they need another cycle for triggering the CFU and another one for getting the result. The latter one is caused by the "CFU proxy logic" (see below). This proxy logic is intended to ensure correct operation of the CPU even if a CFU designer messes up the interface. 😅 Unfortunately, there is no built-in way to remove/bypass this proxy.

neorv32/rtl/core/neorv32_cpu_alu.vhd

Lines 328 to 345 in a7f56cc

    
           -- operation proxy -- 
        
           cfu_arbiter: process(rstn_i, clk_i) 
        
           begin 
        
             if (rstn_i = '0') then 
        
               cfu_wait <= (others => '0'); 
        
             elsif rising_edge(clk_i) then 
        
               cfu_wait(1) <= cfu_wait(0); 
        
               if (cfu_wait(0) = '0') then -- CFU is idle 
        
                 cfu_wait(0) <= ctrl_i.alu_cp_cfu; -- trigger new CFU operation 
        
               elsif (cfu_done = '1') or (ctrl_i.cpu_trap = '1') then -- operation done or abort if trap (exception) 
        
                 cfu_wait(0) <= '0'; 
        
               end if; 
        
             end if; 
        
           end process cfu_arbiter; 
        
           cfu_run      <= ctrl_i.alu_cp_cfu or cfu_wait(0); -- CFU operation in progress 
        
           cp_result(4) <= cfu_res when (cfu_wait(1) = '1') else (others => '0'); -- output gate 
        
           cp_valid(4)  <= cfu_wait(0) and cfu_done;

As a summary: the CFU needs 4 cycles (if your CFU logic has no additional wait states).

Why does the CSR_MCYCLE count 1 cycle for the sum with alu and 7 for the custom RCA?

Be careful with these cycle counts. Always have a look at the generated assembly code. I assume the compiler has trimmed this expression into a simple compile-time-constant expression:

uint32_t sum_sw = sum_sw_func(100, 100);

Hence, the mcycle-accessing instructions might be executing right one after another.

neorv32_cpu_csr_write(CSR_MCYCLE, 0); // start timing
time_sum_sw = neorv32_cpu_csr_read(CSR_MCYCLE); // stop timing

You should also have a look at the assembly code here:

uint32_t sum_hw = custom_rca_sum(100, 100);

Is custom_rca_sum a function or a define? If it is a function then you might have an additional call/return overhead.

H1alus · 2024-10-01T20:58:02Z

H1alus
Oct 1, 2024
Author

Ok so even by trying to toy with the proxy logic I'm not able to have a 1:1 execution 🤦‍♂️.

The custom_rca_sum() is a macro as you've done in the C code example for the CFU with the enc/dec functions.

I'll look into the sum you're referring to being referred as a constant expression but I'll still have those 7 clock cycles to watch for too.

Thank you again.

1 reply

stnolting Oct 2, 2024
Maintainer

Ok so even by trying to toy with the proxy logic I'm not able to have a 1:1 execution 🤦‍♂️

Unfortunately, that is true. CFU instructions have a minimum execution time of 4 cycles while ALU operations (except shifts) only need 2 cycles.

I'll look into the sum you're referring to being referred as a constant expression but I'll still have those 7 clock cycles to watch for too.

There might be another offset because CSR read and write instructions require a different amount of clocck cycles until they "commit". Hence, I recommend to track time deltas (only using CSR-read operations) instead of absolute values:

neorv32_cpu_csr_write(CSR_MCYCLE, 0); // reset counter only once at the beginning of the program
...
uint32_t delta_0 = neorv32_cpu_csr_read(CSR_MCYCLE);
stuff();
uint32_t delta_1 = neorv32_cpu_csr_read(CSR_MCYCLE) - delta_0; // = execution cycles of "stuff"
things();
uint32_t delta_2 = neorv32_cpu_csr_read(CSR_MCYCLE) - delta_1; // = execution cycles of "things"
...

H1alus · 2024-10-02T08:30:57Z

H1alus
Oct 2, 2024
Author

Would it be useful to have a way to bypass the proxy logic? Is it actually possible to implement a feature like this?

2 replies

stnolting Oct 2, 2024
Maintainer

That seems to be a good idea! I think the proxy logic can be adjusted reducing the minimal execution time of a CFU instruction from 4 cycles down to 3 cycles. I'll have a look at this.

stnolting Oct 2, 2024
Maintainer

An optimized version of the CFU handshake and the CFU proxy is being developed in #1046. 🚀

H1alus · 2024-10-02T19:27:53Z

H1alus
Oct 2, 2024
Author

Damn, impressive work, I scraped something this morning obtaining 3 clock cycles indeed but I think I may have compromised the proxy a bit. Here's the waveform:

The data as you can see is available for 2 clock cycles so I'll have to try this new one you made.

I took a quick dive in to the assembly code from a testbench with 3 sums in C, however the CSR_MCYCLE counts 1 cycle on assembly add instructions while for the RCA with exploited proxy it counted 6 cycles.

I guess the owner of the house knows what's best

4 replies

stnolting Oct 2, 2024
Maintainer

I scraped something this morning obtaining 3 clock cycles indeed

That looks great!

but I think I may have compromised the proxy a bit.

Don't worry too much about that. If the CPU still works then everything is seems fine! 👍

I am now curious. 😅 Maybe the decoding can be further accelerated/improved to actually get to a minimum execution time of 2 cycles for CFU operations... 🤔

I took a quick dive in to the assembly code from a testbench with 3 sums in C, however the CSR_MCYCLE counts 1 cycle on assembly add instructions while for the RCA with exploited proxy it counted 6 cycles.

CSR instructions take longer than "normal" ALU instructions. Polling the cycle counter might not be the best approach for benchmarking single instructions 🙈 Better benchmark a whole set of several hundred executions and calculate the average (removing overhead bias).

I guess the owner of the house knows what's best

Absolutely not. I lose track of certain aspects here far too often. That's why I'm glad to get tips like this.

H1alus Oct 2, 2024
Author

I tried to look in to the cpu control but I couldn´t find a way of avoiding the ALU_WAIT state which I think is the main cause of this, could it be possible to consider a CFU for purely combinational blocks hence 1 cycle and a CFU for multi cycle blocks? The cfu could be implementing hardware that is similar to basic ALU functions while also having the benefit of the dedicated CSR for configuration.
I don´t know what could be the best design choice in this case.

I'm quite limited on how many things I can do given that I then need to test for power reports in post implementation on vivado; I need those damn .saif files for power estimation and the simulation takes a lot even for 100 ms.

stnolting Oct 3, 2024
Maintainer

I tried to look in to the cpu control but I couldn´t find a way of avoiding the ALU_WAIT state which I think is the main cause of this

That's right. This additional ALU_WAIT state helps to relax the logic depth in the previous EXECUTE stage making the critical path a bit shorter.

could it be possible to consider a CFU for purely combinational blocks hence 1 cycle and a CFU for multi cycle blocks?

Sure, that would be possible. But then the CFU logic would be right in the middle of the ALU's critical path (register file -> ALU -> write-back mux -> register file).

It is always difficult to weigh up which way you want to go here. If the CFU is purely combinatorial an located right in the middle of the ALU, then it needs one clock less. But this also means that the critical path is a little bit longer, making all instructions (not just those for the CFU) theoretically run more slowly. 🙈

I'm quite limited on how many things I can do given that I then need to test for power reports in post implementation on vivado; I need those damn .saif files for power estimation and the simulation takes a lot even for 100 ms.

I know these problems only too well - I sometimes have simulations that run for a whole night or a whole weekend...

Just as a tip: disable all hardware modules, extensions, etc. that you do not explicitly need to simplify the resulting netlist and to reduce simulation time.

H1alus Oct 3, 2024
Author

Thank you so much for such detailed explanation, I was thinking about merging the CFU alongside the alu basic instructions but I didn't want to go down that route so I think we're stuck with a minimum of 3 clock cycles for the CFU.

I have been running a minimal setup for the processor as you might remember from the issue I opened a couple days ago.

H1alus · 2024-10-02T19:37:32Z

H1alus
Oct 2, 2024
Author

Just for the sake of sharing the code, here's what i came up with:

    -- operation proxy --
    cfu_wait(0) <= ctrl_i.alu_cp_cfu; -- reduce alu_wait
    cfu_arbiter: process(rstn_i, clk_i)
    begin
      if (rstn_i = '0') then
        cfu_wait(1) <= '0';
      elsif rising_edge(clk_i) then
        if(cfu_wait(1) /= cfu_wait(0)) then
            cfu_wait(1) <= cfu_wait(0); -- forcing the output to be valid on the next cycle
        else  -- prepare for next operation
            cfu_wait(1) <= '0';
        end if;
        
      end if;
    end process cfu_arbiter;

    cfu_run      <= ctrl_i.alu_cp_cfu or cfu_wait(0); -- CFU operation in progress 
    cp_result(4) <= cfu_res when (cfu_wait(1) = '1') else (others => '0'); -- output gate
    cp_valid(4)  <= cfu_wait(0) and cfu_done;

I know that this is a pretty naive approach, also the draw back is an increment of the critical path given a purely combinational block inside the CFU.

1 reply

stnolting Oct 2, 2024
Maintainer

If it works for your setup then it is absolutely fine! 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CFU minimum clock cycles #1045

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

CFU minimum clock cycles #1045

H1alus Oct 1, 2024

Replies: 6 comments · 9 replies

stnolting Oct 1, 2024 Maintainer

H1alus Oct 1, 2024 Author

stnolting Oct 1, 2024 Maintainer

H1alus Oct 1, 2024 Author

stnolting Oct 2, 2024 Maintainer

H1alus Oct 2, 2024 Author

stnolting Oct 2, 2024 Maintainer

stnolting Oct 2, 2024 Maintainer

H1alus Oct 2, 2024 Author

stnolting Oct 2, 2024 Maintainer

H1alus Oct 2, 2024 Author

stnolting Oct 3, 2024 Maintainer

H1alus Oct 3, 2024 Author

H1alus Oct 2, 2024 Author

stnolting Oct 2, 2024 Maintainer

H1alus
Oct 1, 2024

Replies: 6 comments 9 replies

stnolting
Oct 1, 2024
Maintainer

H1alus
Oct 1, 2024
Author

stnolting Oct 1, 2024
Maintainer

H1alus
Oct 1, 2024
Author

stnolting Oct 2, 2024
Maintainer

H1alus
Oct 2, 2024
Author

stnolting Oct 2, 2024
Maintainer

stnolting Oct 2, 2024
Maintainer

H1alus
Oct 2, 2024
Author

stnolting Oct 2, 2024
Maintainer

H1alus Oct 2, 2024
Author

stnolting Oct 3, 2024
Maintainer

H1alus Oct 3, 2024
Author

H1alus
Oct 2, 2024
Author

stnolting Oct 2, 2024
Maintainer