Replies: 6 comments 9 replies
-
The execution time of a CFU instruction depends on the architecture of the CFU itself. In your example the CFU hardwires Here is a cut-out from the Vivado simulation simulating your design: execution starts at the yellow marker and end at the blue marker. |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for your kindness and your detailed explanation as expected from your wonderful work on this project. So can I consider both the sum from the built in ALU of neorv32 and the one from the RCA of the CFU to take the same number of overall clock cycles? (I ask you this stupid question since my next work will be the implementation of an approximate computing adder, hopefully ) The C code that bugs me is this: neorv32_cpu_csr_write(CSR_MCYCLE, 0); // start timing
uint32_t sum_sw = sum_sw_func(100, 100);
time_sum_sw = neorv32_cpu_csr_read(CSR_MCYCLE); // stop timing
// sum using RCA CFU
neorv32_uart0_printf("sum using RCA CFU\n");
neorv32_cpu_csr_write(CSR_MCYCLE, 0); // start timing
uint32_t sum_hw = custom_rca_sum(100, 100);
time_sum_hw = neorv32_cpu_csr_read(CSR_MCYCLE); // stop timing Which results in this output comparing the two times:
Thank you so much again |
Beta Was this translation helpful? Give feedback.
-
Ok so even by trying to toy with the proxy logic I'm not able to have a 1:1 execution 🤦♂️. The custom_rca_sum() is a macro as you've done in the C code example for the CFU with the enc/dec functions. I'll look into the sum you're referring to being referred as a constant expression but I'll still have those 7 clock cycles to watch for too. Thank you again. |
Beta Was this translation helpful? Give feedback.
-
Would it be useful to have a way to bypass the proxy logic? Is it actually possible to implement a feature like this? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Just for the sake of sharing the code, here's what i came up with: -- operation proxy --
cfu_wait(0) <= ctrl_i.alu_cp_cfu; -- reduce alu_wait
cfu_arbiter: process(rstn_i, clk_i)
begin
if (rstn_i = '0') then
cfu_wait(1) <= '0';
elsif rising_edge(clk_i) then
if(cfu_wait(1) /= cfu_wait(0)) then
cfu_wait(1) <= cfu_wait(0); -- forcing the output to be valid on the next cycle
else -- prepare for next operation
cfu_wait(1) <= '0';
end if;
end if;
end process cfu_arbiter;
cfu_run <= ctrl_i.alu_cp_cfu or cfu_wait(0); -- CFU operation in progress
cp_result(4) <= cfu_res when (cfu_wait(1) = '1') else (others => '0'); -- output gate
cp_valid(4) <= cfu_wait(0) and cfu_done; I know that this is a pretty naive approach, also the draw back is an increment of the critical path given a purely combinational block inside the CFU. |
Beta Was this translation helpful? Give feedback.
-
I was trying to implement inside the neorv32 a simple 32 bit RCA in order to learn how to use the cfu for further implementation with custom accelerators, however, even by not having the RCA in a pipeline I see that the CFU operation is stated to take 7 clock cycles on Vivado's behavioural simulation. Here's the files that I created from modifying the example files.
Is the cfu bound to have 7 clock cycles as a minimum or am I doing something wrong?
Thank you in advance
files.zip
Beta Was this translation helpful? Give feedback.
All reactions