What ensures that addition/subtraction completes within a single clock cycle? #141

Aaron1011 · 2021-08-14T04:36:35Z

Aaron1011
Aug 14, 2021

I was taking a look at how addition and subtract are implemented in this CPU.
From what I understand of the logic, the actual computation occurs here:

neorv32/rtl/core/neorv32_cpu_alu.vhd

Line 131 in 8bf4b70

    
           addsub_res <= std_ulogic_vector(unsigned(op_a_v) + unsigned(op_y_v) + unsigned(cin_v(0 downto 0)));

This result gets used during instruction execution:

neorv32/rtl/core/neorv32_cpu_control.vhd

Lines 1035 to 1037 in 49fdd28

    
           else -- single cycle ALU operation 
        
             ctrl_nxt(ctrl_rf_wb_en_c) <= '1'; -- valid RF write-back 
        
             execute_engine.state_nxt <= DISPATCH;

During the next clock cycle, we will read the result from alu_i:

neorv32/rtl/core/neorv32_cpu_regfile.vhd

Lines 97 to 98 in 49fdd28

    
           if (ctrl_i(ctrl_rf_in_mux_c) = '0') then 
        
             rf_wdata <= alu_i;

and store it into the destination register specified in the instruction we executed:

neorv32/rtl/core/neorv32_cpu_regfile.vhd

Lines 115 to 116 in 49fdd28

    
           if (ctrl_i(ctrl_rf_wb_en_c) = '1') then 
        
             reg_file(to_integer(unsigned(opa_addr(4 downto 0)))) <= rf_wdata;

Assuming my understanding is correct:

From my very limited knowledge of VHDL, the implementation of + for the unsigned type is provided by the ieee.numeric_std.all package. Does the VHDL synthesis process automatically 'compile' this to use a dedicated adder on the target FPGA? If so, is the adder guaranteed to be fast enough to 'complete' within a single clock cycle?

Thanks for making this amazing project!

Answered by stnolting

Aug 14, 2021

Assuming my understanding is correct:

You are right. But those simple ALU operations like "add" are processed within a single cycle: on the first rising edge of the clock the operands are output from the register file, applied to the ALU (and thus, the adder). The data propagates through the circuit and arrives the input of the register file (rf_wdata). Since the register file's write enable (ctrl_i(ctrl_rf_wb_en_c)) is also applied on this first edge, the computation result is written back to the register file with the next rising edge.

In the next cycle, the CPU moves to the DISPATCH state to get the next instruction and to prepare the output of the operation's operands from the regis…

View full answer

stnolting · 2021-08-14T08:09:36Z

stnolting
Aug 14, 2021
Maintainer

Assuming my understanding is correct:

You are right. But those simple ALU operations like "add" are processed within a single cycle: on the first rising edge of the clock the operands are output from the register file, applied to the ALU (and thus, the adder). The data propagates through the circuit and arrives the input of the register file (rf_wdata). Since the register file's write enable (ctrl_i(ctrl_rf_wb_en_c)) is also applied on this first edge, the computation result is written back to the register file with the next rising edge.

In the next cycle, the CPU moves to the DISPATCH state to get the next instruction and to prepare the output of the operation's operands from the register file.

From my very limited knowledge of VHDL, the implementation of + for the unsigned type is provided by the ieee.numeric_std.all package. Does the VHDL synthesis process automatically 'compile' this to use a dedicated adder on the target FPGA?

This is a great question I never really thought about 😄
I will try to brake it down:

Each library like ieee.numeric_std.all provides a pretty abstract but still very basic definition of each provided function (-> https://www.csee.umbc.edu/portal/help/VHDL/packages/numeric_std.vhd). The synthesis tool takes this description and tries to map it to basic logic cells provided by the specific FPGA.

ADD is probably one of the most common operations, so most FPGAs provide optimized logic cells that allow a small and fast (in terms of delay = high propagation speed of the electric signals) mapping of those ADD-related function (like a "carry chain" to propagate the carry from one bit position to another).

The synthesis tool is aware of those special FPGA features and can create an efficient hardware for the addition.

If so, is the adder guaranteed to be fast enough to 'complete' within a single clock cycle?

I would see it the other way around. The synthesis creates a circuit that implements the addition. Let's assume the tool is "allowed to do whatever it wants" (no constraints, see below), so it will create some circuit. The longest path (= worst case path) an electric signal can take from the circuit's input to the circuit's output defines the critical path. This path is specified by a time, since electric signals have a limited propagation speed. Hence, this critical path defines the maximum frequency the circuit can reliable operate at (f_max = 1 / critical_path[s]) and thus, the "length" (=time) of a single cycle.

There are options to give the synthesis tools some constraints like specifying a minimum clock speed that has to be reached or more implementation-specific options like defining how to actually build the addition circuit (using which FPGA primitives).

Thanks for making this amazing project!

❤️

0 replies

umarcor · 2021-09-12T21:25:03Z

umarcor
Sep 12, 2021
Collaborator

Does the VHDL synthesis process automatically 'compile' this to use a dedicated adder on the target FPGA? If so, is the adder guaranteed to be fast enough to 'complete' within a single clock cycle?

For clarification: the synthesis tool might use a dedicated adder, it might create the adder from LUTs/Logic Elements or it might produce any other result; however, the logical operation will always complete within a single clock cycle, because that is what the designer described in (V)HDL. If the synthesis tool decided to introduce a register that delays the path, that'd be incorrect, because that is not the described behaviour.

Therefore, in practice, the difference between using a dedicated adder or non-so-optimised resources will be the resulting maximum frequency of the clock. In other words, the delay between registers (as explained by @stnolting). The behaviour needs to be the same, regardless.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What ensures that addition/subtraction completes within a single clock cycle? #141

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What ensures that addition/subtraction completes within a single clock cycle? #141

Aaron1011 Aug 14, 2021

Replies: 2 comments

stnolting Aug 14, 2021 Maintainer

umarcor Sep 12, 2021 Collaborator

Aaron1011
Aug 14, 2021

stnolting
Aug 14, 2021
Maintainer

umarcor
Sep 12, 2021
Collaborator