Data Bypass and Scoreboard

This page is maintained by Denis Los

Data bypassing

Data bypassing (or data forwarding) is an optimization in pipelined CPUs to limit performance deficits which occur due to the pipeline stalls caused by RAW (Read after Write) dependencies.

Let us look at the following example:

add R8, R2, R3 ; (R8 <- R2 + R3)
sub R5, R8, R1 ; (R5 <- R8 - R1)

The second instruction has to wait for the first one to write its result back to the register file. However, the result will be ready as soon as sub approaches the Decode stage (assuming that there are no cache misses and such arithmetic instructions as add require only one cycle of execution to be complete), hence there would be no need to stall the pipeline if data was able to be obtained before an actual execution of the second instruction.

Scoreboard for data forwarding

Control Unit should be aware of the current state of the pipeline so as to be able to decide whether the stall is needed and identify a specific pipeline unit where data should be get from. It can be achieved by the implementation of the following structure which has much in common with a scoreboard:

Register	EXE_0	MEM	is in RF?	is ready for bypass?
R0		✓	no	yes
R1			yes	no
...
RN	✓		no	no

The first column denotes a set of registers which are used by the simulated CPU, while the next ones (EXE_0,...,EXE_N,MEM,WB) represent corresponding pipeline stages. The following flags are used to get the information whether a value of a source register should be obtained from a bypass, the register file should be accessed in order to get valid data or the pipeline has to be stalled because of the sources being not ready yet.

Implementation

Control and bypass logic is mainly kept inside DataBypass class with some common interface features segregated into separate classes.

RegisterStage is the class which is used to represent an execution stage in the pipeline an instruction which writes a result of its execution back to a specific register is in. For example, with the simulation of only four following pipeline stages (fetch, decode, exec, writeback) for simple arithmetic instructions possible options of RegisterStage value for R1, being used in add R1, R2, R3 pseudo-instruction, would be exec and writeback stages, because of the reference to the first two being not informative in the current context.
BypassCommand class is meant to be the tool of simulation connecting Decode stage with the first execution stage. Transferred through ports, its instances provide the execution stage with the information such as source registers which should obtain their values through bypass and ports of the simulator the corresponding data should be get from.

Constructing an object

An object of DataBypass class can be created using the following constructor with a latency of a long-latency execution unit passed as a parameter.

template <typename ISA>
explicit DataBypass( uint64 complex_alu_latency);

Entry in scoreboard

In our simulator a structure called RegisterInfo is used to implement an entry in the scoreboard.

struct RegisterInfo
{
    RegisterStage current_stage;
    RegisterStage ready_stage;
    RegisterStage next_stage_after_first_execution_stage;
    bool is_bypassible;
    bool is_traced;
};

It means that for each register we keep the accurate information about a specific stage it is being processed in at the moment of the investigation and about a pipeline stage when a functional unit is expected to be ready with a result, so that a register can be considered as ready for bypass as soon as the instruction which exploits it as a destination target has a result of execution ready too. Moreover, the information about the execution flow is kept as well in order to be able to update a state of the scoreboard properly with respect to a particular type of an instruction. Different execution flows will be discussed in detail in the chapter, dedicated to the implementation of non-unified pipeline.

Interface

DataBypass class

`bool is_stall( const Instr& instr)`

Determines whether the pipeline should be stalled due to data hazards, structural hazards or the support for precise exceptions. This method returns true, if an instruction instr can be successfully processed in the pipeline, and false otherwise. Please, refer to Stalling the pipeline for more details.

`bool is_in_RF( const Instr& instr, uint8 src_index)`

Determines whether a register file should be accessed so as to get a value of a source register with the index src_index of the instruction instr. It returns true, if the state of the source register is in_RF, or false otherwise.

`bool is_bypassible( const Instr& instr, uint8 src_index)`

Determines whether a value of a source register with the index src_index of the instruction instr can be obtained from bypass. It compares a current stage of the source register with the stage when it is expected to be ready and returns true, if the register is in the same stage or further in the pipeline, and false otherwise.

`void trace_new_instr( const Instr& instr)`

Garners information about new instruction instr and updates the scoreboard with its destination register.

Note: This function should be invoked after calling update method, for a status of recently renewed register does not need to be updated.

`void update()`

Updates the scoreboard and all concomitant structures. This method changes the current stages of all traced registers to the next ones in the pipeline in accordance with types of respective instructions and updates information about admissibility of data bypassing. Furthermore, control status used to determine the need for stalling the pipeline because of structural hazards and handling precise exceptions is updated as well.

`void handle_flush()`

Removes all the information regarding a current state of the pipeline due to a flush caused by a misprediction of a branch. It resets the information about traced registers by changing their states to in_RF.

Note: Information about destination registers of a mispredicted branch and all instructions processed in the subsequent stages (correctness of branch prediction is determined now in the mem stage) is discarded as well on the basis of assumption that all of these instructions will retire before any of the fetched anew approach the decode stage. It might become incorrect due to following reappraisals of the simulated model.

`BypassCommand get_bypass_command( const Instr& instr, uint8 src_index)`

Returns an instance of BypassCommand class carrying the information which pipeline stage a value of a source register with the index src_index of the instruction instr should be bypassed from.

RegisterStage class

void set_to_first_execution_stage();
void set_to_last_execution_stage( uint8 complex_alu_latency);
void set_to_mem_stage();
void set_to_writeback();
void set_to_in_RF();

bool is_first_execution_stage();
bool is_last_execution_stage( uint8 complex_alu_latency);
bool is_mem_stage();
bool is_writeback();
bool is_in_RF();

void inc();

Note: Not having initialized an instance of the RegisterStage class properly with the listed methods, the one should not expect it to work correctly. By default, the value is set to in RF.

BypassCommand class

`uint8 get_bypass_direction()`

Returns an integer number indicating which execution unit bypassing data should be get from.

Returned number	Type of bypass
0	EXE_0 — EXE_0
1	EXE_N — EXE_0
2	MEM — EXE_0
3	WB — EXE_0

Non-unified pipeline

Non-unified pipeline is a term which is used in the current context to describe an architecture of a pipelined in-order machine with non-unified latencies of execution units. Exact latency of a functional unit is determined by specifications of the simulated machine and complexity of its operations.

Now in our model of simulated CPU there are three major categories of instructions which can be distinguished on the basis of the set of pipeline stages they are in during the execution:

FE - fetch

DEC - decode

EXE_0 - first execution stage

EXE_N - last execution stage

MEM - mem

WB - writeback

Three categories

The first figure shows the execution flow of simple arithmetic instructions. There is a result produced in one of the arithmetic logic units (ALU) in the EXE_0 stage.
Branch executions, memory accesses, etc. Note that EXE and MEM are only the names of pipeline stages, but not a detailed description of operations usually done. For instance, EXE_0 stage is an address generation in one of address generation units (AGU) for loads and stores, but a calculation of condition of a branch in one of branch execution units (BEU) for branch instructions.
Arithmetic such as multiplication and division is processed in complicated long-latency arithmetic logic unit, which is simulated as fully pipelined. However, data bypassing from internal stages (EXE_1, EXE_2, ..., EXE_N-1) is not supported.

Command line options

A latency of arithmetic logic unit which operates on complicated arithmetic such as multiplication and division of integer numbers can be set from the command line with an integer number from 2 to 64 passed as a parameter

--complex-alu-latency - latency of the complicated arithmetic logic unit

Stalling the pipeline

MIPT-V / MIPT-MIPS — Cycle-accurate pre-silicon simulation.

Main Page
Lectures
Getting started
MIPS
C++
Build and Deployment
Tools
Communication
Functional simulator internals
Performance simulator internals
Studies
- Cache associativity
- IPC sensitivity to cache line size
Integration
- Interactive simulation with GDB
Library
About Us

Provide feedback

Saved searches

Use saved searches to filter your results more quickly