Tutorial

How to use the RISC-V C++ Performance Model

This wiki describes a general overview of the RISC-V Performance Model (olympia) as well as trace, report, and pipeline viewing generation.

What Olympia is

As described in the README.md of the main Olympia page, Olympia is a Performance Model written in C++ for the RISC-V community as an example of an Out-of-Order RISC-V CPU Performance Model based on the Sparta Modeling Framework.

The pipeline design is very rudimentary, with a simple Fetch, Decode, Rename, Dispatch and Execution blocks in the main pipeline. The memory system consists of a simple in-order load/store pipeline coupled to a simple bus interface unit communicating with a very simple memory subsystem. The design layout is very similar to the Sparta Core Example.

Tutorial Flow

This tutorial starts with trace generation using Dromajo, specifically Dhrystone (included in the repository). After the trace is generated the tutorial will focus on running that trace and looking for performance bottlenecks using tools like reporting, Argos pipeline viewing, and time series analysis.

Assumptions made:

The reader of this tutorial has successfully built the Olympia model using the directions found in the main README.md.
The traces, reports definition files, etc have not changed since this tutorial was composed

Generating a Trace

Generation a trace for Olympia involves instrumenting a functional model like Spike or Dromajo (or any functional simulator that can run RISC-V software) with the STF library's writer API. Included in the Olympia is a patch for Dromajo as well as documentation to build, run, and trace Dhrystone on Dromajo.

Traces are instruction streams -- the path an application took running on a RISC-V core. STF traces are binary files and can only be viewed using the STF library's reader API or with the STF tools (like stf_dump and stf_imem) found in the STF tracing tools repository.

For example, this is the command to view the instruction stream of the provided Dhrystone trace:

% stf_dump traces/dhry_riscv.zstf | less
VERSION             1.5
GENERATOR           Dromajo
GEN_VERSION         1.1.0
GEN_COMMENT         Trace from Dromajo
INST_IEM            RISCV
PID                 00000000:00000000:00000000
INST16    1         00000000000101ba                              00006722             c.ldsp      x14,8(x2)
                                     MEM READ          0000       0000003fffa90cb8     0000000000000000
INST32    2         00000000000101bc                              4f805d63             bge         x0,x24,0x00000000000106b6
INST32    3         00000000000101c0                              000247b7             lui         x15,0x24
INST32    4         00000000000101c4                              b4078793             addi        x15,x15,-1216 # 0x0000000000023b40
INST16    5         00000000000101c8                              00006398             c.ld        x14,0(x15)
                                     MEM READ          0000       0000000000023b40     0000000000000000
INST32    6         00000000000101ca                              000244b7             lui         x9,0x24
INST16    7         00000000000101ce                              00004905             c.li        x18,1
...

For example, this is the command to view a sorted instruction memory dump of the provided Dhrystone trace:

% stf_imem -S traces/dhry_riscv.zstf | less

Traces can be extended to include registers and their values per instruction, PTE entries, escape records with speculative paths, exception information, etc. This is beyond the scope of this tutorial, however.

Running a Generated Trace

Running a trace (expected extension zstf or stf) on Olympia is as simple as providing the trace to the simulator:

% ./olympia <trace_file>.zstf

Olympia does, however, support simple JSON input files as well. This is handy if a performance architect is interested in a simple what-if analysis, like load-to-use latency:

[
    {
        "mnemonic": "lw",
        "rs1": 4,
        "rd": 5,
        "imm": 100,
        "vaddr" : "0xdeadbeef"
    }
    {
        "mnemonic": "add",
        "rs1": 5,
        "rs2": 2,
        "rd":  1
    }
]

Running this JSON file on olympia with "infinite caches" gives a general idea of the latency from load issue time to the add execution time.

./olympia -p top.cpu.core0.lsu.params.dl1_always_hit true load_add_dependency.json

More on analyzing such as example later in the tutorial.

Run Dhrystone on the Simulator

Run the provided trace of Dhrystone on the simulator, specifically from the build directory where the olympia binary resides:

./olympia traces/dhry_riscv.zstf --auto-summary on

This will run the default configuration of olympia on 2.3 million instructions of Dhrystone trace in roughly 6 seconds.

Tweaking Parameters

Each unit in Olympia has parameters that it uses at startup/runtime to change/manipulate behavior. A comprehensive list of parameters can be viewed using the following command line options:

./olympia --no-run <parameter option>
#  --show-parameters                                # Dump to the console the parameters found in the tree
#  --write-final-config <config name>.yaml          # Dump the final parameters to a YAML file
#  --write-final-config-verbose  <config name>.yaml # Dump the final parameters to a YAML file with descriptions

--no-run is handy to prevent the simulator from complaining that no workload was provided.

Parameters can be changed on the command line using the -p option or via a configuration YAML file allowing for a list of parameters:

# Set the Dispatch Queue Depth
./olympia -p top.cpu.core0.dispatch.params.dispatch_queue_depth 12 traces/dhry_riscv.zstf

Create a configuration file and run it.

cat > dipatch_params.yaml
top:
    cpu.core0.dispatch.params.dispatch_queue_depth: 12
    cpu.core0.dispatch.params.num_to_dispatch: 3
# <ctrl-D>
./olympia -c dipatch_params.yaml traces/dhry_riscv.zstf

Running Architectures

Architectures are another methodology to group parameters together that represent an architecture configuration. In Olympia, three made-up architectures are provided:

ls arches/*.yaml
# big_core.yaml  medium_core.yaml  small_core.yaml

Each architecture builds on top of the previous one:

head -8 arches/big_core.yaml 
#
# Set up the pipeline for a 8-wide machine
#

# Build on top of a medium core
include: medium_core.yaml

The include statement allows big_core to build on top of medium_core, etc. This allows changes, for example, in medium_core to be automatically included in big_core.

To run an architecture, supply the name of the architecture to the --arch <arch_name> command line:

./olympia --arch medium_core traces/dhry_riscv.zstf

olympia automatically looks in the arches directory (defined here) to look for named architectures.

By default, olympia runs the small_core architecture.

The main difference between each of the cores is the width of the machine and the number of ALU/FPU/BR units.

Report Generation

One of the most powerful features of the Sparta Modeling Framework is the ability to generate precise reports in a multitude of formats. Reports are the first insight to how an application (trace) is performing on a given modeled architecture.

Reports in Sparta are provided in two forms: definitions (.def) and contents (.yaml). Depending on how the modeler wants to collect a report, one of the two formats will be provided.

Report definitions (.def) allow the modeler to control how and when a report is collected. For example, collecting a report starting from the very first instruction of a trace will include statistic/counter values that reflect cold cache/branch prediction effects. Starting a report at a distance into the trace will avoid those cold cache effects and provide a more "steady state" report.

Report content files (.yaml) indicate to the reporting mechanism what to collect from the simulator. A content file can include all of the statistics/counters or a small subset.

Report definition files include report content files. But either file can be provided to the --report command line option.

Start with the simplest report, the auto-summary report. This is a report of every stat regardless of visibility.

# Run 1 million instructions for brevity
% ./olympia --auto-summary on --workload traces/dhry_riscv.zstf -i 1M

Next, generate a report that constrains the statistics/counters to only those that are not hidden using a content report file (.yaml):

% cat reports/core_stats.yaml 
#
# Auto populate the report with stats that are not marked hidden
# https://sparcians.github.io/map/classsparta_1_1InstrumentationNode.html#a855b6ecdd93e412052ae264032002ce1
#
content:
    autopopulate:
         attributes: "!=vis:hidden"
% ./olympia --report "top" reports/core_stats.yaml 1 text  --workload traces/dhry_riscv.zstf -i1M

The command line option --report broken down:

top means start collection at the given node: top.
Try this: top.cpu.core0.dispatch
reports/core_stats.yaml is the content file.
Try this: reports/dhry_report.yaml
1 means send the report to stdout.
Try this: 2 for stderr or my_report.out to save it in a file
text means save the report in text format.
Try this: html for html output, but save it to my_report.html.

Other example to try:

# Supply the --report option as many times as needed
./olympia --report "top.cpu.core0.dispatch" reports/core_stats.yaml dispatch_stats.text text  \
          --report "top.cpu.core0.rob"      reports/core_stats.yaml rob_stats.text      text  \
          --report "top.cpu.core0"          reports/core_stats.yaml core0_stats.text    text  \
          --workload traces/dhry_riscv.zstf -i1M

The above example generates a single report using a content file. Using a definitions file, more report control can be added.

Included in Olympia is a core report definition file core_report.def in the reports directory. In this definition file, there are two reports expected to be generated:

A report for the entire workload
A report started after a certain number of instructions have elapsed

The definition file:

% cat reports/core_report.def
content:
  # Report 1: Start from time/inst == 0 and collect everything
  report:
    pattern:   top
    def_file:  reports/core_stats.yaml
    dest_file: %OUT_BASE%.%OUT_FORMAT%
    format:    %OUT_FORMAT%
  # Report 2: Start from inst == INST_START
  report:
    pattern:   top
    def_file:  reports/core_stats.yaml
    dest_file: %OUT_BASE%_delayed.%OUT_FORMAT%
    format:    %OUT_FORMAT%
    trigger:
      start:   cpu.core0.rob.stats.total_number_retired >= %INST_START%

The report keyword starts a new report, at the given pattern using the content file core_stats.yaml.

Each report has a given destination file (dest_file) and format (format), but the names are replace strings. Anything defined in %% are keywords expected to be replaced at simulation runtime using the command line option --report-yaml-replacements <placeholder_name> <value> [<placeholder_name> <value>]. This is handy when a performance architect is running an experiment on many workloads.

Finally, the second report generated is different from the first in that it is triggered to start statistics/counter collection at the given start. In this case, when the total number of retired instructions is equal to or exceeds the replaced INST_START the report will begin. For more information on triggers, see this README in the Sparta Modeling Framework repository.

Here's an example of how to use a report definition file:

./olympia --report reports/core_report.def \
          --report-yaml-replacements OUT_BASE dhry_1M_insts OUT_FORMAT text INST_START 100k \
          --workload traces/dhry_riscv.zstf -i 1M

This command will generate reports dhry_1M_insts.text and dhry_1M_insts_delayed.text. Diff these files to notice the difference in stats, particularly in instruction count (it will be 100K fewer):

% diff -y dhry_1M_insts.text dhry_1M_insts_delayed.text | grep total_number_retired
          total_number_retired = 1000000		      |	          total_number_retired = 900000

Command line option break down:

Option --report reports/core_report.def: Use the definition file for report generation. Note the lack of other options as compared to the previous example
Option --report-yaml-replacements OUT_BASE dhry_1M_insts OUT_FORMAT text INST_START 100k. Replace OUT_BASE in the definition file with dhry_1M_insts Replace OUT_FORMAT in the definition file with text Replace INST_START in the definition file with 100k

Report content files are very flexible in the types of expressions that can be written. For example, since the trace is Dhrystone, a report that provides a DMIPs calculation would be very helpful.

Included in Olympia, is a report content file called reports/dhry_report.yaml that calculates DMIPs for this example model:

content:
  top:
    "cpu.core0.rob.stats.ipc" : "IPC"

    # Assuming 1 iteration is about 265 instructions, calculate the
    # cycles per iteration
    "265 / cpu.core0.rob.stats.ipc" : "Cycles Per Iteration"

    # Assume a 1MHz part, determine the iterations per second
    "1e6 / (265 / cpu.core0.rob.stats.ipc)" : "Iterations Per Second"

    # Assume a 1MHz part, determine the DMIPs based on the VAX 1780
    # achieving a 1757 Dhrystones per second score
    "(1e6 / (265 / cpu.core0.rob.stats.ipc)) / 1757" : "DMIPS Per MHz"

In the report content YAML file are name/value pairs that are of the form <expression> : <final stat name>. Sparta will parse the expression and determine if it is something to be evaluated or resolves to already existing statistic in the model (read about Sparta Statistical Expressions).

Running this report is the same as the previous example:

% ./olympia --report "" reports/dhry_report.yaml 1 text  --workload traces/dhry_riscv.zstf

... output

Report "reports/dhry_report.yaml on _SPARTA_global_node_" [0,2230144]
    IPC = 1.07169
    Cycles Per Iteration = 247.273
    Iterations Per Second = 4044.12
    DMIPS Per MHz = 2.30172

Here collection node is blank "" since the report itself specifies top.

To see the impact of "larger" cores on this trace, try the following commands and notice the improvement in performance. Remember that olympia defaults to the small_core "architecture."

./olympia                    --report "" reports/dhry_report.yaml small_core_dhry.text text  --workload traces/dhry_riscv.zstf &
./olympia --arch medium_core --report "" reports/dhry_report.yaml medium_core_dhry.text text --workload traces/dhry_riscv.zstf &
./olympia --arch big_core    --report "" reports/dhry_report.yaml big_core_dhry.text text    --workload traces/dhry_riscv.zstf &

Look at the DMIPs

% grep DMIPS *.text | column -t | sort  -r
small_core_dhry.text:   DMIPS  Per  MHz  =  2.30172
medium_core_dhry.text:  DMIPS  Per  MHz  =  2.59233
big_core_dhry.text:     DMIPS  Per  MHz  =  2.65948

Time Series Report Generation

Running workloads through a performance simulator can show varying effects on IPC. For example, if a workload is 10M instructions, there might "pockets" of high IPC performance followed by low performance. Overall, however, the IPC will average out to something in between. This can be misleading and "hide" potential performance bottlenecks.

Generating a time series report can help find where those "ups" and "downs" are in time. Time series reports should always be generated using a constant such as instruction count since from one run to another this will remain the same -- the simulator will always get through 10M instructions for example.

Time series reports are generated only in csv format and can be generated only by using a report definition file (see examples above for information regarding report definition files).

Included in olympia is the report definition file reports/core_timeseries.def to make a time series report:

% cat reports/core_timeseries.def
content:
  report:
    pattern: _global
    def_file: reports/core_stats.yaml
    dest_file: %OUT_BASE%_time_series_all.csv
    format: csv
    trigger:
      update-count: top.cpu.core0.rob.stats.total_number_retired %TS_PERIOD%

Using the similar command line as above, generate a time series report for every 100K instructions:

./olympia --report reports/core_timeseries.def \
          --report-yaml-replacements OUT_BASE dhry_2M_insts TS_PERIOD 100k \
          --workload traces/dhry_riscv.zstf

This will generate a file called dhry_2M_insts_time_series_all.csv with roughly over 20 lines. To view this csv file, either import it into a spreadsheet and plot it or try using the given plot_ts.py script (requires pandas and matplotlib):

python3 reports/plot_ts.py dhry_2M_insts_time_series_all.csv

The script is set up to only show a few stats from the csv file. Add more stats to the python script as needed.

Try this:

% ./olympia --arch small_core \
   --report reports/core_timeseries.def \
   --report-yaml-replacements OUT_BASE dhry_2M_insts_small TS_PERIOD 100k \
   --workload traces/dhry_riscv.zstf 

% ./olympia --arch big_core \
   --report reports/core_timeseries.def \
   --report-yaml-replacements OUT_BASE dhry_2M_insts_big TS_PERIOD 100k \
   --workload traces/dhry_riscv.zstf 

% python3 reports/plot_ts.py \
     dhry_2M_insts_small_time_series_all.csv \
     dhry_2M_insts_big_time_series_all.csv

Pipeout Generation

As documented in the main README.md of the RISC-V Performance Model, olympia can generate data for Sparta's pipeline viewing tools.

% ./olympia -z small_core_pipeout --workload traces/dhry_riscv.zstf -i 100K
% python $MAP_BASE/helios/pipeViewer/pipe_view/argos.py -d small_core_pipeout -l ../layouts/small_core.alf

% ./olympia --arch medium_core -z medium_core_pipeout --workload traces/dhry_riscv.zstf -i 100K
% python $MAP_BASE/helios/pipeViewer/pipe_view/argos.py -d medium_core_pipeout -l ../layouts/medium_core.alf

% ./olympia --arch big_core -z big_core_pipeout --workload traces/dhry_riscv.zstf -i 100K
% python $MAP_BASE/helios/pipeViewer/pipe_view/argos.py -d big_core_pipeout -l ../layouts/big_core.alf

Putting it All Together

At this juncture, familiarity with olympia's parameter manipulation, architecture differences, report generation, and pipeout generation will enable a performance architect to begin performance analysis.

Making Dhrystone Run Faster

Running the dhrystone trace on Olympia using the small core architecture has a very low performance score:

% ./olympia --workload traces/dhry_riscv.zstf --report "" reports/dhry_report.yaml 1 text 
...
Report "reports/dhry_report.yaml on _SPARTA_global_node_" [0,2230144]
    IPC = 1.07169
    Cycles Per Iteration = 247.273
    Iterations Per Second = 4044.12
    DMIPS Per MHz = 2.30172

But what is the cause of this low performance? Begin by looking at a core report, specifically, starting with dispatch's stall conditions (where out-of-"orderness" begins)

% ./olympia --workload traces/dhry_riscv.zstf --report top.cpu.core0.dispatch reports/core_stats.yaml 1 text
...
Report "reports/core_stats.yaml on top.cpu.core0.dispatch" [0,2230144]
...
    stall_alu_busy = 499991
    stall_fpu_busy = 0
    stall_br_busy = 0
    stall_lsu_busy = 940119
    stall_no_rob_credits = 0
    stall_not_stalled = 790033
...

Notice that dispatch stalled on just two main unit clusters: ALU and LSU. The small core archiecture is as follows:

top.cpu.core0.extension.core_extensions:
  execution_topology:
    [["alu", "1"],
     ["fpu", "1"],
     ["br",  "1"]]

It only contains 1 ALU. Perhaps if the number of ALUs were increased, performance would increase? There are two ways to try that experiment:

Supply a new execution topology on the command line
Use the medium_core architecture, but that also increase the width of the machine and the DL1 cache size

First, try increasing the ALU count to 2. This will require the forwarding paths (scoreboards) to include the new ALU:

% cat > alu2.yaml
top.cpu.core0.extension.core_extensions:
  execution_topology:
    [["alu", "2"],
     ["fpu", "1"],
     ["br",  "1"]]

top.cpu.core0.rename.scoreboards:
  # From
  # |
  # V
  integer.params.latency_matrix: |
    [["",     "alu0", "alu1", "fpu0", "fpu1", "br0"],   # <-- TO
     ["alu0",    "1",    "1",    "1",    "1",   "1"],
     ["alu1",    "1",    "1",    "1",    "1",   "1"],
     ["fpu0",    "1",    "1",    "1",    "1",   "1"],
     ["fpu1",    "1",    "1",    "1",    "1",   "1"],
     ["br0",     "1",    "1",    "1",    "1",   "1"]]

  float.params.latency_matrix: |
    [["",     "alu0", "alu1", "fpu0", "fpu1", "br0"],   # <-- TO
     ["alu0",    "1",    "1",    "1",    "1",   "1"],
     ["alu1",    "1",    "1",    "1",    "1",   "1"],
     ["fpu0",    "1",    "1",    "1",    "1",   "1"],
     ["fpu1",    "1",    "1",    "1",    "1",   "1"],
     ["br0",     "1",    "1",    "1",    "1",   "1"]]
<ctrl-D>
% ./olympia -c alu2.yaml --workload traces/dhry_riscv.zstf --report "top.cpu.core0.dispatch" reports/core_stats.yaml 1 text
...
Report "reports/core_stats.yaml on top.cpu.core0.dispatch" [0,2120142]
    stall_alu_busy = 0
    stall_fpu_busy = 0
    stall_br_busy = 0
    stall_lsu_busy = 1170104
    stall_no_rob_credits = 0
    stall_not_stalled = 950037
...

Hooray! The ALU stalls disappeared! Now, compare the old vs new dhrystone DMIPs results:

% ./olympia              --workload traces/dhry_riscv.zstf --report "" reports/dhry_report.yaml dhry_1_alu.text text
% ./olympia -c alu2.yaml --workload traces/dhry_riscv.zstf --report "" reports/dhry_report.yaml dhry_2_alu.text text
% diff -y dhry_1_alu.text dhry_2_alu.text | grep DMIP
    DMIPS Per MHz = 2.30172             |	    DMIPS Per MHz = 2.42114

DMIPs improved. But can it be made better? Dhrystone is a very ALU intensive benchmark. Adding more ALUs should help... and it does. Running the big core design, which has a 6 ALUs give Dhystone a big boost:

% ./olympia --arch big_core --workload traces/dhry_riscv.zstf --report "" reports/dhry_report.yaml dhry_big_core.text text
% diff -y dhry_1_alu.text  dhry_big_core.text | grep DMIP
    DMIPS Per MHz = 2.30172	      |	    DMIPS Per MHz = 2.65948

Conclusion

Olympia is still a very simple simulator with no operand dependency tracking (yet), no branch prediction, no icache, no execution pipelining, and a very simple load/store unit. But this simulation platform opens the door to micro-architectural exploration, analysis, and next generation modeling practices.

RISC-V Performance Model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly