-
Notifications
You must be signed in to change notification settings - Fork 58
Tutorial
This wiki describes a general overview of the RISC-V Performance Model (olympia) as well as trace, report, and pipeline viewing generation.
As described in the README.md of the main Olympia page, Olympia is a Performance Model written in C++ for the RISC-V community as an example of an Out-of-Order RISC-V CPU Performance Model based on the Sparta Modeling Framework.
The pipeline design is very rudimentary, with a simple Fetch, Decode, Rename, Dispatch and Execution blocks in the main pipeline. The memory system consists of a simple in-order load/store pipeline coupled to a simple bus interface unit communicating with a very simple memory subsystem. The design layout is very similar to the Sparta Core Example.
This tutorial starts with trace generation using Dromajo, specifically Dhrystone (included in the repository). After the trace is generated the tutorial will focus on running that trace and looking for performance bottlenecks using tools like reporting, Argos pipeline viewing, and time series analysis.
Assumptions made:
- The reader of this tutorial has successfully built the Olympia model using the directions found in the main README.md.
- The traces, reports definition files, etc have not changed since this tutorial was composed
Generation a trace for Olympia involves instrumenting a functional model like Spike or Dromajo (or any functional simulator that can run RISC-V software) with the STF library's writer API. Included in the Olympia is a patch for Dromajo as well as documentation to build, run, and trace Dhrystone on Dromajo.
Traces are instruction streams -- the path an application took running on a RISC-V core. STF traces are binary files and can only be viewed using the STF library's reader API or with the STF tools (like stf_dump
and stf_imem
) found in the STF tracing tools repository.
For example, this is the command to view the instruction stream of the provided Dhrystone trace:
% stf_dump traces/dhry_riscv.zstf | less
VERSION 1.5
GENERATOR Dromajo
GEN_VERSION 1.1.0
GEN_COMMENT Trace from Dromajo
INST_IEM RISCV
PID 00000000:00000000:00000000
INST16 1 00000000000101ba 00006722 c.ldsp x14,8(x2)
MEM READ 0000 0000003fffa90cb8 0000000000000000
INST32 2 00000000000101bc 4f805d63 bge x0,x24,0x00000000000106b6
INST32 3 00000000000101c0 000247b7 lui x15,0x24
INST32 4 00000000000101c4 b4078793 addi x15,x15,-1216 # 0x0000000000023b40
INST16 5 00000000000101c8 00006398 c.ld x14,0(x15)
MEM READ 0000 0000000000023b40 0000000000000000
INST32 6 00000000000101ca 000244b7 lui x9,0x24
INST16 7 00000000000101ce 00004905 c.li x18,1
...
For example, this is the command to view a sorted instruction memory dump of the provided Dhrystone trace:
% stf_imem -S traces/dhry_riscv.zstf | less
Traces can be extended to include registers and their values per instruction, PTE entries, escape records with speculative paths, exception information, etc. This is beyond the scope of this tutorial, however.
Running a trace (expected extension zstf
or stf
) on Olympia is as simple as providing the trace to the simulator:
% ./olympia <trace_file>.zstf
Olympia does, however, support simple JSON input files as well. This is handy if a performance architect is interested in a simple what-if analysis, like load-to-use latency:
[
{
"mnemonic": "lw",
"rs1": 4,
"rd": 5,
"imm": 100,
"vaddr" : "0xdeadbeef"
}
{
"mnemonic": "add",
"rs1": 5,
"rs2": 2,
"rd": 1
}
]
Running this JSON file on olympia with "infinite caches" gives a general idea of the latency from load issue time to the add
execution time.
./olympia -p top.cpu.core0.lsu.params.dl1_always_hit true load_add_dependency.json
More on analyzing such as example later in the tutorial.
Run the provided trace of Dhrystone on the simulator, specifically from the build directory where the olympia
binary resides:
./olympia traces/dhry_riscv.zstf --auto-summary on
This will run the default configuration of olympia
on 2.3 million instructions of Dhrystone trace in roughly 6 seconds.
Each unit in Olympia has parameters that it uses at startup/runtime to change/manipulate behavior. A comprehensive list of parameters can be viewed using the following command line options:
./olympia --no-run <parameter option>
# --show-parameters # Dump to the console the parameters found in the tree
# --write-final-config <config name>.yaml # Dump the final parameters to a YAML file
# --write-final-config-verbose <config name>.yaml # Dump the final parameters to a YAML file with descriptions
--no-run
is handy to prevent the simulator from complaining that no workload was provided.
Parameters can be changed on the command line using the -p
option or via a configuration YAML file allowing for a list of parameters:
# Set the Dispatch Queue Depth
./olympia -p top.cpu.core0.dispatch.params.dispatch_queue_depth 12 traces/dhry_riscv.zstf
Create a configuration file and run it.
cat > dipatch_params.yaml
top:
cpu.core0.dispatch.params.dispatch_queue_depth: 12
cpu.core0.dispatch.params.num_to_dispatch: 3
# <ctrl-D>
./olympia -c dipatch_params.yaml traces/dhry_riscv.zstf
Architectures are another methodology to group parameters together that represent an architecture configuration. In Olympia, three made-up architectures are provided:
ls arches/*.yaml
# big_core.yaml medium_core.yaml small_core.yaml
Each architecture builds on top of the previous one:
head -8 arches/big_core.yaml
#
# Set up the pipeline for a 8-wide machine
#
# Build on top of a medium core
include: medium_core.yaml
The include
statement allows big_core
to build on top of medium_core
, etc. This allows changes, for example, in medium_core
to be automatically included in big_core
.
To run an architecture, supply the name of the architecture to the --arch <arch_name>
command line:
./olympia --arch medium_core traces/dhry_riscv.zstf
olympia
automatically looks in the arches
directory (defined here) to look for named architectures.
By default, olympia
runs the small_core
architecture.
The main difference between each of the cores is the width of the machine and the number of ALU/FPU/BR units.
One of the most powerful features of the Sparta Modeling Framework is the ability to generate precise reports in a multitude of formats. Reports are the first insight to how an application (trace) is performing on a given modeled architecture.
Reports in Sparta are provided in two forms: definitions (.def
) and contents (.yaml
). Depending on how the modeler wants to collect a report, one of the two formats will be provided.
Report definitions (.def
) allow the modeler to control how and when a report is collected. For example, collecting a report starting from the very first instruction of a trace will include statistic/counter values that reflect cold cache/branch prediction effects. Starting a report at a distance into the trace will avoid those cold cache effects and provide a more "steady state" report.
Report content files (.yaml
) indicate to the reporting mechanism what to collect from the simulator. A content file can include all of the statistics/counters or a small subset.
Report definition files include report content files. But either file can be provided to the --report
command line option.
Start with the simplest report, the auto-summary
report. This is a report of every stat regardless of visibility.
# Run 1 million instructions for brevity
% ./olympia --auto-summary on --workload traces/dhry_riscv.zstf -i 1M
Next, generate a report that constrains the statistics/counters to only those that are not hidden using a content report file (.yaml
):
% cat reports/core_stats.yaml
#
# Auto populate the report with stats that are not marked hidden
# https://sparcians.github.io/map/classsparta_1_1InstrumentationNode.html#a855b6ecdd93e412052ae264032002ce1
#
content:
autopopulate:
attributes: "!=vis:hidden"
% ./olympia --report "top" reports/core_stats.yaml 1 text --workload traces/dhry_riscv.zstf -i1M
The command line option --report
broken down:
-
top
means start collection at the given node:top
.
Try this:top.cpu.core0.dispatch
-
reports/core_stats.yaml
is the content file.
Try this:reports/dhry_report.yaml
-
1
means send the report tostdout
.
Try this:2
forstderr
ormy_report.out
to save it in a file -
text
means save the report in text format.
Try this:html
for html output, but save it tomy_report.html
.
Other example to try:
# Supply the --report option as many times as needed
./olympia --report "top.cpu.core0.dispatch" reports/core_stats.yaml dispatch_stats.text text \
--report "top.cpu.core0.rob" reports/core_stats.yaml rob_stats.text text \
--report "top.cpu.core0" reports/core_stats.yaml core0_stats.text text \
--workload traces/dhry_riscv.zstf -i1M
The above example generates a single report using a content file. Using a definitions file, more report control can be added.
Included in Olympia is a core report definition file core_report.def
in the reports
directory. In this definition file, there are two reports expected to be generated:
- A report for the entire workload
- A report started after a certain number of instructions have elapsed
The definition file:
% cat reports/core_report.def
content:
# Report 1: Start from time/inst == 0 and collect everything
report:
pattern: top
def_file: reports/core_stats.yaml
dest_file: %OUT_BASE%.%OUT_FORMAT%
format: %OUT_FORMAT%
# Report 2: Start from inst == INST_START
report:
pattern: top
def_file: reports/core_stats.yaml
dest_file: %OUT_BASE%_delayed.%OUT_FORMAT%
format: %OUT_FORMAT%
trigger:
start: cpu.core0.rob.stats.total_number_retired >= %INST_START%
The report
keyword starts a new report, at the given pattern
using the content file core_stats.yaml
.
Each report has a given destination file (dest_file
) and format (format
), but the names are replace strings. Anything defined in %%
are keywords expected to be replaced at simulation runtime using the command line option --report-yaml-replacements <placeholder_name> <value> [<placeholder_name> <value>]
. This is handy when a performance architect is running an experiment on many workloads.
Finally, the second report generated is different from the first in that it is triggered to start statistics/counter collection at the given start
. In this case, when the total number of retired instructions is equal to or exceeds the replaced INST_START
the report will begin. For more information on triggers, see this README in the Sparta Modeling Framework repository.
Here's an example of how to use a report definition file:
./olympia --report reports/core_report.def \
--report-yaml-replacements OUT_BASE dhry_1M_insts OUT_FORMAT text INST_START 100k \
--workload traces/dhry_riscv.zstf -i 1M
This command will generate reports dhry_1M_insts.text
and dhry_1M_insts_delayed.text
. Diff these files to notice the difference in stats, particularly in instruction count (it will be 100K fewer):
% diff -y dhry_1M_insts.text dhry_1M_insts_delayed.text | grep total_number_retired
total_number_retired = 1000000 | total_number_retired = 900000
Command line option break down:
- Option
--report reports/core_report.def
: Use the definition file for report generation. Note the lack of other options as compared to the previous example - Option
--report-yaml-replacements OUT_BASE dhry_1M_insts OUT_FORMAT text INST_START 100k
. ReplaceOUT_BASE
in the definition file withdhry_1M_insts
ReplaceOUT_FORMAT
in the definition file withtext
ReplaceINST_START
in the definition file with100k
Report content files are very flexible in the types of expressions that can be written. For example, since the trace is Dhrystone, a report that provides a DMIPs calculation would be very helpful.
Included in Olympia, is a report content file called reports/dhry_report.yaml
that calculates DMIPs for this example model:
content:
top:
"cpu.core0.rob.stats.ipc" : "IPC"
# Assuming 1 iteration is about 265 instructions, calculate the
# cycles per iteration
"265 / cpu.core0.rob.stats.ipc" : "Cycles Per Iteration"
# Assume a 1MHz part, determine the iterations per second
"1e6 / (265 / cpu.core0.rob.stats.ipc)" : "Iterations Per Second"
# Assume a 1MHz part, determine the DMIPs based on the VAX 1780
# achieving a 1757 Dhrystones per second score
"(1e6 / (265 / cpu.core0.rob.stats.ipc)) / 1757" : "DMIPS Per MHz"
In the report content YAML file are name/value pairs that are of the form <expression> : <final stat name>
. Sparta will parse the expression and determine if it is something to be evaluated or resolves to already existing statistic in the model (read about Sparta Statistical Expressions).
Running this report is the same as the previous example:
% ./olympia --report "" reports/dhry_report.yaml 1 text --workload traces/dhry_riscv.zstf
... output
Report "reports/dhry_report.yaml on _SPARTA_global_node_" [0,2230144]
IPC = 1.07169
Cycles Per Iteration = 247.273
Iterations Per Second = 4044.12
DMIPS Per MHz = 2.30172
Here collection node is blank ""
since the report itself specifies top
.
To see the impact of "larger" cores on this trace, try the following commands and notice the improvement in performance. Remember that olympia
defaults to the small_core
"architecture."
./olympia --report "" reports/dhry_report.yaml small_core_dhry.text text --workload traces/dhry_riscv.zstf &
./olympia --arch medium_core --report "" reports/dhry_report.yaml medium_core_dhry.text text --workload traces/dhry_riscv.zstf &
./olympia --arch big_core --report "" reports/dhry_report.yaml big_core_dhry.text text --workload traces/dhry_riscv.zstf &
Look at the DMIPs
% grep DMIPS *.text | column -t | sort -r
small_core_dhry.text: DMIPS Per MHz = 2.30172
medium_core_dhry.text: DMIPS Per MHz = 2.59233
big_core_dhry.text: DMIPS Per MHz = 2.65948
Running workloads through a performance simulator can show varying effects on IPC. For example, if a workload is 10M instructions, there might "pockets" of high IPC performance followed by low performance. Overall, however, the IPC will average out to something in between. This can be misleading and "hide" potential performance bottlenecks.
Generating a time series report can help find where those "ups" and "downs" are in time. Time series reports should always be generated using a constant such as instruction count since from one run to another this will remain the same -- the simulator will always get through 10M instructions for example.
Time series reports are generated only in csv
format and can be generated only by using a report definition file (see examples above for information regarding report definition files).
Included in olympia
is the report definition file reports/core_timeseries.def
to make a time series report:
% cat reports/core_timeseries.def
content:
report:
pattern: _global
def_file: reports/core_stats.yaml
dest_file: %OUT_BASE%_time_series_all.csv
format: csv
trigger:
update-count: top.cpu.core0.rob.stats.total_number_retired %TS_PERIOD%
Using the similar command line as above, generate a time series report for every 100K instructions:
./olympia --report reports/core_timeseries.def \
--report-yaml-replacements OUT_BASE dhry_2M_insts TS_PERIOD 100k \
--workload traces/dhry_riscv.zstf
This will generate a file called dhry_2M_insts_time_series_all.csv
with roughly over 20 lines. To view this csv
file, either import it into a spreadsheet and plot it or try using the given plot_ts.py
script (requires pandas
and matplotlib
):
python3 reports/plot_ts.py dhry_2M_insts_time_series_all.csv
The script is set up to only show a few stats from the csv
file. Add more stats to the python script as needed.
Try this:
% ./olympia --arch small_core \
--report reports/core_timeseries.def \
--report-yaml-replacements OUT_BASE dhry_2M_insts_small TS_PERIOD 100k \
--workload traces/dhry_riscv.zstf
% ./olympia --arch big_core \
--report reports/core_timeseries.def \
--report-yaml-replacements OUT_BASE dhry_2M_insts_big TS_PERIOD 100k \
--workload traces/dhry_riscv.zstf
% python3 reports/plot_ts.py \
dhry_2M_insts_small_time_series_all.csv \
dhry_2M_insts_big_time_series_all.csv
As documented in the main README.md of the RISC-V Performance Model, olympia
can generate data for Sparta's pipeline viewing tools.
% ./olympia -z small_core_pipeout --workload traces/dhry_riscv.zstf -i 100K
% python $MAP_BASE/helios/pipeViewer/pipe_view/argos.py -d small_core_pipeout -l ../layouts/small_core.alf
% ./olympia --arch medium_core -z medium_core_pipeout --workload traces/dhry_riscv.zstf -i 100K
% python $MAP_BASE/helios/pipeViewer/pipe_view/argos.py -d medium_core_pipeout -l ../layouts/medium_core.alf
% ./olympia --arch big_core -z big_core_pipeout --workload traces/dhry_riscv.zstf -i 100K
% python $MAP_BASE/helios/pipeViewer/pipe_view/argos.py -d big_core_pipeout -l ../layouts/big_core.alf
At this juncture, familiarity with olympia's parameter manipulation, architecture differences, report generation, and pipeout generation will enable a performance architect to begin performance analysis.
Running the dhrystone trace on Olympia using the small core architecture has a very low performance score:
% ./olympia --workload traces/dhry_riscv.zstf --report "" reports/dhry_report.yaml 1 text
...
Report "reports/dhry_report.yaml on _SPARTA_global_node_" [0,2230144]
IPC = 1.07169
Cycles Per Iteration = 247.273
Iterations Per Second = 4044.12
DMIPS Per MHz = 2.30172
But what is the cause of this low performance? Begin by looking at a core report, specifically, starting with dispatch
's stall conditions (where out-of-"orderness" begins)
% ./olympia --workload traces/dhry_riscv.zstf --report top.cpu.core0.dispatch reports/core_stats.yaml 1 text
...
Report "reports/core_stats.yaml on top.cpu.core0.dispatch" [0,2230144]
...
stall_alu_busy = 499991
stall_fpu_busy = 0
stall_br_busy = 0
stall_lsu_busy = 940119
stall_no_rob_credits = 0
stall_not_stalled = 790033
...
Notice that dispatch
stalled on just two main unit clusters: ALU and LSU. The small core archiecture is as follows:
top.cpu.core0.extension.core_extensions:
execution_topology:
[["alu", "1"],
["fpu", "1"],
["br", "1"]]
It only contains 1 ALU. Perhaps if the number of ALUs were increased, performance would increase? There are two ways to try that experiment:
- Supply a new execution topology on the command line
- Use the
medium_core
architecture, but that also increase the width of the machine and the DL1 cache size
First, try increasing the ALU count to 2. This will require the forwarding paths (scoreboards) to include the new ALU:
% cat > alu2.yaml
top.cpu.core0.extension.core_extensions:
execution_topology:
[["alu", "2"],
["fpu", "1"],
["br", "1"]]
top.cpu.core0.rename.scoreboards:
# From
# |
# V
integer.params.latency_matrix: |
[["", "alu0", "alu1", "fpu0", "fpu1", "br0"], # <-- TO
["alu0", "1", "1", "1", "1", "1"],
["alu1", "1", "1", "1", "1", "1"],
["fpu0", "1", "1", "1", "1", "1"],
["fpu1", "1", "1", "1", "1", "1"],
["br0", "1", "1", "1", "1", "1"]]
float.params.latency_matrix: |
[["", "alu0", "alu1", "fpu0", "fpu1", "br0"], # <-- TO
["alu0", "1", "1", "1", "1", "1"],
["alu1", "1", "1", "1", "1", "1"],
["fpu0", "1", "1", "1", "1", "1"],
["fpu1", "1", "1", "1", "1", "1"],
["br0", "1", "1", "1", "1", "1"]]
<ctrl-D>
% ./olympia -c alu2.yaml --workload traces/dhry_riscv.zstf --report "top.cpu.core0.dispatch" reports/core_stats.yaml 1 text
...
Report "reports/core_stats.yaml on top.cpu.core0.dispatch" [0,2120142]
stall_alu_busy = 0
stall_fpu_busy = 0
stall_br_busy = 0
stall_lsu_busy = 1170104
stall_no_rob_credits = 0
stall_not_stalled = 950037
...
Hooray! The ALU stalls disappeared! Now, compare the old vs new dhrystone DMIPs results:
% ./olympia --workload traces/dhry_riscv.zstf --report "" reports/dhry_report.yaml dhry_1_alu.text text
% ./olympia -c alu2.yaml --workload traces/dhry_riscv.zstf --report "" reports/dhry_report.yaml dhry_2_alu.text text
% diff -y dhry_1_alu.text dhry_2_alu.text | grep DMIP
DMIPS Per MHz = 2.30172 | DMIPS Per MHz = 2.42114
DMIPs improved. But can it be made better? Dhrystone is a very ALU intensive benchmark. Adding more ALUs should help... and it does. Running the big core design, which has a 6 ALUs give Dhystone a big boost:
% ./olympia --arch big_core --workload traces/dhry_riscv.zstf --report "" reports/dhry_report.yaml dhry_big_core.text text
% diff -y dhry_1_alu.text dhry_big_core.text | grep DMIP
DMIPS Per MHz = 2.30172 | DMIPS Per MHz = 2.65948
Olympia is still a very simple simulator with no operand dependency tracking (yet), no branch prediction, no icache, no execution pipelining, and a very simple load/store unit. But this simulation platform opens the door to micro-architectural exploration, analysis, and next generation modeling practices.
RISC-V Performance Model