section-modelling.tex

\nopagebreak
\section{Energy Modelling}\label{sec:energy-models}

%An energy model provides information on the energy consumed when running a program on a given hardware platform, based on parameters that characterise the program and its execution. The energy model contains measured energy consumption costs for these parameters and links these costs to program specific values for the parameters to arrive at an overall energy consumption estimation. 

An energy model supporting 
%Models that support 
software energy analysis 
%typically 
associates energy 
consumption costs
with 
basic
program constructs such as source code blocks, basic blocks in the intermediate representation 
%of the program 
used during compilation or machine code instructions.
%, with energy consumption costs. 
In addition, other costs arising from the execution of a program may need to be considered, depending on the micro architectural features of the hardware; examples are costs associated with the memory hierarchy, such as the cost of a cache hit and miss or the cost of accessing on-chip and off-chip memory, and also costs associated with the processor pipeline, such as the cost of pipeline stalls. In addition, the cost of the processor being idle and the cost of processing multiple threads concurrently may also need to be considered. 

An energy model, as understood in this chapter, is program-independent.  It captures the energy costs of basic software constructs in a given language executed on a given hardware platform. The model is used during program analysis (Section \ref{sec:energy-analysis}) to obtain information about the energy consumption of a given program.
%
%In general, to instantiate an energy model, model parameters can be obtained either from analysing execution or simulation traces, e.g.\ counting instructions, cache hits and misses, etc, or through static analysis as will be discussed next, in Section~\ref{sec:energy-analysis}.

The challenge in energy modelling for software energy analysis is in finding a good compromise between the accuracy of the model and the ease with which the information can be mapped onto software constructs. Regarding the former, model accuracy tends to be higher for models at the lower-levels of abstraction, i.e.\ instruction-level energy models are typically more accurate than energy models at the intermediate representation of the compiler, and source code energy models are less accurate in comparison. 
%
However, understanding which source code lines or blocks consume most energy is much more useful to software developers looking to optimise their code for energy efficiency, than knowing the energy consumed by the sequence of machine instructions issued by the compiler. The higher the level of abstraction at which the information is presented to the software developer, the easier it is for them to comprehend the impact of algorithms and coding on the energy consumed during program execution. Yet, taking measurements to characterise energy models is simplest and most accurate when performed at the lower levels of abstraction, where energy costs of low-level software constructs such as machine instructions can be determined directly.


\subsection{Defining and constructing an energy model at ISA level}\label{subsec:isa}

The Instruction Set Architecture (ISA) is the interface between hardware and
software. 
%
It defines the hardware architecture and its behaviour in terms of low-level
programming constructs such as supported data types, machine instructions,
registers, the memory architecture, any interrupt and exception handling as
well as I/O operations.
%
The ISA provides a practical level of abstraction for energy modelling, because
it is possible to directly correlate the energy consumption of the hardware
operations associated with instruction execution to low-level software
constructs.

Energy modelling at ISA level dates back to 1994 when Tiwari et
al.~\cite{Tiwari-embedded-1994} first proposed a
%generic 
method to develop instruction-level power models for arbitrary
processor architectures to estimate the power consumption caused by software.
%
Such models could overcome the limitations of hardware design power analysis
methods, which require access to gate level design information including layout
and tend to be impractically slow at producing results for system-level power
analysis. Instruction-level power models, instead, are orders of magnitude
faster at estimating the power consumption of embedded software and can achieve
accuracy within 10\% of what hardware design power analysis methods deliver.
This is a worthwhile tradeoff because software development involves numerous
iterations during the coding phase, and rapid feedback of resource usage is
critical for software developers to make energy aware decisions.

The model in reference~\cite{Tiwari-embedded-1994} captures the energy consumption
directly associated with processing each instruction, obtained by measuring the
average current drawn while executing a dedicated loop that only contains
independent instances of the respective instruction to be profiled, multiplied
by the supply voltage $V_{CC}$, and further multiplied by the number and
duration of the clock cycles required to execute the instruction.
%
Variations in instruction base costs can be observed during measurements and
are due to different operand values being used during execution. It was observed that
different operand registers lead to negligible variation, while using different
immediate values or different memory addresses leads to observable, yet small
variation of no more than 5\% for the architectures analysed.
%
Because the exact operand values for instructions are only known at runtime,
the energy model associates a single base cost with each instruction,
representing averaged values. This is a very important feature of a single
instruction cost Tiwari-style energy model as it has implications on the safety
of the bounds inferred by worst case static analysis techniques. This will be
discussed further in Section~\ref{subsec:data}.

Instruction base costs intentionally do not include any extra costs arising
from executing an instruction within the context of other instructions, i.e.\
the overheads of executing arbitrary instruction sequences.
%
One such cost is associated with switching the circuit state from executing one
instruction to executing the next, termed the circuit state overhead. It
captures the extra energy consumed due to switching on busses, e.g.\ as a
consequence of changing op-codes and operand values, and using different
functional units within the processor. The circuit state overhead is determined
for all pairs of instructions by measuring loops that contain alternating
sequences of the two instructions per pair.
%
While including circuit state overheads into the energy model improves the
accuracy of the model, the variation observed for individual instruction pairs
was very limited for the architectures considered. It may thus be sufficient to
determine a constant circuit state overhead cost and to use that instead of
profiling all instruction pairs.

The execution of instruction sequences may give rise to other costs beyond the
cost of circuit state switching, depending on the micro architecture of the
processor. Resource contention due to data dependencies between instructions
may cause pipeline stalls. Thus, the cost of pipeline stalls needs to be
determined together with the number of stall cycles.
%
In addition, there may be costs associated with cache misses, which typically
cause execution delays of varying durations, depending on whether the fetch is
from other cache levels of main memory. Their energy consumption also needs to
be accounted for in an energy model, potentially sourcing information from a
cache model that can provide cache miss rates for a given program. 
%
Thus, while instruction-level energy modelling techniques can be very accurate
for simple architectures, the presence of complex architectural and micro
architectural features such as several layers of caches, pipelines, superscalar
processing, speculative execution etc, can make it very difficult to achieve
acceptable levels of accuracy when modelling at the instruction level.

In~\cite{TiwariWolfeInstructionLevelPowerAnalysi:1996} this energy model is
used to derive the energy consumption of a program $Prog$ by 
 by Equation~\ref{e:tiwari}.
%
\begin{equation}\label{e:tiwari}
E_{Prog} = \sum_i (B_i \times N_i) + \sum_{i,j} (O_{i,j} \times N_{i,j}) + \sum_k E_k
\end{equation}

According to Equation~\ref{e:tiwari}, the energy consumption of $Prog$, namely $E_{Prog}$, is
calculated as the sum of three components: the base cost
of instruction
execution, the
circuit state overhead, and other inter instruction effects.
%
The first term in the sum in Equation~\ref{e:tiwari} represents the base cost, where $B_i$ is
the base cost of instruction $i$ multiplied with the number of times $i$ occurs
in program $\mathit{Prog}$, $N_i$.
%
The second term is the circuit state overhead, where $O_{i,j}$ represents the
cost incurred by switching the circuit state of the processor from executing
instruction $i$ to executing instruction $j$. This is multiplied by the number of
times instruction $i$ is followed by instruction $j$ in program $\mathit{Prog}$, $N_{i,j}$. 
%
Finally, the third term in the sum accounts for the cost of $k$ inter-instruction
effects that may impact on software related energy consumption, e.g.\ cache
misses or pipeline stalls that can be characterised using external cache models
or models of the micro architecture of the processor.

Equation~\ref{e:tiwari} shows clearly the relation between the energy model and the
analysis of the program.  The terms $B_i$, $O_{i,j}$ and $E_k$ are obtained from the energy
model, and are program independent. The terms $N_i$ and $N_{i,j}$ are obtained by
analysis of the program, either dynamically, by profiling and counting the number of 
times each instruction or pair of instructions is executed, or statically as will be seen
in Section \ref{sec:energy-analysis}.

\todo[inline]{Maybe include Steinke / Sarta model that have far more detail - neither is useful for static analysis though.}

Recently, instruction-level energy models have been developed for modern
processors such as the Intel Xeon Phi, a many integrated core architecture for
high-performance computing, and the hardware multi-threaded XMOS XCore embedded
microprocessor~\cite{XMOS:Arch}.

The XMOS XCore instruction-level energy
model~\cite{DBLP:journals/tecs/KerrisonE15} is based on the original model by
Tiwari et al. However, it re-defines the notion of base cost to be the power
dissipated while the processor is idle, and uses individual instruction costs,
scaled by the level of concurrency in the processor's pipeline as well as a
constant overhead to account for circuit state switching between instructions.
%
Model characterisation was performed using a measurement setup and instruction
loops similar to those originally proposed in reference~\cite{Tiwari-embedded-1994}. The
individual instruction costs represent averages over measurements obtained from
running loops with instructions using operands that were generated
pseudo-randomly, constraining values to those valid for the respective
instruction.
%
Evaluation of this multi-threaded instruction-level model showed average error
margins of less than 7\%. \todo{should say how this evaluation was done - with ISS, then move to SA}
% when used with instruction stream simulation to
%instantiate the model parameters.
%
However, the model was designed to be used for static energy consumption analysis,
requiring static analysis techniques to determine the number of idle cycles and
the level of concurrent thread activity, in addition to the standard
instruction stream statistics.

In contrast, the Xeon Phi instruction-level energy model~\cite{phimodel} relies
on performance counter statistics that are obtained at runtime, rather than
through static analysis at compile time. This model is designed to be used with
software profiling tools to support energy efficient software development. The
model is built by characterising the Energy per Instruction (EPI) of selected
instruction types using microbenchmarks executed on different processor
configurations in terms of numbers of cores and threads per core. Instructions
are classified in terms of op-code and operand locations, both of which
influence the EPI.
%
The energy consumption of a given workload can then be determined by
multiplying the runtime instruction statistics with the respective EPIs.
%
This model achieves an average error rate of less than 5\%.


Energy modelling at the ISA level gives us the following benefits: energy
costs can be assigned at the instruction level; the same level as is output by
the compiler; there are strong correlations between instruction properties and
energy consumption, for example the number of operands used in the instruction;
and machine instructions can be related back to the original programming
statements written by the software developer, as well as to various
intermediate representations. 

The construction of an energy model at the ISA level also has to address several
challenges.
%
Measurements need to be taken to determine both the base cost for each
instruction and also the circuit state overhead. To achieve this, instructions
are placed into infinite loops, i.e.\ loops of single instructions to obtain
that instruction's base costs or loops of alternating instructions to
characterise pair-wise circuit state overhead. The average current is measured
while the loop is executed. Care needs to be taken to ensure the loop runs for
a sufficiently long time to minimise measurement errors due to loop overheads.
%
However, typically not all instructions can be directly profiled, requiring
indirect or statistical approaches to their characterisation.

In general, for a modern processor with hundreds of instructions the
characterisation of the entire ISA is a significant effort.
%
To reduce the measurement effort, rather than determining a base cost for each
instruction, it may be sufficient to group instructions into classes of similar
energy cost and to determine a single instruction class base cost. Likewise,
instead of measuring circuit state overheads for individual instruction pairs,
a cost that represents switching between instruction classes, or a single
constant circuit state overhead may be sufficient.

In addition, other properties such as the cost of running multiple threads and
the cost of idle periods must be determined for multi-threaded architectures,
and communication costs must be considered for interacting multi-threaded
programs running on multi-core platforms. 
%
%While characterising these costs can be very challenging, it can be even more
%challenging to determine the corresponding model parameters, e.g.\ based on
%execution or simulation statistics of the program or by statically extracting
%them from the program.
 

\subsection{Energy modelling at higher levels of software abstraction}
\label{subsec:mapping}

Instruction-level energy models are 
%practical 
useful 
due to the close, almost direct
link of measurements to programming constructs within the ISA. Energy models at
higher levels of abstraction, however, provide more intuitive feedback to
software and tool developers, 
%all be it 
albeit 
at the cost of accuracy of the
predictions. 

Modelling at the level of the Intermediate Representation (IR) used by
compilers can be a useful compromise between the accuracy of a lower-level
(ISA) model and the high-level source code. Since the compiler is a natural
place for optimisation, modelling and predicting the energy consumption at IR
level could therefore enable energy specific optimisations.

IR-level energy models have been built using two distinct techniques. One is
based on statistical methods, the other on mapping a lower-level model, i.e.\
one at ISA level, up to the IR level at compile time; both have been developed
for LLVM IR~\cite{LattnerLLVM2004} in the context of the LLVM
toolchain~\cite{LLVM}.

In reference ~\cite{Brandolese2011} statistical analysis was employed to build an
energy model for LLVM IR for the purpose of fast and accurate early-stage
prediction of the energy consumption of embedded software at the function level
to enable compile-time energy optimisation.
%
Modelling starts with instrumenting and compiling the source code of a large
set of benchmarks into architecture-independent LLVM IR to extract block-level
statistics capturing the structural features of the source code in terms of
LLVM instructions. 
%
Profiling of the LLVM IR basic blocks is then performed on a host computer to
capture their dynamic behaviour in terms of basic block execution counts. 
%
The final step then factors in the timing and energy consumption of the target
architecture. Using native compilation, back-annotation techniques to associate
LLVM IR instructions with target machine instructions and statistical analysis,
the target machine specific costs are associated with the
architecture-independent LLVM IR. This requires a cost model of the instruction
set for the respective target machine, which is assumed to be provided by the
manufacturer. The resulting model can be used to estimate the energy
consumption of code for the target hardware, based solely on its
target-independent LLVM IR.

A 
%direct 
mapping technique that lifts an energy model at ISA level to LLVM IR
level is described in~\cite{DBLP:journals/corr/GeorgiouKE15}. The energy characteristics of LLVM IR
instructions are determined from the costs of the associated machine
instructions based on a mapping that tracks which LLVM IR instructions created
which machine instructions during the lowering phase of compilation, i.e.\
after optimisation passes. 

%
The approach provides on-the-fly LLVM-IR energy characterisation that takes
into consideration the context of instructions since there is no
program-independent mapping between ISA instructions and LLVM instructions.
Strictly speaking,
the energy model is still at the ISA level; a program-dependent mapping is used to
obtain energy costs of LLVM-IR instructions and blocks at compile time.
%
The technique has been used for static energy consumption analysis at the LLVM
IR level in~\cite{DBLP:journals/corr/GeorgiouKE15},~\cite{isa-vs-llvm-fopara} and also
in~\cite{grech15}. The accuracy of energy consumption estimation at LLVM IR
level is typically within 1\% to 3\% of that achieved using ISA-level models.
This indicates excellent potential for exploiting this energy transparency
during code generation.
%
In principle, the same mapping technique may be used to map the energy consumption
of programs to even higher levels, such as source code blocks or functions. 

An alternative approach to building a source-level energy model, used in \cite{LiGallagherMobiquitous2016}
to obtain a source code energy model for Android code, 
is to identify
basic energy-consuming operations from the source code and correlate them to
energy costs by measuring energy consumption in a large number of benchmark programs
and analysing the results using techniques based on regression analysis. The
resulting energy model of the basic operations implicitly includes the effect
of all the layers of the software stack down to the hardware, including
compiled code, virtual machine and operating system layers. The approach is
inherently approximate; nevertheless such an approach may be the only feasible
one in cases where the software stack has many complex layers, rendering a
mapping-based approach difficult or impossible.

\subsection{The impact of data on the energy model}\label{subsec:data}

The classic instruction level energy model as described in
Section~\ref{subsec:isa} does not capture the impact of data on software energy
consumption. For instance, a single energy cost is assigned to each instruction
or instruction type.
%
In practice, however, energy consumption is dependent on the data being
processed and, for simple processors, data can make a significant contribution
to energy consumption.
%
\begin{figure}[h]
    \centering
    \includegraphics[width = 0.95\linewidth]{Figs/and_map}
    \caption{Dynamic power in mW for the XMOS XCore {\tt and} instruction.}
    \label{f:and-heatmap}
\end{figure}
%
This is illustrated in Figure~\ref{f:and-heatmap}, which shows the dynamic
power, in mW, for the single cycle XMOS XCore bitwise {\tt and} instruction for
all 65536 combinations of 8-bit operands from 0 to 255. The colours in the
``heatmap'' range from dark blue, indicating low power, to dark red, indicating
high power. 
%
In~\cite{DBLP:journals/corr/PallisterKME15} 15\% data induced variation has been reported for
the 8-bit AVR processor, while up to 1.7x data-dependent variation was observed
for the 32-bit XMOS XCore in~\cite{DBLP:journals/tecs/KerrisonE15}. Variation
of as much as 50\% is reported in~\cite{highdatadep} for an ST20 32-bit
microprocessor.

These observations naturally lead to the question of which energy cost should
be associated with an instruction in a single cost instruction-level energy
model. Assigning the averages measured using randomly generated, valid data for
the given instructions is a popular choice. 
%
Estimations based on such models may over- or under-predict the energy
consumption of a program when compared to measurements. Error margins reported
in the literature are typically below 10\%, and over- or under-prediction are
both acceptable when the model is used to obtain estimations of energy
consumption and no guarantees are required.

However, static resource consumption bound analyses must provide bounds on
resource consumption that are both safe and tight, i.e.\ sufficiently close to
the actual values. A good example is Worst Case Execution Time (WCET)
analysis~\cite{DBLP:journals/tecs/WilhelmEEHTWBFHMMPPSS08}, where
under-approximation is not acceptable, i.e.\ unsafe, and significant
over-approximation is considered not useful.
%
Thus, for bound analysis models must support the derivation of safe and tight
bounds. 
%
A key prerequisite to achieve this for WCET analysis is timing predictability
within a system, which enables precise bounds to be established with acceptable
effort, without sacrificing performance of the computation in the general case.
In fact, the architecture of processors can significantly impact on the design
of analysis tools and the properties of the analysis results these can
deliver~\cite{influence-wilhelm-2003} in terms of safety and precision. This is
equally important for static energy consumption analysis, i.e.\ predictable
architectures enable precise models to be developed.

It may be tempting to assign to an instruction the lowest or highest values
observed during measurements to support best and worst case analyses,
respectively. However, this approach has been shown to lead to high
over-approximation~\cite{Wagemann-2015-WCEC} in worst case energy consumption
static analysis, so this may not be a suitable option.
%
In fact, the estimations based on such models may never be reachable in
practice. Intuitively, this is because the data that causes the highest energy
consumption for one instruction is very unlikely to produce output that will
trigger the highest energy consumption also in subsequent instructions. 
%

This leads us to the question of which input data causes the worst energy
consumption for a given program. This question is investigated
in~\cite{DBLP:journals/corr/MorseKE16}, where the problem of data-dependent energy
consumption during program execution is formalised in terms of circuit
switching and a formal proof is presented demonstrating that in general
analysing switching in processor datapaths is NP-hard. Thus, 
%accurate
optimal
data-sensitive worst case energy consumption analysis of programs is, in
general, not achievable efficiently and alternative approaches 
giving good approximations
must be
developed. This is an area of ongoing research.

In~\cite{DBLP:journals/corr/PallisterKME15} energy modelling for worst case energy consumption
analysis has been explored. The most promising approach uses probabilistic
energy distributions to characterise individual instruction pairs and proposes
techniques to compose these to block-level instruction sequences.
%
In~\cite{highdatadep} activity indices were introduced into a single cost
instruction-level model to achieve higher precision of energy consumption
predictions and to enable bound analysis for architectures where data
significantly impacts on energy consumption.

\todo[inline]{Should there be a ``summary'' at the end?}