Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bench: cleanup redundant code and add Operator and XDSLOperator at same file #21

Merged
merged 61 commits into from
Oct 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
687e857
mpi: Add custom topology from devito codebase
georgebisbas Jul 14, 2023
9af33a5
mpi: Add tests for Custom topology
georgebisbas Jul 14, 2023
2d33ad2
devito/mpi/distributed.py
georgebisbas Jul 14, 2023
aa7e25a
Clean benchmark
georgebisbas Jul 14, 2023
d4d0ed3
Cleanup benchmark
georgebisbas Jul 14, 2023
416f1ec
bench: cleanup
georgebisbas Jul 19, 2023
51dbc2d
bench: cleanup 3d
georgebisbas Jul 19, 2023
fe1650c
bench: more cleanup, drop redundant
georgebisbas Jul 19, 2023
479d478
bench: more cleanup
georgebisbas Jul 19, 2023
1142684
add 'set -eo pipefail' to compiler pipeline to catch errors early
AntonLydike Jul 20, 2023
8e5e7ae
Use /bin/bash for set -eo pipefail.
PapyChacal Jul 20, 2023
335b5f5
add todo.
PapyChacal Jul 20, 2023
e9c76a7
Make xDSL flow use a temp .mlir file just like the usual temp .c file.
PapyChacal Jul 20, 2023
5df0f76
operator: Add fixed for xdsloperator compilation - apply_kernel
georgebisbas Jul 21, 2023
6fc0ff7
Lower subviews.
PapyChacal Jul 21, 2023
b346498
Link to MLIR runner utils.
PapyChacal Jul 21, 2023
bd48ffa
c_runner_utils rather.
PapyChacal Jul 21, 2023
d9c4239
Merge branch 'add_custom_topology' into bench_edits
PapyChacal Jul 21, 2023
c389f96
Reverse stencil.apply inputs and try to name accordingly.
PapyChacal Jul 22, 2023
c5f9552
fix data copy, buffer play
georgebisbas Jul 26, 2023
d8b5dba
wave2d.py
georgebisbas Jul 27, 2023
a67394e
Comment out pdb.
PapyChacal Jul 27, 2023
6de1971
Fix initial buffer order.
PapyChacal Jul 27, 2023
afca66d
add canonicalize-dmp pass to dmp pipeline
AntonLydike Jul 31, 2023
f2604c6
Merge pull request #22 from xdslproject/emilien/try-fix-wave
georgebisbas Aug 2, 2023
8087f95
Add tiling.
PapyChacal Aug 1, 2023
63ab369
Add proper quoting.
PapyChacal Aug 1, 2023
e15ce97
Add dimensionality-1 tiling dimensions logic.
PapyChacal Aug 2, 2023
bacf1af
mpi: Init effort for serial modelling on wave operator
georgebisbas Aug 2, 2023
7f3b37e
mpi: wip
georgebisbas Aug 2, 2023
e54cd7b
mpi: wip
georgebisbas Aug 2, 2023
625e976
wave3d: cleanup
georgebisbas Aug 3, 2023
b73489a
wave2d: cleanup
georgebisbas Aug 3, 2023
e4db7f9
mpi-mfe: Add
georgebisbas Aug 3, 2023
ea7fe19
hacky fix for row major dmp.grid
AntonLydike Aug 3, 2023
d45c113
bench: Conditional execution heat2d
georgebisbas Aug 4, 2023
ae3d586
bench: Conditional execution heat3d
georgebisbas Aug 4, 2023
9a973cc
bench: Generalize benchmarking scripts
georgebisbas Aug 4, 2023
4b34fc5
bench: Generalize wave3d
georgebisbas Aug 4, 2023
7d7e639
wave: TryAdd example with no Operator
georgebisbas Aug 4, 2023
26abd2e
add datatest
georgebisbas Aug 4, 2023
fd312d5
setup: Add necessary data
georgebisbas Aug 5, 2023
8db89d7
bench: Load dt to XDSL
georgebisbas Aug 5, 2023
bcd7a8d
setup: Save extent
georgebisbas Aug 5, 2023
8c8858a
bench: Add so to saved data
georgebisbas Aug 5, 2023
72fb27d
bench: Add wave3d setup
georgebisbas Aug 5, 2023
722e998
bench: compress saved data
georgebisbas Aug 5, 2023
aa73e33
bench: compress properly u.data[:]
georgebisbas Aug 5, 2023
163db36
Merge pull request #23 from xdslproject/emilien/stencil-tiling
georgebisbas Aug 6, 2023
ad55988
bench: More cleanup and tiling merge
georgebisbas Aug 6, 2023
6cfe569
bench: Hide pyvista req
georgebisbas Aug 6, 2023
80c0d31
Merge pull request #24 from xdslproject/bench_edits-2
georgebisbas Aug 9, 2023
dde653e
Insert necessary boilerplate. stencil lowerings doesn't handle it.
PapyChacal Aug 9, 2023
0d73b6e
Add more sensible and resilient tile sizes.
PapyChacal Aug 4, 2023
9082d9f
Try with a arguments-minimizing pipeline.
PapyChacal Aug 4, 2023
84b5522
Improve args-minimization pipeline (by still folding all compile-time…
PapyChacal Aug 6, 2023
c694e02
Remove superfluous GPU passes.
PapyChacal Aug 7, 2023
cca08a4
Add direct .so backdoor capability, and XDSL_SKIP_CLEAN env variable …
PapyChacal Aug 7, 2023
f057a23
Use DeVito's par-tile.
PapyChacal Aug 8, 2023
baa38fb
Use the boilerplate flag to not always copy to GPU.
PapyChacal Aug 9, 2023
fa0276f
Merge pull request #26 from xdslproject/emilien/gpu-again
georgebisbas Aug 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 59 additions & 21 deletions devito/ir/ietxdsl/cluster_to_ssa.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# ------------- devito import -------------#

from sympy import Add, Expr, Float, Indexed, Integer, Mod, Mul, Pow, Symbol
from xdsl.dialects import arith, builtin, func, memref, scf, stencil
from devito.arch.archinfo import NvidiaDevice
from devito.parameters import configuration
from xdsl.dialects import arith, builtin, func, memref, scf, stencil, gpu
from xdsl.dialects.experimental import dmp, math
from xdsl.ir import Attribute, Block, Operation, OpResult, Region, SSAValue
from typing import Any
Expand Down Expand Up @@ -113,7 +115,7 @@ def _convert_eq(self, eq: LoweredEq):
), f"can only write to offset [0,0,0], given {offsets[1:]}"

self.block.add_op(stencil.ReturnOp.get([rhs_result]))
outermost_block.add_op(func.Return.get())
outermost_block.add_op(func.Return())

return func.FuncOp.from_region(
"apply_kernel", [], [], Region([outermost_block])
Expand Down Expand Up @@ -272,7 +274,7 @@ def _ensure_same_type(self, *vals: SSAValue):
new_vals.append(val)
continue
# insert an integer to float cast op
conv = arith.SIToFPOp.get(val, builtin.f32)
conv = arith.SIToFPOp(val, builtin.f32)
self.block.add_op(conv)
new_vals.append(conv.result)
return new_vals
Expand Down Expand Up @@ -323,6 +325,38 @@ def is_float(val: SSAValue):

from xdsl.dialects import llvm

@dataclass
class WrapFunctionWithTransfers(RewritePattern):
func_name: str
done: bool = field(default=False)

@op_type_rewrite_pattern
def match_and_rewrite(self, op: func.FuncOp, rewriter: PatternRewriter):
if op.sym_name.data != self.func_name or self.done:
return
self.done = True

op.sym_name = builtin.StringAttr("gpu_kernel")
print("Doing GPU STUFF")
# GPU STUFF
wrapper = func.FuncOp(self.func_name, op.function_type, Region(Block([func.Return()], arg_types=op.function_type.inputs)))
body = wrapper.body.block
wrapper.body.block.insert_op_before(func.Call("gpu_kernel", body.args, []), body.last_op)
for arg in wrapper.args:
shapetype = arg.type
if isinstance(shapetype, stencil.FieldType):
memref_type = memref.MemRefType.from_element_type_and_shape(shapetype.get_element_type(), shapetype.get_shape())
alloc = gpu.AllocOp(memref.MemRefType.from_element_type_and_shape(shapetype.get_element_type(), shapetype.get_shape()))
outcast = builtin.UnrealizedConversionCastOp.get(alloc, shapetype)
arg.replace_by(outcast.results[0])
incast = builtin.UnrealizedConversionCastOp.get(arg, memref_type)
copy = gpu.MemcpyOp(source=incast, destination=alloc)
body.insert_ops_before([alloc, outcast, incast, copy], body.ops.first)

copy_out = gpu.MemcpyOp(source=alloc, destination=incast)
dealloc = gpu.DeallocOp(alloc)
body.insert_ops_before([copy_out, dealloc], body.ops.last)
rewriter.insert_op_after_matched_op(wrapper)
@dataclass
class MakeFunctionTimed(RewritePattern):
"""
Expand All @@ -340,16 +374,16 @@ def match_and_rewrite(self, op: func.FuncOp, rewriter: PatternRewriter):
self.seen_ops.add(op)

rewriter.insert_op_at_start([
t0 := func.Call.get('timer_start', [], [builtin.f64])
t0 := func.Call('timer_start', [], [builtin.f64])
], op.body.block)

ret = op.get_return_op()
assert ret is not None

rewriter.insert_op_before([
timers := iet_ssa.LoadSymbolic.get('timers', llvm.LLVMPointerType.typed(builtin.f64)),
t1 := func.Call.get('timer_end', [t0], [builtin.f64]),
llvm.StoreOp.get(t1, timers),
t1 := func.Call('timer_end', [t0], [builtin.f64]),
llvm.StoreOp(t1, timers),
], ret)

rewriter.insert_op_after_matched_op([
Expand Down Expand Up @@ -405,8 +439,8 @@ def match_and_rewrite(self, op: iet_ssa.Stencil, rewriter: PatternRewriter, /):

for field in op.input_indices:
rewriter.insert_op_before_matched_op(load_op := stencil.LoadOp.get(field))
input_temps.append(load_op.res)
load_op.res.name_hint = field.name_hint + "_temp"
input_temps.insert(0, load_op.res)

rewriter.replace_matched_op(
[
Expand Down Expand Up @@ -479,8 +513,10 @@ def match_and_rewrite(self, op: iet_ssa.LoadSymbolic, rewriter: PatternRewriter,
if symb_name not in args:
body = parent.body.blocks[0]
args[symb_name] = body.insert_arg(op.result.type, len(body.args))


op.result.replace_by(args[symb_name])

rewriter.erase_matched_op()
parent.update_function_type()
# attach information on parameter names to func
Expand All @@ -492,26 +528,28 @@ def match_and_rewrite(self, op: iet_ssa.LoadSymbolic, rewriter: PatternRewriter,
)


def convert_devito_stencil_to_xdsl_stencil(module):
grpa = GreedyRewritePatternApplier(
[
_DevitoStencilToStencilStencil(),
LowerIetForToScfFor(),
MakeFunctionTimed('apply_kernel'),
def convert_devito_stencil_to_xdsl_stencil(module, timed:bool=True):
patterns:list[RewritePattern] = [
_DevitoStencilToStencilStencil(),
LowerIetForToScfFor(),
]
)
if timed:
patterns.append(MakeFunctionTimed('apply_kernel'))
grpa = GreedyRewritePatternApplier(patterns)
perf("DevitoStencil to stencil.stencil")
perf("LowerIetForToScfFor")

PatternRewriteWalker(grpa, walk_regions_first=True).rewrite_module(module)



def finalize_module_with_globals(module: builtin.ModuleOp, known_symbols: dict[str, Any]):
grpa = GreedyRewritePatternApplier(
[
_InsertSymbolicConstants(known_symbols),
_LowerLoadSymbolidToFuncArgs(),
]
)
def finalize_module_with_globals(module: builtin.ModuleOp, known_symbols: dict[str, Any], gpu_boilerplate):
patterns = [
_InsertSymbolicConstants(known_symbols),
_LowerLoadSymbolidToFuncArgs(),
]
grpa = GreedyRewritePatternApplier(patterns)
PatternRewriteWalker(grpa).rewrite_module(module)
if gpu_boilerplate:
walker = PatternRewriteWalker(GreedyRewritePatternApplier([WrapFunctionWithTransfers('apply_kernel')]))
walker.rewrite_module(module)
2 changes: 1 addition & 1 deletion devito/ir/ietxdsl/iet_ssa.py
Original file line number Diff line number Diff line change
Expand Up @@ -466,7 +466,7 @@ def get(
stencil.TempType(len(shape), typ)
] * (time_buffers - 1))

for block_arg, idx_arg in zip(block.args, time_indices):
for block_arg, idx_arg in zip(block.args, reversed(inputs)):
name = SSAValue.get(idx_arg).name_hint
if name is None:
continue
Expand Down
13 changes: 6 additions & 7 deletions devito/ir/ietxdsl/ietxdsl_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,11 @@
# XDSL specific imports
from xdsl.irdl import AnyOf, Operation, SSAValue
from xdsl.dialects.builtin import (ContainerOf, Float16Type, Float32Type,
Float64Type, Builtin, i32, f32)
Float64Type, i32, f32)

from xdsl.dialects.arith import Muli, Addi
from devito.ir.ietxdsl import iet_ssa

from xdsl.dialects import memref, arith, builtin, llvm
from xdsl.dialects import memref, arith, builtin
from xdsl.dialects.experimental import math

import devito.types
Expand Down Expand Up @@ -74,7 +73,7 @@ def print_calls(cgen, calldefs):
print("Call not translated in calldefs")
return

call = Call.get(call_name, C_names, C_typenames, C_typeqs, prefix, retval)
call = Call(call_name, C_names, C_typenames, C_typeqs, prefix, retval)

cgen.printCall(call, True)

Expand Down Expand Up @@ -180,10 +179,10 @@ def add_to_block(expr, arg_by_expr: dict[Any, Operation], result):
# reconcile differences

if isinstance(rhs.typ, builtin.IntegerType):
rhs = arith.SIToFPOp.get(rhs, lhs.typ)
rhs = arith.SIToFPOp(rhs, lhs.typ)
result.append(rhs)
else:
lhs = arith.SIToFPOp.get(lhs, rhs.typ)
lhs = arith.SIToFPOp(lhs, rhs.typ)
result.append(lhs)


Expand Down Expand Up @@ -426,7 +425,7 @@ def myVisit(node, block: Block, ssa_vals={}):
print(f"Call {node.name} instance translated as comment")
return

call = Call.get(call_name, C_names, C_typenames, C_typeqs, prefix, retval)
call = Call(call_name, C_names, C_typenames, C_typeqs, prefix, retval)
block.add_ops([call])

print(f"Call {node.name} translated")
Expand Down
5 changes: 4 additions & 1 deletion devito/ir/ietxdsl/lowering.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,9 @@ def match_and_rewrite(self, op: iet_ssa.For, rewriter: PatternRewriter, /):
]
rewriter.insert_op_before_matched_op(subindice_vals)

subindice_vals = list(reversed(subindice_vals))
subindice_vals.append(subindice_vals.pop(0))

rewriter.replace_matched_op([
cst1 := arith.Constant.from_int_and_width(1, builtin.IndexType()),
new_ub := arith.Addi(op.ub, cst1),
Expand Down Expand Up @@ -368,7 +371,7 @@ def match_and_rewrite(self, op: memref.Store, rewriter: PatternRewriter,
ssa_indices=[idx],
result_type=llvm.LLVMPointerType.typed(op.memref.memref.element_type)
),
store := llvm.StoreOp.get(op.value, gep),
store := llvm.StoreOp(op.value, gep),
],
[],
)
Expand Down
2 changes: 1 addition & 1 deletion devito/ir/ietxdsl/xdsl_passes.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ def _op_to_func(op: Operator):
ietxdsl_functions.myVisit(i, block=block, ssa_vals=ssa_val_dict)

# add a trailing return
block.add_op(func.Return.get())
block.add_op(func.Return())

func_op = func.FuncOp.from_region(str(op.name), arg_types, [], Region([block]))

Expand Down
116 changes: 113 additions & 3 deletions devito/mpi/distributed.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
from abc import ABC, abstractmethod
from ctypes import c_int, c_void_p, sizeof
from itertools import groupby, product
from math import ceil
from abc import ABC, abstractmethod
from math import ceil, pow
from sympy import factorint

import atexit

from cached_property import cached_property
Expand Down Expand Up @@ -194,6 +196,9 @@ def __init__(self, shape, dimensions, input_comm=None, topology=None):
# mpi4py takes care of that when the object gets out of scope
self._input_comm = (input_comm or MPI.COMM_WORLD).Clone()

if len(shape) == 3:
topology = ('*', '*', 1)

if topology is None:
# `MPI.Compute_dims` sets the dimension sizes to be as close to each other
# as possible, using an appropriate divisibility algorithm. Thus, in 3D:
Expand All @@ -204,6 +209,9 @@ def __init__(self, shape, dimensions, input_comm=None, topology=None):
# guarantee that 9 ranks are arranged into a 3x3 grid when shape=(9, 9))
self._topology = compute_dims(self._input_comm.size, len(shape))
else:
# A custom topology may contain integers or the wildcard '*'
topology = CustomTopology(topology, self._input_comm)

self._topology = topology

if self._input_comm is not input_comm:
Expand Down Expand Up @@ -253,9 +261,18 @@ def nprocs(self):
def topology(self):
return self._topology

@property
def topology_logical(self):
if isinstance(self.topology, CustomTopology):
return self.topology.logical
else:
return None

@cached_property
def is_boundary_rank(self):
""" MPI rank interfaces with the boundary of the domain. """
"""
MPI rank interfaces with the boundary of the domain.
"""
return any([True if i == 0 or i == j-1 else False for i, j in
zip(self.mycoords, self.topology)])

Expand Down Expand Up @@ -550,6 +567,99 @@ def _arg_values(self, *args, **kwargs):
return self._arg_defaults()


class CustomTopology(tuple):

"""
The CustomTopology class provides a mechanism to describe parametric domain
decompositions. It allows users to specify how the dimensions of a domain are
decomposed into chunks based on certain parameters.

Examples
--------
For example, let's consider a domain with three distributed dimensions: x, y, and z,
and an MPI communicator with N processes. Here are a few examples of CustomTopology:

With N known, say N=4:
* `(1, 1, 4)`: the z Dimension is decomposed into 4 chunks
* `(2, 1, 2)`: the x Dimension is decomposed into 2 chunks and the z Dimension
is decomposed into 2 chunks

With N unknown:
* `(1, '*', 1)`: the wildcard `'*'` indicates that the runtime should decompose the y
Dimension into N chunks
* `('*', '*', 1)`: the wildcard `'*'` indicates that the runtime should decompose both
the x and y Dimensions in `nstars` factors of N, prioritizing
the outermost dimension

Assuming that the number of ranks `N` cannot evenly be decomposed to the requested
stars=6 we decompose as evenly as possible by prioritising the outermost dimension

For N=3
* `('*', '*', 1)` gives: (3, 1, 1)
* `('*', 1, '*')` gives: (3, 1, 1)
* `(1, '*', '*')` gives: (1, 3, 1)

For N=6
* `('*', '*', 1)` gives: (3, 2, 1)
* `('*', 1, '*')` gives: (3, 1, 2)
* `(1, '*', '*')` gives: (1, 3, 2)

For N=8
* `('*', '*', '*')` gives: (2, 2, 2)
* `('*', '*', 1)` gives: (4, 2, 1)
* `('*', 1, '*')` gives: (4, 1, 2)
* `(1, '*', '*')` gives: (1, 4, 2)

Notes
-----
Users should not directly use the CustomTopology class. It is instantiated
by the Devito runtime based on user input.
"""

def __new__(cls, items, input_comm):
# Keep track of nstars and already defined decompositions
nstars = items.count('*')

# If no stars exist we are ready
if nstars == 0:
processed = items
else:
# Init decomposition list and track star positions
processed = [1] * len(items)
star_pos = []
for i, item in enumerate(items):
if isinstance(item, int):
processed[i] = item
else:
star_pos.append(i)

# Compute the remaining procs to be allocated
alloc_procs = np.prod([i for i in items if i != '*'])
rem_procs = int(input_comm.size // alloc_procs)

# List of all factors of rem_procs in decreasing order
factors = factorint(rem_procs)
vals = [k for (k, v) in factors.items() for _ in range(v)][::-1]

# Split in number of stars
split = np.array_split(vals, nstars)

# Reduce
star_vals = [int(np.prod(s)) for s in split]

# Apply computed star values to the processed
for index, value in zip(star_pos, star_vals):
processed[index] = value

# Final check that topology matches the communicator size
assert np.prod(processed) == input_comm.size

obj = super().__new__(cls, processed)
obj.logical = items

return obj


def compute_dims(nprocs, ndim):
# We don't do anything clever here. In fact, we do something very basic --
# we just try to distribute `nprocs` evenly over the number of dimensions,
Expand Down
Loading
Loading