Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorization error with predicates depending on vectorized loop #615

Open
kaushikcfd opened this issue May 7, 2022 · 13 comments
Open

Vectorization error with predicates depending on vectorized loop #615

kaushikcfd opened this issue May 7, 2022 · 13 comments

Comments

@kaushikcfd
Copy link
Collaborator

The following kernel --

knl = lp.make_kernel(
    "{[i, j]: 0<=i<100 and 0<=j<4}",
    """
    for i
        for j
            <> tmp1[j] = i+j
            <> tmp2[j] = 0
            if j
                tmp2[j] = 2 * tmp1[j]
            end
            out[i, j] = 2*tmp2[j]
        end
    end
    """, seq_dependencies=True)

knl = lp.tag_array_axes(knl, "tmp1,tmp2", "vec")
knl = lp.tag_inames(knl, "j:vec")

print(lp.generate_code_v2(knl).device_code())

generates:

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int *__restrict__ out)
{
  int4 tmp1;
  int4 tmp2;

  for (int i = 0; i <= 99; ++i)
  {
    tmp1.s0 = i;
    tmp1.s1 = i + 1;
    tmp1.s2 = i + 2;
    tmp1.s3 = i + 3;
    tmp2 = 0;
    if (j)
      tmp2 = 2 * tmp1;
    out[4 * i] = 2 * tmp2.s0;
    out[1 + 4 * i] = 2 * tmp2.s1;
    out[2 + 4 * i] = 2 * tmp2.s2;
    out[3 + 4 * i] = 2 * tmp2.s3;
  }
}

Notice the stray (j) in the conditional. A short term solution which is not vectorizing such instructions will be included as a part of #557.

/cc @sv2518

@kaushikcfd
Copy link
Collaborator Author

kaushikcfd commented Jul 4, 2022

@inducer: I think this is unavoidable for the kernels that @sv2518 is working with. Since changing the logic for emitting predicates in loopy.codegen.instruction would be too intrusive, I propose that loop distribution would give us an easy fallback as follows:

tunit = lp.make_kernel(
    "{[i, j]: 0<=i<100 and 0<=j<4}",
    """
    for i
        for j
            <> tmp1[j] = i+j
            <> tmp2[j] = 0
            if j>2
                tmp2[j] = 2 * tmp1[j]  {id=w_tmp2}
            end
            out[i, j] = 2*tmp2[j]
        end
    end
    """, seq_dependencies=True)


tunit = lp.tag_array_axes(tunit, "tmp1,tmp2", "vec")
tunit = lp.tag_inames(tunit, "j:vec")

# {{{ fallback -->

knl = lp.distribute_loops(tunit.default_entrypoint,
                          "id:w_tmp2",
                          outer_inames=frozenset("i"))
renamed_j, = knl.id_to_insn["w_tmp2"].within_inames - {"i"}
knl = lp.untag_inames(knl, renamed_j, VectorizeTag)
knl = lp.tag_inames(knl, {renamed_j: "unr"})

# }}}

tunit = tunit.with_kernel(knl)
print(lp.generate_code_v2(tunit).device_code())

With the potential use-case for loop distribution, any opinions on (me) moving forward with it?

@kaushikcfd
Copy link
Collaborator Author

@sv2518: If you want to access this functionality, please checkout the branch cvec_x_distribute_loops.

@sv2518
Copy link
Contributor

sv2518 commented Jul 6, 2022

Thanks!

@sv2518
Copy link
Contributor

sv2518 commented Jul 6, 2022

cvec_x_distribute_loops is not updated to the new version of the c_vecextensions_target, is it?

@kaushikcfd
Copy link
Collaborator Author

cvec_x_distribute_loops is not updated to the new version of the c_vecextensions_target, is it?

Oops, yep. Messed up while cherry-picking the commits. Pushed a fix.

@sv2518
Copy link
Contributor

sv2518 commented Jul 7, 2022

Thanks!

Is the fallback code meant to go in our codebase or can this be done in loopy?

I added this snippet to PyOP2

        all_insn_cinsn = list(insn for insn in wrapper.default_entrypoint.instructions  if isinstance(insn, lp.CInstruction))
        # {{{ fallback -->
        for insn in all_insn_cinsn:
            wrapper = lp.distribute_loops(wrapper.default_entrypoint,
                                            insn.id,
                                            outer_inames=shifted_iname)
            renamed_j, = wrapper.id_to_insn[insn.id].within_inames - shifted_iname
            wrapper = lp.untag_inames(wrapper, renamed_j, VectorizeTag)
            wrapper = lp.tag_inames(wrapper, {renamed_j: "unr"})

        # }}}

but I run into the following error. Did I drive this wrong?

../PyOP2/pyop2/global_kernel.py:501: in vectorise
    wrapper = lp.distribute_loops(wrapper.default_entrypoint,
../loopy/loopy/transform/iname.py:1179: in wrapper
    transformed_kernel = transformation_func(kernel, *args, **kwargs)
../loopy/loopy/transform/loop_distribution.py:337: in distribute_loops
    within = match.parse_match(insn_match)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

expr = 'mtf__cond'

    def parse_match(expr):
        """Syntax examples::
    
        * ``id:yoink and writes:a_temp``
        * ``id:yoink and (not writes:a_temp or tag:input)``
        """
        if not expr:
            return All()
    
        def parse_terminal(pstate):
            next_tag = pstate.next_tag()
            if next_tag is _id:
                result = Id(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            elif next_tag is _tag:
                result = Tagged(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            elif next_tag is _writes:
                result = Writes(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            elif next_tag is _reads:
                result = Reads(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            elif next_tag is _in_kernel:
                result = InKernel(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            elif next_tag is _iname:
                result = Iname(pstate.next_match_obj().group(1))
                pstate.advance()
                return result
            else:
                pstate.expected("terminal")
    
        def inner_parse(pstate, min_precedence=0):
            pstate.expect_not_end()
    
            if pstate.is_next(_not):
                pstate.advance()
                left_query = Not(inner_parse(pstate, _PREC_NOT))
            elif pstate.is_next(_openpar):
                pstate.advance()
                left_query = inner_parse(pstate)
                pstate.expect(_closepar)
                pstate.advance()
            else:
                left_query = parse_terminal(pstate)
    
            did_something = True
            while did_something:
                did_something = False
                if pstate.is_at_end():
                    return left_query
    
                next_tag = pstate.next_tag()
    
                if next_tag is _and and _PREC_AND > min_precedence:
                    pstate.advance()
                    left_query = And(
                            (left_query, inner_parse(pstate, _PREC_AND)))
                    did_something = True
                elif next_tag is _or and _PREC_OR > min_precedence:
                    pstate.advance()
                    left_query = Or(
                            (left_query, inner_parse(pstate, _PREC_OR)))
                    did_something = True
    
            return left_query
    
        if isinstance(expr, MatchExpressionBase):
            return expr
    
        from pytools.lex import LexIterator, lex, InvalidTokenError
        try:
            pstate = LexIterator(
                [(tag, s, idx, matchobj)
                 for (tag, s, idx, matchobj) in lex(_LEX_TABLE, expr,
                     match_objects=True)
                 if tag is not _whitespace], expr)
        except InvalidTokenError as e:
            from loopy.diagnostic import LoopyError
>           raise LoopyError(
                    "invalid match expression: '{match_expr}' ({err_type}: {err_str})"
                    .format(
                        match_expr=expr,
                        err_type=type(e).__name__,
                        err_str=str(e)))
E           loopy.diagnostic.LoopyError: invalid match expression: 'mtf__cond' (InvalidTokenError: at index 0: ...mtf__cond...)

../loopy/loopy/match.py:403: LoopyError

@inducer
Copy link
Owner

inducer commented Jul 7, 2022

I think this is unavoidable

I'm not sure I agree. I think we can expect the predicate to be an expression, and thus we can determine that there's a dependency on that iname in the predicates. Based on that, we just need to raise Unvectorizable somewhere in the target.

@kaushikcfd
Copy link
Collaborator Author

kaushikcfd commented Jul 7, 2022

we just need to raise Unvectorizable somewhere in the target.

We did that in #617. The issue has to do more with the way predicates are emitted in the codegen pipeline. It needs to be taught about UnvectorizableErrors.

Did I drive this wrong?

I think there is a minor error there. I think it should be like:

from loopy.math import Id, Or
cinsn_ids = [cinsn.id
             for cinsn in kernel.instructions
             if (isinstance(cinsn, lp.CInstruction) and cinsn.predicates)]
cinsn_match = Or(tuple(Id(cinsn_id) for cinsn_id in cinsns_ids))
outer_inames = frozenset([shifted_iname+"_outer"])
kernel = lp.distribute_loops(kernel,
                             cinsn_match,
                             outer_inames=outer_inames)
inames_to_untag = [kernel.id_to_insn[cinsn_id].within_inames - outer_inames
                   for cinsn_id in cinsn_ids]
kernel = lp.untag_inames(kernel, inames_to_untag, VectorizeTag)
kernel = lp.tag_inames(kernel, {iname_to_untag: "unr"
                                for iname_to_untag in inames_to_untag})

@sv2518
Copy link
Contributor

sv2518 commented Jul 7, 2022

Ah yes I got further now, but now I error with

../PyOP2/pyop2/global_kernel.py:502: in vectorise
    kernel = lp.distribute_loops(kernel,
../loopy/loopy/transform/iname.py:1179: in wrapper
    transformed_kernel = transformation_func(kernel, *args, **kwargs)
../loopy/loopy/transform/loop_distribution.py:400: in distribute_loops
    if not _is_loop_distribution_sound(kernel,
../loopy/loopy/transform/loop_distribution.py:133: in _is_loop_distribution_sound
    amaps_pred = [
../loopy/loopy/transform/loop_distribution.py:134: in <listcomp>
    process_amap(get_insn_access_map(kernel, insn_id, var))
../loopy/loopy/kernel/tools.py:2166: in get_insn_access_map
    indices = list(_IndexCollector(var)((insn.expression,
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:244: in __call__
    result = super().rec(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:151: in __call__
    return self.map_foreign(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:206: in map_foreign
    return self.map_tuple(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:359: in map_list
    return self.combine(self.rec(child, *args, **kwargs) for child in expr)
../loopy/loopy/kernel/tools.py:2127: in combine
    return reduce(operator.or_, values, frozenset())
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:359: in <genexpr>
    return self.combine(self.rec(child, *args, **kwargs) for child in expr)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:244: in __call__
    result = super().rec(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:153: in __call__
    return method(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:314: in map_quotient
    self.rec(expr.numerator, *args, **kwargs),
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:244: in __call__
    result = super().rec(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:153: in __call__
    return method(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:281: in map_call
    (self.rec(expr.function, *args, **kwargs),)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:244: in __call__
    result = super().rec(expr, *args, **kwargs)
../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:148: in __call__
    return self.handle_unsupported_expression(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <loopy.kernel.tools._IndexCollector object at 0x15075e700>
expr = ResolvedFunction(Variable('sqrt')), args = (), kwargs = {}

    def handle_unsupported_expression(self, expr, *args, **kwargs):
        """Mapper method that is invoked for
        :class:`pymbolic.primitives.Expression` subclasses for which a mapper
        method does not exist in this mapper.
        """
    
>       raise UnsupportedExpressionError(
                "{} cannot handle expressions of type {}".format(
                    type(self), type(expr)))
E       pymbolic.mapper.UnsupportedExpressionError: <class 'loopy.kernel.tools._IndexCollector'> cannot handle expressions of type <class 'loopy.symbolic.ResolvedFunction'>

../../lib/python3.9/site-packages/pymbolic/mapper/__init__.py:122: UnsupportedExpressionError

The CInstruction does not contain an sqrt function, but the instruction it depends on does.

@kaushikcfd
Copy link
Collaborator Author

The CInstruction does not contain an sqrt function, but the instruction it depends on does.

It needed a map_resolved_function. Pushed a fix to the branch.

@sv2518
Copy link
Contributor

sv2518 commented Jul 7, 2022

Okay, cool, thanks! Now one step further I run into

../PyOP2/pyop2/global_kernel.py:502: in vectorise
    kernel = lp.distribute_loops(kernel,
../loopy/loopy/transform/iname.py:1179: in wrapper
    transformed_kernel = transformation_func(kernel, *args, **kwargs)
../loopy/loopy/transform/loop_distribution.py:400: in distribute_loops
    if not _is_loop_distribution_sound(kernel,
../loopy/loopy/transform/loop_distribution.py:138: in _is_loop_distribution_sound
    amaps_succ = [
../loopy/loopy/transform/loop_distribution.py:139: in <listcomp>
    process_amap(get_insn_access_map(kernel, insn_id, var))
../loopy/loopy/kernel/tools.py:2167: in get_insn_access_map
    indices = list(_IndexCollector(var)((insn.expression,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = CInstruction(groups=frozenset(), code='break;', read_variables=frozenset(), tags=frozenset(), within_inames_is_final=T...tf__atol_crit_id'}), no_sync_with=frozenset(), within_inames=frozenset({'mtf__i_c', 'n_shift_outer', 'n_shift_batch'}))
name = 'expression'

    def __getattr__(self, name):
        # This method is implemented to avoid pylint 'no-member' errors for
        # attribute access.
>       raise AttributeError(
                "'{}' object has no attribute '{}'".format(
                    self.__class__.__name__, name))
E       AttributeError: 'CInstruction' object has no attribute 'expression'

../../lib/python3.9/site-packages/pytools/__init__.py:449: AttributeError

@kaushikcfd
Copy link
Collaborator Author

Sorry, hadn't accounted for CInstruction. Pushed a fix that runs the following snippet as expected:

import loopy as lp
from loopy.symbolic import parse


tunit = lp.make_kernel(
    "{[i]: 0<=i<4}",
    ["<> tmp[i] = 0 {id=w_tmp}",
     lp.CInstruction(iname_exprs=("i", "i"),
                     code="break;",
                     predicates={parse("tmp[i] > n")},
                     read_variables={"i", "n"},
                     depends_on=frozenset({"w_tmp"}),
                     id="break",),
     lp.Assignment("out_callee",
                   "i",
                   depends_on=frozenset(["break"]))
     ],
    [lp.ValueArg("n", dtype="int32"), ...],
    name="circuit_breaker")
knl = tunit.default_entrypoint

knl = lp.tag_inames(knl, "i:vec")
knl = lp.distribute_loops(knl, "id:break", frozenset())
knl = lp.untag_inames(knl, "i_1", lp.VectorizeTag)
knl = lp.tag_inames(knl, "i_1:unr")

tunit = tunit.with_kernel(knl)
print(lp.generate_code_v2(knl).device_code())

@sv2518
Copy link
Contributor

sv2518 commented Jul 14, 2022

Fixed as part of #557

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants