How to promise alignment of Input and Output buffers? #6495
Replies: 11 comments 5 replies
-
See test/correctness/constraints.cpp Normally I say something like:
That asserts that the min and extent are equal to themselves rounded down to the next multiple of 32. I think you can also say my_pipeline.add_requirement(my_input.dim(0).extent() % 32 == 0) (and similar for the min). |
Beta Was this translation helpful? Give feedback.
-
Aha, cool! Thanks for the reply. I think this could have a helper function. If you think this is a good idea, I'll make a PR for this. |
Beta Was this translation helpful? Give feedback.
-
I think the simplifier doesn't understand my constraint to its full potential: coord.dim(0).set_bounds(0, max(32, (coord.dim(0).extent() / 32) * 32)); Still yields: let sum_prob_per_sample_global_wrapper$0.s0.sample.max.s = min((min(coord.extent.0, 16) + (((coord.extent.0 + -1) / 16) * 16)), coord.extent.0) (Update: promising multiple of 16 also doesn't work) |
Beta Was this translation helpful? Give feedback.
-
I attempted |
Beta Was this translation helpful? Give feedback.
-
Hrm, I do this sort of thing constantly, so something else must be going on. Could you post a full repro? |
Beta Was this translation helpful? Give feedback.
-
I don't know where to start on making a MWE, so I'll drop a gist of my actual generator here. It's lengthy, and still contains preprocessor |
Beta Was this translation helpful? Give feedback.
-
Where are you seeing the bad indexing? Here's the .stmt I get:
|
Beta Was this translation helpful? Give feedback.
-
You don't have a GPU target selected. Sorry if that wasn't clear. I use: |
Beta Was this translation helpful? Give feedback.
-
This is not a discussion. This is a bug. Why is this made into a discussion? |
Beta Was this translation helpful? Give feedback.
-
What I'm seeing is some code that doesn't exploit it that occurs before the assert. That's expected because it still might not be a multiple of 16 at that point in the code. If we incorrectly assume that it won't fail an assert in the future, we may return a garbled or incorrect error before that point. It's possible that a let computed before the assert should be known to be a multiple of 16 but isn't because it's computed before the assert, and then that let is used in a context where we'd like to know it's a multiple of 16 later after the assert, but Halide's not smart enough to infer the alignment of dependent lets, but I'm not seeing anything like that. I do see things like coord.extent.0/16*16 scattered throughout, but that's fine. That's how we communicate to inner code that this thing is a multiple of 16. It'll be lifted as a loop invariant and computed once by masking off the low bits of coord.extent.0. I don't think I'm seeing a bug here. I think it's exploiting it to the fullest extent necessary. Are you seeing unnecessary indexing code in the ptx somewhere? |
Beta Was this translation helpful? Give feedback.
-
The is_bound_query stuff is so that a user can call into a Halide pipeline
to ask what the constraints *are*. So they need to be able to pass in a
(unallocated) buffer that fails the bounds checks, and get an updated shape back that
would pass the bounds checks. That's supposed to work with things like
set_stride and so on.
So the bounds inference code and the bounds querying code needs to come
before those asserts.
However, I'm not sure it needs to come before the add_requirement asserts,
because those aren't actionable by a bounds query (it can't easily change
the buffer shape to pass the assert). I'm not sure how people would feel
about add_requirement applying even to bounds queries. This is something
we'd have to raise more broadly.
…On Thu, Dec 16, 2021 at 2:13 AM Martijn Courteaux ***@***.***> wrote:
Okay, looking at the generated code (now with %64, instead of 16), I see
indeed that the /64*64 and +63)/64 is not there anymore, except for in the
allocate[] statements, which is not bad for performance, but the most
confusing one to investigate manually.
However, the asserts being inserted that late in the pipeline leaves out a
lot of possibilities to optimize and simplify for reading. I think the
asserts (at least the ones I'm dealing with) can be inserted higher. Right
after the block of code that fetches all the buffers specs with
_halide_buffer_get_stride, _halide_buffer_get_extent and
_halide_buffer_get_min. It seems that the assert is now inserted after
bounds inference computations, which is exactly the thing I want to
simplify and influence with my add_requirement.
Judging from the documentation of add_requirement:
/** Add a top-level precondition to the generated pipeline, * expressed as a boolean Expr. The Expr may depend on parameters * only, and may not call any Func or use a Var. If the condition * is not true at runtime, the pipeline will call halide_error * with the remaining arguments, and return * halide_error_code_requirement_failed. Requirements are checked * in the order added. */
and looking at the statement file, it seems that the requirement-asserts
can be inserted earlier. Moreover, when looking at the statement file, the
requirement-asserts are inserted after a bunch of Halide-generated
asserts. So let's analyse if they can move up until before the bounds
inference without trouble:
1. So these Halide-generated asserts and the requirement-asserts are
commutative (meaning the requirement-asserts can go above them just
fine).
2. Then there is a block of buffer initialization like this:
if ((uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)acc_d.buffer)) {(halide_buffer_t *)_halide_buffer_init((halide_buffer_t *)acc_d.buffer, (halide_dimension_t *)_halide_buffer_get_shape((halide_buffer_t *)acc_d.buffer), (void *)reinterpret((uint64)0), (uint64)0, (halide_device_interface_t *)reinterpret((uint64)0), 2, 32, 2, (halide_dimension_t *)make_struct(0, indices.extent.0, 1, 0, 0, 7, indices.extent.0, 0), (uint64)0)}
I have not a clue what this does (because this the then-body runs if the
host is 0 and device is 0, and then proceeds setting host and device to 0
if it is so, and then just c opies the other values (like shapes, strides,
etc) into the buffer that were already there). However, it looks
commutative with the requirement-asserts.
1. Then above that, there is a block of bounds inference (my target
for optimizing / simplifyng) which contains a lot of let statements
that are used throughout the rest of the statement file and cause
unnecessary /64*64 and alike. This is exactly the part where they (the
requirement-asserts and the let-statements) are not commutative, as
changing the order influences computations produced. However, they are
functionally commutative, which is exactly what I want, and will simplify
the following let statements, and by extent the entire statement file.
For completeness, this is the block of let statements:
let coord.extent.0.required = max(min((min(coord.extent.0, 16) + ((coord.extent.0/16)*16)) + -16, coord.extent.0), (max(logit.s0.sample.max.s + 15, coord.extent.0)/16)*16)let indices.extent.0.required = min(max(min(num_kernels, 64) + (((num_kernels + -1)/64)*64), min(num_kernels, 32) + (((num_kernels + -1)/32)*32)), num_kernels)assert(!(uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)invCholR.buffer) || ((16 <= num_kernels) && (num_kernels <= indices.extent.0.required)), halide_error_constraints_make_required_region_smaller("Input buffer invCholR", 0, 0, indices.extent.0.required + -1, min(num_kernels, 16) + -16, num_kernels + -1))assert(!(uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)logit_kernel_offset.buffer) || ((16 <= num_kernels) && (num_kernels <= indices.extent.0.required)), halide_error_constraints_make_required_region_smaller("Input buffer logit_kernel_offset", 0, 0, indices.extent.0.required + -1, min(num_kernels, 16) + -16, num_kernels + -1))assert(!(uint1)_halide_buffer_is_bounds_query((halide_buffer_t *)mu.buffer) || ((16 <= num_kernels) && (num_kernels <= indices.extent.0.required)), halide_error_constraints_make_required_region_smaller("Input buffer mu", 0, 0, indices.extent.0.required + -1, min(num_kernels, 16) + -16, num_kernels + -1))
1. Then we arrive at the block of _halide_buffer_get_stride,
_halide_buffer_get_extent and _halide_buffer_get_min, which we of
course don't want to reorder, as it just fetches all of the variables we
want to define asserts over.
So, in conclusion, I believe the requirement-asserts -- in general --
could be placed higher in the pipeline, and produce valid results, while
simplifying a lot in the statement file. I have not investigated how to do
that, but it would be very interesting to try it, and see if all the tests
still pass.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6495 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAD52RRCATANPEW6WC2D4P3URG3V3ANCNFSM5J4BNXMQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
I'd like to simplify addressing and intermediate buffer allocation size computations.
Basically, I want to promise in the scheduling language that a generators
Input<Buffer>
will have dimension extents that have a zero remainder modulo 32. It can generate a pipeline assert to validate if the buffer I pass in effectively does that. Now I have a bunch of computations I don't really grasp easily like these:So it does a bunch of calculations that will yield results that are aligned to 16. But I know that the
coord
buffer will always be non-zero multiple of 32 along thesample
dimension. If I manually substitute and propagate this information, I'd get:In fact, I think
coord.extent.0.required.s
would even disappear. This would make the statement file so much cleaner. But I don't know if it is possible to add buffer constraints.Beta Was this translation helpful? Give feedback.
All reactions