-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add f128 int to float conversions #624
base: master
Are you sure you want to change the base?
Conversation
4332449
to
a062626
Compare
This should be ready now Cc @m-ou-se since you wrote the original int to float algorithms. The generic function's monos seem to get reasonably similar codegen, running the benchmarks seems like changes are within the noise margin. |
Haven't looked at it in detail, but was it necessary to make it generic? For f128, all integers up to 64 bit fit into it losslessly without rounding, so those conversions should be nearly trivial, right? Anyway, happy to review this in detail next week. r? m-ou-se |
Not strictly, but nine near-identical functions would be a lot. I think the constants are easier to follow than the magic numbers in any case.
Thanks! |
a062626
to
0150c63
Compare
Having reviewed in detail now, I'm still not convinced we should make this function generic. It now has quite a few different cases to handle, adding complexity. And now it always starts with an "if 0" check, even though a few functions (e.g. |
Doing each of them separately makes it possible to spot nice optimizations that only apply to specific cases. For example, in the u32 to f128 case, the lower 64 bits will always be zero, so all the operations can be done on u64 rather than u128, which results in much shorter and faster assembly: pub fn u32_to_f128_bits(i: u32) -> u128 {
if i == 0 {
return 0;
}
let n = i.leading_zeros();
let m = (i as u64) << (17 + n); // High 64 significant bits, with bit 113 still in tact.
let e = 16413 - n as u64; // Exponent plus 16383, minus one.
let h = (e << 48) + m; // High 64 bits of f128 representation.
(h as u128) << 64 // Low bits are always 0.
} |
4dc63a4
to
b084ae4
Compare
Thanks for taking a look. I think I struck a happier medium - moved the redundant parts to generic functions to make things less magic-number-y, but did not combine the algorithms. Also included your suggested u32 -> f128 conversion. |
b084ae4
to
3d2857a
Compare
3d2857a
to
0bd5245
Compare
It looks like the system NaN functions are converting to max values rather than zero for f128. Not sure why this would be but I'll disable them and look into it. @m-ou-se gentle nudge, could you take another look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find it more readable when split up over multiple functions like this. For example, the exp
function doesn't actually calculate the exponent; it calculates the exponent minus one, because of the trick with the mantissa containing an extra 1 bit that will unconditinally overflow into the exponent later. Separating tricks over multiple functions just makes it harder to follow and review.
That said, if you truly find this more maintainable, I'll be happy to approve this given that the code is well tested and benchmarked.
src/float/conv.rs
Outdated
/// Calculate the exponent from the number of leading zeros. | ||
fn exp<I: Int, F: Float<Int: CastFrom<u32>>>(n: u32) -> F::Int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't calculate the exponent. It calculates the exponent minus one.
src/float/conv.rs
Outdated
/// Shift the integer into the float's mantissa bits. Keep the lowest exponent bit intact. | ||
fn m_base<I: Int, F: Float<Int: CastFrom<I>>>(i_m: I) -> F::Int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It keeps the highest bit in tact.
src/float/conv.rs
Outdated
m_base + adj | ||
} | ||
|
||
/// Combine a final float repr from an exponent and mantissa. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an important detail that it doesn't just combine the exponent and mantissa fields, but instead is designed to take an off-by-one exponent and a mantissa with an extra 1 bit on the high side.
6f7e12b
to
32c548f
Compare
d1aa791
to
0d807c2
Compare
63e7e9e
to
43214fe
Compare
Extract some common routines to separate functions in order to deduplicate code and remove some of the magic.
43214fe
to
ded247f
Compare
Ah this kept falling to the back of my todo list, but I finally got around to it. I changed it so none of the mantissa shifting is hidden in functions, only the values they shift by, so compared to the original it's just easier to understand where the numbers come from. I left the obvious functions Take a look when you get the chance @m-ou-se, feel free to push on top of this if you feel it's needed. |
No description provided.