Return `char` instead of `u8` from `InvalidCharError::invalid_char` #100

Kixunil · 2024-08-19T17:18:52Z

If the string contains a multi-byte UTF-8 char then the error returns confusing value unsuitable for the user.

This change is kinda breaking but we can phase it slowly: add invalid_char_human(&self) -> char method (please invent a better name!), implement char -> u8 conversion inside invalid_char by encoding the first byte of the char and deprecate it.

I suspect this may need a bunch of changes to be able to get the entire char though.

The text was updated successfully, but these errors were encountered:

apoelstra · 2024-08-19T18:04:12Z

We had a similar bug in rust-bech32 in our new Hrp::from_display API. Fortunately we caught it before releasing.

Agreed with your strategy. Ideas for the new method name

invalid
invalid_character
character
char_ (lol)

Kixunil · 2024-08-19T18:08:30Z

character sounds least bad to me.

tcharding · 2024-09-04T02:05:40Z

I had a play with this which led me to the view that

Adding checks for utf-8 slows down parsing (because without some complex trickery we need to parse the input string twice)
Giving a nice error message for utf-8 adds either complexity to the lib or a perf hit or both
Parsing utf-8 (non ascii) characters is an edge case not the normal case
We should not have a perf hit or unnecessary complexity for an edge case

I hacked up a validate_hex_string function but it is so trivial and undiscoverable that users will likely just get stuck debugging then write their own when they think of it.

/// Validates that input `string` contains valid hex characters.
pub fn validate_hex_string(s: &str) -> Result<(), NonHexDigitError> {
    for c in s.chars() {
        if !c.is_ascii_hexdigit() {
            return Err(NonHexDigitError { invalid: c});
        }
    }
    Ok(())
}

Perhaps a middle ground would be to add this to crate docs under some sort of # Debugging section?

apoelstra · 2024-09-04T02:30:15Z

UTF-8 is designed so that as soon as you hit a non-ASCII byte you know that it's the start of a non-ASCII character. So we shouldn't need to use the slow chars iterator or anything. Just, when we hit a non-ASCII character then we need to maybe do some slow stuff to extract the full character that we've run into. (Taking s[pos..].chars().next().unwrap() would work for example, if we know the byte position of the bad character. And knowing this won't slow us down at all.)

tcharding · 2024-09-04T02:35:27Z

That sounds feasible, I'll have a play with it. Thanks

Kixunil · 2024-09-04T07:13:01Z

Another possible approach that doesn't require explicitly tracking the position and relies on iterators only:

// We're not using `.bytes()` because it doesn't have the `as_slice` method.
let mut bytes = s.as_bytes().iter();
// clone() is actually a copy and it allows peeking behavior without losing access to the `as_slice` method on the iterator.
while let Some(byte) = bytes.clone().next() {
    match decode_hex_digit(byte) {
        Some(value) => { /* do something with the value here */ },
        None => {
            // SAFETY: all bytes up to this one were ACII which implies valid UTF-8 and the input was valid UTF-8 to begin with so splitting the string here is sound. We're also making sure we split it at correct position by cloning the iterator.
            let remaining_str = unsafe { core::str::from_utf8_unchecked(bytes.as_slice()) };
            return Err(InvalidChar { c: remaining_str.chars().next().unwrap(), pos: s.len() - remaining_str.len() });
        },
    }
    bytes.next()
}

apoelstra · 2024-09-04T12:52:10Z

👀 can you really call .next() directly on a byteslice like that?

Kixunil · 2024-09-04T18:06:33Z

LOl, I forgot to write .iter(). :D

apoelstra · 2024-09-04T19:03:12Z

Ah! The .iter() is on bytes. Gotcha. Yep, that would work. I don't feel strongly in favor of either solution.

Kixunil · 2024-09-05T16:46:39Z

FTR I was also considering having enum InvalidChar { Utf8(char), Other(u8) } to allow parsing raw &[u8] slices without prior UTF-8 validation (e.g. when parsing from stdin). But I think it's better to have separate types for those cases.

apoelstra · 2024-09-06T13:04:15Z

But I think it's better to have separate types for those cases.

What do you mean by this?

Kixunil · 2024-09-06T19:30:15Z

Basically if we ever add functions that take &[u8] we would make them return separate error types to indicate that the bytes may not be UTF-8.

apoelstra · 2024-09-06T22:03:09Z

Ah. yeah, that's probably best.

Kixunil added enhancement New feature or request 1.0 Issues and PRs required or helping to stabilize the API labels Aug 19, 2024

tankyleo mentioned this issue Nov 9, 2024

Add InvalidCharError::char #124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return `char` instead of `u8` from `InvalidCharError::invalid_char` #100

Return `char` instead of `u8` from `InvalidCharError::invalid_char` #100

Kixunil commented Aug 19, 2024

apoelstra commented Aug 19, 2024

Kixunil commented Aug 19, 2024

tcharding commented Sep 4, 2024

apoelstra commented Sep 4, 2024

tcharding commented Sep 4, 2024

Kixunil commented Sep 4, 2024 •

edited

Loading

apoelstra commented Sep 4, 2024

Kixunil commented Sep 4, 2024

apoelstra commented Sep 4, 2024

Kixunil commented Sep 5, 2024

apoelstra commented Sep 6, 2024

Kixunil commented Sep 6, 2024

apoelstra commented Sep 6, 2024

Return char instead of u8 from InvalidCharError::invalid_char #100

Return char instead of u8 from InvalidCharError::invalid_char #100

Comments

Kixunil commented Aug 19, 2024

apoelstra commented Aug 19, 2024

Kixunil commented Aug 19, 2024

tcharding commented Sep 4, 2024

apoelstra commented Sep 4, 2024

tcharding commented Sep 4, 2024

Kixunil commented Sep 4, 2024 • edited Loading

apoelstra commented Sep 4, 2024

Kixunil commented Sep 4, 2024

apoelstra commented Sep 4, 2024

Kixunil commented Sep 5, 2024

apoelstra commented Sep 6, 2024

Kixunil commented Sep 6, 2024

apoelstra commented Sep 6, 2024

Return `char` instead of `u8` from `InvalidCharError::invalid_char` #100

Return `char` instead of `u8` from `InvalidCharError::invalid_char` #100

Kixunil commented Sep 4, 2024 •

edited

Loading