Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BuiltinByteString literals aren't UTF8 encoded. #6655

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Unisay
Copy link
Contributor

@Unisay Unisay commented Nov 12, 2024

  • BuiltinByteString literals are utf-8 decoded in the plugin.
  • Haskell.IsString BuiltinByteString instance delegates to the Haskell ByteString instance.
  • Unit tests to verify that bytes in range from 0 to 255 (\x00 - \xFF) found in string literals are passed to the BuiltinByteString without the UTF8 conversion.

Closes #6505

@Unisay Unisay self-assigned this Nov 12, 2024
@@ -699,13 +732,21 @@ compileExpr e = traceCompilation 2 ("Compiling expr:" GHC.<+> GHC.ppr e) $ do
_ -> throwPlain $ CompilationError "No info for Pair builtin"

-- TODO: Maybe share this to avoid repeated lookups. Probably cheap, though.
(stringTyName, sbsName) <- case (Map.lookup ''Builtins.BuiltinString nameInfo, Map.lookup 'Builtins.stringToBuiltinString nameInfo) of
(stringTyName, sbsName) <-
case
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap long line

| let name = GHC.getName n
, name == GHC.unpackCStringName || name == GHC.unpackCStringUtf8Name ->
stringExprContent expr
data StringExprContentAs = AsBytes | AsText
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stringExprContent function is used to extract bytes for the BuiltinByteString as well as for the BuiltinString literals;
I have added this datatype as its parameter to distinguish between these scenarios as we want to utf-8 decode BuiltinByteString literals only.

This isn't a full UTF-8 decoder: it only decodes the subset of UTF-8 that
is expected to be found in bytestring literals: 0x00 - 0xFF
-}
fromUtf8 :: [Word8] -> Maybe [Word8]
Copy link
Contributor Author

@Unisay Unisay Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using Data.Text.Encoding.decodeUtf8 and it didn't work: the error was saying about invalid utf-8 bytes. Maybe the implementation in GHC is not compatible with the one in Text (Currently GHC ships with at least five UTF-8 implementations)
It turned out that decoding rules are not super complex so I have implemented them below.

Copy link
Contributor Author

@Unisay Unisay Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a unit test to prove it works correctly on bytes 00 - FF

(bsTyName, sbbsName) <- case (Map.lookup ''Builtins.BuiltinByteString nameInfo, Map.lookup 'Builtins.stringToBuiltinByteString nameInfo) of
(Just t1, Just t2) -> pure (GHC.getName t1, GHC.getName t2)
_ -> throwPlain $ CompilationError "No info for ByteString builtin"
(builtinByteStringTyName, sbbsName) <-
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap another long line

Just tc ->
if
| GHC.getName tc == builtinByteStringTyName ->
case stringExprContent AsBytes (strip content) of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting content as bytes undoes the UTF8 encoding for BuiltinByteString

Just bs ->
pure $ PIR.Constant annMayInline $ PLC.someValue bs
| GHC.getName tc == stringTyName ->
case stringExprContent AsText (strip content) of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting content as text leaves original UTF8 encoding for BuiltinString

throwPlain $
CompilationError $
"Text literal with invalid UTF-8 content: " <> (T.pack $ show err)
(strip -> GHC.Var n) `GHC.App` (strip -> stringExprContent AsText -> Just bs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped long line

@@ -91,21 +91,23 @@ map2 =
\n ->
let m1 =
Data.AssocMap.unsafeFromList
[ (n PlutusTx.+ 1, "one")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this test relies on UTF-8 encoding for BuiltinByteStrings is accidental

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-worked the test to use Integers instead of BuiltinByteStrings.

@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from 0f08f16 to 439f7e5 Compare November 13, 2024 10:54
@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from 439f7e5 to b4f1bb2 Compare November 13, 2024 10:57
@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from 8c7865f to 7cd0c62 Compare November 13, 2024 11:20
@Unisay Unisay requested review from zliu41 and a team November 13, 2024 11:21
@Unisay Unisay marked this pull request as ready for review November 13, 2024 11:21
@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from 7cd0c62 to 03011f0 Compare November 13, 2024 11:23
@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from 03011f0 to b718e0b Compare November 13, 2024 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix BuiltinByteString literal construction
2 participants