-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BuiltinByteString literals aren't UTF8 encoded. #6655
base: master
Are you sure you want to change the base?
Conversation
...lc/evaluation/builtin/semantics/bls12_381_G1_scalarMul/mul-neg-one/mul-neg-one.uplc.expected
Outdated
Show resolved
Hide resolved
@@ -699,13 +732,21 @@ compileExpr e = traceCompilation 2 ("Compiling expr:" GHC.<+> GHC.ppr e) $ do | |||
_ -> throwPlain $ CompilationError "No info for Pair builtin" | |||
|
|||
-- TODO: Maybe share this to avoid repeated lookups. Probably cheap, though. | |||
(stringTyName, sbsName) <- case (Map.lookup ''Builtins.BuiltinString nameInfo, Map.lookup 'Builtins.stringToBuiltinString nameInfo) of | |||
(stringTyName, sbsName) <- | |||
case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap long line
...t-cases/uplc/evaluation/builtin/semantics/replicateByte/case-07/case-07.uplc.budget.expected
Outdated
Show resolved
Hide resolved
| let name = GHC.getName n | ||
, name == GHC.unpackCStringName || name == GHC.unpackCStringUtf8Name -> | ||
stringExprContent expr | ||
data StringExprContentAs = AsBytes | AsText |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stringExprContent
function is used to extract bytes for the BuiltinByteString
as well as for the BuiltinString
literals;
I have added this datatype as its parameter to distinguish between these scenarios as we want to utf-8 decode BuiltinByteString
literals only.
This isn't a full UTF-8 decoder: it only decodes the subset of UTF-8 that | ||
is expected to be found in bytestring literals: 0x00 - 0xFF | ||
-} | ||
fromUtf8 :: [Word8] -> Maybe [Word8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using Data.Text.Encoding.decodeUtf8
and it didn't work: the error was saying about invalid utf-8 bytes. Maybe the implementation in GHC is not compatible with the one in Text (Currently GHC ships with at least five UTF-8 implementations)
It turned out that decoding rules are not super complex so I have implemented them below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a unit test to prove it works correctly on bytes 00 - FF
(bsTyName, sbbsName) <- case (Map.lookup ''Builtins.BuiltinByteString nameInfo, Map.lookup 'Builtins.stringToBuiltinByteString nameInfo) of | ||
(Just t1, Just t2) -> pure (GHC.getName t1, GHC.getName t2) | ||
_ -> throwPlain $ CompilationError "No info for ByteString builtin" | ||
(builtinByteStringTyName, sbbsName) <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap another long line
Just tc -> | ||
if | ||
| GHC.getName tc == builtinByteStringTyName -> | ||
case stringExprContent AsBytes (strip content) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracting content as bytes undoes the UTF8 encoding for BuiltinByteString
Just bs -> | ||
pure $ PIR.Constant annMayInline $ PLC.someValue bs | ||
| GHC.getName tc == stringTyName -> | ||
case stringExprContent AsText (strip content) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracting content as text leaves original UTF8 encoding for BuiltinString
throwPlain $ | ||
CompilationError $ | ||
"Text literal with invalid UTF-8 content: " <> (T.pack $ show err) | ||
(strip -> GHC.Var n) `GHC.App` (strip -> stringExprContent AsText -> Just bs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapped long line
@@ -91,21 +91,23 @@ map2 = | |||
\n -> | |||
let m1 = | |||
Data.AssocMap.unsafeFromList | |||
[ (n PlutusTx.+ 1, "one") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that this test relies on UTF-8 encoding for BuiltinByteStrings is accidental
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-worked the test to use Integer
s instead of BuiltinByteString
s.
0f08f16
to
439f7e5
Compare
439f7e5
to
b4f1bb2
Compare
8c7865f
to
7cd0c62
Compare
7cd0c62
to
03011f0
Compare
03011f0
to
b718e0b
Compare
b718e0b
to
5ff83c0
Compare
BuiltinByteString
literals are utf-8 decoded in the plugin.Haskell.IsString BuiltinByteString
instance delegates to the HaskellByteString
instance.BuiltinByteString
without the UTF8 conversion.Closes #6505