-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken Text-Encoding in Filenames after Upload #2825
Comments
I haven't seen an obvious commit that might have broken this. |
Oh, that is very strange! I didn't change anything on the client for some time. I'll need to reproduce it here. What browser are you using? |
I did this on Firefox, but I can also test with chromium. |
Here is an example file that reproduces this for me in Firefox and Chromium: äöüÄÖÜß.pdf |
#2842 might be related to this (I can approve this behavior with Firefox and 0.42.0, too). |
So it seems that
Here is what will happen doing the same wrong conversion using Python: >>> bytes("Vertragsübersicht".encode("utf-8")).decode("iso-8859-1")
'Vertragsübersicht'
For reference AI suggested the following code as an example to provide a dedicated header: encodeURIComponent : String -> String
encodeURIComponent str =
String.join ""
(List.map encodeChar (String.toList str))
encodeChar : Char -> String
encodeChar char =
case char of
'ä' ->
"%C3%A4"
_ ->
String.fromChar char
fileParts : List File -> List (Http.Part msg)
fileParts files =
List.map (\f ->
let
filename = "ä.txt"
encodedFilename = encodeURIComponent filename
contentDisposition = "form-data; name=\"file[]\"; filename*=UTF-8''" ++ encodedFilename
in
Http.filePartWithHeaders "file[]" f [ ( "Content-Disposition", contentDisposition ) ]
) files So somehow parts of this code example must be integrated to I've currently no Elm/Scala dev environment setup so can't tinker around with the frontend code, but hope it helps to understand the encoding problems. PS:
|
@nekrondev sounds like a good solution. What I still wonder about: Why did it a become an issue now? Especially since @eikek didn't change anything about the client. The http4s version used since right after the the 0.41.0 release contains this PR, which seems to perfectly explain the behavior: http4s/http4s#7419 |
Yea, that PR makes sense and the maintainers are right that The back to UTF-8 encoding I think needs to be fixed here where the multipart filenames are processed. |
The issue that I still see: We don't have any information on what the encoding originally was, right? Would we just blindly reinterpret it as UTF-8 in the hope that things at least don't get worse? The last section of https://datatracker.ietf.org/doc/html/rfc7578#section-4.2 suggest that could be a reasonable approach. It also seems like that's what http4s' filename method defaults to. I'll send a PR. |
I noticed this recently, but not sure when exactly this started happening, possibly with the 0.42 update.
Any german Umlaute are affected by this, but I assume this might be a general UTF-8 encoding issue.
Here it is in the upload form before uploading:
Here is the document after uploading without modifying it:
I feels like a classic case of interpreting UTF-8 as ASCII/ISO 8859-1 on the byte level.
The text was updated successfully, but these errors were encountered: