Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken Text-Encoding in Filenames after Upload #2825

Open
pschichtel opened this issue Oct 21, 2024 · 12 comments · May be fixed by #2853
Open

Broken Text-Encoding in Filenames after Upload #2825

pschichtel opened this issue Oct 21, 2024 · 12 comments · May be fixed by #2853

Comments

@pschichtel
Copy link

I noticed this recently, but not sure when exactly this started happening, possibly with the 0.42 update.

Any german Umlaute are affected by this, but I assume this might be a general UTF-8 encoding issue.

Here it is in the upload form before uploading:
image

Here is the document after uploading without modifying it:
image

I feels like a classic case of interpreting UTF-8 as ASCII/ISO 8859-1 on the byte level.

@pschichtel
Copy link
Author

image

The POST request uploading the file has the broken name already.

@pschichtel
Copy link
Author

I haven't seen an obvious commit that might have broken this.

@eikek
Copy link
Owner

eikek commented Oct 22, 2024

Oh, that is very strange! I didn't change anything on the client for some time. I'll need to reproduce it here. What browser are you using?

@pschichtel
Copy link
Author

I did this on Firefox, but I can also test with chromium.

@pschichtel
Copy link
Author

Here is an example file that reproduces this for me in Firefox and Chromium: äöüÄÖÜß.pdf

@pschichtel
Copy link
Author

image

It's interesting to see Firefox (left) and Chromium (right) display the characters differently.

@nekrondev
Copy link
Contributor

nekrondev commented Nov 1, 2024

#2842 might be related to this (I can approve this behavior with Firefox and 0.42.0, too).

@nekrondev
Copy link
Contributor

So it seems that

  • a current web browser like Firefox or Chrome will send an UTF-8 based filename like e.g. "Vertragsübersicht" as ISO-8859-1 encoded string inside the Content-Disposition mime multipart header. This will result into a wrong formatted Docspell filename, because http4s will decode the UTF-8 formatted string as ISO-8859-1 (that's what RFC 5987 tells, see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition)

Here is what will happen doing the same wrong conversion using Python:

>>> bytes("Vertragsübersicht".encode("utf-8")).decode("iso-8859-1")
'Vertragsübersicht'
  • Docspells Elm UI needs to force the special filename*=UTF-8'' attribute inside the Content-Disposition header. Not being an Elm programmer myself, but asking our AI for assistance there should be a method for the Http package called Http.filePartWithHeaders that should allow Elm to provide a dedicated Content-Disposition adding the nowadays default UTF-8 encoding for the filename.

For reference AI suggested the following code as an example to provide a dedicated header:

encodeURIComponent : String -> String
encodeURIComponent str =
    String.join ""
        (List.map encodeChar (String.toList str))


encodeChar : Char -> String
encodeChar char =
    case char of
        'ä' ->
            "%C3%A4"

        _ ->
            String.fromChar char


fileParts : List File -> List (Http.Part msg)
fileParts files =
    List.map (\f -> 
        let
            filename = "ä.txt"
            encodedFilename = encodeURIComponent filename
            contentDisposition = "form-data; name=\"file[]\"; filename*=UTF-8''" ++ encodedFilename
        in
        Http.filePartWithHeaders "file[]" f [ ( "Content-Disposition", contentDisposition ) ]
    ) files

So somehow parts of this code example must be integrated to

https://github.com/eikek/docspell/blob/fc3629e3b7efd3e68bf29c9804bac03dfeaab53a/modules/webapp/src/main/elm/Api.elm#L1075C9-L1076C60

I've currently no Elm/Scala dev environment setup so can't tinker around with the frontend code, but hope it helps to understand the encoding problems.

PS:

grafik
(gibt's hier).

@pschichtel
Copy link
Author

pschichtel commented Nov 2, 2024

@nekrondev sounds like a good solution. What I still wonder about: Why did it a become an issue now? Especially since @eikek didn't change anything about the client. The http4s version used since right after the the 0.41.0 release contains this PR, which seems to perfectly explain the behavior: http4s/http4s#7419

@nekrondev
Copy link
Contributor

Yea, that PR makes sense and the maintainers are right that filename*= is forbidden for newer HTML5 RFCs. That's why they convert the UTF-8 encoded filename= attribute back to ISO-8859-1 default and allow manual transformation by http4s API methods if a specific conversation is needed. Fixing the ELM web UI by adding filename*= won't work, because it's no longer supported by http4s framework. The burden it takes will be on Dospells backend to re-encode the ISO8859-1 filename string back to UTF-8. The main problem I see here is that you get no reliable information from the browser which encoding should be used. The legacy filename*= provided that information, but the HTML5 RFC mentioned in that https4 PR tells us otherwise that it's forbidden to do so.

The back to UTF-8 encoding I think needs to be fixed here where the multipart filenames are processed.

@pschichtel
Copy link
Author

The issue that I still see: We don't have any information on what the encoding originally was, right? Would we just blindly reinterpret it as UTF-8 in the hope that things at least don't get worse? The last section of https://datatracker.ietf.org/doc/html/rfc7578#section-4.2 suggest that could be a reasonable approach. It also seems like that's what http4s' filename method defaults to. I'll send a PR.

@pschichtel pschichtel linked a pull request Nov 11, 2024 that will close this issue
@pschichtel
Copy link
Author

@nekrondev @eikek #2853

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@pschichtel @eikek @nekrondev and others