Broken Text-Encoding in Filenames after Upload #2825

pschichtel · 2024-10-21T22:40:07Z

I noticed this recently, but not sure when exactly this started happening, possibly with the 0.42 update.

Any german Umlaute are affected by this, but I assume this might be a general UTF-8 encoding issue.

Here it is in the upload form before uploading:

Here is the document after uploading without modifying it:

I feels like a classic case of interpreting UTF-8 as ASCII/ISO 8859-1 on the byte level.

pschichtel · 2024-10-21T22:54:25Z

The POST request uploading the file has the broken name already.

pschichtel · 2024-10-21T23:01:07Z

I haven't seen an obvious commit that might have broken this.

eikek · 2024-10-22T12:25:49Z

Oh, that is very strange! I didn't change anything on the client for some time. I'll need to reproduce it here. What browser are you using?

pschichtel · 2024-10-22T13:15:34Z

I did this on Firefox, but I can also test with chromium.

pschichtel · 2024-10-22T22:20:10Z

Here is an example file that reproduces this for me in Firefox and Chromium: äöüÄÖÜß.pdf

pschichtel · 2024-10-22T22:22:18Z

It's interesting to see Firefox (left) and Chromium (right) display the characters differently.

nekrondev · 2024-11-01T10:36:28Z

#2842 might be related to this (I can approve this behavior with Firefox and 0.42.0, too).

nekrondev · 2024-11-01T18:10:06Z

So it seems that

a current web browser like Firefox or Chrome will send an UTF-8 based filename like e.g. "Vertragsübersicht" as ISO-8859-1 encoded string inside the Content-Disposition mime multipart header. This will result into a wrong formatted Docspell filename, because http4s will decode the UTF-8 formatted string as ISO-8859-1 (that's what RFC 5987 tells, see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition)

Here is what will happen doing the same wrong conversion using Python:

>>> bytes("Vertragsübersicht".encode("utf-8")).decode("iso-8859-1")
'VertragsÃ¼bersicht'

Docspells Elm UI needs to force the special filename*=UTF-8'' attribute inside the Content-Disposition header. Not being an Elm programmer myself, but asking our AI for assistance there should be a method for the Http package called Http.filePartWithHeaders that should allow Elm to provide a dedicated Content-Disposition adding the nowadays default UTF-8 encoding for the filename.

For reference AI suggested the following code as an example to provide a dedicated header:

encodeURIComponent : String -> String
encodeURIComponent str =
    String.join ""
        (List.map encodeChar (String.toList str))


encodeChar : Char -> String
encodeChar char =
    case char of
        'ä' ->
            "%C3%A4"

        _ ->
            String.fromChar char


fileParts : List File -> List (Http.Part msg)
fileParts files =
    List.map (\f -> 
        let
            filename = "ä.txt"
            encodedFilename = encodeURIComponent filename
            contentDisposition = "form-data; name=\"file[]\"; filename*=UTF-8''" ++ encodedFilename
        in
        Http.filePartWithHeaders "file[]" f [ ( "Content-Disposition", contentDisposition ) ]
    ) files

So somehow parts of this code example must be integrated to

https://github.com/eikek/docspell/blob/fc3629e3b7efd3e68bf29c9804bac03dfeaab53a/modules/webapp/src/main/elm/Api.elm#L1075C9-L1076C60

I've currently no Elm/Scala dev environment setup so can't tinker around with the frontend code, but hope it helps to understand the encoding problems.

PS:

(gibt's hier).

pschichtel · 2024-11-02T17:49:09Z

@nekrondev sounds like a good solution. What I still wonder about: Why did it a become an issue now? Especially since @eikek didn't change anything about the client. The http4s version used since right after the the 0.41.0 release contains this PR, which seems to perfectly explain the behavior: http4s/http4s#7419

nekrondev · 2024-11-03T11:24:12Z

Yea, that PR makes sense and the maintainers are right that filename*= is forbidden for newer HTML5 RFCs. That's why they convert the UTF-8 encoded filename= attribute back to ISO-8859-1 default and allow manual transformation by http4s API methods if a specific conversation is needed. Fixing the ELM web UI by adding filename*= won't work, because it's no longer supported by http4s framework. The burden it takes will be on Dospells backend to re-encode the ISO8859-1 filename string back to UTF-8. The main problem I see here is that you get no reliable information from the browser which encoding should be used. The legacy filename*= provided that information, but the HTML5 RFC mentioned in that https4 PR tells us otherwise that it's forbidden to do so.

The back to UTF-8 encoding I think needs to be fixed here where the multipart filenames are processed.

pschichtel · 2024-11-11T00:59:40Z

The issue that I still see: We don't have any information on what the encoding originally was, right? Would we just blindly reinterpret it as UTF-8 in the hope that things at least don't get worse? The last section of https://datatracker.ietf.org/doc/html/rfc7578#section-4.2 suggest that could be a reasonable approach. It also seems like that's what http4s' filename method defaults to. I'll send a PR.

pschichtel · 2024-11-11T01:08:01Z

@nekrondev @eikek #2853

pschichtel linked a pull request Nov 11, 2024 that will close this issue

Use http4s' new Part.filename method #2853

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken Text-Encoding in Filenames after Upload #2825

Broken Text-Encoding in Filenames after Upload #2825

pschichtel commented Oct 21, 2024

pschichtel commented Oct 21, 2024

pschichtel commented Oct 21, 2024

eikek commented Oct 22, 2024

pschichtel commented Oct 22, 2024

pschichtel commented Oct 22, 2024

pschichtel commented Oct 22, 2024

nekrondev commented Nov 1, 2024 •

edited

Loading

nekrondev commented Nov 1, 2024

pschichtel commented Nov 2, 2024 •

edited

Loading

nekrondev commented Nov 3, 2024

pschichtel commented Nov 11, 2024

pschichtel commented Nov 11, 2024

Broken Text-Encoding in Filenames after Upload #2825

Broken Text-Encoding in Filenames after Upload #2825

Comments

pschichtel commented Oct 21, 2024

pschichtel commented Oct 21, 2024

pschichtel commented Oct 21, 2024

eikek commented Oct 22, 2024

pschichtel commented Oct 22, 2024

pschichtel commented Oct 22, 2024

pschichtel commented Oct 22, 2024

nekrondev commented Nov 1, 2024 • edited Loading

nekrondev commented Nov 1, 2024

pschichtel commented Nov 2, 2024 • edited Loading

nekrondev commented Nov 3, 2024

pschichtel commented Nov 11, 2024

pschichtel commented Nov 11, 2024

nekrondev commented Nov 1, 2024 •

edited

Loading

pschichtel commented Nov 2, 2024 •

edited

Loading