RPM: not all strings are UTF-8 #672

armijnhemel · 2023-05-15T18:45:03Z

In the current rpm.ksy the encoding for strings is set to UTF-8. There are RPM files that fail to parse, because as it turns out not everyone has been playing nice with encodings.

An example is this file from Fedora Core 3:

https://archives.fedoraproject.org/pub/archive/fedora/linux/core/3/x86_64/os/Fedora/RPMS/bash-3.0-17.x86_64.rpm

One of the tags is a record_type_string_array related to ChangeLogs and some people seem to have used Latin-1 characters instead.

Trond Eivind Glomsr\xf8d <[email protected]> 2.0.5a-10

Currently record_type_string_array is defined as follows:

  record_type_string_array:
    params:
      - id: num_values
        type: u4
    seq:
      - id: values
        type: strz
        repeat: expr
        repeat-expr: num_values

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

The text was updated successfully, but these errors were encountered:

generalmimon · 2023-05-15T18:51:02Z

@armijnhemel:

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

Well, if there isn't a single character encoding we could specify in the .ksy, the "next best thing" is to downgrade to a byte array:

   record_type_string_array:
     params:
       - id: num_values
         type: u4
     seq:
       - id: values
-        type: strz
+        terminator: 0
         repeat: expr
         repeat-expr: num_values

A byte array is the implicit type in .ksy specs when no type is given but the field size is delimited by size, size-eos: true or terminator.

armijnhemel · 2023-05-15T18:52:57Z

@armijnhemel:

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

Well, if there's no one clear character encoding we could specify in the .ksy, the "next best thing" is to downgrade to a byte array:
   record_type_string_array:
     params:
       - id: num_values
         type: u4
     seq:
       - id: values
-        type: strz
+        terminator: 0
         repeat: expr
         repeat-expr: num_values
A byte array is the implicit type in .ksy specs when no type is given but the field size is delimited by size, size-eos: true or terminator.

I actually had been thinking about that and looked at the docs, but that seems to indicate that terminator was only for strings. Using a byte array and then processing the strings in an external script would work for me.

armijnhemel · 2023-05-15T19:05:29Z

@armijnhemel:

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

Well, if there's no one clear character encoding we could specify in the .ksy, the "next best thing" is to downgrade to a byte array:
   record_type_string_array:
     params:
       - id: num_values
         type: u4
     seq:
       - id: values
-        type: strz
+        terminator: 0
         repeat: expr
         repeat-expr: num_values
A byte array is the implicit type in .ksy specs when no type is given but the field size is delimited by size, size-eos: true or terminator.
I actually had been thinking about that and looked at the docs, but that seems to indicate that terminator was only for strings. Using a byte array and then processing the strings in an external script would work for me.

Thinking a bit more about this: probably this isn't a good idea, as \x00 can be part of a valid UTF-8 string.

armijnhemel · 2023-05-16T19:51:38Z

I found it easier to just work around it like this:

parse regularly (which will parse the vast majority of RPM files out there)
reparse if it fails with a copy of the RPM specification with the above change (byte array instead of strz)
decode all the strings to valid UTF-8 for some common encodings

This is cleaner than trying to fix it here.

armijnhemel changed the title ~~RPM: not all strings UTF-8~~ RPM: not all strings are UTF-8 May 15, 2023

generalmimon added the bug label May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPM: not all strings are UTF-8 #672

RPM: not all strings are UTF-8 #672

armijnhemel commented May 15, 2023

generalmimon commented May 15, 2023 •

edited

Loading

armijnhemel commented May 15, 2023

armijnhemel commented May 15, 2023

armijnhemel commented May 16, 2023

RPM: not all strings are UTF-8 #672

RPM: not all strings are UTF-8 #672

Comments

armijnhemel commented May 15, 2023

generalmimon commented May 15, 2023 • edited Loading

armijnhemel commented May 15, 2023

armijnhemel commented May 15, 2023

armijnhemel commented May 16, 2023

generalmimon commented May 15, 2023 •

edited

Loading