Problem reading little-endian Unicode #62

samalone · 2014-04-13T21:14:20Z

I have an NSString that was converted from little-endian Unicode NSData. The first character of the string is the Unicode byte-order marker (BOM), which in little endian is 0xFF 0xFE.

The first time _loadMoreIfNecessary calls initWithBytes:length:encoding:, the BOM is in the buffer and the buffer is read correctly. However, when the second buffer is converted there is no BOM, and the data is treated as big-endian. This means that the second and all subsequent buffers of data are corrupted.

In one sense, the bug is that _loadMoreIfNecessary is converting each buffer of text independently, rather than maintaining conversion context from one buffer to the next. In general, text encodings require context to handle multi-byte characters, byte order markers and such. A more robust version of this function would use the lower-level Text Encoding Converter, which maintains context from one buffer to the next.

But an easier fix might be to change initWithCSVString: to use a fixed encoding like NSUTF16BigEndianStringEncoding rather than calling [csv fastestEncoding], which evaluates to NSUnicodeStringEncoding which is ambiguous. I believe that using a unambiguous encoding would prevent the error, even if its not as general a solution as using Text Encoding Converter.

davedelong · 2014-07-12T15:57:27Z

The convenience initializers now use a fixed encoding (NSUTF8StringEncoding), but this would still be an issue for NSInputStreams provided to the designated initializer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem reading little-endian Unicode #62

Problem reading little-endian Unicode #62

samalone commented Apr 13, 2014

davedelong commented Jul 12, 2014

Problem reading little-endian Unicode #62

Problem reading little-endian Unicode #62

Comments

samalone commented Apr 13, 2014

davedelong commented Jul 12, 2014