EOF error when sending non-UTF-8 strings #376

dorjorg · 2020-09-12T16:45:20Z

When you sent non-English msg in events, It printed "EOF" because it does not support UTF-8 format. please fix it.
Thanks

adrianmxb · 2020-09-12T17:16:46Z

Not easy to implement, because of how the encoding works into an engine io payload and the fact that JavaScript actually uses UTF-16 encoding for strings.

I've got it working as proof of concept in a side project of mine though, so it is possible.

erkie · 2020-09-16T07:07:38Z

I take it this only happens when not using something like JSON that encodes UTF-8 characters?

adrianmxb · 2020-09-22T10:56:49Z

Yeah, this happens when you have raw utf-8 data.

adrianmxb · 2020-09-23T19:27:12Z

@erkie The biggest problem is actually that JavaScript uses UTF-16 and engine.io payloads contain the data size.
Decoding: https://github.com/adrianmxb/goseio/blob/4ea3bf17ed72b00181159f19f1ff7d01d8f92973/pkg/eio/parser/parser.go#L123-L184
Encoding:
https://github.com/adrianmxb/goseio/blob/4ea3bf17ed72b00181159f19f1ff7d01d8f92973/pkg/eio/parser/parser.go#L74-L90

I built a small proof of concept library that handles it, though it is horrible performance wise.
Maybe we can use some parts or ideas of it when we plan to implement it here.

grahamjenson · 2023-07-28T10:59:35Z

I have no idea about performance but changing the Read method in payload/decoder.go to below code works:

func (d *decoder) Read(p []byte) (int, error) {
	if d.b64Reader != nil {
		return d.b64Reader.Read(p)
	}
	dd, err := d.limitReader.Read(p)
	unicodeCount := 0
	for i := range p[:dd] {
		b := p[i]
		// Add additional unicode charater bytes
		if b>>3 == 30 {
			// starts with 11110 4 byte unicode char
			unicodeCount = unicodeCount + 3
		} else if b>>4 == 14 {
			// starts with 1110 3 byte unicode char
			unicodeCount = unicodeCount + 2
		} else if b>>5 == 6 {
			// starts with 110 2 byte unicode char
			unicodeCount = unicodeCount + 1
		}

	}
	d.limitReader.N = d.limitReader.N + int64(unicodeCount)
	return dd, err
}

The works by scanning the read bytes for unicode header bytes and then adding additional bytes onto the limit reader.

I may have to also change the encoder, but I will look at that soon :)

This code and a bunch of other changes are over at my fork https://github.com/grahamjenson/go-socket.io, where I also am writing a client and fixing other issues (like Ack packet decoding...)

grahamjenson · 2023-07-29T09:24:54Z

The above doesnt quite work. The differences between UTF8 and UTF16 are annoying. I got something working, but pretty sure it will only work in most cases (It is really annoying problem, which is wy EIO v4 looks much better.

Decoder

func (d *decoder) Read(p []byte) (int, error) {
	if d.b64Reader != nil {
		return d.b64Reader.Read(p)
	}
	dd, err := d.limitReader.Read(p)
	unicodeCount := 0
	for i := range p[:dd] {
		b := p[i]
		if b>>3 == 30 {
			// starts with 11110 4 byte unicode char, probably 2 length in JS
			unicodeCount = unicodeCount + 2
		} else if b>>4 == 14 {
			// starts with 1110 3 byte unicode char, probably 2 length in JS
			unicodeCount = unicodeCount + 2
		} else if b>>5 == 6 {
			// starts with 110 2 byte unicode char, , probably 1 length in JS
			unicodeCount = unicodeCount + 1
		}
	}

	d.limitReader.N = d.limitReader.N + int64(unicodeCount)
	return dd, err
}

Encoder


func (e *encoder) writeTextHeader() error {

	err := writeTextLen(e.calcCodeUnitLength(), &e.header)
	if err == nil {
		err = e.header.WriteByte(e.pt.StringByte())
	}
	return err
}


func (e *encoder) calcCodeUnitLength() int64 {
	var l int64 = 1
	var codeUnitSize int64 = 0
	bytes := e.frameCache.Bytes()
	for i := range bytes {
		b := bytes[i]
		if b>>3 == 30 {
			// starts with 11110 4 byte unicode char, probably 2 length in JS
			codeUnitSize = 2
		} else if b>>4 == 14 {
			// starts with 1110 3 byte unicode char, probably 2 length in JS
			codeUnitSize = 1
		} else if b>>5 == 6 {
			// starts with 110 2 byte unicode char, , probably 1 length in JS
			codeUnitSize = 1
		} else if b>>6 == 2 {
			// starts with 10 just unicode byte
			codeUnitSize = 0
		} else {
			codeUnitSize = 1
		}
		l = l + codeUnitSize
	}

	return int64(l)
}

grahamjenson · 2023-08-21T09:12:03Z

Fixed(ish) with #608

dorjorg added the bug label Sep 12, 2020

sshaplygin changed the title ~~PLZ SUPPORT UTF-8 [BUG]~~ [BUG] plz support utf-8 Sep 14, 2020

sshaplygin changed the title ~~[BUG] plz support utf-8~~ [BUG] Support utf-8 Sep 14, 2020

sshaplygin added enhancement and removed bug labels Sep 14, 2020

erkie changed the title ~~[BUG] Support utf-8~~ Support utf-8 Sep 16, 2020

erkie added the bug label Sep 16, 2020

erkie changed the title ~~Support utf-8~~ EOF error when sending non-UTF-8 strings Sep 16, 2020

googollee closed this as completed Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EOF error when sending non-UTF-8 strings #376

EOF error when sending non-UTF-8 strings #376

dorjorg commented Sep 12, 2020 •

edited

Loading

adrianmxb commented Sep 12, 2020 •

edited

Loading

erkie commented Sep 16, 2020

adrianmxb commented Sep 22, 2020

adrianmxb commented Sep 23, 2020 •

edited

Loading

grahamjenson commented Jul 28, 2023 •

edited

Loading

grahamjenson commented Jul 29, 2023 •

edited

Loading

grahamjenson commented Aug 21, 2023

EOF error when sending non-UTF-8 strings #376

EOF error when sending non-UTF-8 strings #376

Comments

dorjorg commented Sep 12, 2020 • edited Loading

adrianmxb commented Sep 12, 2020 • edited Loading

erkie commented Sep 16, 2020

adrianmxb commented Sep 22, 2020

adrianmxb commented Sep 23, 2020 • edited Loading

grahamjenson commented Jul 28, 2023 • edited Loading

grahamjenson commented Jul 29, 2023 • edited Loading

grahamjenson commented Aug 21, 2023

dorjorg commented Sep 12, 2020 •

edited

Loading

adrianmxb commented Sep 12, 2020 •

edited

Loading

adrianmxb commented Sep 23, 2020 •

edited

Loading

grahamjenson commented Jul 28, 2023 •

edited

Loading

grahamjenson commented Jul 29, 2023 •

edited

Loading