Skip to content
This repository has been archived by the owner on Sep 29, 2024. It is now read-only.

EOF error when sending non-UTF-8 strings #376

Closed
dorjorg opened this issue Sep 12, 2020 · 7 comments
Closed

EOF error when sending non-UTF-8 strings #376

dorjorg opened this issue Sep 12, 2020 · 7 comments

Comments

@dorjorg
Copy link

dorjorg commented Sep 12, 2020

When you sent non-English msg in events, It printed "EOF" because it does not support UTF-8 format. please fix it.
Thanks

@dorjorg dorjorg added the bug label Sep 12, 2020
@adrianmxb
Copy link
Contributor

adrianmxb commented Sep 12, 2020

Not easy to implement, because of how the encoding works into an engine io payload and the fact that JavaScript actually uses UTF-16 encoding for strings.

I've got it working as proof of concept in a side project of mine though, so it is possible.

@sshaplygin sshaplygin changed the title PLZ SUPPORT UTF-8 [BUG] [BUG] plz support utf-8 Sep 14, 2020
@sshaplygin sshaplygin changed the title [BUG] plz support utf-8 [BUG] Support utf-8 Sep 14, 2020
@erkie erkie changed the title [BUG] Support utf-8 Support utf-8 Sep 16, 2020
@erkie erkie added the bug label Sep 16, 2020
@erkie erkie changed the title Support utf-8 EOF error when sending non-UTF-8 strings Sep 16, 2020
@erkie
Copy link
Collaborator

erkie commented Sep 16, 2020

I take it this only happens when not using something like JSON that encodes UTF-8 characters?

@adrianmxb
Copy link
Contributor

Yeah, this happens when you have raw utf-8 data.

@adrianmxb
Copy link
Contributor

adrianmxb commented Sep 23, 2020

@erkie The biggest problem is actually that JavaScript uses UTF-16 and engine.io payloads contain the data size.
Decoding: https://github.com/adrianmxb/goseio/blob/4ea3bf17ed72b00181159f19f1ff7d01d8f92973/pkg/eio/parser/parser.go#L123-L184
Encoding:
https://github.com/adrianmxb/goseio/blob/4ea3bf17ed72b00181159f19f1ff7d01d8f92973/pkg/eio/parser/parser.go#L74-L90

I built a small proof of concept library that handles it, though it is horrible performance wise.
Maybe we can use some parts or ideas of it when we plan to implement it here.

@grahamjenson
Copy link
Contributor

grahamjenson commented Jul 28, 2023

I have no idea about performance but changing the Read method in payload/decoder.go to below code works:

func (d *decoder) Read(p []byte) (int, error) {
	if d.b64Reader != nil {
		return d.b64Reader.Read(p)
	}
	dd, err := d.limitReader.Read(p)
	unicodeCount := 0
	for i := range p[:dd] {
		b := p[i]
		// Add additional unicode charater bytes
		if b>>3 == 30 {
			// starts with 11110 4 byte unicode char
			unicodeCount = unicodeCount + 3
		} else if b>>4 == 14 {
			// starts with 1110 3 byte unicode char
			unicodeCount = unicodeCount + 2
		} else if b>>5 == 6 {
			// starts with 110 2 byte unicode char
			unicodeCount = unicodeCount + 1
		}

	}
	d.limitReader.N = d.limitReader.N + int64(unicodeCount)
	return dd, err
}

The works by scanning the read bytes for unicode header bytes and then adding additional bytes onto the limit reader.

I may have to also change the encoder, but I will look at that soon :)

This code and a bunch of other changes are over at my fork https://github.com/grahamjenson/go-socket.io, where I also am writing a client and fixing other issues (like Ack packet decoding...)

@grahamjenson
Copy link
Contributor

grahamjenson commented Jul 29, 2023

The above doesnt quite work. The differences between UTF8 and UTF16 are annoying. I got something working, but pretty sure it will only work in most cases (It is really annoying problem, which is wy EIO v4 looks much better.

Decoder

func (d *decoder) Read(p []byte) (int, error) {
	if d.b64Reader != nil {
		return d.b64Reader.Read(p)
	}
	dd, err := d.limitReader.Read(p)
	unicodeCount := 0
	for i := range p[:dd] {
		b := p[i]
		if b>>3 == 30 {
			// starts with 11110 4 byte unicode char, probably 2 length in JS
			unicodeCount = unicodeCount + 2
		} else if b>>4 == 14 {
			// starts with 1110 3 byte unicode char, probably 2 length in JS
			unicodeCount = unicodeCount + 2
		} else if b>>5 == 6 {
			// starts with 110 2 byte unicode char, , probably 1 length in JS
			unicodeCount = unicodeCount + 1
		}
	}

	d.limitReader.N = d.limitReader.N + int64(unicodeCount)
	return dd, err
}

Encoder


func (e *encoder) writeTextHeader() error {

	err := writeTextLen(e.calcCodeUnitLength(), &e.header)
	if err == nil {
		err = e.header.WriteByte(e.pt.StringByte())
	}
	return err
}


func (e *encoder) calcCodeUnitLength() int64 {
	var l int64 = 1
	var codeUnitSize int64 = 0
	bytes := e.frameCache.Bytes()
	for i := range bytes {
		b := bytes[i]
		if b>>3 == 30 {
			// starts with 11110 4 byte unicode char, probably 2 length in JS
			codeUnitSize = 2
		} else if b>>4 == 14 {
			// starts with 1110 3 byte unicode char, probably 2 length in JS
			codeUnitSize = 1
		} else if b>>5 == 6 {
			// starts with 110 2 byte unicode char, , probably 1 length in JS
			codeUnitSize = 1
		} else if b>>6 == 2 {
			// starts with 10 just unicode byte
			codeUnitSize = 0
		} else {
			codeUnitSize = 1
		}
		l = l + codeUnitSize
	}

	return int64(l)
}

@grahamjenson
Copy link
Contributor

Fixed(ish) with #608

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants