Does len(string) might be a candidate? #4

joanlopez · 2020-05-01T18:07:02Z

Even though it's properly documented (see here), IMHO the behaviour of the len built-in method for the case when the parameter is a string is a candidate for this repository.

I assume there is a fair enough amount of reasons behind that could explain us why this behaviour was chosen, but I'd say that using the amount of bytes for strings is quite confusing.

Example:

len("si") // 2
len("sí") // 3
len("世界") // 6

So, as discussed here, the proper way to get the amount of characters within a given string is by using len([]rune(string)):

len([]rune("si")) // 2
len([]rune("sí")) // 2
len([]rune("世界")) // 2

Additionally, I'd say it could be interesting to open a new Go's proposal to include a wrapper function on the strings package. If the Go's spirit is keeping it simple & keeping backwards compatibility, I'd keep the len behaviour but I'd add that method, as the rune hack is not simple at all.

PS: I'm not really really sure if the proposed method already exists, I only did a quick look up 😇

The text was updated successfully, but these errors were encountered:

colega · 2020-06-02T08:01:49Z

Hi, sorry for the late response. I think that len(string) doesn't fit well here.

Although it might be confusing for people coming from other programming languages like Python (where unicode runes are counted by default), it is consistent in Golang, and I think that this paragraphs from the official docs let it clear:

In Go, a string is in effect a read-only slice of bytes. If you're at all uncertain about what a slice of bytes is or how it works, please read the previous blog post; we'll assume here that you have.

It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

Here is a string literal (more about those soon) that uses the \xNN notation to define a string constant holding some peculiar byte values. (Of course, bytes range from hexadecimal values 00 through FF, inclusive.)

So strings are just slices of bytes, and they know nothing about unicode, thus their behaviour is consistent with bytes slices.

Also note, that len([]rune(...)) operation is not just a different syntax, it's an operation where the cost increases from O(1) to O(n) so making that explicit in the code is always good.

joanlopez · 2020-06-13T18:15:06Z

Sure, fair enough! Thanks for your time 🙏 Happy to keep learn everyday 😇

colega · 2020-11-23T14:28:26Z

I'm reopening this as I feel we can have a page for strings behaviour.

While len(unicodeString) is not unexpected enough, IMO, the whole set of unicode-bytes duality of strings can be definitely documented, especially, as @joanlopez pointed out in a private conversation at some point, that for loop iterates runes but provides bytes indexes: https://play.golang.org/p/lEYcSV4Btgh

joanlopez · 2020-12-13T12:34:33Z

Additional context here.

colega · 2024-05-07T11:16:35Z

Just as a heads-up, I just hit a bug because of this.

colega closed this as completed Jun 13, 2020

colega reopened this Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does len(string) might be a candidate? #4

Does len(string) might be a candidate? #4

joanlopez commented May 1, 2020

colega commented Jun 2, 2020

joanlopez commented Jun 13, 2020

colega commented Nov 23, 2020

joanlopez commented Dec 13, 2020

colega commented May 7, 2024

Does len(string) might be a candidate? #4

Does len(string) might be a candidate? #4

Comments

joanlopez commented May 1, 2020

colega commented Jun 2, 2020

joanlopez commented Jun 13, 2020

colega commented Nov 23, 2020

joanlopez commented Dec 13, 2020

colega commented May 7, 2024