Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does len(string) might be a candidate? #4

Open
joanlopez opened this issue May 1, 2020 · 5 comments
Open

Does len(string) might be a candidate? #4

joanlopez opened this issue May 1, 2020 · 5 comments

Comments

@joanlopez
Copy link
Contributor

Even though it's properly documented (see here), IMHO the behaviour of the len built-in method for the case when the parameter is a string is a candidate for this repository.

I assume there is a fair enough amount of reasons behind that could explain us why this behaviour was chosen, but I'd say that using the amount of bytes for strings is quite confusing.

Example:

len("si") // 2
len("sí") // 3
len("世界") // 6

So, as discussed here, the proper way to get the amount of characters within a given string is by using len([]rune(string)):

len([]rune("si")) // 2
len([]rune("sí")) // 2
len([]rune("世界")) // 2

Additionally, I'd say it could be interesting to open a new Go's proposal to include a wrapper function on the strings package. If the Go's spirit is keeping it simple & keeping backwards compatibility, I'd keep the len behaviour but I'd add that method, as the rune hack is not simple at all.

PS: I'm not really really sure if the proposed method already exists, I only did a quick look up 😇

@colega
Copy link
Owner

colega commented Jun 2, 2020

Hi, sorry for the late response. I think that len(string) doesn't fit well here.

Although it might be confusing for people coming from other programming languages like Python (where unicode runes are counted by default), it is consistent in Golang, and I think that this paragraphs from the official docs let it clear:

In Go, a string is in effect a read-only slice of bytes. If you're at all uncertain about what a slice of bytes is or how it works, please read the previous blog post; we'll assume here that you have.

It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

Here is a string literal (more about those soon) that uses the \xNN notation to define a string constant holding some peculiar byte values. (Of course, bytes range from hexadecimal values 00 through FF, inclusive.)

So strings are just slices of bytes, and they know nothing about unicode, thus their behaviour is consistent with bytes slices.

Also note, that len([]rune(...)) operation is not just a different syntax, it's an operation where the cost increases from O(1) to O(n) so making that explicit in the code is always good.

@colega colega closed this as completed Jun 13, 2020
@joanlopez
Copy link
Contributor Author

Sure, fair enough! Thanks for your time 🙏 Happy to keep learn everyday 😇

@colega colega reopened this Nov 23, 2020
@colega
Copy link
Owner

colega commented Nov 23, 2020

I'm reopening this as I feel we can have a page for strings behaviour.

While len(unicodeString) is not unexpected enough, IMO, the whole set of unicode-bytes duality of strings can be definitely documented, especially, as @joanlopez pointed out in a private conversation at some point, that for loop iterates runes but provides bytes indexes: https://play.golang.org/p/lEYcSV4Btgh

@joanlopez
Copy link
Contributor Author

Additional context here.

@colega
Copy link
Owner

colega commented May 7, 2024

Just as a heads-up, I just hit a bug because of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants