Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 output gets mangled in the Scala Worksheet #185

Open
Blaisorblade opened this issue Jul 2, 2014 · 3 comments
Open

UTF-8 output gets mangled in the Scala Worksheet #185

Blaisorblade opened this issue Jul 2, 2014 · 3 comments
Labels

Comments

@Blaisorblade
Copy link

My Scala code (a lambda-calculus implementation) produces UTF-8 output. The worksheet is exactly what I'd want, except that it doesn't cope with UTF-8 program output. The whole project is using UTF-8 as far as I can tell, as the workspace is.

For instance, compare an output fragment, as seen by running the Scala REPL inside Eclipse:
((ℤ → ℤ) → ℤ → ℤ) → (ℤ → ℤ) → ℤ → ℤ)
with what I get in the Worksheet:

((��� ��� ���) ��� ��� ��� ���) ��� (��� ��� ���) ��� ��� ��� ���)

Each Unicode character translates to three question marks because all these characters take 3 bytes in UTF-8 (because they're outside the BMP).

This is with version 3.0.4 of Scala IDE. More precisely:
Scala Worksheet 0.2.3.v-2_11-201405200954-4f7988d org.scalaide.worksheet.feature.feature.group Scala IDE
Scala IDE for Eclipse 3.0.4.v-2_11-201405200946-c46f499 org.scala-ide.sdt.feature.feature.group scala-ide.org

(Plus Scala Search & ScalaTest plugins, I could provide those version numbers if needed).

I've looked at the current source code (which maybe was a bad idea), and it seems that the conversion should be done purely by Eclipse libraries here, and I can't see anything wrong with that:

@skyluc
Copy link
Member

skyluc commented Jul 2, 2014

Are you running on a non-UTF-8 system?
For Java point-of-view, about all operating system, except correctly configured Linux machines, are using an encoding different than UTF-8.
It is relevant because the execution of the worksheet code is done in a forked process, and it is likely that the encoding is not forced to UTF-8.

@Blaisorblade
Copy link
Author

Thanks for the prompt answer!
I assumed this would be a problem when decoding from the stream, but you might still be right.

Do you agree that using the host configuration would be a bug?

I investigated a bit, and before answering your question, I'll give my analysis: Eclipse is correctly configured to use UTF-8 (according to this: http://stackoverflow.com/a/9181068/53974), and that should be enough. Instead, I also need to set -Dfile.encoding=UTF8 in eclipse.ini, and the worksheet works correctly if and only if that option is active. (When relaunching Eclipse, I also need to modify & save the worksheet to update the output).

Analysis: Since the documented setting is inside Eclipse itself, it seems that what I'm doing is a hack, needed because some code uses the default encoding instead of passing the Eclipse-configured one.
Now, I don't envy the poor soul who's supposed to debug this (you forget to thread the encoding once and you have a bug), even though I suppose those needed for people configuring multiple encodings.
So I'll be OK with any resolution other than not "not-a-bug" — for instance, I'd be happy with WontFix or a late milestone/low priority, as long as the workaround is documented.

Side note/additional issue: line breaking seems very much not Unicode-aware, both in practice:

  val test1T: Term = test1                        //> test1T  : ilc.feature.let.ANormalFormTest.v.Term = App(Abs(Var(id,((ℤ → 
                                                  //| ℤ) → ℤ → ℤ) → (ℤ → ℤ) → ℤ → ℤ),App(Abs(Var(id_i,�
                                                  //| � → ℤ),App(Abs(Var(apply,(ℤ → ℤ) → ℤ → ℤ),App(App(App(Var(

And maybe happens because this implementation is in terms of bytes — it adds newlines after a certain byte count, but I didn't run anything with debugging:


Are you running on a non-UTF-8 system?

As far as I can tell, no. I'd be happy to try a test of your choice.

I'm using OS X 10.9, but almost everything else on my system is handling Unicode correctly. I say "almost" because IIRC some programs (TextEdit) still dare offer me "Mac OS Roman" as default encoding.

Regarding -Dfile.encoding=UTF8, most of my JVMs have that option (according to jvisualvm). Eclipse didn't, but still, both in the Scala REPL and in the worksheet, the property seems correctly set. However, setting -Dfile.encoding made a difference, not sure why.

Scala REPL, both inside and outside Eclipse, and

scala> sys.props("file.encoding")
res4: String = UTF-8

sys.props("file.encoding")                      //> res0: String = UTF-8

Also, from the prompt:

$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"

Finally, I run this program:

package charset;

public class TestCharset {
  public static void main(String[] args) {
    System.out.println(System.getProperty("file.encoding"));
  }
}

and got this output:

$ java charset.TestCharset
UTF-8

So the default encoding seems to be the right one. But I must be missing something, since -Dfile.encoding=UTF8 made a difference for Eclipse.

Blaisorblade added a commit to inc-lc/ilc-scala that referenced this issue Jul 2, 2014
@dragos dragos added the bug label Jul 28, 2014
@skogler
Copy link

skogler commented Nov 3, 2014

I am also getting this issue on a UTF-8 system. All files are correctly configured to use UTF-8. The line splitting in the worksheet messes up the output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants