Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly set LANG=C for serial login; better handling for CP437 terminals #5

Open
knghtbrd opened this issue Oct 26, 2015 · 11 comments

Comments

@knghtbrd
Copy link
Member

From @IvanExpert on October 25, 2015 22:24

UTF-16 is 2-4 byte (not relevant, but just saying)
UTF-8 is one byte 0-127, ASCII compatible; 2-6 bytes for everything else
this screws up Apple II term programs for non-ASCII chars (e.g. hyphen, smart quote)

ISO-8859-* is one byte 0-255, with 128-255 variying by "part" 1-16
ISO-8859-1 is "Latin-1", revision is ISO-8859-15, others are langauge-specific
Apple II text comm programs are going to display 0-127 anyway, since
Apple II 128-255 are redundant or MouseText
"ANSI" in a comm program means pseudo VT-100, and may also mean the "DOS CodePage 437"
(IBM PC character set), as is the case with Spectrum ANSI emulation
So it doesn't matter which ISO-8859 part, since the comm programs aren't going to use
any of them. The main thing is that it's one byte per character, unlike UTF-8
TERM=vt100 on Pi makes Linux programs mostly display B&W, and makes ctrl-chars
display on Spectrum ANSI
TERM=pcansi on Pi makes Linux programs do color for Spectrum ANSI
(TERM=ansi just breaks everything)
LANG=en_US (as opposed to en_US.UTF-8) gets you ISO-8859-1, which is better for
Spectrum ANSI, but the en_US ISO-8859-1 locale has to be available (from raspi-config)
See A2CLOUD setup for how to generate locales from Linux prompt
ProTERM VT-100 just repeats 128-255; ANSI BBS uses ASCII and mousetext to approximate DOS Code Page 437
Spectrum VT-100 is sort of arbitrary in 128-255
TERM=VT100 doesn't work with "ANSI" emulation because it outputs ctrl-O around
text styling which is a displayed character in CP437

single-byte:
ASCII is single byte 0-127 (0-31 are "C0" control codes, plus 127 is DEL)
ISO-8859-* (1-16) is ASCII for 0-127, 128-159 are "C1" control codes, 160-255 are regional characters
ISO-8859-1 is standard "Latin-1", ISO-8859-15 is updated for Euro and other chars

Microsoft has its own "codepage" numbers for character sets.
Codepage 437 (aka "ANSI BBS") is the DOS character set: ASCII from 32-126,
plus printable chars at 1-31 and 127-255; (all chars are also represented in UTF-8)
"Linedraw" font for Windows provides characters 128+ for codepage 437: ftp://ftp.microsoft.com/Softlib/MSLFILES/GC0651.EXE (use 64.4.17.176 if doesn't resolve)
Also "Terminal" font in XP provides most of it; Courier New is a Unicode font with most of the same characters
Windows-1252 (codepage 1252) is ISO-8859-1 with additional chars from 128-159 instead of C1,
including all chars in ISO-8859-15
Mac has "macintosh" or "MacRoman" encoding which is ASCII for 0-127 and
its own characters for 128-255

UTF-8 characters 0-127 is same as ASCII
UTF-8 characters 128+ are between two and four bytes and can represent everything (I guess)
UTF-16 characters are between two and four bytes, and are endian-sensitive
UTF-32 characters are always four bytes, and are endian-sensitive

Copied from original issue: RasppleII/a2server#39

@knghtbrd
Copy link
Member Author

The real solution for terminals would be to define the appropriate terminal definitions for ProTERM, Spectrum, etc. Character sets and locales are more of an issue since these tend to be offered as iso8859-* or utf-8 or sometimes multibyte character sets that don't relate to the Apple // at all. We should be able to get cp437 working for Spectrum. It's possible that we could also get MouseText working for limited boxdraw support in things like dialog.

Not sure how to tag this one. It's a bug certainly, but a bug in what exactly, aside from A2CLOUD in general? I'll move this there, but the fix is going to be complicated.

@IvanExpert
Copy link
Contributor

This isn't a bug. It behaves correctly already. The above info is just for reference.

The most important detail is that ISO-8859-* is an available locale for TERM to use, which it isn't out of the box on Raspbian -- only UTF-8 is. That makes everything look good on an Apple II, whose comm programs don't know about Unicode, and therefore can't handle multi-byte characters.

The packaging of Raspple II ensures ISO-8859-1 is made available, and that locale is specified for the serial console during A2CLOUD setup (I just forget how offhand, but I know that's what I made it do).

If you set ProTERM, Z-Link, or Spectrum to VT-100, it works great out of the box; if you set ProTERM or Spectrum for ANSI, and type "term color" at the prompt (which is an alias to TERM=pcansi), then cp437 is used instead of VT-100. There's questionable benefit to this in ProTERM but it's great for Spectrum's color display.

Of course, it would be theoretically possible to create a different emulation table for the proprietary "special" emulations offered by those programs, but a) why hardcode for one no-longer-maintained program, and b) in my observation, much of Linux is hardcoded for VT-100 and its derivatives, regardless of what TERM is set to.

See further conversation about this here, particularly my posts: https://groups.google.com/d/msg/comp.sys.apple2/WZ3p8IcrPrw/LToNxoh88IgJ
http://appleii.ivanx.com/prnumber6/a2cloud-log-in-from-your-apple-ii/

@knghtbrd
Copy link
Member Author

I don't think iso8859-1 does resolve the problem for the Apple // though—not really. The Apple // character ROM is 7 bit, and iso8859-1 is very 8 bit. Characters such as é and ü and ç and æ simply don't exist on the Apple, but are part of iso8859-1.

It's actually kind of too bad that we cannot easily load a soft-font into the Apple // text mode. Much cool stuff could be done with language support on the Apple // if we could.

@IvanExpert
Copy link
Contributor

No, having an ISO-8859-1 local doesn't fully resolve the problem of the Apple II not having the ISO-8859-1 character set. However, it solves a lot of problems by virtue of simply being any kind of 8-bit character set with standard ASCII in characters 0-127.

Otherwise, the Apple II comm programs attempt to render the default, multibyte UTF-8 character set by displaying every single byte of multibyte characters, which causes a host of formatting problems even in things as trivial as man pages, and certainly web pages in things like Lynx.

I don't know whether it's possible to soft-load fonts for Spectrum based on Spectrum's ANSI mode, which is graphical -- I'd think you could replace the CP437 characters in the upper half with ISO-8859-1 characters. You could ask Ewen.

What might be interesting, and only occurs to me now, is that if CP437 is an available Linux locale, maybe that could be used instead of ISO-8859-1 for the Apple II shell login, and then Spectum (and to a lesser extent, ProTERM) would be able to accurately represent what is intended, to the extent that CP437 can represent characters in other encodings.

On Nov 18, 2015, at 5:38 AM, Joseph Carter [email protected] wrote:

I don't think iso8859-1 does resolve the problem for the Apple // though—not really. The Apple // character ROM is 7 bit, and iso8859-1 is very 8 bit. Characters such as é and ü and ç and æ simply don't exist on the Apple, but are part of iso8859-1.

It's actually kind of too bad that we cannot easily load a soft-font into the Apple // text mode. Much cool stuff could be done with language support on the Apple // if we could.


Reply to this email directly or view it on GitHub #5 (comment).

@IvanExpert
Copy link
Contributor

So, this is interesting. Due to a bug I just found in a2cloud-setup, which fails to write "en_US" into /usr/local/etc/a2cloud-lang during setup, and instead writes nothing, serial login is behaving probably better than was intended, by supporting only the ASCII character set, with no high characters. This bug should be made the permanent behavior.

This is because /usr/local/etc/a2cloudrc is supposed to be setting LANG=en_US, which uses ISO-8859-1, but instead it's setting LANG=C, which uses ANSI_X3.4-1968, aka ASCII. That's is the fallback, but it's happening because of the bug.

So, having discovered the bug and realizing that all this time I had been looking at LANG=C on the Apple II, I looked at a French web site in Lynx via SSH on my Mac (LANG=en_US.UTF-8); all accented characters looked good. I then looked at it on my Apple II with LANG=C. Also looked good, with no accents, but otherwise readable. Then I set the Apple II to LANG=en_US. Looked bad, with some characters incorrectly displayed, and formatting problems. So, ISO-8859-1 should probably be avoided altogether.

I think the correct course of action is to always set LANG=C for serial login in a2cloudrc, and remove reference to a2cloud-lang; and if we do this, then there's no need to generate the en_US locale at all, either in a2cloud-setup or in the Raspple II packaging steps. (However, I'd still prefer to generate the en_US.UTF-8 locale in the packaging steps to replace the default en_GB.)

As a footnote, I took a look, and there's no Debian locale in /usr/share/i18n/SUPPORTED that uses the CP437 charset, even though that charset exists in /usr/share/i18n/charmaps. I might try, just for grins, to see if I can get a locale to use it, because it would be slick if Spectrum's graphical ANSI emulation could actually represent accented letters correctly.

@IvanExpert
Copy link
Contributor

Update: while there is no locale that supports the IBM437 charset, it might be worthwhile to create one as an option for accented character support in Specrtrum's ANSI display; alternatively, users can use the included Links browser, which ignores the locale and provides its own character set menu, from which CP437 can be selected.

I took a little-used locale ('eo', which is Esperanto, and which is the only locale to use ISO-8859-3), and simply replaced took /usr/share/i18n/charmaps/IBM437.gz and renamed it to ISO-8859-3.gz. I then created the 'eo' locale using dpkg-reconfigure, and then set LANG=eo.ibm437 and TERM=pcansi in Spectrum's ANSI display. (To make things readable in color text, in Spectrum choose Settings -> ...More Display Options -> Support color -> Use high intensity.)

Apart from the surprise of Lynx's menus being in Esperanto, I was able to render accented characters correctly. A demo of the character set can be seen here in Lynx:
http://www.kostis.net/charsets/cp437.htm
http://symbolcodes.tlt.psu.edu/bylanguage/french.html

And you can use them in Links as well, if you need to select Setup -> Character Set -> CP437.

TL; dr: If we think generalized, system-wide accented character support is desirable on the Apple II in Spectrum's ANSI emulation, we could make a new locale that uses the CP437 character set; if we think it's mostly useful for browsing only, users can use Links without us doing anything as long as we document how. (Or if we actually wanted to support ISO-8859-*, we could presumably edit Spectrum's ANSI character set; I assume Ewen would be receptive.)

As an aside, I noticed that Spectrum gets ASCII 130 wrong; it's supposed to be a lowercase e with a "forward slash" accent (acute), but it has a "two dots" (diaeresis or umlaut) instead.

@knghtbrd
Copy link
Member Author

We could perhaps add en_US.ascii and en_US.cp437 locales. I don't know exactly how to add them to the decongestant menu without building some custom locale packages, but it is easy enough to generate a new locale. One reason to use something other than C for LANG is so that accented letters appear in correct sorting order. That's not really a big deal for ASCII-only terminals like ProTERM, but it would matter for CP437.

Did you file an issue against Spectrum for its charset problem?

@IvanExpert
Copy link
Contributor

I wrote an issue about Spectrum's character set issue here and notified Ewen, and also asked him if the character set is easily editable, for possible creation of an ISO-8859-1 character set that could be used with LANG=en_US.

This issue could probably be separated into two: one to make LANG=C the permanent default, and one to create the CP437 locale for those wanting to use Spectrum ANSI. (And possibly another for creating an ISO-8859-1 alternative character set for Spectrum ANSI.)

@IvanExpert
Copy link
Contributor

Just did some homework on this pursuant to recent emails. Summary:

  • Contrary to my previous belief, ProTERM appears to only provide MouseText characters using ProTERM Special.
  • When you choose ANSI BBS emulatoin, it tries to represent CP437 using the closest conventional ASCII equivalents for characters 128-255, but alas does not employ MouseText characters. Missed golden opportunity right there.
  • ProTERM Special treats characters 128-255 as though their high bits are stripped; that is, they repeat characters 0-127. The same behavior happens with "No Emulation" or VT-100.
  • To enable MouseText when using PSE, you send a ctrl-P; to disable it, you send ctrl-N. The MouseText characters are in $40-$5F (64-95).
  • Spectrum's ProTERM Special and VT-100 behave the same as ProTERM's.
  • Spectrum's ANSI emulation faithfully represents CP437 using the SHGR display (at the expense of speed).

So, it might indeed be worthwhile to create a PSE termcap, based on VT-100 (because so much of Linux is hardcoded to VT-100 and derivatives), that maps box-drawing chars (Unicode? CP437?) and anything else suitable to MouseText.

For whatever reason, raspi-config in the OS X terminal displays box drawing characters even when LANG=C and TERM=vt100; however, using Spectrum's VT-100 emulation, it uses appropraite ASCII equivalents (when TERM=vt100), while its ANSI emulation (when TERM=pcansi) shows accurate box-drawing characters.

To use ProTERM within GSport, set it to use the IIgs Modem Port with a Null Modem (RTS/CTS) driver. In the window, then type ATZ followed by ATS0=1. Then in a telnet window, pipe whatever you want to send to ProTERM to nc port 6502, e.g.: echo -e "\x10\x40" | nc localhost 6502 will output a solid-apple if ProTERM Special is turned on.

To use Spectrum, you can do the same, but you can also use Telnet if A2SERVER 1.5.0+ is running somehwere.

@knghtbrd
Copy link
Member Author

Still not sure what exactly to do with this one, so I'm marking it for requested help in case some old UNIX hand who's had more experience with these issues can offer some advice about the best way to do this stuff.

@knghtbrd knghtbrd changed the title Info: character sets Explicitly set LANG=C for serial login; better handling for CP437 terminals Jun 24, 2018
@knghtbrd
Copy link
Member Author

Added the informational component to the wiki page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants