Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it #497

Closed
ivankokan opened this issue Apr 18, 2021 · 17 comments
Milestone

Comments

@ivankokan
Copy link
Contributor

ivankokan commented Apr 18, 2021

Related to #216.

Hyphenation within "fallback" , , , LJ, Lj, lj, NJ, Nj and nj must be suppressed (regardless of what hyphenation patterns provide). I am not sure which among \mbox{}, \nobreak, \nolinebreak etc. would be correct (or the most appropriate, at least).

Other:

@jspitz
Copy link
Collaborator

jspitz commented Apr 20, 2021

In this case, \mbox seems most suitable, as it allows for this oneliner fix:

diff --git a/tex/gloss-croatian.ldf b/tex/gloss-croatian.ldf
index 54074f0..001f8ea 100644
--- a/tex/gloss-croatian.ldf
+++ b/tex/gloss-croatian.ldf
@@ -112,7 +112,7 @@
    \ifcroatian@disableligatures
      \bgroup#2\egroup%
    \else
-     \charifavailable{#1}{#2}%
+     \charifavailable{#1}{\mbox{#2}}%
    \fi%
 }

Please test whether this works for you.

@ivankokan
Copy link
Contributor Author

ivankokan commented Apr 21, 2021

Actually, if disableligatures is set, then the hyphenation must be suppressed for both occurrences of #2.

(What is the purpose of grouping within \ifcroatian@disableligatures true block, can that be omitted?)

@jspitz
Copy link
Collaborator

jspitz commented Apr 22, 2021

One thing that keeps me wondering: If hyphenation occurs at this place, either this is orthographically correct (and we shouldn't suppress it) or the hyphenation patterns would need to be corrected (rather than letting polyglossia block hyphenation points).

@ivankokan
Copy link
Contributor Author

ivankokan commented Apr 22, 2021

One thing that keeps me wondering: If hyphenation occurs at this place, either this is orthographically correct (and we shouldn't suppress it) or the hyphenation patterns would need to be corrected (rather than letting polyglossia block hyphenation points).

Consider these two words: poluinjektivnost (semi-injectivity) and nestručnjak (nonprofessional). The first one contains separate letters n and j, the second one contains the letter nj.

Note the way I wrote them here - not using the Unicode digraph nj for the latter, but consecutive n and j in both words. That's (fortunately or unfortunately but mostly historically conditioned) the common way of typing (Croatian/Serbian QWERTZ layout does not have digraphs' slots at all, compared to Serbian Cyrillic layout which does have - but they are not considered to be digraphs in Cyrillic, though).
image
image

When it comes to hyphenation, for the first word it is not forbidden to hyphenate between n and j because they are (etymologically) different letters. On the other hand, it is forbidden to hyphenate between n and j in nestručnjak, simply because those two consecutive n and j denote one letter in Croatian. (Note that I used "forbidden", not "good", "bad" or something similar.)

So, if both cases are typeset the common way (not using Unicode digraphs when applicable), it is up to hyphenation patterns to judge.

But if somebody for example typesets nestruč"njak, one really wants to typeset the letter nj. If disableligatures is set or the font is missing respective ligature, polyglossia will fall back to the separate n and j for which we must forbid the hyphenation in between - it would be orthographically incorrect to leave any possibility for hyphenating (regardless of hyphenation patterns, which might treat consecutive n and j / l and j / d and ž more or less strict/pragmatic).

...or the hyphenation patterns would need to be corrected (rather than letting polyglossia block hyphenation points).

It would be possible and reasonable if all Croatian (even Serbian and other) editors start using Unicode digraphs exclusively instead of consecutive letters, which I consider to never happen. While consecutive letters are used in both contexts, hyphenation patterns really cannot judge their nature. But at least we can make sure that those who use Unicode digraphs do not get any incorrect hyphenation.

@yannis1962
Copy link

yannis1962 commented Apr 22, 2021 via email

@jspitz
Copy link
Collaborator

jspitz commented Apr 22, 2021

I am also still thinking this is an hyphenation pattern issue, not a polyglossia issue. @reutenauer what is your take?

@ivankokan
Copy link
Contributor Author

ivankokan commented Apr 22, 2021

Sorry for this but I join Jürgen in not understanding why you say "regardless of hyphenation patterns"? If you add nestručnjak as a pattern with a high even number between n and j, or as a hyphenation exception, then hyphenation is forbidden. What would make the system overrule the hyphenation patterns so that you worry that hyphenation may occur after all?

There are two options in order to suppress wrong hyphenations in this matter:

  1. Enrich the hyphenation patterns with all known patterns
    The process is not finite, and it could happen that e.g. word i(nj)e gets hyphenated like in-je because patterns contain "in" and "je". (The claim does not hold for the particular example, but it shows what could happen.) The hyphenation patterns yield mostly good but also bad results, they will never be perfect.
  2. Suppress in one place
    This is very finite. At the place where we really know the context ("this is really a digraph - although it is typeset using two separate letters that must be kept together" - WYSIWYM). With this approach we firstly obey the orthography, but secondly stop the hyphenation process to produce something that is not (contextually) allowed (remember that the same two consecutive letters might represent two letters in some other case).

Additionally, I would prefer \nobreak or \nolinebreak because \mbox{...} would completely hide characters away from the hyphenation engine. For example, having in(\penalty 10000)je vs injekcija:

  • injekcija might still be hyphenated like in-je... if there is such rule/pattern
  • compound consisting of word i(nj)e would still be more hyphenable than in the case where the implementation is based on \mbox{...}.

@yannis1962
Copy link

yannis1962 commented Apr 22, 2021 via email

@ivankokan
Copy link
Contributor Author

Comparing "contact"/"contact" vs "inje"/"injekcija" (or some compounds built of this):

  • The one who typesets "inje" and "injekcija" cannot expect the engine to know that it is really i(nj)e and i(n)(j)ekcija, the same as the system cannot know whether "contact" is a verb or a noun
  • The one who typesets "i"nje" and "injekcija" intentionally uses digraph in one place; that digraph will either be output with the digraph from the font or with two consecutive characters. But definitely would like to have a guarantee that "nj will not be hyphenated in any case. (And with simple \nobreak, \nolinebreak or even \mbox{...} we have an opportunity to guarantee that.)

People will not have the reflex of writing "nj, they expect the system to solve these issues, and are ready to take measures if the system behaves incorrectly. You could choose a general solution that privileges inertia (= no hyphenation) between n and j, so that in the worst case a word like poluinjektivnost (or bijou in Flemish) will not be hyphenated between n and j, not a big loss, there are many other places to break the word.
Exactly, they expect the system to deal with it because they are intentionally using it.

I mean, polyglossia is the one who introduces and maps the "nj to something for the later processing:

\newcommand*\hr@charifavailable[2]{%
\ifcroatian@disableligatures
\bgroup#2\egroup%
\else
\charifavailable{#1}{#2}%
\fi%
}
% Provide croatian ligatures if available in current font
\def\xpg@hr@lig#1#2{%
\bgroup%
% 1. DŽ, Dž and dž
\ifx#1D%
\ifx#2Z\relax%
\hr@charifavailable{01C4}{DŽ}%
\else%
\ifx#2z\relax
\hr@charifavailable{01C5}{Dž}%
\else
D#2%
\fi%
\fi%
\fi%
\ifx#1d%
\ifx#2z\relax
\hr@charifavailable{01C6}{dž}%
\else
d#2%
\fi%
\fi%
% 2. LJ, Lj and lj
\ifx#1L%
\ifx#2J\relax%
\hr@charifavailable{01C7}{LJ}%
\else%
\ifx#2j\relax
\hr@charifavailable{01C8}{Lj}%
\else
L#2%
\fi%
\fi%
\fi%
\ifx#1l%
\ifx#2j\relax
\hr@charifavailable{01C9}{lj}%
\else
l#2%
\fi%
\fi%
% 2. NJ, Nj and nj
\ifx#1N%
\ifx#2J\relax%
\hr@charifavailable{01CA}{NJ}%
\else%
\ifx#2j\relax
\hr@charifavailable{01CB}{Nj}%
\else
N#2%
\fi%
\fi%
\fi%
\ifx#1n%
\ifx#2j\relax
\hr@charifavailable{01CC}{nj}%
\else
n#2%
\fi%
\fi%
\egroup%
}

We/it know(s) what that "nj represents: a digraph, which is output either by picking the very digraph from the font or with two consecutive letters (with no hyphenation allowed in between). And that mapping is not complete if no penalty on hyphenation is made.

@yannis1962
Copy link

yannis1962 commented Apr 22, 2021 via email

@ivankokan
Copy link
Contributor Author

ivankokan commented Apr 22, 2021

in2je. in3jek1cija when you type i(nj)e patterns are not even needed, but when, for some reason, your nj digraph is replaced by characters n and j then the pattern in2je. steps into action and prevents hyphenation. What am I missing in the above reasoning?

Assuming that you are aware of the fact that it is only one among infinitely many cases (don't remember the compounds and other words with the same roots) - of course, that infinity is not so big but still - you are not missing anything.

Let me get back to the beginning instead - to the semantics. I think we have lost the focus.

The shorthands "DZ, "Dz, "dz, "LJ, "Lj, "lj, "NJ, "Nj and "nj were introduced to help Croatian typesetter properly typeset Croatian digraphs (which are considered to be letters just like the other 27 letters in Croatian alphabet, 30 in total), either by:
a) picking them from the font if the one contains it (U+01C4-U+01CC)
b) mimicking them using two consecutive "basic" characters if the font is missing them (or the user explicitly decides to).

In any case, those shorthands must be mapped to outputs:

  1. which must be treated as a unit, e.g. "whoever gets either a) DŽ or b) DŽ later in the processing, glyphs D and Ž must never be separated (because those two glyphs together form a single letter) - it happens automatically for a), the mapping must guarantee that the same holds for b)"
  2. respective outputs for a) and b) should provide more or less the same output ("normal" fonts containing our Unicode digraphs will not have any weird kerning within Unicode digraphs, meaning that the expected widths for a) and b) are equal).

Now consider the following MWE and the respective output with the current implementation:

\documentclass{article}

\usepackage{fontspec}
\setmainfont{Arial} % contains digraph dž

\usepackage{polyglossia}
\setmainlanguage[babelshorthands]{croatian}


\begin{document}

\Large

\selectlanguage[disableligatures=false]{croatian}

"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija

\bigskip

\selectlanguage[disableligatures=true]{croatian}

"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija

\end{document}

image

Local intervention at

\hr@charifavailable{01C6}{dž}%

changing it to

       \hr@charifavailable{01C6}{d\nobreak ž}%

produces the following:
image

So, \nobreak addresses the 1., while the new output confirms the 2.

@yannis1962
Copy link

yannis1962 commented Apr 22, 2021 via email

@ivankokan
Copy link
Contributor Author

ivankokan commented Apr 22, 2021

Le 22 avr. 2021 à 22:21, Ivan Kokan @.***> a écrit : Assuming that you are aware of the fact that it is only one among infinitely many cases (don't remember the compounds and other words with the same roots) - of course, that infinity is not so big but still - you are not missing anything.
I do not wish to monopolize this thread and spend more time on it than necessary. When you say "that there are infinitely many cases" you in fact mean that we are talking of an open non-deterministic set of words, non-deterministic in the sense that there is no algorithm to obtain all cases. This may be occupational hazard (déformation professionnelle) as I'm working in Natural Language Processing, but your problem can be tackled by processing a large corpus. A corpus-driven approach may give an asymptotic solution to it. I'm not talking for nothing. If you wish we can start a project with the goal to obtain a generative list (generative in the sense of compound creation) of such words by extracting them from a large corpus. The result will not be perfect but it will be an improvement over the current situation (you don't want people to have to use the double quote when writing Croatian, TeX should be as natural and spontaneous as possible). And it will improve academic knowledge on Croatian language. And it also may a lot of fun from the computer science point of view. One could apply machine learning methods to build a model classifying words into those with individual nj and those with digraphs, and then use that predictive model to build hyphenation patterns. A typical NLP project with a lot of fun. If you are interested in such a project, let us discuss it off-line. If not, I won't bother you anymore. Cheers, Yannis

I am not saying that hyphenation patterns cannot or should not be improved, even though I drastically described it as something that is non-solvable. Please do not get my comments that way. I think just the opposite: whenever it is possible to improve / enrich them, great!

In my latest comment I wanted to reset everything and point out this is not about hyphenation; hyphenation is just something where the side effects of incomplete polyglossia shorthands definitions appear. I consider them to be incomplete because one digraph shorthand, e.g. "dz (the one who uses it wants a digraph, a letter, even if it must be mimicked) can be mapped either to dž (Unicode) or dž (consecutive letters which must be treated as a unit).

WYSIWYM: "dz ("I want to typeset a Croatian digraph letter or at least make the output of this behave as one")

These two output possibilities should be as equal as possible for anyone/anything that takes them into processing/consideration (e.g. hyphenation). They are currently not, and they can be equal (at least more equal than they are at the moment). Simple as that.

Again, I am trying to point out that the current output for "dz (and others) is not behaving as a single letter in all cases - which it must.

@ivankokan
Copy link
Contributor Author

ivankokan commented Apr 23, 2021

#500 (comment)

Croatian alphabet consists of 30 letters:

  • A B C Č Ć D DŽ Đ E F G H I J K L LJ M N NJ O P R S Š T U V Z Ž (three specific letters are here intentionally written using Unicode slots)
  • a letter is a unit and must be treated that way
  • of course that 27 common letters cannot be broken (in any sense)
  • for the three specific letters the same holds - they must be treated as a unit
  • e.g. "dz: it is a shorthand to typeset a letter dž
    => output must be one letter or it must mimic one letter
    => output must have properties of one letter, among which is that it cannot be broken in any way
  • "dz (and others) in the current implementation is expanded either to dž or ; the former is "a unit" as such, the latter is not - and that is a problem (how do we see it? in the code itself, in the MWE's output Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it #497 (comment)).

Long story short: d and ž in the fallback part must somehow be "glued":

  • using \mbox{} keeps them together but produces bad things (because the hyphenation engine eventually gets dž in one case and a box in the other case)
    image
  • \nobreak, \nolinebreak - they are equivalent in this case, and \nobreak produces good results
  • something else?

So, the question is: how can we guarantee that the fallback part(s) are treated as a unit in later processing?

@ivankokan ivankokan changed the title Suppress hyphenation within Croatian (and any other) digraphs output as separate glyphs Croatian digraphs: make the fallback output (separate glyphs) treated as a unit in later processing Apr 23, 2021
@jspitz
Copy link
Collaborator

jspitz commented Apr 23, 2021

If (1.) such digraphs are not supposed to be hyphenated, if (2.) entering them as two separate glyphs (alternatively) is orthographically valid and common, and if (1.) also applies to (2.), then this definitely must be addressed in the hyphenation patterns, not in polyglossia.

@ivankokan
Copy link
Contributor Author

ivankokan commented Apr 24, 2021

If (1.) such digraphs are not supposed to be hyphenated, if (2.) entering them as two separate glyphs (alternatively) is orthographically valid and common, and if (1.) also applies to (2.), then this definitely must be addressed in the hyphenation patterns, not in polyglossia.

  1. They are letters, cannot be "broken" in any fashion. ✅

  2. a) Entering them as two separate glyphs is indeed the common way (historically conditioned, because there are no special slots on the Croatian QWERTZ layout). ✅
    // Note that the same holds for the Serbian language and the Serbian QWERTZ layout (Latin). However, there are designated slots within the Serbian Cyrillic keyboard layout for the respective three letters, but there is no notion of a digraph in Cyrillic. Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it #497 (comment)
    b) What do you mean by "entering them as two separate glyphs is orthographically valid"? ❓ (I will try to answer, but I am not sure if I understood the question.)
    All respective letters consist of two glyphs - that is true.
    Hence it is orthographically valid to enter them as two separate glyphs (they are written that way, of course).
    But it is not valid to enter them as two separate letters. 🚫
    In other words: "two glyphs forming a digraph letter dž" is orthographically different than "two consecutive letters d and ž", e.g. džemper (6 letters, 7 glyphs) vs nadžbuka (8 glyphs, 8 letters).
    https://en.wikipedia.org/wiki/D%C5%BE
    Although several other languages (see below) also use the letter combination DŽ, they treat it as a pair of the letters D and Ž, not as a single distinct letter.
    In Croatian (and Serbian) you can have a letter dž in a word, and you can have consecutive letters d and ž (not forming a letter dž).

Considering the hyphenation... If the fallback for "dz is simply , there is no way to tell whether the within a word __dž_____ represents a letter dž or two consecutive letters d and ž (somewhat similar to contact verb vs contact noun mentioned in previous comments). But there is a way how we can denote that the fallback represents a dž indeed - isn't it?

After reading all previous comments again and again, it seems that we have not agreed on what should the fallback actually be (conceptually). Can we first try to decide on that?

a) it should be the same as what Croatian (and Serbian) editors commonly enter, i.e. two consecutive letters
// two consecutive letters (it looks the same as two consecutive glyphs but those glyphs are not treated together as a unit) <- orthographically incorrect ➖
// hyphenating in between <- orthographically incorrect (but probably mostly discouraged by the patterns) ➖
// after all, it is common 👍 : Unicode representations of the letter are very rarely used in digital media, which tends to favor the corresponding two-character combinations.
b) a mimic of a (digraph) letter => it must behave more or less similar to the Unicode digraph when present in the font
// two consecutive glyphs somehow "glued" not to be "broken" anytime <- orthographically correct 👍
// ...that would lead to suppressed hyphenation in between <- orthographically correct 👍
// Unicode representations of the letter are very rarely used in digital media ➖, which tends to favor the corresponding two-character combinations. - But hey, the digraph shorthand exists for some reason! 🎯

Obviously, my preference is b) with the following reasoning (check the following MWE):

  • one will use "The common way approach" if one does not want Unicode digraphs at all
  • "dz WYSIWYM: "I want the Unicode digraph or at least something that will mimic it, expecting it to have the same properties and the same output"

MWE:

\documentclass{article}

\usepackage{fontspec}
\setmainfont{Arial} % contains digraph dž

\usepackage{polyglossia}
\setmainlanguage[babelshorthands]{croatian}


\begin{document}

\LARGE

\section{Using Unicode slots}
\selectlanguage[disableligatures=false]{croatian}
"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija

\section{Fallback}
\selectlanguage[disableligatures=true]{croatian}
"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija

\section{The common way}
džamija džep džezva džihad feredža hodža džepni hodžin bridž džem džemper džentlmen džip džokej
džul džungla džentlmenski džokejski jednadžba narudžba svjedodžba bundžija buregdžija ćevabdžija galamdžija

\end{document}

P.S. Additionally, maybe setting disableligatures=true should be interpreted as "I do not want to have anything with digraphs, fallback to the common way, e.g. " but if disableligatures=false and the font is missing Unicode digraphs, the fallback should mimic the digraph? (A mixed solution, more complicated, but there is valid reasoning. We already have three cases in the code itself.) A sketch:

\newcommand*\hr@charifavailable[3]{%
   \ifcroatian@disableligatures
     \bgroup#3\egroup% "I do not want digraphs at all"
   \else
     \charifavailable{#1}{#2}% "I want digraph, either original or a mimic of one"
   \fi%
}

\hr@charifavailable{01C6}{d\nobreak ž}{dž}%

Should Section 2 be like Section 1 or Section 3?
Is the answer the same as when using Latin Modern instead, which does not contain Unicode digraphs?

What do digraph shorthands represent? is the very question, not How do we implement it? - that comes later.

@ivankokan ivankokan changed the title Croatian digraphs: make the fallback output (separate glyphs) treated as a unit in later processing Croatian digraphs: make the fallback output (consecutive glyphs) treated as a unit in later processing Apr 24, 2021
ivankokan added a commit to ivankokan/polyglossia that referenced this issue Apr 24, 2021
@ivankokan ivankokan changed the title Croatian digraphs: make the fallback output (consecutive glyphs) treated as a unit in later processing Croatian digraphs: When the font is missing Unicode digraph, make the fallback output treated as a unit in later processing Apr 25, 2021
@ivankokan ivankokan changed the title Croatian digraphs: When the font is missing Unicode digraph, make the fallback output treated as a unit in later processing Croatian digraphs: When the font is missing Unicode digraph and the one is to be output, mimic it Apr 25, 2021
@ivankokan ivankokan changed the title Croatian digraphs: When the font is missing Unicode digraph and the one is to be output, mimic it Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it Apr 25, 2021
@jspitz
Copy link
Collaborator

jspitz commented Dec 10, 2021

PR merged and amended with df87338

@jspitz jspitz added the FIXED IN DEV This bug is fixed for the next release label Dec 10, 2021
@jspitz jspitz added this to the 1.54 milestone Feb 17, 2022
@jspitz jspitz closed this as completed Mar 27, 2022
@jspitz jspitz removed the FIXED IN DEV This bug is fixed for the next release label Mar 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants