Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it #497

ivankokan · 2021-04-18T20:55:01Z

Related to #216.

Hyphenation within "fallback" DŽ, Dž, dž, LJ, Lj, lj, NJ, Nj and nj must be suppressed (regardless of what hyphenation patterns provide). I am not sure which among \mbox{}, \nobreak, \nolinebreak etc. would be correct (or the most appropriate, at least).

Other:

https://github.com/reutenauer/polyglossia/blob/master/doc/polyglossia.tex#L2285-L2286 (Also, is such forbidden hyphenation somehow defined within https://github.com/jspitz/babel-german?)
future analog support in Serbian Latin Support Latin <-> Cyrillic transliteration and Latin digraphs for Serbian #483.

The text was updated successfully, but these errors were encountered:

jspitz · 2021-04-20T16:44:11Z

In this case, \mbox seems most suitable, as it allows for this oneliner fix:

diff --git a/tex/gloss-croatian.ldf b/tex/gloss-croatian.ldf
index 54074f0..001f8ea 100644
--- a/tex/gloss-croatian.ldf
+++ b/tex/gloss-croatian.ldf
@@ -112,7 +112,7 @@
    \ifcroatian@disableligatures
      \bgroup#2\egroup%
    \else
-     \charifavailable{#1}{#2}%
+     \charifavailable{#1}{\mbox{#2}}%
    \fi%
 }

Please test whether this works for you.

ivankokan · 2021-04-21T22:13:58Z

Actually, if disableligatures is set, then the hyphenation must be suppressed for both occurrences of #2.

(What is the purpose of grouping within \ifcroatian@disableligatures true block, can that be omitted?)

jspitz · 2021-04-22T06:18:10Z

One thing that keeps me wondering: If hyphenation occurs at this place, either this is orthographically correct (and we shouldn't suppress it) or the hyphenation patterns would need to be corrected (rather than letting polyglossia block hyphenation points).

ivankokan · 2021-04-22T09:22:05Z

One thing that keeps me wondering: If hyphenation occurs at this place, either this is orthographically correct (and we shouldn't suppress it) or the hyphenation patterns would need to be corrected (rather than letting polyglossia block hyphenation points).

Consider these two words: poluinjektivnost (semi-injectivity) and nestručnjak (nonprofessional). The first one contains separate letters n and j, the second one contains the letter nj.

Note the way I wrote them here - not using the Unicode digraph ǌ for the latter, but consecutive n and j in both words. That's (fortunately or unfortunately but mostly historically conditioned) the common way of typing (Croatian/Serbian QWERTZ layout does not have digraphs' slots at all, compared to Serbian Cyrillic layout which does have - but they are not considered to be digraphs in Cyrillic, though).

When it comes to hyphenation, for the first word it is not forbidden to hyphenate between n and j because they are (etymologically) different letters. On the other hand, it is forbidden to hyphenate between n and j in nestručnjak, simply because those two consecutive n and j denote one letter in Croatian. (Note that I used "forbidden", not "good", "bad" or something similar.)

So, if both cases are typeset the common way (not using Unicode digraphs when applicable), it is up to hyphenation patterns to judge.

But if somebody for example typesets nestruč"njak, one really wants to typeset the letter nj. If disableligatures is set or the font is missing respective ligature, polyglossia will fall back to the separate n and j for which we must forbid the hyphenation in between - it would be orthographically incorrect to leave any possibility for hyphenating (regardless of hyphenation patterns, which might treat consecutive n and j / l and j / d and ž more or less strict/pragmatic).

...or the hyphenation patterns would need to be corrected (rather than letting polyglossia block hyphenation points).

It would be possible and reasonable if all Croatian (even Serbian and other) editors start using Unicode digraphs exclusively instead of consecutive letters, which I consider to never happen. While consecutive letters are used in both contexts, hyphenation patterns really cannot judge their nature. But at least we can make sure that those who use Unicode digraphs do not get any incorrect hyphenation.

yannis1962 · 2021-04-22T10:09:43Z

Sorry for this but I join Jürgen in not understanding why you say "regardless of hyphenation patterns"? If you add nestručnjak as a pattern with a high even number between n and j, or as a hyphenation exception, then hyphenation is forbidden. What would make the system overrule the hyphenation patterns so that you worry that hyphenation may occur after all?

Le 22 avr. 2021 à 11:22, Ivan Kokan ***@***.***> a écrit : But if somebody for example typesets nestruč"njak, one really wants to typeset the letter nj. If disableligatures is set or the font is missing respective ligature, polyglossia will fall back to the separate n and j for which we must forbid the hyphenation in between - regardless of hyphenation patterns (which might treat consecutive n and j / l and j / d and ž more or less strict/pragmatic).

<http://www.imt-atlantique.fr/> Yannis HARALAMBOUS Professor Computer Science Department UMR CNRS 6285 Lab-STICC <http://perso.telecom-bretagne.eu/yannisharalambous/> <https://twitter.com/y_haralambous> <https://www.linkedin.com/in/yannis-haralambous-5529073?trk=hp-identity-name>Technopôle Brest-Iroise CS 83818 29238 Brest Cedex 3, France Une école de l'IMT <http://www.imt.fr/> Und nach einer kleinen Stille fügte Sie hinzu: Jeder Weg, der dorthin führt, war am Ende der richtige. (Michael Ende)

jspitz · 2021-04-22T10:33:14Z

I am also still thinking this is an hyphenation pattern issue, not a polyglossia issue. @reutenauer what is your take?

ivankokan · 2021-04-22T10:42:25Z

Sorry for this but I join Jürgen in not understanding why you say "regardless of hyphenation patterns"? If you add nestručnjak as a pattern with a high even number between n and j, or as a hyphenation exception, then hyphenation is forbidden. What would make the system overrule the hyphenation patterns so that you worry that hyphenation may occur after all?

There are two options in order to suppress wrong hyphenations in this matter:

Enrich the hyphenation patterns with all known patterns
The process is not finite, and it could happen that e.g. word i(nj)e gets hyphenated like in-je because patterns contain "in" and "je". (The claim does not hold for the particular example, but it shows what could happen.) The hyphenation patterns yield mostly good but also bad results, they will never be perfect.
Suppress in one place
This is very finite. At the place where we really know the context ("this is really a digraph - although it is typeset using two separate letters that must be kept together" - WYSIWYM). With this approach we firstly obey the orthography, but secondly stop the hyphenation process to produce something that is not (contextually) allowed (remember that the same two consecutive letters might represent two letters in some other case).

Additionally, I would prefer \nobreak or \nolinebreak because \mbox{...} would completely hide characters away from the hyphenation engine. For example, having in(\penalty 10000)je vs injekcija:

injekcija might still be hyphenated like in-je... if there is such rule/pattern
compound consisting of word i(nj)e would still be more hyphenable than in the case where the implementation is based on \mbox{...}.

yannis1962 · 2021-04-22T10:53:56Z

Le 22 avr. 2021 à 12:42, Ivan Kokan ***@***.***> a écrit : There are two options in order to suppress wrong hyphenations in this matter: Enrich the hyphenation patterns with all known patterns The process is not finite, and it could happen that e.g. word i(nj)e gets hyphenated like in-je because patterns contain "in" and "je". (The claim does not hold for the particular example, but it shows what could happen.) The hyphenation patterns yield mostly good but also bad results, they will never be perfect.

There are two ways of "not being perfect": 1) the systemic error: for example, hyphenation depending on semantics and the patterns not having access to this level. Like the English "contact" hyphenated differently whether it is a verb or a noun. This is impossible to deal with, with the current TeX hyphenation mechanism 2) the error for lack of information: if you provide patterns with the exact rules of hyphenation, then these will always be applied. And if a word is discovered that is not covered by the rules, we can always update the patterns. So, asymptotically, the patterns will be perfect. People will not have the reflex of writing "nj, they expect the system to solve these issues, and are ready to take measures if the system behaves incorrectly. You could choose a general solution that privileges inertia (= no hyphenation) between n and j, so that in the worst case a word like poluinjektivnost (or bijou in Flemish) will not be hyphenated between n and j, not a big loss, there are many other places to break the word. In a regular text, absence of hyphenation at some point is not a problem. Wrong hyphenations must be avoided at all cost. And with patterns you can do it, provided you update your patterns whenever new, unexpected, cases appear. <http://www.imt-atlantique.fr/> Yannis HARALAMBOUS Professor Computer Science Department UMR CNRS 6285 Lab-STICC <http://perso.telecom-bretagne.eu/yannisharalambous/> <https://twitter.com/y_haralambous> <https://www.linkedin.com/in/yannis-haralambous-5529073?trk=hp-identity-name>Technopôle Brest-Iroise CS 83818 29238 Brest Cedex 3, France Une école de l'IMT <http://www.imt.fr/> The history of linguistics is largely a history of misreadings, of failed communication between authors and readers, exacerbated by the illusion that communication has successfully occurred. (John E. Joseph)

ivankokan · 2021-04-22T11:11:30Z

Comparing "contact"/"contact" vs "inje"/"injekcija" (or some compounds built of this):

The one who typesets "inje" and "injekcija" cannot expect the engine to know that it is really i(nj)e and i(n)(j)ekcija, the same as the system cannot know whether "contact" is a verb or a noun
The one who typesets "i"nje" and "injekcija" intentionally uses digraph in one place; that digraph will either be output with the digraph from the font or with two consecutive characters. But definitely would like to have a guarantee that "nj will not be hyphenated in any case. (And with simple \nobreak, \nolinebreak or even \mbox{...} we have an opportunity to guarantee that.)

People will not have the reflex of writing "nj, they expect the system to solve these issues, and are ready to take measures if the system behaves incorrectly. You could choose a general solution that privileges inertia (= no hyphenation) between n and j, so that in the worst case a word like poluinjektivnost (or bijou in Flemish) will not be hyphenated between n and j, not a big loss, there are many other places to break the word.
Exactly, they expect the system to deal with it because they are intentionally using it.

I mean, polyglossia is the one who introduces and maps the "nj to something for the later processing:

polyglossia/tex/gloss-croatian.ldf

Lines 111 to 180 in 09973c8

    
           \newcommand*\hr@charifavailable[2]{% 
        
              \ifcroatian@disableligatures 
        
                \bgroup#2\egroup% 
        
              \else 
        
                \charifavailable{#1}{#2}% 
        
              \fi% 
        
           } 
        
           % Provide croatian ligatures if available in current font 
        
           \def\xpg@hr@lig#1#2{% 
        
            \bgroup% 
        
             % 1. DŽ, Dž and dž 
        
             \ifx#1D% 
        
               \ifx#2Z\relax% 
        
                  \hr@charifavailable{01C4}{DŽ}% 
        
               \else% 
        
                  \ifx#2z\relax 
        
                     \hr@charifavailable{01C5}{Dž}% 
        
                  \else 
        
                      D#2% 
        
                  \fi% 
        
               \fi% 
        
             \fi% 
        
             \ifx#1d% 
        
               \ifx#2z\relax 
        
                  \hr@charifavailable{01C6}{dž}% 
        
               \else 
        
                  d#2% 
        
               \fi% 
        
             \fi% 
        
             % 2. LJ, Lj and lj 
        
             \ifx#1L% 
        
               \ifx#2J\relax% 
        
                  \hr@charifavailable{01C7}{LJ}% 
        
               \else% 
        
                  \ifx#2j\relax 
        
                     \hr@charifavailable{01C8}{Lj}% 
        
                  \else 
        
                      L#2% 
        
                  \fi% 
        
               \fi% 
        
             \fi% 
        
             \ifx#1l% 
        
               \ifx#2j\relax 
        
                  \hr@charifavailable{01C9}{lj}% 
        
               \else 
        
                  l#2% 
        
               \fi% 
        
             \fi% 
        
             % 2. NJ, Nj and nj 
        
             \ifx#1N% 
        
               \ifx#2J\relax% 
        
                  \hr@charifavailable{01CA}{NJ}% 
        
               \else% 
        
                  \ifx#2j\relax 
        
                     \hr@charifavailable{01CB}{Nj}% 
        
                  \else 
        
                      N#2% 
        
                  \fi% 
        
               \fi% 
        
             \fi% 
        
             \ifx#1n% 
        
               \ifx#2j\relax 
        
                  \hr@charifavailable{01CC}{nj}% 
        
               \else 
        
                  n#2% 
        
               \fi% 
        
             \fi% 
        
             \egroup% 
        
           }

We/it know(s) what that "nj represents: a digraph, which is output either by picking the very digraph from the font or with two consecutive letters (with no hyphenation allowed in between). And that mapping is not complete if no penalty on hyphenation is made.

yannis1962 · 2021-04-22T11:23:34Z

in2je. in3jek1cija when you type i(nj)e patterns are not even needed, but when, for some reason, your nj digraph is replaced by characters n and j then the pattern in2je. steps into action and prevents hyphenation. What am I missing in the above reasoning?

Le 22 avr. 2021 à 13:11, Ivan Kokan ***@***.***> a écrit : Comparing "contact"/"contact" vs "inje"/"injekcija" (or some compounds built of this): The one who typesets "inje" and "injekcija" cannot expect the engine to know that it is really i(nj)e and i(n)(j)ekcija, the same as the system cannot know whether "contact" is a verb or a noun The one who typesets "i"nje" and "injekcija" intentionally uses digraph in one place; that digraph will either be output with the digraph from the font or with two consecutive characters. But definitely would like to have a guarantee that "nj will not be hyphenated in any case. (And with simple \nobreak, \nolinebreak or even \mbox{...} we have an opportunity to guarantee that.) People will not have the reflex of writing "nj, they expect the system to solve these issues, and are ready to take measures if the system behaves incorrectly. You could choose a general solution that privileges inertia (= no hyphenation) between n and j, so that in the worst case a word like poluinjektivnost (or bijou in Flemish) will not be hyphenated between n and j, not a big loss, there are many other places to break the word. Exactly, they expect the system to deal with it because they are intentionally using it. I mean, polyglossia is the one who introduces and maps the "nj to something for the later processing: https://github.com/reutenauer/polyglossia/blob/09973c867b792cb6b1683dd42e183e86b506d5c2/tex/gloss-croatian.ldf#L111-L180 <https://github.com/reutenauer/polyglossia/blob/09973c867b792cb6b1683dd42e183e86b506d5c2/tex/gloss-croatian.ldf#L111-L180> We/it know(s) what that "nj represents: a digraph, which is output either by picking the very digraph from the font or with two consecutive letters (with no hyphenation allowed in between). And that mapping is not complete if no penalty on hyphenation is made. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#497 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFXC7KJZ3NJJS6QJWVMOYLTJ777RANCNFSM43ETQKBQ>.

<http://www.imt-atlantique.fr/> Yannis HARALAMBOUS Professor Computer Science Department UMR CNRS 6285 Lab-STICC <http://perso.telecom-bretagne.eu/yannisharalambous/> <https://twitter.com/y_haralambous> <https://www.linkedin.com/in/yannis-haralambous-5529073?trk=hp-identity-name>Technopôle Brest-Iroise CS 83818 29238 Brest Cedex 3, France Une école de l'IMT <http://www.imt.fr/> Le lecteur a le droit et même parfois le devoir de savoir en quels caractères est composé le livre qu'il a entre les mains, et on ne peut exiger de lui qu'il sache le reconnaître tout seul. (Gérard Genette)

ivankokan · 2021-04-22T20:20:54Z

in2je. in3jek1cija when you type i(nj)e patterns are not even needed, but when, for some reason, your nj digraph is replaced by characters n and j then the pattern in2je. steps into action and prevents hyphenation. What am I missing in the above reasoning?

Assuming that you are aware of the fact that it is only one among infinitely many cases (don't remember the compounds and other words with the same roots) - of course, that infinity is not so big but still - you are not missing anything.

Let me get back to the beginning instead - to the semantics. I think we have lost the focus.

The shorthands "DZ, "Dz, "dz, "LJ, "Lj, "lj, "NJ, "Nj and "nj were introduced to help Croatian typesetter properly typeset Croatian digraphs (which are considered to be letters just like the other 27 letters in Croatian alphabet, 30 in total), either by:
a) picking them from the font if the one contains it (U+01C4-U+01CC)
b) mimicking them using two consecutive "basic" characters if the font is missing them (or the user explicitly decides to).

In any case, those shorthands must be mapped to outputs:

which must be treated as a unit, e.g. "whoever gets either a) Ǆ or b) DŽ later in the processing, glyphs D and Ž must never be separated (because those two glyphs together form a single letter) - it happens automatically for a), the mapping must guarantee that the same holds for b)"
respective outputs for a) and b) should provide more or less the same output ("normal" fonts containing our Unicode digraphs will not have any weird kerning within Unicode digraphs, meaning that the expected widths for a) and b) are equal).

Now consider the following MWE and the respective output with the current implementation:

\documentclass{article}

\usepackage{fontspec}
\setmainfont{Arial} % contains digraph ǆ

\usepackage{polyglossia}
\setmainlanguage[babelshorthands]{croatian}


\begin{document}

\Large

\selectlanguage[disableligatures=false]{croatian}

"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija

\bigskip

\selectlanguage[disableligatures=true]{croatian}

"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija

\end{document}

Local intervention at

polyglossia/tex/gloss-croatian.ldf

Line 136 in fd838c4

\hr@charifavailable{01C6}{dž}%

changing it to

       \hr@charifavailable{01C6}{d\nobreak ž}%

produces the following:

So, \nobreak addresses the 1., while the new output confirms the 2.

reutenauer#497

yannis1962 · 2021-04-22T20:55:04Z

Le 22 avr. 2021 à 22:21, Ivan Kokan ***@***.***> a écrit : Assuming that you are aware of the fact that it is only one among infinitely many cases (don't remember the compounds and other words with the same roots) - of course, that infinity is not so big but still - you are not missing anything.

I do not wish to monopolize this thread and spend more time on it than necessary. When you say "that there are infinitely many cases" you in fact mean that we are talking of an open non-deterministic set of words, non-deterministic in the sense that there is no algorithm to obtain all cases. This may be occupational hazard (déformation professionnelle) as I'm working in Natural Language Processing, but your problem can be tackled by processing a large corpus. A corpus-driven approach may give an asymptotic solution to it. I'm not talking for nothing. If you wish we can start a project with the goal to obtain a generative list (generative in the sense of compound creation) of such words by extracting them from a large corpus. The result will not be perfect but it will be an improvement over the current situation (you don't want people to have to use the double quote when writing Croatian, TeX should be as natural and spontaneous as possible). And it will improve academic knowledge on Croatian language. And it also may a lot of fun from the computer science point of view. One could apply machine learning methods to build a model classifying words into those with individual nj and those with digraphs, and then use that predictive model to build hyphenation patterns. A typical NLP project with a lot of fun. If you are interested in such a project, let us discuss it off-line. If not, I won't bother you anymore. Cheers, Yannis <http://www.imt-atlantique.fr/> Yannis HARALAMBOUS Professor Computer Science Department UMR CNRS 6285 Lab-STICC <http://perso.telecom-bretagne.eu/yannisharalambous/> <https://twitter.com/y_haralambous> <https://www.linkedin.com/in/yannis-haralambous-5529073?trk=hp-identity-name>Technopôle Brest-Iroise CS 83818 29238 Brest Cedex 3, France Une école de l'IMT <http://www.imt.fr/> Tradition says that one linguist, who was heard enough of frequency counts, predicted ironically that when the counters had grown tired of tabulating single items they would begin to count words by pairs. (David G. Hays)

ivankokan · 2021-04-22T21:15:12Z

Le 22 avr. 2021 à 22:21, Ivan Kokan @.***> a écrit : Assuming that you are aware of the fact that it is only one among infinitely many cases (don't remember the compounds and other words with the same roots) - of course, that infinity is not so big but still - you are not missing anything.
I do not wish to monopolize this thread and spend more time on it than necessary. When you say "that there are infinitely many cases" you in fact mean that we are talking of an open non-deterministic set of words, non-deterministic in the sense that there is no algorithm to obtain all cases. This may be occupational hazard (déformation professionnelle) as I'm working in Natural Language Processing, but your problem can be tackled by processing a large corpus. A corpus-driven approach may give an asymptotic solution to it. I'm not talking for nothing. If you wish we can start a project with the goal to obtain a generative list (generative in the sense of compound creation) of such words by extracting them from a large corpus. The result will not be perfect but it will be an improvement over the current situation (you don't want people to have to use the double quote when writing Croatian, TeX should be as natural and spontaneous as possible). And it will improve academic knowledge on Croatian language. And it also may a lot of fun from the computer science point of view. One could apply machine learning methods to build a model classifying words into those with individual nj and those with digraphs, and then use that predictive model to build hyphenation patterns. A typical NLP project with a lot of fun. If you are interested in such a project, let us discuss it off-line. If not, I won't bother you anymore. Cheers, Yannis

I am not saying that hyphenation patterns cannot or should not be improved, even though I drastically described it as something that is non-solvable. Please do not get my comments that way. I think just the opposite: whenever it is possible to improve / enrich them, great!

In my latest comment I wanted to reset everything and point out this is not about hyphenation; hyphenation is just something where the side effects of incomplete polyglossia shorthands definitions appear. I consider them to be incomplete because one digraph shorthand, e.g. "dz (the one who uses it wants a digraph, a letter, even if it must be mimicked) can be mapped either to ǆ (Unicode) or dž (consecutive letters which must be treated as a unit).

WYSIWYM: "dz ("I want to typeset a Croatian digraph letter or at least make the output of this behave as one")

These two output possibilities should be as equal as possible for anyone/anything that takes them into processing/consideration (e.g. hyphenation). They are currently not, and they can be equal (at least more equal than they are at the moment). Simple as that.

Again, I am trying to point out that the current output for "dz (and others) is not behaving as a single letter in all cases - which it must.

ivankokan · 2021-04-23T08:54:34Z

#500 (comment)

Croatian alphabet consists of 30 letters:

A B C Č Ć D Ǆ Đ E F G H I J K L Ǉ M N Ǌ O P R S Š T U V Z Ž (three specific letters are here intentionally written using Unicode slots)
a letter is a unit and must be treated that way
of course that 27 common letters cannot be broken (in any sense)
for the three specific letters the same holds - they must be treated as a unit
e.g. "dz: it is a shorthand to typeset a letter ǆ
=> output must be one letter or it must mimic one letter
=> output must have properties of one letter, among which is that it cannot be broken in any way
"dz (and others) in the current implementation is expanded either to ǆ or dž; the former is "a unit" as such, the latter is not - and that is a problem (how do we see it? in the code itself, in the MWE's output Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it #497 (comment)).

Long story short: d and ž in the fallback part must somehow be "glued":

using \mbox{} keeps them together but produces bad things (because the hyphenation engine eventually gets ǆ in one case and a box in the other case)
\nobreak, \nolinebreak - they are equivalent in this case, and \nobreak produces good results
something else?

So, the question is: how can we guarantee that the fallback part(s) are treated as a unit in later processing?

jspitz · 2021-04-23T09:11:33Z

If (1.) such digraphs are not supposed to be hyphenated, if (2.) entering them as two separate glyphs (alternatively) is orthographically valid and common, and if (1.) also applies to (2.), then this definitely must be addressed in the hyphenation patterns, not in polyglossia.

ivankokan · 2021-04-24T02:23:56Z

If (1.) such digraphs are not supposed to be hyphenated, if (2.) entering them as two separate glyphs (alternatively) is orthographically valid and common, and if (1.) also applies to (2.), then this definitely must be addressed in the hyphenation patterns, not in polyglossia.

They are letters, cannot be "broken" in any fashion. ✅
a) Entering them as two separate glyphs is indeed the common way (historically conditioned, because there are no special slots on the Croatian QWERTZ layout). ✅
// Note that the same holds for the Serbian language and the Serbian QWERTZ layout (Latin). However, there are designated slots within the Serbian Cyrillic keyboard layout for the respective three letters, but there is no notion of a digraph in Cyrillic. Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it #497 (comment)
b) What do you mean by "entering them as two separate glyphs is orthographically valid"? ❓ (I will try to answer, but I am not sure if I understood the question.)
All respective letters consist of two glyphs - that is true.
Hence it is orthographically valid to enter them as two separate glyphs (they are written that way, of course).
But it is not valid to enter them as two separate letters. 🚫
In other words: "two glyphs forming a digraph letter ǆ" is orthographically different than "two consecutive letters d and ž", e.g. ǆemper (6 letters, 7 glyphs) vs nadžbuka (8 glyphs, 8 letters).
https://en.wikipedia.org/wiki/D%C5%BE
Although several other languages (see below) also use the letter combination DŽ, they treat it as a pair of the letters D and Ž, not as a single distinct letter.
In Croatian (and Serbian) you can have a letter ǆ in a word, and you can have consecutive letters d and ž (not forming a letter ǆ).

Considering the hyphenation... If the fallback for "dz is simply dž, there is no way to tell whether the dž within a word __dž_____ represents a letter ǆ or two consecutive letters d and ž (somewhat similar to contact verb vs contact noun mentioned in previous comments). But there is a way how we can denote that the fallback represents a ǆ indeed - isn't it?

After reading all previous comments again and again, it seems that we have not agreed on what should the fallback actually be (conceptually). Can we first try to decide on that?

a) it should be the same as what Croatian (and Serbian) editors commonly enter, i.e. two consecutive letters
// two consecutive letters (it looks the same as two consecutive glyphs but those glyphs are not treated together as a unit) <- orthographically incorrect ➖
// hyphenating in between <- orthographically incorrect (but probably mostly discouraged by the patterns) ➖
// after all, it is common 👍 : Unicode representations of the letter are very rarely used in digital media, which tends to favor the corresponding two-character combinations.
b) a mimic of a (digraph) letter => it must behave more or less similar to the Unicode digraph when present in the font
// two consecutive glyphs somehow "glued" not to be "broken" anytime <- orthographically correct 👍
// ...that would lead to suppressed hyphenation in between <- orthographically correct 👍
// Unicode representations of the letter are very rarely used in digital media ➖, which tends to favor the corresponding two-character combinations. - But hey, the digraph shorthand exists for some reason! 🎯

Obviously, my preference is b) with the following reasoning (check the following MWE):

one will use "The common way approach" if one does not want Unicode digraphs at all
"dz WYSIWYM: "I want the Unicode digraph or at least something that will mimic it, expecting it to have the same properties and the same output"

MWE:

\documentclass{article}

\usepackage{fontspec}
\setmainfont{Arial} % contains digraph ǆ

\usepackage{polyglossia}
\setmainlanguage[babelshorthands]{croatian}


\begin{document}

\LARGE

\section{Using Unicode slots}
\selectlanguage[disableligatures=false]{croatian}
"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija

\section{Fallback}
\selectlanguage[disableligatures=true]{croatian}
"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija

\section{The common way}
džamija džep džezva džihad feredža hodža džepni hodžin bridž džem džemper džentlmen džip džokej
džul džungla džentlmenski džokejski jednadžba narudžba svjedodžba bundžija buregdžija ćevabdžija galamdžija

\end{document}

P.S. Additionally, maybe setting disableligatures=true should be interpreted as "I do not want to have anything with digraphs, fallback to the common way, e.g. dž" but if disableligatures=false and the font is missing Unicode digraphs, the fallback should mimic the digraph? (A mixed solution, more complicated, but there is valid reasoning. We already have three cases in the code itself.) A sketch:

\newcommand*\hr@charifavailable[3]{%
   \ifcroatian@disableligatures
     \bgroup#3\egroup% "I do not want digraphs at all"
   \else
     \charifavailable{#1}{#2}% "I want digraph, either original or a mimic of one"
   \fi%
}

\hr@charifavailable{01C6}{d\nobreak ž}{dž}%

Should Section 2 be like Section 1 or Section 3?
Is the answer the same as when using Latin Modern instead, which does not contain Unicode digraphs?

What do digraph shorthands represent? is the very question, not How do we implement it? - that comes later.

…cking one digraph letter (reutenauer#497)

jspitz · 2021-12-10T15:25:06Z

PR merged and amended with df87338

ivankokan added a commit to ivankokan/polyglossia that referenced this issue Apr 22, 2021

Update gloss-croatian.ldf

2e6102a

reutenauer#497

ivankokan mentioned this issue Apr 22, 2021

Improvements and clarifications on Croatian digraphs (not ligatures) #500

Merged

ivankokan changed the title ~~Suppress hyphenation within Croatian (and any other) digraphs output as separate glyphs~~ Croatian digraphs: make the fallback output (separate glyphs) treated as a unit in later processing Apr 23, 2021

ivankokan changed the title ~~Croatian digraphs: make the fallback output (separate glyphs) treated as a unit in later processing~~ Croatian digraphs: make the fallback output (consecutive glyphs) treated as a unit in later processing Apr 24, 2021

ivankokan added a commit to ivankokan/polyglossia that referenced this issue Apr 24, 2021

Croatian digraphs: Always fallback to "glued" consecutive glyphs mimi…

90cba87

…cking one digraph letter (reutenauer#497)

ivankokan changed the title ~~Croatian digraphs: make the fallback output (consecutive glyphs) treated as a unit in later processing~~ Croatian digraphs: When the font is missing Unicode digraph, make the fallback output treated as a unit in later processing Apr 25, 2021

ivankokan changed the title ~~Croatian digraphs: When the font is missing Unicode digraph, make the fallback output treated as a unit in later processing~~ Croatian digraphs: When the font is missing Unicode digraph and the one is to be output, mimic it Apr 25, 2021

ivankokan changed the title ~~Croatian digraphs: When the font is missing Unicode digraph and the one is to be output, mimic it~~ Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it Apr 25, 2021

jspitz added the FIXED IN DEV This bug is fixed for the next release label Dec 10, 2021

ivankokan mentioned this issue Dec 31, 2021

Support for Croatian Unicode digraph letters within (at least) T1 latex3/latex2e#723

Closed

jspitz added this to the 1.54 milestone Feb 17, 2022

jspitz closed this as completed Mar 27, 2022

jspitz removed the FIXED IN DEV This bug is fixed for the next release label Mar 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it #497

Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it #497

ivankokan commented Apr 18, 2021 •

edited

Loading

jspitz commented Apr 20, 2021 •

edited

Loading

ivankokan commented Apr 21, 2021 •

edited

Loading

jspitz commented Apr 22, 2021

ivankokan commented Apr 22, 2021 •

edited

Loading

yannis1962 commented Apr 22, 2021 via email

jspitz commented Apr 22, 2021

ivankokan commented Apr 22, 2021 •

edited

Loading

yannis1962 commented Apr 22, 2021 via email

ivankokan commented Apr 22, 2021

yannis1962 commented Apr 22, 2021 via email

ivankokan commented Apr 22, 2021 •

edited

Loading

yannis1962 commented Apr 22, 2021 via email

ivankokan commented Apr 22, 2021 •

edited

Loading

ivankokan commented Apr 23, 2021 •

edited

Loading

jspitz commented Apr 23, 2021 •

edited

Loading

ivankokan commented Apr 24, 2021 •

edited

Loading

jspitz commented Dec 10, 2021

Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it #497

Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it #497

Comments

ivankokan commented Apr 18, 2021 • edited Loading

jspitz commented Apr 20, 2021 • edited Loading

ivankokan commented Apr 21, 2021 • edited Loading

jspitz commented Apr 22, 2021

ivankokan commented Apr 22, 2021 • edited Loading

yannis1962 commented Apr 22, 2021 via email

jspitz commented Apr 22, 2021

ivankokan commented Apr 22, 2021 • edited Loading

yannis1962 commented Apr 22, 2021 via email

ivankokan commented Apr 22, 2021

yannis1962 commented Apr 22, 2021 via email

ivankokan commented Apr 22, 2021 • edited Loading

yannis1962 commented Apr 22, 2021 via email

ivankokan commented Apr 22, 2021 • edited Loading

ivankokan commented Apr 23, 2021 • edited Loading

jspitz commented Apr 23, 2021 • edited Loading

ivankokan commented Apr 24, 2021 • edited Loading

jspitz commented Dec 10, 2021

ivankokan commented Apr 18, 2021 •

edited

Loading

jspitz commented Apr 20, 2021 •

edited

Loading

ivankokan commented Apr 21, 2021 •

edited

Loading

ivankokan commented Apr 22, 2021 •

edited

Loading

ivankokan commented Apr 22, 2021 •

edited

Loading

ivankokan commented Apr 22, 2021 •

edited

Loading

ivankokan commented Apr 22, 2021 •

edited

Loading

ivankokan commented Apr 23, 2021 •

edited

Loading

jspitz commented Apr 23, 2021 •

edited

Loading

ivankokan commented Apr 24, 2021 •

edited

Loading