-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Croatian digraphs: When the digraph is to be output and the font is missing one, mimic it #497
Comments
In this case, diff --git a/tex/gloss-croatian.ldf b/tex/gloss-croatian.ldf
index 54074f0..001f8ea 100644
--- a/tex/gloss-croatian.ldf
+++ b/tex/gloss-croatian.ldf
@@ -112,7 +112,7 @@
\ifcroatian@disableligatures
\bgroup#2\egroup%
\else
- \charifavailable{#1}{#2}%
+ \charifavailable{#1}{\mbox{#2}}%
\fi%
} Please test whether this works for you. |
Actually, if (What is the purpose of grouping within |
One thing that keeps me wondering: If hyphenation occurs at this place, either this is orthographically correct (and we shouldn't suppress it) or the hyphenation patterns would need to be corrected (rather than letting polyglossia block hyphenation points). |
Sorry for this but I join Jürgen in not understanding why you say "regardless of hyphenation patterns"? If you add
nestručnjak as a pattern with a high even number between n and j, or as a hyphenation exception, then hyphenation is forbidden.
What would make the system overrule the hyphenation patterns so that you worry that hyphenation may occur after all?
Le 22 avr. 2021 à 11:22, Ivan Kokan ***@***.***> a écrit :
But if somebody for example typesets nestruč"njak, one really wants to typeset the letter nj. If disableligatures is set or the font is missing respective ligature, polyglossia will fall back to the separate n and j for which we must forbid the hyphenation in between - regardless of hyphenation patterns (which might treat consecutive n and j / l and j / d and ž more or less strict/pragmatic).
<http://www.imt-atlantique.fr/> Yannis HARALAMBOUS
Professor
Computer Science Department
UMR CNRS 6285 Lab-STICC
<http://perso.telecom-bretagne.eu/yannisharalambous/> <https://twitter.com/y_haralambous> <https://www.linkedin.com/in/yannis-haralambous-5529073?trk=hp-identity-name>Technopôle Brest-Iroise CS 83818
29238 Brest Cedex 3, France
Une école de l'IMT <http://www.imt.fr/>
Und nach einer kleinen Stille fügte Sie hinzu:
Jeder Weg, der dorthin führt, war am Ende der richtige. (Michael Ende)
|
I am also still thinking this is an hyphenation pattern issue, not a polyglossia issue. @reutenauer what is your take? |
There are two options in order to suppress wrong hyphenations in this matter:
Additionally, I would prefer
|
Le 22 avr. 2021 à 12:42, Ivan Kokan ***@***.***> a écrit :
There are two options in order to suppress wrong hyphenations in this matter:
Enrich the hyphenation patterns with all known patterns
The process is not finite, and it could happen that e.g. word i(nj)e gets hyphenated like in-je because patterns contain "in" and "je". (The claim does not hold for the particular example, but it shows what could happen.) The hyphenation patterns yield mostly good but also bad results, they will never be perfect.
There are two ways of "not being perfect":
1) the systemic error: for example, hyphenation depending on semantics and the patterns not having access to this level. Like the English "contact" hyphenated differently whether it is a verb or a noun. This is impossible to deal with, with the current TeX hyphenation mechanism
2) the error for lack of information: if you provide patterns with the exact rules of hyphenation, then these will always be applied. And if a word is discovered that is not covered by the rules, we can always update the patterns. So, asymptotically, the patterns will be perfect.
People will not have the reflex of writing "nj, they expect the system to solve these issues, and are ready to take measures if the system behaves incorrectly. You could choose a general solution that privileges inertia (= no hyphenation) between n and j, so that in the worst case a word like poluinjektivnost (or bijou in Flemish) will not be hyphenated between n and j, not a big loss, there are many other places to break the word.
In a regular text, absence of hyphenation at some point is not a problem. Wrong hyphenations must be avoided at all cost. And with patterns you can do it, provided you update your patterns whenever new, unexpected, cases appear.
<http://www.imt-atlantique.fr/> Yannis HARALAMBOUS
Professor
Computer Science Department
UMR CNRS 6285 Lab-STICC
<http://perso.telecom-bretagne.eu/yannisharalambous/> <https://twitter.com/y_haralambous> <https://www.linkedin.com/in/yannis-haralambous-5529073?trk=hp-identity-name>Technopôle Brest-Iroise CS 83818
29238 Brest Cedex 3, France
Une école de l'IMT <http://www.imt.fr/>
The history of linguistics is largely a history of misreadings,
of failed communication between authors and readers,
exacerbated by the illusion that communication has successfully occurred. (John E. Joseph)
|
Comparing "contact"/"contact" vs "inje"/"injekcija" (or some compounds built of this):
I mean, polyglossia/tex/gloss-croatian.ldf Lines 111 to 180 in 09973c8
We/it know(s) what that "nj represents: a digraph, which is output either by picking the very digraph from the font or with two consecutive letters (with no hyphenation allowed in between). And that mapping is not complete if no penalty on hyphenation is made. |
in2je.
in3jek1cija
when you type i(nj)e patterns are not even needed, but when, for some reason, your nj digraph is replaced by characters n and j then the
pattern in2je. steps into action and prevents hyphenation.
What am I missing in the above reasoning?
Le 22 avr. 2021 à 13:11, Ivan Kokan ***@***.***> a écrit :
Comparing "contact"/"contact" vs "inje"/"injekcija" (or some compounds built of this):
The one who typesets "inje" and "injekcija" cannot expect the engine to know that it is really i(nj)e and i(n)(j)ekcija, the same as the system cannot know whether "contact" is a verb or a noun
The one who typesets "i"nje" and "injekcija" intentionally uses digraph in one place; that digraph will either be output with the digraph from the font or with two consecutive characters. But definitely would like to have a guarantee that "nj will not be hyphenated in any case. (And with simple \nobreak, \nolinebreak or even \mbox{...} we have an opportunity to guarantee that.)
People will not have the reflex of writing "nj, they expect the system to solve these issues, and are ready to take measures if the system behaves incorrectly. You could choose a general solution that privileges inertia (= no hyphenation) between n and j, so that in the worst case a word like poluinjektivnost (or bijou in Flemish) will not be hyphenated between n and j, not a big loss, there are many other places to break the word.
Exactly, they expect the system to deal with it because they are intentionally using it.
I mean, polyglossia is the one who introduces and maps the "nj to something for the later processing:
https://github.com/reutenauer/polyglossia/blob/09973c867b792cb6b1683dd42e183e86b506d5c2/tex/gloss-croatian.ldf#L111-L180 <https://github.com/reutenauer/polyglossia/blob/09973c867b792cb6b1683dd42e183e86b506d5c2/tex/gloss-croatian.ldf#L111-L180>
We/it know(s) what that "nj represents: a digraph, which is output either by picking the very digraph from the font or with two consecutive letters (with no hyphenation allowed in between). And that mapping is not complete if no penalty on hyphenation is made.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#497 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFXC7KJZ3NJJS6QJWVMOYLTJ777RANCNFSM43ETQKBQ>.
<http://www.imt-atlantique.fr/> Yannis HARALAMBOUS
Professor
Computer Science Department
UMR CNRS 6285 Lab-STICC
<http://perso.telecom-bretagne.eu/yannisharalambous/> <https://twitter.com/y_haralambous> <https://www.linkedin.com/in/yannis-haralambous-5529073?trk=hp-identity-name>Technopôle Brest-Iroise CS 83818
29238 Brest Cedex 3, France
Une école de l'IMT <http://www.imt.fr/>
Le lecteur a le droit et même parfois le devoir de savoir en quels caractères
est composé le livre qu'il a entre les mains, et on ne peut exiger de lui
qu'il sache le reconnaître tout seul. (Gérard Genette)
|
Assuming that you are aware of the fact that it is only one among infinitely many cases (don't remember the compounds and other words with the same roots) - of course, that infinity is not so big but still - you are not missing anything. Let me get back to the beginning instead - to the semantics. I think we have lost the focus. The shorthands "DZ, "Dz, "dz, "LJ, "Lj, "lj, "NJ, "Nj and "nj were introduced to help Croatian typesetter properly typeset Croatian digraphs (which are considered to be letters just like the other 27 letters in Croatian alphabet, 30 in total), either by: In any case, those shorthands must be mapped to outputs:
Now consider the following MWE and the respective output with the current implementation: \documentclass{article}
\usepackage{fontspec}
\setmainfont{Arial} % contains digraph dž
\usepackage{polyglossia}
\setmainlanguage[babelshorthands]{croatian}
\begin{document}
\Large
\selectlanguage[disableligatures=false]{croatian}
"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija
\bigskip
\selectlanguage[disableligatures=true]{croatian}
"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija
\end{document} Local intervention at polyglossia/tex/gloss-croatian.ldf Line 136 in fd838c4
changing it to \hr@charifavailable{01C6}{d\nobreak ž}% So, |
Le 22 avr. 2021 à 22:21, Ivan Kokan ***@***.***> a écrit :
Assuming that you are aware of the fact that it is only one among infinitely many cases (don't remember the compounds and other words with the same roots) - of course, that infinity is not so big but still - you are not missing anything.
I do not wish to monopolize this thread and spend more time on it than necessary.
When you say "that there are infinitely many cases" you in fact mean that we are talking of an open non-deterministic set of words,
non-deterministic in the sense that there is no algorithm to obtain all cases.
This may be occupational hazard (déformation professionnelle) as I'm working in Natural Language Processing, but your problem
can be tackled by processing a large corpus. A corpus-driven approach may give an asymptotic solution to it.
I'm not talking for nothing. If you wish we can start a project with the goal to obtain a generative list (generative in the sense
of compound creation) of such words by extracting them from a large corpus. The result will not be perfect but it will
be an improvement over the current situation (you don't want people to have to use the double quote when writing Croatian,
TeX should be as natural and spontaneous as possible). And it will improve academic knowledge on Croatian language.
And it also may a lot of fun from the computer science point of view. One could apply machine learning methods to build a
model classifying words into those with individual nj and those with digraphs, and then use that predictive model to build
hyphenation patterns. A typical NLP project with a lot of fun.
If you are interested in such a project, let us discuss it off-line. If not, I won't bother you anymore.
Cheers,
Yannis
<http://www.imt-atlantique.fr/> Yannis HARALAMBOUS
Professor
Computer Science Department
UMR CNRS 6285 Lab-STICC
<http://perso.telecom-bretagne.eu/yannisharalambous/> <https://twitter.com/y_haralambous> <https://www.linkedin.com/in/yannis-haralambous-5529073?trk=hp-identity-name>Technopôle Brest-Iroise CS 83818
29238 Brest Cedex 3, France
Une école de l'IMT <http://www.imt.fr/>
Tradition says that one linguist, who was heard enough of frequency counts,
predicted ironically that when the counters had grown tired of tabulating single items
they would begin to count words by pairs. (David G. Hays)
|
I am not saying that hyphenation patterns cannot or should not be improved, even though I drastically described it as something that is non-solvable. Please do not get my comments that way. I think just the opposite: whenever it is possible to improve / enrich them, great! In my latest comment I wanted to reset everything and point out this is not about hyphenation; hyphenation is just something where the side effects of incomplete WYSIWYM: "dz ("I want to typeset a Croatian digraph letter or at least make the output of this behave as one") These two output possibilities should be as equal as possible for anyone/anything that takes them into processing/consideration (e.g. hyphenation). They are currently not, and they can be equal (at least more equal than they are at the moment). Simple as that. Again, I am trying to point out that the current output for "dz (and others) is not behaving as a single letter in all cases - which it must. |
Croatian alphabet consists of 30 letters:
Long story short:
So, the question is: how can we guarantee that the fallback part(s) are treated as a unit in later processing? |
If (1.) such digraphs are not supposed to be hyphenated, if (2.) entering them as two separate glyphs (alternatively) is orthographically valid and common, and if (1.) also applies to (2.), then this definitely must be addressed in the hyphenation patterns, not in polyglossia. |
Considering the hyphenation... If the fallback for After reading all previous comments again and again, it seems that we have not agreed on what should the fallback actually be (conceptually). Can we first try to decide on that? a) it should be the same as what Croatian (and Serbian) editors commonly enter, i.e. two consecutive letters Obviously, my preference is b) with the following reasoning (check the following MWE):
MWE: \documentclass{article}
\usepackage{fontspec}
\setmainfont{Arial} % contains digraph dž
\usepackage{polyglossia}
\setmainlanguage[babelshorthands]{croatian}
\begin{document}
\LARGE
\section{Using Unicode slots}
\selectlanguage[disableligatures=false]{croatian}
"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija
\section{Fallback}
\selectlanguage[disableligatures=true]{croatian}
"dzamija "dzep "dzezva "dzihad fere"dza ho"dza "dzepni ho"dzin bri"dz "dzem "dzemper "dzentlmen "dzip "dzokej
"dzul "dzungla "dzentlmenski "dzokejski jedna"dzba naru"dzba svjedo"dzba bun"dzija bureg"dzija ćevab"dzija galam"dzija
\section{The common way}
džamija džep džezva džihad feredža hodža džepni hodžin bridž džem džemper džentlmen džip džokej
džul džungla džentlmenski džokejski jednadžba narudžba svjedodžba bundžija buregdžija ćevabdžija galamdžija
\end{document} P.S. Additionally, maybe setting \newcommand*\hr@charifavailable[3]{%
\ifcroatian@disableligatures
\bgroup#3\egroup% "I do not want digraphs at all"
\else
\charifavailable{#1}{#2}% "I want digraph, either original or a mimic of one"
\fi%
}
\hr@charifavailable{01C6}{d\nobreak ž}{dž}% Should Section 2 be like Section 1 or Section 3? What do digraph shorthands represent? is the very question, not How do we implement it? - that comes later. |
…cking one digraph letter (reutenauer#497)
PR merged and amended with df87338 |
Related to #216.
Hyphenation within "fallback"
DŽ
,Dž
,dž
,LJ
,Lj
,lj
,NJ
,Nj
andnj
must be suppressed (regardless of what hyphenation patterns provide). I am not sure which among\mbox{}
,\nobreak
,\nolinebreak
etc. would be correct (or the most appropriate, at least).Other:
The text was updated successfully, but these errors were encountered: