forked from w3c/string-search
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
771 lines (633 loc) · 52.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<meta charset="UTF-8">
<title>String Searching</title>
<!-- local styles. Includes the styles from https://www.w3.org/International/i18n-activity/guidelines/editing -->
<link rel="stylesheet" href="local.css" type="text/css">
<script src="https://www.w3.org/Tools/respec/respec-w3c" async class="remove"></script>
<script class="remove">
var respecConfig = {
useExperimentalStyles: true,
// specification status (e.g. WD, LCWD, NOTE, etc.). If in doubt use ED.
specStatus: "ED",
//publishDate: "2020-03-20",
//previousMaturity: "ED",
noRecTrack: true,
shortName: "string-search",
copyrightStart: "2016",
edDraftURI: "https://w3c.github.io/string-search/",
group: "i18n",
github: "w3c/string-search",
xref: ["i18n-glossary"],
// lcEnd: "2009-08-05",
// editors, add as many as you like
// only "name" is required
editors: [
{ name: "Addison Phillips", mailto: "[email protected]", company: "Invited Expert", w3cid: 33573 }
],
// authors, add as many as you like.
//authors: [
// { name: "Your Name", url: "http://example.org/",
// company: "Your Company", companyURL: "http://example.com/" },
//],
};
</script> </head>
<body>
<section id="abstract">
<p>This document describes string searching operations on the Web in order to allow greater interoperability. String searching refers to natural language string matching such as the "find" command in a Web browser. This document builds upon the concepts found in <cite>Character Model for the World Wide Web 1.0: Fundamentals </cite>[[CHARMOD]] and <cite>Character Model for the World Wide Web 1.0: String Matching</cite> [[CHARMOD-NORM]] to provide authors of specifications, software developers, and content developers the information they need to describe and implement search features suitable for global audiences.</p>
</section>
<section id="sotd">
<div class="note">
<p data-lang="en" style="font-weight: bold; font-size: 120%">Sending comments on this document</p>
<p data-lang="en">If you wish to make comments regarding this document, please raise them as <a href="https://github.com/w3c/string-search/issues" style="font-size: 120%;">github issues</a> against the latest <a href="https://w3c.github.io/string-search"> editor's copy</a>. Only send comments by email if you are unable to raise issues on github (see links below). All comments are welcome.</p>
<p data-lang="en">To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL.</p>
</div>
</section>
<section id="intro">
<h2>Introduction</h2>
<section id="goals">
<h3>Goals and Scope</h3>
<p>This document describes the problems, requirements, and considerations for specification or implementations of string searching operations. A common example of string searching is the "find" command in a Web browser, but there are many other forms of searching that a specification might wish to define.</p>
<p class="note">This document builds on <cite>Character Model for the World Wide Web: Fundamentals</cite> [[CHARMOD]] and <cite>Character Model for the Word Wide Web: String Matching</cite> [[CHARMOD-NORM]]. Understanding the concepts in those documents are important to being able to understand and apply this document successfully.</p>
<p>The main target audience of this specification is W3C specification developers who need to define some form of search or find algorithm: the goal is to provide a stable reference to the concepts, terms, and requirements needed.</p>
<p>The concepts described in this document provide authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text searching on the World Wide Web. Working together, these three groups can build a globally accessible Web.</p>
<p>This document contains best practices and requirements for other specifications, as well as recommendations for implementations and content authors. These best practices for specifications (and others) can also be found in the Internationalization Working Group's document <cite>Internationalization Best Practices for Spec Developers</cite> [[INTERNATIONAL-SPECS]], which is intended to serve as a general reference for all Internationalization best practices in W3C specifications.</p>
</section>
<section id="conventions">
<h3>Document Conventions</h3>
<p>In this document [[RFC2119]] keywords in uppercase italics have their usual meaning. We also use these stylistic conventions:</p>
<p class="definition-example"><strong>Definitions</strong> appear with a different background color and decoration like this.</p>
<p class="advisement"><strong>Best practices</strong> appear with a different background color and decoration like this.</p>
<p class="issue-example" id="issue-example"><strong>Issues</strong>, gaps, and recommendations for future work appear with a different background color and decoration like this.</p>
</section>
<section id="terminology">
<h3>Terminology</h3>
<p>This section contains terminology specific to this document.</p>
<p>Much of the terminology needed to understand this document is provided by the <cite>Internationalization Glossary</cite> [[I18N-GLOSSARY]]. Some terms are also defined by [[CHARMOD-NORM]] and can be found in the <a href="https://www.w3.org/TR/charmod-norm/#terminology">Terminology and Notation</a> section of that document.</p>
<p><a>Unicode</a>, also known as the <a>Universal Character Set</a>, allows Web documents to be authored in any of the world's writing systems, scripts, or languages, on any computing platforms and then to be exchanged, read, and searched by the Web's users around the world. The first few chapters of the <cite>Unicode Standard</cite> [[Unicode]] provide useful background reading. Also see the <cite>Unicode Collation Algorithm</cite> [[UTS10]], which contains a chapter on searching.</p>
<p class="definition"><dfn>Corpus</dfn> The natural language text contained by a document or set of documents which the user would like to search.</p>
<p class="definition"><dfn>Segmentation</dfn> The process of breaking natural language text up into distinct words and phrases. This often includes operations such as "named entity recognition" (such as recognizing that the three word sequence <strong>Dr. Jonas Salk</strong> is a person's name).</p>
<p class="definition"><dfn data-lt="stemming|lemmatization">Stemming</dfn> A process or operation that reduces words to their "stem" or root. For example, the words <strong>runs</strong>, <strong>ran</strong>, and <strong>running</strong> all share the stem <strong>run</strong>. This some sometimes called (more formally) <em>lemmatization</em> and the stem is sometimes called the <em>lemma</em>.</p>
<p class="definition"><dfn data-lt="full text search|full-text search|full text searching">Full-Text Search</dfn> refers to searches that process the entire contents of the textual document or set of documents. Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on the rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.</p>
<p>Frequently this means that a <a>full-text search</a> employs indexes and natural language processing. When you are using a search engine, you are using a form of full text search. Full text search often breaks natural language text into words or phrases (this is called <a>segmentation</a>) and may apply complex processing to get at the semantic "root" values of words (this is called <a>stemming</a>). These processes are sensitive to language, context, and many other aspects of textual variation.</p>
<p class="definition"><dfn data-lt="natural language processing|NLP">Natural Language Processing</dfn> (<abbr title="natural language processing">NLP</abbr>) refers to the domain of software designed to understand, process, and manipulate human languages (that is, <a>natural language</a>). This is a very wide ranging term. It can cover relatively simple problems, such as word tokenization, or more complex behaviors, such as deriving "meaning" from text, recognizing parts of speech, performing accurate translation, and much else.</p>
</section>
</section>
</section>
<section id="searching">
<h2>Searching Text in Natural Language Content</h2>
<p>Users of the Web often want to search for specific text in a document or collection of documents without having to read line-by-line. Specifications sometimes seek to support this desire by exposing text searching in the Web platform.</p>
<p>There are different types of document searching. One type, called a <a>full text search</a>, is the sort of searching most often found in applications such as a search engine. This type of searching is complex, can be resource intensive, and often depends on processes outside the scope of a given search request.</p>
<p>A more limited form of text search (and the topic of this document) is <q>sub-string matching</q>. One familiar form of sub-string matching is the <q><em>find</em></q> feature of browsers and other types of user-agent. In browsers, this functionality is often accessed via a key combination such as <kbd>Cmd+F</kbd> or <kbd>Ctrl+F</kbd>. Such a feature might be exposed on the Web via the API <code translate=no>window.find</code>, which is currently not fully standardized, or capabilities such as the proposed scroll-to-text-fragment.</p>
<aside class="note">
<p>Textual search is different from the sorts of programmatic matching needed by formal languages, such as markup languages like [[HTML]]; style sheets [[CSS21]]; or data formats such as [[TURTLE]] or [[JSON-LD]]. String matching in formal languages is described by our document <cite>String Matching</cite> [[CHARMOD-NORM]].</p>
</aside>
<p>Find operations can provide optional mechanisms for improving or tailoring the matching behavior. For example, the abilility to add (or remove) <a href="#caseVariation">case sensitivity</a>, whether the feature supports different aspects of a regular expression language such as wildcard characters, or whether to limit matches to <a href="#wordBoundary">whole words</a>.</p>
<p>One way that sub-string matching usually differs from <a>full-text search</a> is that, while it might use various algorithms in an attempt to suppress or ignore textual variations, it usually does not produce matches that contain additional or unspecified character sequences, words, or phrases, such as would result from <a>stemming</a> or other <a>NLP</a> processes.</p>
<p>When attempting to standardize sub-string matching, specification authors often struggle with the complexity that is inherent in the encoding of <a>natural language</a> in computer systems, including the different mechanisms employed to encode characters in the [[Unicode]] standard.</p>
<!-- preserving text for the nonce
<p>When searching text, the concept of "<a>grapheme</a> boundaries" and "user-perceived characters" can be important. See Section 3 of <cite>Character Model for the World Wide Web: Fundamentals</cite> [[CHARMOD]] for a description. For example, if the user has entered a capital "A" into a search box, should the software find the character À (<code class="uname" translate="no">U+00C0 LATIN CAPITAL LETTER A WITH ACCENT GRAVE</code>)? What about the character "A" followed by <code class="uname" translate="no">U+0300 COMBINING ACCENT GRAVE</code>? What about writing systems, such as Devanagari, which use combining marks to suppress or express certain vowels?</p>
<p>In order to describe or implement sub-string matching, it is necessary to understand the types of textual variation that users expect the search feature to pay attention to (or ignore) and the types of features that the implementation will need to consider when building the searching algorithm.</p>
<p>The <cite>Character Model for the World-Wide Web: String Matching</cite> [[CHARMOD-NORM]] describes several textual equivalences which also apply to sub-string matching. These include <a href="https://www.w3.org/TR/charmod-norm/#definitionCaseFolding">case folding</a> and <a href="https://www.w3.org/TR/charmod-norm/#unicodeNormalization">different Unicode normalization forms</a>.
<p>There are other types of equivalence that are interesting when performing sub-string matching. Some forms of equivalence, such as those mentioned above, are based on character properties assigned by Unicode or due to the mapping of legacy character encodings to the Unicode character set. Other "interesting equivalences" go outside of those defined by Unicode. Some of these potential "text normalizations" are application, natural language, or domain specific and should not be overlooked by specifications or implementations.</p>
-->
<section id="otherEquivalences">
<h3>Problems with Determining Equivalence</h3>
<p>Quite often, the user's input doesn't consist of exactly the same sequence of <a>code points</a> as that used in the document being searched, while the user still expects a match to occur. This can happen for a variety of reasons. Sometimes it is because the text being searched varies in ways the user could not have predicted. In other cases it is because the user's keyboard or input method does not provide ready access to the textual variations needed. It can even be because the user cannot be bothered to input the text accurately.</p>
<p>In this section, we examine various common cases known to us which specification authors need to take into consideration when specifying a sub-string match API or mechanism.</p>
<section id="languageVariation">
<h3>Matching variation due to language</h3>
<p>User expectations about whether their search term matches a given part of a document or [=corpus=] sometimes depends on the user's language, the language of the document, or both. It might also involve other factors, such as which keyboards or input methods are available on a given device. This might be because various operations that are part of searching, such as case folding, are locale-affected, or that, given the complexity of human language and culture, that expectations about matching or about the use and interpretation of various character sequences differs, even within a given <a data-cite="i18n-glossary#dfn-script">script</a>. Similarly, the handling of accents, alternate scripts, or character encoding (such as variations in the formation of <a>grapheme clusters</a>) is linked to the specific language of the text in question.</p>
<p>It is important to emphasize that we mean <em>language</em> here, and not <a data-cite="i18n-glossary#dfn-script">script</a>. Many different languages that share a script apply different processing or imply different expectations.</p>
<p>Implementations of a "find" feature often have to guess what language the user intended based solely on the user's input or on various "hints" in the runtime environment, such as the operating environment locale, the user agent's localization, or the language of the active keyboard. These hints are, at best, a proxy for the user's intent, particularly when the user is searching a document that doesn't match any of these or when the searched document contains more than one language.</p>
<aside class="example" id="text-frag-lang">
<p>Different languages treat the letter combinations <q>a</q>, <q>ae</q>, and <q>ä</q> differently.
English speakers expect <q>ae</q> to be different from <q>a</q> and <q>ä</q>. Since <q>ä</q> is a foreign letter, they usually expect it to match the unmarked <q>a</q>.
German speakers expect <q>ae</q> and <q>ä</q> to be equivalent (and different from <q>a</q>).
Finnish speakers expect all three to be separate.
</p>
<p>Now suppose you have a sentence in Finnish:
<strong lang="fi" dir="ltr">Haen Han Solon. Hän on salakuljettaja.</strong></p>
<p>(For the curious, this translates to: <em>I’ll go get Han Solo. He is a smuggler.</em>)</p>
<p>The above sentence is tagged as Finnish (<code translate=no>lang="fi"</code>). Notice that the letter "n" attached to the end of Han Solo's name (<em>Han Solon</em>) is a part of Finnish grammar.</p>
<p>Here are some spelling variations that speakers of English, German, and Finnish might enter when performing a "find" operation on the text:</p>
<ul>
<li>Han</li>
<li>Hän</li>
<li>Haen</li>
<li>han</li>
<li>hän</li>
<li>haen</li>
</ul>
<p>Finnish speakers expect that each of the above examples is a different word. They might expect that the case variation between <kbd>Hän</kbd> and <kbd>hän</kbd> might be ignored.
German speakers might expect that <kbd>Hän</kbd> and <kbd>Haen</kbd> are equivalent.
English speakers might expect <kbd>Han</kbd> to match <kbd>Hän</kbd> (but perhaps not the reverse, since <q>ä</q> is not native to English).
However, the language tagging of the document doesn't seem to affect most find operations.
Neither is there usually a way for the user to affect which language is applied to the search term.
</p>
<p>Here is a phrase that we believe means <em>warm marrow</em> in Turkish: <strong lang="tr">ılık ilik</strong>.</p>
<p>Here are some spelling variations that English and Turkish speakers might enter:</p>
<ul>
<li>ILIK</li>
<li>İLİK</li>
<li>ilik</li>
<li>ılık</li>
</ul>
<p>Depending on your browser and runtime locale, you can get anomolous matching with these terms. In some browsers, the first three terms above consistently match <q>ilik</q> (with an ASCII dotted-i) but not the word <q>ılık</q> with <span class="codepoint" translate="no"><bdi lang="tr">ı</bdi><code class="uname">U+0131 LATIN SMALL LETTER DOTLESS I</code></span>.</p>
<p>This is not what Turkish users would expect, since they expect "I"/"ı" and "İ"/"i" to be caseless pairs. A side-effect of this is that the search term "ılık" only matches its lowercase equivalent—and that the uppercase variations do not match that word. Such variation means that both English and Turkish users will notice that the search misses words.</p>
</aside>
</section>
<section id="caseVariation">
<h4>Case Folding</h4>
<p>A user might expect a term entered in lowercase to match uppercase equivalents (and perhaps vice-versa). Sub-string matching features, such as the browser "find" command, often offer a user-selectable option for matching (or not) the case of the input to that of the text.</p>
<p>For a survey of case folding, see the discussion <a href="https://www.w3.org/TR/charmod-norm/#definitionCaseFolding">here</a> in [[CHARMOD-NORM]].</p>
</section>
<section id="unicodeNormalization">
<h4>Unicode Normalization and character equivalence</h4>
<p>Unicode defines canonical and compatibility relationships between characters which can impact user perceptions of string searching. For a detailed discussion of Unicode Normalization forms see Section 2.2 of [[CHARMOD-NORM]] as well as the definitions found in <cite>Unicode Normalization Forms</cite> [[UAX15]].</p>
<aside class="example">
<p>For example, consider the letter "K". The characters with a normalization including <code>U+004B LATIN CAPITAL LETTER K</code> include the following, many of which might be expected to match a letter "K" in a sub-string search request by a user because they appear to contain a logical "letter K":</p>
<ul>
<li>Ķ <code>U+0136 LATIN CAPITAL LETTER K WITH CEDILLA</code>
<li>Ǩ <code>U+01E8 LATIN CAPITAL LETTER K WITH CARON</code>
<li>ᴷ <code>U+1D37 MODIFIER LETTER CAPITAL K</code>
<li>Ḱ <code>U+1E30 LATIN CAPITAL LETTER K WITH ACUTE</code>
<li>Ḳ <code>U+1E32 LATIN CAPITAL LETTER K WITH DOT BELOW</code>
<li>Ḵ <code>U+1E34 LATIN CAPITAL LETTER K WITH LINE BELOW</code>
<li>K <code>U+212A KELVIN SIGN</code>
<li>Ⓚ <code>U+24C0 CIRCLED LATIN CAPITAL LETTER K</code>
<li>㎅ <code>U+3385 SQUARE KB</code>
<li>㏍ <code>U+33CD SQUARE KK</code>
<li>㏎ <code>U+33CE SQUARE KM CAPITAL</code>
<li>K <code>U+FF2B FULLWIDTH LATIN CAPITAL LETTER K</code>
<li>𝐊 <code>U+1D40A MATHEMATICAL BOLD CAPITAL K</code>
<li>𝐾 <code>U+1D43E MATHEMATICAL ITALIC CAPITAL K</code>
<li>𝑲 <code>U+1D472 MATHEMATICAL BOLD ITALIC CAPITAL K</code>
<li>𝒦 <code>U+1D4A6 MATHEMATICAL SCRIPT CAPITAL K</code>
<li>𝓚 <code>U+1D4DA MATHEMATICAL BOLD SCRIPT CAPITAL K</code>
<li>𝔎 <code>U+1D50E MATHEMATICAL FRAKTUR CAPITAL K</code>
<li>𝕂 <code>U+1D542 MATHEMATICAL DOUBLE-STRUCK CAPITAL K</code>
<li>𝕶 <code>U+1D576 MATHEMATICAL BOLD FRAKTUR CAPITAL K</code>
<li>𝖪 <code>U+1D5AA MATHEMATICAL SANS-SERIF CAPITAL K</code>
<li>𝗞 <code>U+1D5DE MATHEMATICAL SANS-SERIF BOLD CAPITAL K</code>
<li>𝘒 <code>U+1D612 MATHEMATICAL SANS-SERIF ITALIC CAPITAL K</code>
<li>𝙆 <code>U+1D646 MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL K</code>
<li>𝙺 <code>U+1D67A MATHEMATICAL MONOSPACE CAPITAL K</code>
<li>🄚 <code>U+1F11A PARENTHESIZED LATIN CAPITAL LETTER K</code>
<li>🄺 <code>U+1F13A SQUARED LATIN CAPITAL LETTER K</code>
</ul>
</aside>
<p>In many complex scripts it is possible to encode letters or vowel-signs in more than one way, but the alternatives are canonically equivalent.</p>
</section>
<section id="scriptEquiv">
<h4>Script Equivalence</h4>
<p>Some languages are written in more than one script. A user searching a document might type in text in one script, but wish to find equivalent text in both scripts.</p>
<aside class="example">
<p>Japanese uses two syllabic scripts, <code>hiragana</code> and <code>katakana</code>. These scripts encode the same phonemes; thus the user might expect that typing in a search term in <em>hiragana</em> would find the exact same word spelled out in <em>katakana</em>.</p>
<p>In the example shown here, the word <em translate="no" lang="ja-Latn">nihongo</em> (Japanese for "Japanese") is shown in both hiragana and katakana. Note that this word is usually represented by <em>kanji</em> (Han ideograph) characters: <span class="kw" lang="ja" translate="no">日本語</span>.</p>
<table>
<thead>
<tr>
<th style="width:30%">Description</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2">Hiragana</td>
<td class="exampleChar" lang="ja">にほんご</td>
</tr>
<tr>
<td><span class="codepoint" translate="no">U+306B U+307B U+3093 U+3054</span></td>
</tr>
<tr>
<td rowspan="2">Katakana</td>
<td class="exampleChar" lang="ja">ニホンゴ</td>
</tr>
<tr>
<td><span class="codepoint" translate="no">U+30CB U+30DB U+30F3 U+30B4</span></td>
</tr>
</tbody>
</table>
</aside>
</section>
<section id="eastAsianWidthEquiv">
<h4>East Asian Width</h4>
<p>Some compatibility characters were encoded into Unicode to account for single- or multibyte representation in <a>legacy character encodings</a> or for compatibility with certain layout behaviors in East Asian languages.</p>
<aside class="example" title="Examples of East Asian width variations">
<table>
<thead>
<tr>
<th>Description</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan=2>full-width katakana</td>
<td class="exampleChar" lang="ja">ニホンゴ</td>
</tr>
<tr>
<td><span class="codepoint" translate="no">U+30CB U+30DB U+30F3 U+30B4</span></td>
</tr>
<tr>
<td rowspan=2>half-width katakana<br><em>These are compatibility characters</em></td>
<td class="exampleChar" lang="ja">ニホンゴ</td>
</tr>
<tr>
<td><span class="codepoint" translate="no">U+FF86 U+FF83 U+FF9D U+FF7A U+FF9E</span></td>
</tr>
<tr>
<td rowspan=2>half-width Latin letters<br><em>These are ASCII letters!</em></td>
<td class="exampleChar" lang="en">abcXYZ</td>
</tr>
<tr>
<td><span class="codepoint" translate="no">U+0061 U+0062 U+0063 U+0058 U+0059 U+005A</span></td>
</tr>
<tr>
<td rowspan=2>full-width Latin letters<br><em>These are compatibility characters.</em></td>
<td class="exampleChar" lang="en">abcXYZ</td>
</tr>
<tr>
<td><span class="codepoint" translate="no">U+FF41 U+FF42 U+FF43 U+FF38 U+FF39 U+FF3A</span></td>
</tr>
</tbody>
</table>
</aside>
</section>
<section id="digitShaping">
<h4>Digit Shaping</h4>
<p>Many scripts have their own digit characters for the numbers from 0 to 9. In some Web applications, the familiar ASCII digits are replaced for display purposes with the local digit shapes. In other cases, the text actually might contain the Unicode characters for the local digits. Users attempting to search a document might expect that typing one form of digit will find the eqivalent digits.</p>
<aside class="example" title="Examples of digit shapes in four scripts">
<p>Here are some selected examples of different digit shapes, from zero to nine, in four scripts. Many scripts have equivalent sets of digits with distinct shapes.</p>
<table style="position:center">
<thead>
<tr>
<th rowspan=2 style="vertical-align:top; width:30%;">Script</th>
<th colspan=10 style="text-align:center">Digits</th>
</tr>
<tr>
<th class="exampleChar">0</th>
<th class="exampleChar">1</th>
<th class="exampleChar">2</th>
<th class="exampleChar">3</th>
<th class="exampleChar">4</th>
<th class="exampleChar">5</th>
<th class="exampleChar">6</th>
<th class="exampleChar">7</th>
<th class="exampleChar">8</th>
<th class="exampleChar">9</th>
</tr>
</thead>
<tbody>
<tr>
<td>Latin</td>
<td class="exampleChar">0</td>
<td class="exampleChar">1</td>
<td class="exampleChar">2</td>
<td class="exampleChar">3</td>
<td class="exampleChar">4</td>
<td class="exampleChar">5</td>
<td class="exampleChar">6</td>
<td class="exampleChar">7</td>
<td class="exampleChar">8</td>
<td class="exampleChar">9</td>
</tr>
<tr>
<td>Gujurati</td>
<td class="exampleChar">૦</td>
<td class="exampleChar">૧</td>
<td class="exampleChar">૨</td>
<td class="exampleChar">૩</td>
<td class="exampleChar">૪</td>
<td class="exampleChar">૫</td>
<td class="exampleChar">૬</td>
<td class="exampleChar">૭</td>
<td class="exampleChar">૮</td>
<td class="exampleChar">૯</td>
</tr>
<tr>
<td>Thai</td>
<td class="exampleChar">๐</td>
<td class="exampleChar">๑</td>
<td class="exampleChar">๒</td>
<td class="exampleChar">๓</td>
<td class="exampleChar">๔</td>
<td class="exampleChar">๕</td>
<td class="exampleChar">๖</td>
<td class="exampleChar">๗</td>
<td class="exampleChar">๘</td>
<td class="exampleChar">๙</td>
</tr>
<tr>
<td>Arabic</td>
<td class="exampleChar">٠</td>
<td class="exampleChar">١</td>
<td class="exampleChar">٢</td>
<td class="exampleChar">٣</td>
<td class="exampleChar">٤</td>
<td class="exampleChar">٥</td>
<td class="exampleChar">٦</td>
<td class="exampleChar">٧</td>
<td class="exampleChar">٨</td>
<td class="exampleChar">٩</td>
</tr>
</tbody>
</table>
</aside>
</section>
<section id="orthoVariation">
<h4>Orthographic or Dialectical Variation</h4>
<p>Some languages have different orthographic traditions that vary by region or dialect or allow different spellings of the same word. Searches and spell-checking may need to know about these variations.</p>
<aside class="example">
<p>US English (language tag <code translate="no">en-US</code>) and UK English (language tag <code translate="no">en-GB</code>) have different spelling traditions, which manifest in different ways. For example, <strong>color</strong> versus <strong>colour</strong> or exchanging the letters <em>s</em> and <em>z</em> as in <em>internationali<span style="font-size:125%">Z</span>ation</em> vs. <em>internationali<span style="font-size:125%">S</span>ation</em>. A few words have even more divergent spellings, such as <strong>jail</strong> vs. <strong>gaol</strong>.</p>
<p>The spelling variants for US vs UK English are mostly standardised, however sometimes the spelling is down to personal preferences (or sometimes lack of knowledge). For example, the US English word 'through' can be spelled 'thru'.</p>
</aside>
<section id="south-asian-scripts">
<h4>South Asian (Indic script) languages</h4>
<p>Indic script languages have many instances of this kind of problem. Sometimes these are spelling errors, but in other cases multiple spellings are acceptable.</p>
<p>For example, the Bengali language (language tag <code class="kw" translate="no">bn</code>) is notorious for having a wide range of spelling variations permitted by the language: nearly 80% of Bengali words have at least two spellings. Many words have 3, 4, or more variations—with at least one word having 16 different <em>valid</em> spellings.</p>
<aside class="example">
<p>One example is the word which transliterates to the Latin script as <em lang="bn-Latn">rani</em>, but which users may spell with different letters and vowel marks. In modern Bengali <span class="codepoint" translate="no"><bdi lang="bn">ণ</bdi> [<span class="uname">U+09A3 BENGALI LETTER NNA</span>]</span> and <span class="codepoint" translate="no"><bdi lang="bn">ন</bdi> [<span class="uname">U+09A8 BENGALI LETTER NA</span>]</span> are pronounced /n/, and <span class="codepoint" translate="no"><bdi lang="bn">ি</bdi> [<span class="uname">U+09BF BENGALI VOWEL SIGN I </span>]</span> and <span class="codepoint" translate="no"><bdi lang="bn">ী</bdi> [<span class="uname">U+09C0 BENGALI VOWEL SIGN II </span>]</span> are both pronounced /i/. Therefore different users might choose any of the following alternative <a>code point</a> sequences for the same word:</p>
<table>
<thead>
<tr>
<th></th>
<th><span class="uname">U+09A8 BENGALI LETTER NA</span></th>
<th><span class="uname">U+09A3 BENGALI LETTER NNA</span></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="2"><span class="uname">U+09BF BENGALI VOWEL SIGN I</span></th>
<td style="text-align:center">
<span class="exampleChar" translate="no"><bdi lang="bn">রানি</bdi></span>
</td>
<td style="text-align:center">
<span class="exampleChar" translate="no"><bdi lang="bn">রাণি</bdi></span>
</td>
</tr>
<tr>
<td><span class="codepoint" translate="no">U+09B0 U+09BE U+09A8 U+09BF</span></td>
<td><span class="codepoint" translate="no">U+09B0 U+09BE U+09A3 U+09BF</span></td>
</tr>
<tr>
<th rowspan="2"><span class="uname">U+09C0 BENGALI VOWEL SIGN II</span></th>
<td style="text-align:center">
<span class="exampleChar" translate="no"><bdi lang="bn">রানী</bdi></span>
</td>
<td style="text-align:center">
<span class="exampleChar" translate="no"><bdi lang="bn">রাণী</bdi></span>
</td>
</tr>
<tr>
<td><span class="codepoint" translate="no">U+09B0 U+09BE U+09A8 U+09C0</span></td>
<td><span class="codepoint" translate="no">U+09B0 U+09BE U+09A3 U+09C0</span></td>
</tr>
</tbody>
</table>
</aside>
<p>Other Indic scripts provide alternative mechanisms for representing particular sounds, and in most cases either representation is considered equally valid. The most common instance of this involves representation of syllable-final nasals.</p>
<p>For example, the <samp translate="no">/n/</samp> sound in the word for <em>snake</em> in Hindi can be written using either <span class="" translate="no"><bdi lang="hi">ँ</bdi></span> [<span class="uname">U+0901 DEVANAGARI SIGN CANDRABINDU</span>] and <span class="" translate="no"><bdi lang="hi">ं</bdi></span> [<span class="uname">U+0902 DEVANAGARI SIGN ANUSVARA</span>] Both of the following are possible valid spellings:</p>
<aside class="example">
<table>
<thead>
<tr>
<th>Description</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2">With <span class="" translate="no"><bdi lang="hi">ँ</bdi></span> [<span class="uname">U+0901 DEVANAGARI SIGN CANDRABINDU</span>]</td>
<td class="exampleChar" style="text-align:center">साँप</td>
</tr>
<td style="text-align:center">U+0938 U+093E U+0901 U+092A</td>
<tr>
</tr>
<tr>
<td rowspan="2">With <span class="" translate="no"><bdi lang="hi">ं</bdi></span> [<span class="uname">U+0902 DEVANAGARI SIGN ANUSVARA</span>]</td>
<td class="exampleChar" style="text-align:center">सांप</td>
</tr>
<tr>
<td style="text-align:center">U+0938 U+093E U+0902 U+092A</td>
</tr>
</tbody>
</table>
</aside>
<p>In an additional twist to this story, two diacritics with different code points could be used here. In our previous example we used <span class="codepoint" translate="no"><bdi lang="hi">ं</bdi> [<span class="uname">U+0902 DEVANAGARI SIGN ANUSVARA </span>]</span> to represent the nasal sound because the accompanying vowel-sign rises above the hanging baseline. If the vowel-sign was one that didn't rise above the hanging baseline, we would normally use <span class="codepoint" translate="no"><bdi lang="hi">ँ</bdi> [<span class="uname">U+0901 DEVANAGARI SIGN CANDRABINDU </span>]</span> instead. The function of both of these diacritics is the same, but their code points are different.</p>
<p>The alternative use of either a letter or a diacritic for syllable-final nasals is common to several other Indian languages. In addition to Devanagari, used to write languages such as Hindi (language tag <code translate="no">hi</code>) or Marathi (language tag <code translate="no">mr</code>, scripts such as Malayalam, Gujarati, Odia, and others provide similar spelling options.</p>
<aside class="example" title="Example of another Indic script spelling variation">
<p>Here is an example from Malayalam (<code translate="no">ml</code>) showing alternative spellings of the same word.</p>
<table>
<thead>
<tr>
<th>Description</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2">with <span class="uname" translate="no">U+0D03 MALAYALAM SIGN VISARGA</span></td>
<td class="exampleChar" lang="ml" style="text-align:center">ദുഃഖം</td>
</tr>
<tr>
<td style="text-align:center">U+0D26 U+0D41 U+0D03 U+0D16 U+0D02</td>
</tr>
<tr>
<td rowspan="2">without <span class="uname" translate="no">U+0D03 MALAYALAM SIGN VISARGA</span></td>
<td class="exampleChar" lang="ml" style="text-align:center">ദുഖം</td>
</tr>
<tr>
<td style="text-align:center">U+0D26 U+0D41 U+0D16 U+0D02</td>
</tr>
</tbody>
</table>
</aside>
</section>
</section>
<section id="whitespaceNormalization">
<h4>Whitespace Normalization</h4>
<p>Some languages use whitespace to separate words, sentences, or paragraphs while others do not. When performing sub-string matching, different forms of whitespace found in [[Unicode]] must be normalized so that the match succeeds.</p>
</section>
<section id="accents">
<h4>Accents and diacritic marks</h4>
<p>Users will sometimes vary their input when dealing with letters that contain accents or diacritic marks when entering search terms in scripts (such as the Latin script) that use various diacritics, even though the text they are searching includes the additional marks. This is particularly true on mobile keyboards, where input of these characters can require additional effort. In these cases, users generally expect the search operation to be more "promiscuous" to make up for their failure to make the additional effort needed.</p>
<aside class="example">
<p>Users in languages such as French sometimes omit entering accents when inputting search terms because it is more work to enter the correct character, even though this affects the meaning. For example, they might type <code>cote</code> and might expect to find the variations (which have different meanings) like <code>côte</code> or <code>côté</code>, etc. This is "misspelling".</p>
</aside>
<aside class="example">
<p>German uses several letters that have an <em>umlaut</em> accent, such as <span class="codepoint" translate="no"><bdi lang="de">ö</bdi> [<span class="uname">U+00F6 LATIN SMALL LETTER O WITH DIERISIS</span>]</span> or <span class="codepoint" translate="no"><bdi lang="de">ü</bdi> [<span class="uname">U+00FC LATIN SMALL LETTER U WITH DIERISIS</span>]</span>. Users sometimes will enter these accents when searching, but sometimes they replace the umlauts with the letter <code>e</code>. For example, instead of entering <code>Dürst</code> they might enter <code>Duerst</code>. Either spelling is recognizable and has the same meaning. The umlauts are probably "better" than the <code>e</code> spelling, but German speakers are not confused by the difference.</p>
<p class="note">Other languages use these same characters for a different purpose than German does. The formal name of the "umlaut" diacritic in Unicode is <em>diaeresis</em>, which means approximately "break" or "pause". Languages such as French, Spanish, and English occasionally use the diaeresis to indicate the need to pronounce a specific letter, such as the word "<span lang="es">ambigüedad</span>" in Spanish or a name like "Zoë" in English.</p>
</aside>
<p>This effect might vary depending on context as well. For example, a person using a physical keyboard may have direct access to accented letters, while a virtual or on-screen keyboard may require extra effort to access and select the same letters.</p>
</section>
<section id="optional-characters">
<h4>Optional characters</h4>
<p>In some orthographies it is necessary to match strings with different numbers of characters.</p>
<p>A prime example of this involves vowel diacritics in <a>abjads</a>. For example, some languages that use the Arabic and Hebrew scripts do not require (but optionally allow) the user to input short vowels. (For some other languages in these scripts, the inclusion of the short vowels is not optional.) The presence or absence of vowels in the text being input or searched might impede a match if the user doesn't enter or know to enter them.</p>
<aside class="example">
<p>Arabic, Persian, and Urdu users generally do not enter short vowels—but some texts do include them. Searching is affected by this, but meaning generally is not. A generalized description of this might be "optional to encode" sequences.</p>
</aside>
</section>
<section id="visually-identical-non-canonical">
<h4>Visually identical text that is not canonically equivalent</h4>
<p>In some cases, visually similar or identical glyph patterns can be made from different sequences of code points. Sometimes this is intentional and variations can be removed via Unicode normalization. But there are other cases in which similar-appearing [= graphemes =] are not made the same by normalisation, and they are not semantically equivalent.</p>
<aside class="example">
<p>For example, here are a number of character sequences that produce the same or similar textual appearance in the Malayalam script. The inappropriate sequences should be avoided because they will cause the meaning of the text to change: searches, matching and other aspects of the text will fail to be understood by the application or the font. In some cases, fonts will indicate that there is a problem by forcing the appearance of a dotted circle or otherwise failing to render the text correctly, but this may not always be the case.</p>
<table>
<thead>
<tr>
<th>Use</th>
<th>Do <em>not</em> use</th>
</tr>
</thead>
<tbody>
<tr>
<td><bdi class="exampleChar">ൈ</bdi></td>
<td><bdi class="exampleChar">െെ</bdi></td>
</tr>
<tr>
<td>[<span class="uname" translate="no">U+0D48 MALAYALAM VOWEL SIGN AI</span>]</td>
<td>[<span class="uname" translate="no">U+0D46 MALAYALAM VOWEL SIGN E + U+0D46 VOWEL SIGN E</span>]</td>
</tr>
<tr>
<td><bdi class="exampleChar">ഈ</bdi></td>
<td><bdi class="exampleChar">ഇൗ</bdi></td>
</tr>
<tr>
<td>[<span class="uname" translate="no">U+0D08 MALAYALAM LETTER II</span>]</td>
<td>[<span class="uname" translate="no">U+0D07 MALAYALAM LETTER I + U+0D57 AU LENGTH MARK</span>]</td>
</tr>
<tr>
<td><bdi class="exampleChar">ഊ</bdi></td>
<td><bdi class="exampleChar">ഉൗ</bdi></td>
</tr>
<tr>
<td>[<span class="uname" translate="no">U+0D0A MALAYALAM LETTER UU</span>]</td>
<td>[<span class="uname" translate="no">U+0D09 MALAYALAM LETTER U + U+0D57 AU LENGTH MARK</span>]</td>
</tr>
<tr>
<td><bdi class="exampleChar">ഓ</bdi></td>
<td><bdi class="exampleChar">ഒാ</bdi></td>
</tr>
<tr>
<td>[<span class="uname" translate="no">U+0D13 MALAYALAM LETTER OO</span>]</td>
<td>[<span class="uname" translate="no">U+0D12 MALAYALAM LETTER O + U+0D3E VOWEL SIGN AA</span>]</td>
</tr>
<tr>
<td><bdi class="exampleChar">ഐ</bdi></td>
<td><bdi class="exampleChar">എെ</bdi></td>
</tr>
<tr>
<td>[<span class="uname" translate="no">U+0D10 MALAYALAM LETTER AI</span>]</td>
<td>[<span class="uname" translate="no">U+0D0E MALAYALAM LETTER E + U+0D46 VOWEL SIGN E</span>]</td>
</tr>
<tr>
<td><bdi class="exampleChar">ഔ</bdi></td>
<td><bdi class="exampleChar">ഒൗ</bdi></td>
</tr>
<tr>
<td>[<span class="uname" translate="no">U+0D14 MALAYALAM LETTER AU</span>]</td>
<td>[<span class="uname" translate="no">U+0D12 MALAYALAM LETTER O + U+0D57 MALAYALAM AU LENGTH MARK</span>]</td>
</tr>
</tbody>
</table>
</aside>
<p>Some languages which use the Arabic script also have [= graphemes =] which can be encoded in more than one way. In some cases, these variations are handled by <a href="#unicodeNormalization">Unicode Normalization</a>, but in other cases they are not considered equivalent by Unicode, even if they appear visually to be identical. Sometimes these variations are considered to be valid spelling variations. In other cases they are the result of user's mistaken perception.</p>
<aside class="example">
<p>A number of language are written in the Arabic script but are unrelated to the Arabic language. Some of these languages therefore require character sequences to represent sounds not present in Arabic. A significant problem for some of these languages is that these specially-encoded character sequences can be visually similar (or identical) to character sequences encoded for other uses and users may experience difficulty entering or knowing how to enter the correct sequence, such as when inputting a search term.</p>
<p>One such language is Kashmiri (language tag <kbd>ks</kbd>). Here are some selected examples one might find in Kashmiri:</p>
<table>
<thead>
<tr>
<th>Description</th>
<th colspan=4 style="text-align:center">Examples</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Canonically equivalent alternatives</strong><br/>(differences resolved by Unicode Normalization)</td>
<td class="exampleChar">إ</td>
<td><code class="uname" translate="no">U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW</code></td>
<td class="exampleChar">إ</td>
<td><code class="uname" translate="no">U+0627 ARABIC LETTER ALEF</code> + <code class="uname" translate="no">U+0655 ARABIC HAMZA BELOW</code></td>
</tr>
<tr>
<td><strong>Not canonically equivalent</strong><br/>(differences that <em>remain</em> after Unicode Normalization) Many of these are linked to user perception of whether the vowel is part of the base letter (<em lang="ar-Latn" translate="no">ijam</em>) vs. separable (<em lang="ar-Latn" translate="no">tashkil</em>)</td>
<td class="exampleChar">ێ</td>
<td><code class="uname" translate="no">U+06CE ARABIC LETTER YEH WITH SMALL V</code></td>
<td class="exampleChar">یٚ</td>
<td><code class="uname" translate="no">U+06CC ARABIC LETTER FARSI YEH</code> + <code class="uname" translate="no">U+065A ARABIC VOWEL SIGN SMALL V ABOVE</code></td>
</tr>
<tr>
<td><strong>Confusables or spelling errors</strong><br/>these can be common in certain kinds of text due to gaps in keyboard support or due to a similarity in appearance</td>
<td class="exampleChar">ئ</td>
<td><code class="uname" translate="no">U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE</code></td>
<td class="exampleChar">یٔ</td>
<td><code class="uname" translate="no">U+06CC ARABIC LETTER FARSI YEH</code> + <code class="uname" translate="no">U+0654 ARABIC HAMZA ABOVE</code></td>
</tr>
</tbody>
</table>
<p>(For more information, see Richard Ishida's doc <a href="https://r12a.github.io/scripts/arabic/ks.html#encoding">here</a>.)</p>
</aside>
</section>
</section>
<section id="wordBoundary">
<h3>Word boundaries and "whole word" matching</h3>
<p>Some languages, such as English or Arabic, use spaces between words. Other languages, such as Chinese, Japanese, or Thai, don't. In many non-spacing languages, computing "whole word" matching depends on the ability to determine word boundaries when the boundaries are not themselves encoded into the text.</p>
</section>
</section><!-- end of "additional types of equivalence" -->
<section id="searchingConsiderations">
<h2>Considerations for Searching</h2>
<p class="issue">This section was identified as a new area needing document as part of the overall rearchitecting of the document. The text here is incomplete and needs further development. Contributions from the community are invited.</p>
<p>Implementers often need to provide simple "find text" algorithms and specifications often try to define APIs to support these needs. Find operations on text generate different user expectations and thus have different requirements from the need for absolute identity matching needed by document formats and protocols. It is important to note that domain-specific requirements may impose additional restrictions or alter the considerations presented here.</p>
<p class="advisement">Increasing input effort from the user SHOULD be mirrored by more selective matching.</p>
<p>When the user expends more effort on the input—by using the shift key to produce uppercase or by entering a letter with diacritics instead of just the base letter—they might expect their search results to match (only) their more-specific input.</p>
<aside class="example">
<p>Consider a document containing these strings: "re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ".</p>
<p>In the table below, the user's input (on the left) might be considered a match for the above items as follows:</p>
<table class="data">
<tbody>
<tr>
<th scope="col">User Input</th>
<th scope="col">Matched Strings</th>
</tr>
<tr>
<td>e (lowercase 'e')</td>
<td>"re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ"</td>
</tr>
<tr>
<td>E (uppercase 'E')</td>
<td>"RE-RESUME" and "RE-RÉSUMÉ"</td>
</tr>
<tr>
<td>é (lowercase 'e' with acute accent)</td>
<td>"re-résumé" and "RE-RÉSUMÉ"</td>
</tr>
<tr>
<td>É (uppercase 'E' with acute accent)</td>
<td>"RE-RÉSUMÉ"</td>
</tr>
</tbody>
</table>
</aside>
<section id="SearchOptions">
<h3>Types of Search Option</h3>
<p>When creating a string search API or algorithm, the following textual options might be useful to users:</p>
<ul>
<li>Case-sensitive vs. case-insensitive</li>
<li>Kana folding</li>
<li>Unicode normalization form</li>
<li>etc.</li>
</ul>
</section>
</section>
<section>
<h2 id="Acknowledgements" class="informative">Acknowledgements</h2>
<p>The W3C Internationalization Working Group and Interest Group, as well as others, provided many comments and suggestions. The Working Group would like to thank: all of the contributors to the Character Model series of documents over the many years of their development.</p>
<p>The examples in <a href="#text-frag-lang">this example</a> were taken from a page authored by Henri Sivonen, as were a number of concepts and ideas recorded by him in <a href="https://github.com/WICG/scroll-to-text-fragment/issues/233">this issue</a>.</p>
</section>
</body>
</html>