+This stemmer for Armenian was developed and contributed by Astghik Mkrtchyan.
+
+
+
+The following characters are vowels for the purposes of this algorithm:
+
+
+ ա է ի օ ւ ե ո ը
+
+
+
+R2 is the region after the first non-vowel following a vowel after the
+first non-vowel following a vowel, or the end of the word if there is no such
+non-vowel.
+
+This stemmer for Armenian was developed and contributed by Astghik Mkrtchyan.
+
+
+
+The following characters are vowels for the purposes of this algorithm:
+
+
+ ա է ի օ ւ ե ո ը
+
+
+
+R2 is the region after the first non-vowel following a vowel after the
+first non-vowel following a vowel, or the end of the word if there is no such
+non-vowel.
+
+The Danish alphabet includes the following additional letters,
+
+
+
+ æ å ø
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u y æ å ø
+
+
+
+A consonant is defined as a character from ASCII a-z which isn't a vowel
+(originally this was "A consonant is defined as a non-vowel" but since
+2018-11-15 we've changed this definition to avoid the stemmer altering
+alphanumeric codes which end with a repeated digit).
+
+
+
+R2 is not used: R1 is defined in the same way as in the
+German stemmer.
+(See the note on R1 and R2.)
+
+
+
+Define a valid s-ending as one of
+
+
+
+a b c d f g h j k l m n o p r
+ t v y z å
+
+
+
+Do each of steps 1, 2, 3 and 4.
+
+
+
+Step 1:
+
+
+
+ Search for the longest among the following suffixes in R1, and
+ perform the action indicated.
+
+
+
(a)
+ hed ethed ered e erede ende erende ene
+ erne ere en heden eren er heder erer
+ heds es endes erendes enes ernes eres
+ ens hedens erens ers ets erets et eret
+
delete
+
(b)
+ s
+
delete if preceded by a valid s-ending
+
+
+ (Of course the letter of the valid s-ending is
+ not necessarily in R1)
+
+
+
+
+Step 2:
+
+
+
+ Search for one of the following suffixes in R1, and if found
+ delete the last letter.
+
+
+ gd dt gt kt
+
+ (For example, friskt → frisk)
+
+
+Step 3:
+
+
+ If the word ends igst, remove the final st.
+
+
+
+ Search for the longest among the following suffixes in R1, and
+ perform the action indicated.
+
+
+
(a)
+ ig lig elig els
+
delete, and then repeat step 2
+
(b)
+ løst
+
replace with løs
+
+
+
+Step 4: undouble
+
+
+ If the word ends with double consonant in R1, remove one of the
+ consonants.
+
+
+
+ (For example, bestemmelse → bestemmels (step 1)
+ → bestemm (step 3a)
+ → bestem in this step.)
+
+The Danish alphabet includes the following additional letters,
+
+
+
+ æ å ø
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u y æ å ø
+
+
+
+A consonant is defined as a character from ASCII a-z which isn't a vowel
+(originally this was "A consonant is defined as a non-vowel" but since
+2018-11-15 we've changed this definition to avoid the stemmer altering
+alphanumeric codes which end with a repeated digit).
+
+
+
+R2 is not used: R1 is defined in the same way as in the
+German stemmer.
+(See the note on R1 and R2.)
+
+
+
+Define a valid s-ending as one of
+
+
+
+a b c d f g h j k l m n o p r
+ t v y z å
+
+
+
+Do each of steps 1, 2, 3 and 4.
+
+
+
+Step 1:
+
+
+
+ Search for the longest among the following suffixes in R1, and
+ perform the action indicated.
+
+
+
(a)
+ hed ethed ered e erede ende erende ene
+ erne ere en heden eren er heder erer
+ heds es endes erendes enes ernes eres
+ ens hedens erens ers ets erets et eret
+
delete
+
(b)
+ s
+
delete if preceded by a valid s-ending
+
+
+ (Of course the letter of the valid s-ending is
+ not necessarily in R1)
+
+
+
+
+Step 2:
+
+
+
+ Search for one of the following suffixes in R1, and if found
+ delete the last letter.
+
+
+ gd dt gt kt
+
+ (For example, friskt → frisk)
+
+
+Step 3:
+
+
+ If the word ends igst, remove the final st.
+
+
+
+ Search for the longest among the following suffixes in R1, and
+ perform the action indicated.
+
+
+
(a)
+ ig lig elig els
+
delete, and then repeat step 2
+
(b)
+ løst
+
replace with løs
+
+
+
+Step 4: undouble
+
+
+ If the word ends with double consonant in R1, remove one of the
+ consonants.
+
+
+
+ (For example, bestemmelse → bestemmels (step 1)
+ → bestemm (step 3a)
+ → bestem in this step.)
+
+
+
+
The same algorithm in Snowball
+
+[% highlight_file('danish') %]
+
+[% footer %]
diff --git a/algorithms/danish/stop.txt b/algorithms/danish/stop.txt
new file mode 100644
index 0000000..3705204
--- /dev/null
+++ b/algorithms/danish/stop.txt
@@ -0,0 +1,102 @@
+
+ | A Danish stop word list. Comments begin with vertical bar. Each stop
+ | word is at the start of a line.
+
+ | This is a ranked list (commonest to rarest) of stopwords derived from
+ | a large text sample.
+
+
+og | and
+i | in
+jeg | I
+det | that (dem. pronoun)/it (pers. pronoun)
+at | that (in front of a sentence)/to (with infinitive)
+en | a/an
+den | it (pers. pronoun)/that (dem. pronoun)
+til | to/at/for/until/against/by/of/into, more
+er | present tense of "to be"
+som | who, as
+på | on/upon/in/on/at/to/after/of/with/for, on
+de | they
+med | with/by/in, along
+han | he
+af | of/by/from/off/for/in/with/on, off
+for | at/for/to/from/by/of/ago, in front/before, because
+ikke | not
+der | who/which, there/those
+var | past tense of "to be"
+mig | me/myself
+sig | oneself/himself/herself/itself/themselves
+men | but
+et | a/an/one, one (number), someone/somebody/one
+har | present tense of "to have"
+om | round/about/for/in/a, about/around/down, if
+vi | we
+min | my
+havde | past tense of "to have"
+ham | him
+hun | she
+nu | now
+over | over/above/across/by/beyond/past/on/about, over/past
+da | then, when/as/since
+fra | from/off/since, off, since
+du | you
+ud | out
+sin | his/her/its/one's
+dem | them
+os | us/ourselves
+op | up
+man | you/one
+hans | his
+hvor | where
+eller | or
+hvad | what
+skal | must/shall etc.
+selv | myself/yourself/herself/ourselves etc., even
+her | here
+alle | all/everyone/everybody etc.
+vil | will (verb)
+blev | past tense of "to stay/to remain/to get/to become"
+kunne | could
+ind | in
+når | when
+være | present tense of "to be"
+dog | however/yet/after all
+noget | something
+ville | would
+jo | you know/you see (adv), yes
+deres | their/theirs
+efter | after/behind/according to/for/by/from, later/afterwards
+ned | down
+skulle | should
+denne | this
+end | than
+dette | this
+mit | my/mine
+også | also
+under | under/beneath/below/during, below/underneath
+have | have
+dig | you
+anden | other
+hende | her
+mine | my
+alt | everything
+meget | much/very, plenty of
+sit | his, her, its, one's
+sine | his, her, its, one's
+vor | our
+mod | against
+disse | these
+hvis | if
+din | your/yours
+nogle | some
+hos | by/at
+blive | be/become
+mange | many
+ad | by/through
+bliver | present tense of "to be/to become"
+hendes | her/hers
+været | be
+thi | for (conj)
+jer | you
+sådan | such, like this/like that
diff --git a/algorithms/dutch/stemmer.html b/algorithms/dutch/stemmer.html
new file mode 100644
index 0000000..a5e06c0
--- /dev/null
+++ b/algorithms/dutch/stemmer.html
@@ -0,0 +1,524 @@
+
+
+
+
+
+
+
+
+
+ Dutch stemming algorithm - Snowball
+
+
+
+
+
+
+
+
+
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
+
end ing
+
delete if in R2
+
if preceded by ig, delete if in R2 and not preceded by e, otherwise
+ undouble the ending
+
+
ig
+
delete if in R2 and not preceded by e
+
+
lijk
+
delete if in R2, and then repeat step 2
+
+
baar
+
delete if in R2
+
+
bar
+
delete if in R2 and if step 2 actually removed an e
+
+
+Step 4: undouble vowel
+
+ If the words ends CVD, where C is a non-vowel, D is a non-vowel other
+ than I, and V is double a, e, o or u, remove one of the vowels from
+ V (for example, maan → man, brood → brod).
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
+
end ing
+
delete if in R2
+
if preceded by ig, delete if in R2 and not preceded by e, otherwise
+ undouble the ending
+
+
ig
+
delete if in R2 and not preceded by e
+
+
lijk
+
delete if in R2, and then repeat step 2
+
+
baar
+
delete if in R2
+
+
bar
+
delete if in R2 and if step 2 actually removed an e
+
+
+Step 4: undouble vowel
+
+ If the words ends CVD, where C is a non-vowel, D is a non-vowel other
+ than I, and V is double a, e, o or u, remove one of the vowels from
+ V (for example, maan → man, brood → brod).
+
+Finally,
+
+ Turn I and Y back into lower case.
+
+
+
The same algorithm in Snowball
+
+[% highlight_file('dutch') %]
+
+[% footer %]
diff --git a/algorithms/dutch/stop.txt b/algorithms/dutch/stop.txt
new file mode 100644
index 0000000..d9f38a8
--- /dev/null
+++ b/algorithms/dutch/stop.txt
@@ -0,0 +1,113 @@
+
+
+ | A Dutch stop word list. Comments begin with vertical bar. Each stop
+ | word is at the start of a line.
+
+ | This is a ranked list (commonest to rarest) of stopwords derived from
+ | a large sample of Dutch text.
+
+ | Dutch stop words frequently exhibit homonym clashes. These are indicated
+ | clearly below.
+
+de | the
+en | and
+van | of, from
+ik | I, the ego
+te | (1) chez, at etc, (2) to, (3) too
+dat | that, which
+die | that, those, who, which
+in | in, inside
+een | a, an, one
+hij | he
+het | the, it
+niet | not, nothing, naught
+zijn | (1) to be, being, (2) his, one's, its
+is | is
+was | (1) was, past tense of all persons sing. of 'zijn' (to be) (2) wax, (3) the washing, (4) rise of river
+op | on, upon, at, in, up, used up
+aan | on, upon, to (as dative)
+met | with, by
+als | like, such as, when
+voor | (1) before, in front of, (2) furrow
+had | had, past tense all persons sing. of 'hebben' (have)
+er | there
+maar | but, only
+om | round, about, for etc
+hem | him
+dan | then
+zou | should/would, past tense all persons sing. of 'zullen'
+of | or, whether, if
+wat | what, something, anything
+mijn | possessive and noun 'mine'
+men | people, 'one'
+dit | this
+zo | so, thus, in this way
+door | through by
+over | over, across
+ze | she, her, they, them
+zich | oneself
+bij | (1) a bee, (2) by, near, at
+ook | also, too
+tot | till, until
+je | you
+mij | me
+uit | out of, from
+der | Old Dutch form of 'van der' still found in surnames
+daar | (1) there, (2) because
+haar | (1) her, their, them, (2) hair
+naar | (1) unpleasant, unwell etc, (2) towards, (3) as
+heb | present first person sing. of 'to have'
+hoe | how, why
+heeft | present third person sing. of 'to have'
+hebben | 'to have' and various parts thereof
+deze | this
+u | you
+want | (1) for, (2) mitten, (3) rigging
+nog | yet, still
+zal | 'shall', first and third person sing. of verb 'zullen' (will)
+me | me
+zij | she, they
+nu | now
+ge | 'thou', still used in Belgium and south Netherlands
+geen | none
+omdat | because
+iets | something, somewhat
+worden | to become, grow, get
+toch | yet, still
+al | all, every, each
+waren | (1) 'were' (2) to wander, (3) wares, (3)
+veel | much, many
+meer | (1) more, (2) lake
+doen | to do, to make
+toen | then, when
+moet | noun 'spot/mote' and present form of 'to must'
+ben | (1) am, (2) 'are' in interrogative second person singular of 'to be'
+zonder | without
+kan | noun 'can' and present form of 'to be able'
+hun | their, them
+dus | so, consequently
+alles | all, everything, anything
+onder | under, beneath
+ja | yes, of course
+eens | once, one day
+hier | here
+wie | who
+werd | imperfect third person sing. of 'become'
+altijd | always
+doch | yet, but etc
+wordt | present third person sing. of 'become'
+wezen | (1) to be, (2) 'been' as in 'been fishing', (3) orphans
+kunnen | to be able
+ons | us/our
+zelf | self
+tegen | against, towards, at
+na | after, near
+reeds | already
+wil | (1) present tense of 'want', (2) 'will', noun, (3) fender
+kon | could; past tense of 'to be able'
+niets | nothing
+uw | your
+iemand | somebody
+geweest | been; past participle of 'be'
+andere | other
+
diff --git a/algorithms/english-combining-forms.png b/algorithms/english-combining-forms.png
new file mode 100644
index 0000000..ecac711
Binary files /dev/null and b/algorithms/english-combining-forms.png differ
diff --git a/algorithms/english/stemmer.html b/algorithms/english/stemmer.html
new file mode 100644
index 0000000..16b6714
--- /dev/null
+++ b/algorithms/english/stemmer.html
@@ -0,0 +1,1069 @@
+
+
+
+
+
+
+
+
+
+ The English (Porter2) stemming algorithm - Snowball
+
+
+
+
+
+
+
+
+
+
+
+(Revised slightly, December 2001)
+(Further revised, September 2002)
+
+
+
+I have made more than one attempt to improve the structure of the Porter
+algorithm by making it follow the pattern of ending removal of the Romance
+language stemmers. It is not hard to see why one should want to do this:
+step 1b of the Porter stemmer removes ed and ing, which are
+i-suffixes (*) attached to verbs. If these suffixes are removed, there
+should be no need to remove d-suffixes which are not verbal, although
+it will try to do so. This seems to be a deficiency in the Porter stemmer,
+not shared by the Romance stemmers. Again, the divisions between steps
+2, 3 and 4 seem rather arbitrary, and are not found in the Romance stemmers.
+
+
+
+Nevertheless, these attempts at improvement have been abandoned. They seem
+to lead to a more complicated algorithm with no very obvious improvements.
+A reason for not taking note of the outcome of step 1b may be that
+English endings do not determine word categories quite as strongly as
+endings in the Romance languages. For example, condition and
+position in French have to be nouns, but in English they can be verbs
+as well as nouns,
+
+ We are all conditioned by advertising
+ They are positioning themselves differently today
+
+A possible reason for having separate steps 2, 3 and 4 is that
+d-suffix combinations in English are quite complex, a point which has
+been made
+elsewhere.
+
+
+
+But it is hardly surprising that after twenty years of use of the Porter
+stemmer, certain improvements did suggest themselves, and a new algorithm
+for English is therefore offered here. (It could be called the ‘Porter2’
+stemmer to distinguish it from the Porter stemmer, from which it derives.)
+The changes are not so very extensive: (1) terminating y is changed to
+i rather less often, (2) suffix us does not lose its s, (3) a
+few additional suffixes are included for removal, including (4) suffix
+ly. In addition, a small list of exceptional forms is included. In
+December 2001 there were two further adjustments: (5) Steps 5a and 5b
+of the old Porter stemmer were combined into a single step. This means
+that undoubling final ll is not done with removal of final e. (6)
+In Step 3 ative is removed only when in region R2.
+(7)
+In July
+2005 a small adjustment was made (including a new step 0) to handle
+apostrophe.
+
+
+
+To begin with, here is the basic algorithm without reference to the
+exceptional forms. An exact comparison with the Porter algorithm needs to
+be done quite carefully if done at all. Here we indicate by * points
+of departure, and by + additional features. In the sample vocabulary,
+Porter and Porter2 stem slightly under 5% of words to different forms.
+
+
+
Definition of the English stemmer
+
+
+Define a vowel as one of
+
+ a e i o u y
+
+Define a double as one of
+
+ bb dd ff gg mm nn pp rr tt
+
+Define a valid li-ending as one of
+
+ c d e g h k m n r t
+
+
+R1 is the region after the first non-vowel following a vowel, or the end of
+the word if there is no such non-vowel. (This definition may be modified for certain exceptional
+words — see below.)
+
+
+
+R2 is the region after the first non-vowel following a vowel in R1, or the
+end of the word if there is no such non-vowel.
+(See note on R1 and R2.)
+
+
+
+Define a short syllable in a word as either (a) a vowel followed by a
+non-vowel other than w, x or Y and preceded by a non-vowel, or
+*
+(b) a vowel at the beginning of the word followed by a non-vowel.
+
+
+
+So rap,
+trap, entrap end with a short syllable, and ow, on, at are
+classed as short syllables. But uproot, bestow, disturb do not end with a
+short syllable.
+
+
+
+A word is called short if it ends in a short syllable, and if R1 is null.
+
+
+
+So bed, shed and shred are short words, bead, embed, beds are
+not short words.
+
+
+
+An apostrophe (') may be regarded as a letter.
+(See note on apostrophes in English.)
+
+
+
+If the word has two letters or less, leave it as it is.
+
+
+
+Otherwise, do each of the following operations,
+
+
+
+Remove initial ', if present. + Then,
+
+
+
+Set initial y, or y after a vowel, to Y, and then establish the regions
+R1 and R2.
+(See note on vowel marking.)
+
+
+
+Step 0: +
+
+
+ Search for the longest among the suffixes,
+
+
+
'
+
's
+
's'
+
and remove if found.
+
+
+Step 1a:
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
+
sses
+
replace by ss
+
ied+ies*
+
replace by i if preceded by more than one letter, otherwise by ie
+ (so ties → tie, cries → cri)
+
s
+
delete if the preceding word part contains a vowel not immediately before the
+s (so gas and this retain the s, gaps and kiwis lose it)
+
us+ss
+
do nothing
+
+
+
+Step 1b:
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
+
eed eedly+
+
replace by ee if in R1
+
ed edly+ing ingly+
+
delete if the preceding word part contains a vowel, and after the deletion:
+
if the word ends at, bl or iz add e (so luxuriat → luxuriate), or
+
if the word ends with a double
+ remove the last letter (so hopp → hop), or
+
if the word is short, add e (so hop → hope)
+
+
+
+Step 1c: *
+
+ replace suffix y or Y by i if preceded by a non-vowel which is not the
+ first letter of the word (so cry → cri, by → by, say → say)
+
+
+Step 2:
+
+
+ Search for the longest among the following suffixes, and, if
+ found and in R1, perform the action indicated.
+
+
+
tional: replace by tion
+
enci: replace by ence
+
anci: replace by ance
+
abli: replace by able
+
entli: replace by ent
+
izer ization: replace by ize
+
ational ation ator: replace by ate
+
alism aliti alli: replace by al
+
fulness: replace by ful
+
ousli ousness: replace by ous
+
iveness iviti: replace by ive
+
biliti bli+: replace by ble
+
ogi+: replace by og if preceded by l
+
fulli+: replace by ful
+
lessli+: replace by less
+
li+: delete if preceded by a valid li-ending
+
+
+
+Step 3:
+
+
+ Search for the longest among the following suffixes, and, if
+ found and in R1, perform the action indicated.
+
+
+
tional+: replace by tion
+
ational+: replace by ate
+
alize: replace by al
+
icate iciti ical: replace by ic
+
ful ness: delete
+
ative*: delete if in R2
+
+
+
+Step 4:
+
+
+ Search for the longest among the following suffixes, and, if
+ found and in R2, perform the action indicated.
+
+
+
al ance ence er ic able ible ant ement
+ ment ent ism ate iti ous ive ize
+
delete
+
ion
+
delete if preceded by s or t
+
+
+
+Step 5: *
+
+
+ Search for the following suffixes, and, if
+ found, perform the action indicated.
+
+
+
e
+
delete if in R2, or in R1 and not preceded by a short
+ syllable
+
l
+
delete if in R2 and preceded by l
+
+
+
+Finally, turn any remaining Y letters in the word back into lower case.
+
+
+
Exceptional forms in general
+
+
+It is quite easy to expand a Snowball script so that certain exceptional
+word forms get special treatment. The standard case is that certain words
+W1, W2 ..., instead of passing through the stemming process, are
+mapped to the forms X1, X2 ... respectively. If the script does
+the stemming by means of the call
+
+
definestemasC
+
+
+
+where C is a command, the exceptional cases can be dealt with by extending this to
+
+atlimit causes the whole string to be tested for equality with one of
+the Wi, and if a match is found, the string is replaced with
+Xi.
+
+
+
+More precisely we might have a group of words W11, W12 ...
+that need to be mapped to X1, another group W21, W22
+... that need to be mapped to X2, and so on, and a list of words
+V1, V2 ... Vk that are to remain invariant. The
+exception routine may then be written as follows:
+
+
+And indeed the exception1 routine for the English stemmer has just that
+shape:
+
+
defineexception1as(
+
+[substring]atlimitamong(
+
+/* special changes: */
+
+'skis'(<-'ski')
+'skies'(<-'sky')
+'dying'(<-'die')
+'lying'(<-'lie')
+'tying'(<-'tie')
+
+/* special -LY cases */
+
+'idly'(<-'idl')
+'gently'(<-'gentl')
+'ugly'(<-'ugli')
+'early'(<-'earli')
+'only'(<-'onli')
+'singly'(<-'singl')
+
+// ... extensions possible here ...
+
+/* invariant forms: */
+
+'sky'
+'news'
+'howe'
+
+'atlas''cosmos''bias''andes'// not plural forms
+
+// ... extensions possible here ...
+)
+)
+
+
+
+
+(More will be said about the words that appear here shortly.)
+
+
+
+Here we see words being treated exceptionally before stemming is done, but equally we could
+treat stems exceptionally after stemming is done, and so, if we wish, map absorpt to
+absorb, reduct to reduc etc., as in the
+Lovins stemmer.
+But more generally, throughout the algorithm, each significant step may have recognised
+exceptions, and a suitably placed among will take care of them. For example, a point made
+at least twice in the literature is that words beginning gener are overstemmed by the
+Porter stemmer:
+
+
+
+
generate
+ generates
+ generated
+ generating
+ general
+ generally
+ generic
+ generically
+ generous
+ generously
→
gener
+
+
+
+
+To fix this over-stemming, we make an exception to the usual setting of p1,
+the left point of R1, and therefore replace
+
+
+
gopastvgopastnon-vsetmarkp1
+
+
+
+
+with
+
+
+
among(
+'gener'
+// ... and other stems may be included here ...
+)or(gopastvgopastnon-v)
+setmarkp
+
+
+
+
+after which the words beginning gener stem as follows:
+
+
+
+
generate
+ generates
+ generated
+ generating
+
→
generat
+
+
general
+ generally
+
→
general
+
generic
+ generically
+
→
generic
+
generous
+ generously
+
→
generous
+
+
+
+Another example is given by the exception2 routine, which is similar to exception1,
+but placed after the call of Step_1a, which may have removed terminal s,
+
+
+
defineexception2as(
+
+[substring]atlimitamong(
+'inning''outing''canning''herring'
+'proceed''exceed''succeed'
+
+// ... extensions possible here ...
+
+)
+)
+
+
+
+
+Snowball makes it easy therefore to add in lists of exceptions. But deciding what the lists of
+exceptions should be is far from easy. Essentially there are two lines of attack, the
+systematic and the piecemeal. One might systematically treat as exceptions the stem changes of
+irregular verbs, for example. The piecemeal approach is to add in exceptions as people notice
+them — like gener above. The problem with the systematic approach is that it should be
+done by investigating the entire language vocabulary, and that is more than most people are
+prepared to do. The problem with the piecemeal approach is that it is arbitrary, and usually
+yields little.
+
+
+
+The exception lists in the English stemmer are meant to be illustrative (‘this is how it is done if you
+want to do it’), and were derived piecemeal.
+
+
+
+a)
+The new stemmer improves on the Porter stemmer in handling short words ending e and
+y. There is however a mishandling of the four forms sky, skies, ski,
+skis, which is easily corrected by treating three of these words as
+special cases.
+
+
+
+b)
+Similarly there is a problem with the ing form of three letter verbs ending ie. There
+are only three such verbs: die, lie and tie, so a special case is made for
+dying, lying and tying.
+
+
+
+c)
+One has to be a little careful of certain ing forms.
+inning, outing, canning, which one does not wish
+to be stemmed to
+in, out, can.
+
+
+
+d)
+The removal of suffix ly, which is not in the Porter stemmer, has a number of exceptions.
+Certain short-word exceptions are idly, gently, ugly, early, only, singly.
+Rarer words (bristly, burly, curly, surly ...) are not included.
+
+
+
+e)
+The remaining words were included following complaints from users of the Porter algorithm.
+news is not the plural of new (noticed when IR systems were being set up for
+Reuters). Howe is a surname, and needs to be separated from how (noticed when
+doing a search for ‘Sir Geoffrey Howe’ in a demonstration at the House of Commons).
+succeed etc are not past participles, so the ed should not be removed (pointed out
+to me in an email from India). herring should not stem to her (another email from
+Russia).
+
+
+
+f)
+Finally, a few non-plural words ending s have been added.
+
+
+
+Incidentally, this illustrates how much feedback to expect from the real users of a stemming
+algorithm: seven or eight words in twenty years!
+
+
+
+The definition of the English stemmer above is therefore supplemented by the following:
+
+
+
Exceptional forms in the English stemmer
+
+
+
+ If the words begins gener, commun or arsen, set R1 to be the remainder of the
+ word.
+
+
+
+ Stem certain special words as follows,
+
+
+
+
skis
→
ski
+
skies
→
sky
+
+
dying lying tying
+
→
+
die lie tie
+
+
+
idly gently ugly early only singly
+
→
+
idl gentl ugli earli onli singl
+
+
+
+
+ If one of the following is found, leave it invariant,
+
+
+
+
sky news howe
+
atlas
cosmos
bias
andes
+
+
+
+ Following step 1a, leave the following invariant,
+
+(Revised slightly, December 2001)
+(Further revised, September 2002)
+
+
+
+I have made more than one attempt to improve the structure of the Porter
+algorithm by making it follow the pattern of ending removal of the Romance
+language stemmers. It is not hard to see why one should want to do this:
+step 1b of the Porter stemmer removes ed and ing, which are
+i-suffixes (*) attached to verbs. If these suffixes are removed, there
+should be no need to remove d-suffixes which are not verbal, although
+it will try to do so. This seems to be a deficiency in the Porter stemmer,
+not shared by the Romance stemmers. Again, the divisions between steps
+2, 3 and 4 seem rather arbitrary, and are not found in the Romance stemmers.
+
+
+
+Nevertheless, these attempts at improvement have been abandoned. They seem
+to lead to a more complicated algorithm with no very obvious improvements.
+A reason for not taking note of the outcome of step 1b may be that
+English endings do not determine word categories quite as strongly as
+endings in the Romance languages. For example, condition and
+position in French have to be nouns, but in English they can be verbs
+as well as nouns,
+
+ We are all conditioned by advertising
+ They are positioning themselves differently today
+
+A possible reason for having separate steps 2, 3 and 4 is that
+d-suffix combinations in English are quite complex, a point which has
+been made
+elsewhere.
+
+
+
+But it is hardly surprising that after twenty years of use of the Porter
+stemmer, certain improvements did suggest themselves, and a new algorithm
+for English is therefore offered here. (It could be called the ‘Porter2’
+stemmer to distinguish it from the Porter stemmer, from which it derives.)
+The changes are not so very extensive: (1) terminating y is changed to
+i rather less often, (2) suffix us does not lose its s, (3) a
+few additional suffixes are included for removal, including (4) suffix
+ly. In addition, a small list of exceptional forms is included. In
+December 2001 there were two further adjustments: (5) Steps 5a and 5b
+of the old Porter stemmer were combined into a single step. This means
+that undoubling final ll is not done with removal of final e. (6)
+In Step 3 ative is removed only when in region R2.
+(7)
+In July
+2005 a small adjustment was made (including a new step 0) to handle
+apostrophe.
+
+
+
+To begin with, here is the basic algorithm without reference to the
+exceptional forms. An exact comparison with the Porter algorithm needs to
+be done quite carefully if done at all. Here we indicate by * points
+of departure, and by + additional features. In the sample vocabulary,
+Porter and Porter2 stem slightly under 5% of words to different forms.
+
+
+
Definition of the English stemmer
+
+
+Define a vowel as one of
+
+ a e i o u y
+
+Define a double as one of
+
+ bb dd ff gg mm nn pp rr tt
+
+Define a valid li-ending as one of
+
+ c d e g h k m n r t
+
+
+R1 is the region after the first non-vowel following a vowel, or the end of
+the word if there is no such non-vowel. (This definition may be modified for certain exceptional
+words — see below.)
+
+
+
+R2 is the region after the first non-vowel following a vowel in R1, or the
+end of the word if there is no such non-vowel.
+(See note on R1 and R2.)
+
+
+
+Define a short syllable in a word as either (a) a vowel followed by a
+non-vowel other than w, x or Y and preceded by a non-vowel, or
+*
+(b) a vowel at the beginning of the word followed by a non-vowel.
+
+
+
+So rap,
+trap, entrap end with a short syllable, and ow, on, at are
+classed as short syllables. But uproot, bestow, disturb do not end with a
+short syllable.
+
+
+
+A word is called short if it ends in a short syllable, and if R1 is null.
+
+
+
+So bed, shed and shred are short words, bead, embed, beds are
+not short words.
+
+
+
+An apostrophe (') may be regarded as a letter.
+(See note on apostrophes in English.)
+
+
+
+If the word has two letters or less, leave it as it is.
+
+
+
+Otherwise, do each of the following operations,
+
+
+
+Remove initial ', if present. + Then,
+
+
+
+Set initial y, or y after a vowel, to Y, and then establish the regions
+R1 and R2.
+(See note on vowel marking.)
+
+
+
+Step 0: +
+
+
+ Search for the longest among the suffixes,
+
+
+
'
+
's
+
's'
+
and remove if found.
+
+
+Step 1a:
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
+
sses
+
replace by ss
+
ied+ies*
+
replace by i if preceded by more than one letter, otherwise by ie
+ (so ties → tie, cries → cri)
+
s
+
delete if the preceding word part contains a vowel not immediately before the
+s (so gas and this retain the s, gaps and kiwis lose it)
+
us+ss
+
do nothing
+
+
+
+Step 1b:
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
+
eed eedly+
+
replace by ee if in R1
+
ed edly+ing ingly+
+
delete if the preceding word part contains a vowel, and after the deletion:
+
if the word ends at, bl or iz add e (so luxuriat → luxuriate), or
+
if the word ends with a double
+ remove the last letter (so hopp → hop), or
+
if the word is short, add e (so hop → hope)
+
+
+
+Step 1c: *
+
+ replace suffix y or Y by i if preceded by a non-vowel which is not the
+ first letter of the word (so cry → cri, by → by, say → say)
+
+
+Step 2:
+
+
+ Search for the longest among the following suffixes, and, if
+ found and in R1, perform the action indicated.
+
+
+
tional: replace by tion
+
enci: replace by ence
+
anci: replace by ance
+
abli: replace by able
+
entli: replace by ent
+
izer ization: replace by ize
+
ational ation ator: replace by ate
+
alism aliti alli: replace by al
+
fulness: replace by ful
+
ousli ousness: replace by ous
+
iveness iviti: replace by ive
+
biliti bli+: replace by ble
+
ogi+: replace by og if preceded by l
+
fulli+: replace by ful
+
lessli+: replace by less
+
li+: delete if preceded by a valid li-ending
+
+
+
+Step 3:
+
+
+ Search for the longest among the following suffixes, and, if
+ found and in R1, perform the action indicated.
+
+
+
tional+: replace by tion
+
ational+: replace by ate
+
alize: replace by al
+
icate iciti ical: replace by ic
+
ful ness: delete
+
ative*: delete if in R2
+
+
+
+Step 4:
+
+
+ Search for the longest among the following suffixes, and, if
+ found and in R2, perform the action indicated.
+
+
+
al ance ence er ic able ible ant ement
+ ment ent ism ate iti ous ive ize
+
delete
+
ion
+
delete if preceded by s or t
+
+
+
+Step 5: *
+
+
+ Search for the following suffixes, and, if
+ found, perform the action indicated.
+
+
+
e
+
delete if in R2, or in R1 and not preceded by a short
+ syllable
+
l
+
delete if in R2 and preceded by l
+
+
+
+Finally, turn any remaining Y letters in the word back into lower case.
+
+
+
Exceptional forms in general
+
+
+It is quite easy to expand a Snowball script so that certain exceptional
+word forms get special treatment. The standard case is that certain words
+W1, W2 ..., instead of passing through the stemming process, are
+mapped to the forms X1, X2 ... respectively. If the script does
+the stemming by means of the call
+
+[% highlight("
+ define stem as C
+") %]
+
+where C is a command, the exceptional cases can be dealt with by extending this to
+
+[% highlight("
+ define stem as ( exception or C )
+") %]
+
+atlimit causes the whole string to be tested for equality with one of
+the Wi, and if a match is found, the string is replaced with
+Xi.
+
+
+
+More precisely we might have a group of words W11, W12 ...
+that need to be mapped to X1, another group W21, W22
+... that need to be mapped to X2, and so on, and a list of words
+V1, V2 ... Vk that are to remain invariant. The
+exception routine may then be written as follows:
+
+
+(More will be said about the words that appear here shortly.)
+
+
+
+Here we see words being treated exceptionally before stemming is done, but equally we could
+treat stems exceptionally after stemming is done, and so, if we wish, map absorpt to
+absorb, reduct to reduc etc., as in the
+Lovins stemmer.
+But more generally, throughout the algorithm, each significant step may have recognised
+exceptions, and a suitably placed among will take care of them. For example, a point made
+at least twice in the literature is that words beginning gener are overstemmed by the
+Porter stemmer:
+
+
+
+
generate
+ generates
+ generated
+ generating
+ general
+ generally
+ generic
+ generically
+ generous
+ generously
→
gener
+
+
+
+
+To fix this over-stemming, we make an exception to the usual setting of p1,
+the left point of R1, and therefore replace
+
+
+[% highlight("
+ among (
+ 'gener'
+ // ... and other stems may be included here ...
+ ) or (gopast v gopast non-v)
+ setmark p
+") %]
+
+
+after which the words beginning gener stem as follows:
+
+
+
+
generate
+ generates
+ generated
+ generating
+
→
generat
+
+
general
+ generally
+
→
general
+
generic
+ generically
+
→
generic
+
generous
+ generously
+
→
generous
+
+
+
+Another example is given by the exception2 routine, which is similar to exception1,
+but placed after the call of Step_1a, which may have removed terminal s,
+
+Snowball makes it easy therefore to add in lists of exceptions. But deciding what the lists of
+exceptions should be is far from easy. Essentially there are two lines of attack, the
+systematic and the piecemeal. One might systematically treat as exceptions the stem changes of
+irregular verbs, for example. The piecemeal approach is to add in exceptions as people notice
+them — like gener above. The problem with the systematic approach is that it should be
+done by investigating the entire language vocabulary, and that is more than most people are
+prepared to do. The problem with the piecemeal approach is that it is arbitrary, and usually
+yields little.
+
+
+
+The exception lists in the English stemmer are meant to be illustrative (‘this is how it is done if you
+want to do it’), and were derived piecemeal.
+
+
+
+a)
+The new stemmer improves on the Porter stemmer in handling short words ending e and
+y. There is however a mishandling of the four forms sky, skies, ski,
+skis, which is easily corrected by treating three of these words as
+special cases.
+
+
+
+b)
+Similarly there is a problem with the ing form of three letter verbs ending ie. There
+are only three such verbs: die, lie and tie, so a special case is made for
+dying, lying and tying.
+
+
+
+c)
+One has to be a little careful of certain ing forms.
+inning, outing, canning, which one does not wish
+to be stemmed to
+in, out, can.
+
+
+
+d)
+The removal of suffix ly, which is not in the Porter stemmer, has a number of exceptions.
+Certain short-word exceptions are idly, gently, ugly, early, only, singly.
+Rarer words (bristly, burly, curly, surly ...) are not included.
+
+
+
+e)
+The remaining words were included following complaints from users of the Porter algorithm.
+news is not the plural of new (noticed when IR systems were being set up for
+Reuters). Howe is a surname, and needs to be separated from how (noticed when
+doing a search for ‘Sir Geoffrey Howe’ in a demonstration at the House of Commons).
+succeed etc are not past participles, so the ed should not be removed (pointed out
+to me in an email from India). herring should not stem to her (another email from
+Russia).
+
+
+
+f)
+Finally, a few non-plural words ending s have been added.
+
+
+
+Incidentally, this illustrates how much feedback to expect from the real users of a stemming
+algorithm: seven or eight words in twenty years!
+
+
+
+The definition of the English stemmer above is therefore supplemented by the following:
+
+
+
Exceptional forms in the English stemmer
+
+
+
+ If the words begins gener, commun or arsen, set R1 to be the remainder of the
+ word.
+
+
+
+ Stem certain special words as follows,
+
+
+
+
skis
→
ski
+
skies
→
sky
+
+
dying lying tying
+
→
+
die lie tie
+
+
+
idly gently ugly early only singly
+
→
+
idl gentl ugli earli onli singl
+
+
+
+
+ If one of the following is found, leave it invariant,
+
+
+
+
sky news howe
+
atlas
cosmos
bias
andes
+
+
+
+ Following step 1a, leave the following invariant,
+
+
+
+
inning
outing
canning
herring
earring
+
proceed
exceed
succeed
+
+
+
+
The full algorithm in Snowball
+
+[% highlight_file('english') %]
+
+[% footer %]
diff --git a/algorithms/english/stop.txt b/algorithms/english/stop.txt
new file mode 100644
index 0000000..aee35c5
--- /dev/null
+++ b/algorithms/english/stop.txt
@@ -0,0 +1,312 @@
+
+ | An English stop word list. Comments begin with vertical bar. Each stop
+ | word is at the start of a line.
+
+ | Many of the forms below are quite rare (e.g. "yourselves") but included for
+ | completeness.
+
+ | PRONOUNS FORMS
+ | 1st person sing
+
+i | subject, always in upper case of course
+
+me | object
+my | possessive adjective
+ | the possessive pronoun `mine' is best suppressed, because of the
+ | sense of coal-mine etc.
+myself | reflexive
+ | 1st person plural
+we | subject
+
+| us | object
+ | care is required here because US = United States. It is usually
+ | safe to remove it if it is in lower case.
+our | possessive adjective
+ours | possessive pronoun
+ourselves | reflexive
+ | second person (archaic `thou' forms not included)
+you | subject and object
+your | possessive adjective
+yours | possessive pronoun
+yourself | reflexive (singular)
+yourselves | reflexive (plural)
+ | third person singular
+he | subject
+him | object
+his | possessive adjective and pronoun
+himself | reflexive
+
+she | subject
+her | object and possessive adjective
+hers | possessive pronoun
+herself | reflexive
+
+it | subject and object
+its | possessive adjective
+itself | reflexive
+ | third person plural
+they | subject
+them | object
+their | possessive adjective
+theirs | possessive pronoun
+themselves | reflexive
+ | other forms (demonstratives, interrogatives)
+what
+which
+who
+whom
+this
+that
+these
+those
+
+ | VERB FORMS (using F.R. Palmer's nomenclature)
+ | BE
+am | 1st person, present
+is | -s form (3rd person, present)
+are | present
+was | 1st person, past
+were | past
+be | infinitive
+been | past participle
+being | -ing form
+ | HAVE
+have | simple
+has | -s form
+had | past
+having | -ing form
+ | DO
+do | simple
+does | -s form
+did | past
+doing | -ing form
+
+ | The forms below are, I believe, best omitted, because of the significant
+ | homonym forms:
+
+ | He made a WILL
+ | old tin CAN
+ | merry month of MAY
+ | a smell of MUST
+ | fight the good fight with all thy MIGHT
+
+ | would, could, should, ought might however be included
+
+ | | AUXILIARIES
+ | | WILL
+ |will
+
+would
+
+ | | SHALL
+ |shall
+
+should
+
+ | | CAN
+ |can
+
+could
+
+ | | MAY
+ |may
+ |might
+ | | MUST
+ |must
+ | | OUGHT
+
+ought
+
+ | COMPOUND FORMS, increasingly encountered nowadays in 'formal' writing
+ | pronoun + verb
+
+i'm
+you're
+he's
+she's
+it's
+we're
+they're
+i've
+you've
+we've
+they've
+i'd
+you'd
+he'd
+she'd
+we'd
+they'd
+i'll
+you'll
+he'll
+she'll
+we'll
+they'll
+
+ | verb + negation
+
+isn't
+aren't
+wasn't
+weren't
+hasn't
+haven't
+hadn't
+doesn't
+don't
+didn't
+
+ | auxiliary + negation
+
+won't
+wouldn't
+shan't
+shouldn't
+can't
+cannot
+couldn't
+mustn't
+
+ | miscellaneous forms
+
+let's
+that's
+who's
+what's
+here's
+there's
+when's
+where's
+why's
+how's
+
+ | rarer forms
+
+ | daren't needn't
+
+ | doubtful forms
+
+ | oughtn't mightn't
+
+ | ARTICLES
+a
+an
+the
+
+ | THE REST (Overlap among prepositions, conjunctions, adverbs etc is so
+ | high, that classification is pointless.)
+and
+but
+if
+or
+because
+as
+until
+while
+
+of
+at
+by
+for
+with
+about
+against
+between
+into
+through
+during
+before
+after
+above
+below
+to
+from
+up
+down
+in
+out
+on
+off
+over
+under
+
+again
+further
+then
+once
+
+here
+there
+when
+where
+why
+how
+
+all
+any
+both
+each
+few
+more
+most
+other
+some
+such
+
+no
+nor
+not
+only
+own
+same
+so
+than
+too
+very
+
+ | Just for the record, the following words are among the commonest in English
+
+ | one
+ | every
+ | least
+ | less
+ | many
+ | now
+ | ever
+ | never
+ | say
+ | says
+ | said
+ | also
+ | get
+ | go
+ | goes
+ | just
+ | made
+ | make
+ | put
+ | see
+ | seen
+ | whether
+ | like
+ | well
+ | back
+ | even
+ | still
+ | way
+ | take
+ | since
+ | another
+ | however
+ | two
+ | three
+ | four
+ | five
+ | first
+ | second
+ | new
+ | old
+ | high
+ | long
+
diff --git a/algorithms/estonian/stemmer.html b/algorithms/estonian/stemmer.html
new file mode 100644
index 0000000..62f5154
--- /dev/null
+++ b/algorithms/estonian/stemmer.html
@@ -0,0 +1,657 @@
+
+
+
+
+
+
+
+
+
+ Estonian stemming algorithm - Snowball
+
+
+
+
+
+
+
+
+
+
+
+This algorithm is written in collaboration with Estonian text analytics enterprise Texta.
+
+
+
+Letters in Estonian include the following accented forms,
+
+
+
+ ä ö õ ü š ž
+
+
+
+The following letters are vowels (V1):
+
+
+
+ a e i o u õ ä ö ü
+
+
+
+RV is defined as one of the following:
+
+
+
+ a e i u o
+
+
+
+KI is defined as one of the following (letters possible before -ki emphasis):
+
+
+
+ k p t g b d s h f š z ž
+
+
+
+GI is defined as one of the following (letters possible before -gi emphasis):
+
+
+
+ c j l m n q r v w x a e i o u õ ä ö ü
+
+
+
+R1 in this algorithm is defined as a region after the first consonant preceded by a vowel (laul[nud], mõt[teid], kar[tuleid], saab[as]). If there’s no such region, then R1 is empty (laul[Ø], saun[Ø]). Limitations in steps (such as "if preceded by RV") are not restricted to the R1 region.
+
+
+
+LONGV is defined as one of the following:
+
+
+
+ aa ee ii oo uu ää öö üü õõ
+
+
+
+Do step 0. If nothing was changed in step 0, continue with the steps, otherwise stop. Do step 1 and step 2. If nothing was changed in step 2, do steps 3, 4, 5, 6, 7, 8 and 9. If something was changed in step 2, do step 9.
+
+
+
+Step 0: verb_exceptions
+
+
+
+ Search for some frequent irregular short verbs which wouldn’t have been found otherwise and give them a chosen stem.
+
+
+
joon jood joob joote joome joovad
+
replace by joo
+
jõin jõid jõi jõime jõite
+
replace by joo
+
joomata juuakse joodakse juua jooma
+
replace by joo
+
saan saad saab saate saame saavad
+
replace by saa
+
saaksin saaksid saaks saaksite saaksime
+
replace by saa
+
sain said sai saite saime
+
replace by saa
+
saamata saadakse saadi saama saada
+
replace by saa
+
viin viid viib viite viime viivad
+
replace by viima
+
viiksin viiksid viiks viiksite viiksime
+
replace by viima
+
viisin viisite viisime
+
replace by viima
+
viimata viiakse viidi viima viia
+
replace by viima
+
keen keeb keed kees keeme keete keevad
+
replace by keesi
+
keeksin keeks keeksid keeksime keeksite
+
replace by keesi
+
keemata keema keeta keedakse
+
replace by keesi
+
löön lööd lööb lööme lööte löövad
+
replace by löö
+
lööksin lööksid lööks lööksime lööksite
+
replace by löö
+
löömata lüüakse löödakse löödi lööma lüüa
+
replace by löö
+
lõin lõid lõi lõime lõite
+
replace by lõi
+
loon lood loob loome loote loovad
+
replace by loo
+
looksin looksid looks looksime looksite
+
replace by loo
+
loomata luuakse loodi luua looma
+
replace by loo
+
käin käib käid käis käime käite käivad
+
replace by käisi
+
käiksin käiks käiksid käiksime käiksite
+
replace by käisi
+
käimata käiakse käidi käia käima
+
replace by käisi
+
söön sööb sööd sööme sööte söövad
+
replace by söö
+
sööksin sööks sööksid sööksime sööksite
+
replace by söö
+
sõin sõi sõid sõime sõite
+
replace by söö
+
söömata süüakse söödakse söödi sööma süüa
+
replace by söö
+
toon tood toob toote toome toovad
+
replace by too
+
tooksin tooksid tooks tooksite tooksime
+
replace by too
+
tõin tõid tõi tõime tõite
+
replace by too
+
toomata tuuakse toodi tooma tuua
+
replace by too
+
võin võid võib võime võis võite võivad
+
replace by võisi
+
võiksin võiksid võiks võiksime võiksite
+
replace by võisi
+
võimata võidakse võidi võida võima
+
replace by võisi
+
jään jääd jääb jääme jääte jäävad
+
replace by jääma
+
jääksin jääksid jääks jääksime jääksite
+
replace by jääma
+
jäime jäite jäin jäid jäi
+
replace by jääma
+
jäämata jäädakse jääda jääma jäädi
+
replace by jääma
+
müün müüd müüb müüs müüme müüte müüvad
+
replace by müüsi
+
müüksin müüksid müüks müüksime müüksite
+
replace by müüsi
+
müümata müüakse müüdi müüa müüma
+
replace by müüsi
+
loeb loen loed loeme loete loevad
+
replace by luge
+
loeks loeksin loeksid loeksime loeksite
+
replace by luge
+
põen põeb põed põeme põete põevad
+
replace by põde
+
põeksin põeks põeksid põeksime põeksite
+
replace by põde
+
laon laob laod laome laote laovad
+
replace by ladu
+
laoksin laoks laoksid laoksime laoksite
+
replace by ladu
+
teeksin teeks teeksid teeksime teeksite
+
replace by tegi
+
teen teeb teed teeme teete teevad
+
replace by tegi
+
tegemata tehakse tehti tegema teha
+
replace by tegi
+
näen näeb näed näeme näete näevad
+
replace by nägi
+
näeksin näeks näeksid näeksime näeksite
+
replace by nägi
+
nägemata nähakse nähti näha nägema
+
replace by nägi
+
+
+
+
+
+Step 1: emphasis
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the action indicated
+
+
+
Test if there’s at least 4 characters before R1 region. If so, continue this step
+
gi
+
if preceded by a character from GI which is not the second character of a long vowel as defined by LONGV, delete
+
ki
+
if preceded by KI, delete
+
+
+
+
+Step 2: verb
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
+
nuksin nuksime nuksid nuksite
+
delete
+
ksin ksid ksime ksite
+
delete
+
mata
+
delete
+
takse dakse
+
delete
+
taks daks
+
delete
+
akse
+
replace with a
+
sime
+
delete
+
site
+
delete
+
sin
+
delete
+
me
+
if preceded by V1, delete
+
da
+
if preceded by V1, delete
+
n
+
if preceded by V1, delete
+
b
+
if preceded by V1, delete
+
+
+
+
+Step 3: special_noun_endings
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
lasse
+
replace by lase
+
last
+
replace by lase
+
lane
+
replace by lase
+
lasi
+
replace by lase
+
misse
+
replace by mise
+
mist
+
replace by mise
+
mine
+
replace by mise
+
misi
+
replace by mise
+
lisse
+
replace by lise
+
list
+
replace by lise
+
line
+
replace by lise
+
lisi
+
replace by lise
+
+
+
+
+Step 4: case_ending
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
sse if preceded by RV or LONGV
+
st if preceded by RV or LONGV
+
le if preceded by RV or LONGV
+
lt if preceded by RV or LONGV
+
ga if preceded by RV or LONGV
+
ks if preceded by RV or LONGV
+
ta if preceded by RV or LONGV
+
t if preceded by at least 4 characters
+
s if preceded by RV or LONGV
+
l if preceded by RV or LONGV
+
delete
+
+
+
+
+Step 5: plural_three_first_cases
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
ikkude
+
replace by iku
+
ikke
+
replace by iku
+
ike
+
replace by iku
+
sid
+
if it is not preceded by LONGV, delete
+
te
+
if it doesn't have at least 4 characters before it, replace by t.
+
Otherwise:
+
a) if it is preceded by mis, replace with e,
+
b) if it is preceded by las, replace with e,
+
c) if it is preceded by lis, replace with e,
+
if it wasn't replaced with e in steps a)-c) and it isn't preceded by t, delete
+
de if preceded by RV or LONGV
+
delete
+
d if preceded by RV or LONGV
+
delete
+
+
+
+
+Step 6: degrees
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
mai if preceded by RV
+
ma
+
m if preceded by RV
+
delete
+
+
+
+
+Step 7: i_plural
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
i if preceded by RV
+
delete
+
+
+
+
+Step 8: nu
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
nu
+
tu
+
du
+
va
+
delete
+
+
+
+
+Step 9: undouble_kpt
+
+
+
+ Undouble consonant if word ending is kk+V1, tt+V1, pp+V1,
+ provided the vowel is in R1.
+
+This algorithm is written in collaboration with Estonian text analytics enterprise Texta.
+
+
+
+Letters in Estonian include the following accented forms,
+
+
+
+ ä ö õ ü š ž
+
+
+
+The following letters are vowels (V1):
+
+
+
+ a e i o u õ ä ö ü
+
+
+
+RV is defined as one of the following:
+
+
+
+ a e i u o
+
+
+
+KI is defined as one of the following (letters possible before -ki emphasis):
+
+
+
+ k p t g b d s h f š z ž
+
+
+
+GI is defined as one of the following (letters possible before -gi emphasis):
+
+
+
+ c j l m n q r v w x a e i o u õ ä ö ü
+
+
+
+R1 in this algorithm is defined as a region after the first consonant preceded by a vowel (laul[nud], mõt[teid], kar[tuleid], saab[as]). If there’s no such region, then R1 is empty (laul[Ø], saun[Ø]). Limitations in steps (such as "if preceded by RV") are not restricted to the R1 region.
+
+
+
+LONGV is defined as one of the following:
+
+
+
+ aa ee ii oo uu ää öö üü õõ
+
+
+
+Do step 0. If nothing was changed in step 0, continue with the steps, otherwise stop. Do step 1 and step 2. If nothing was changed in step 2, do steps 3, 4, 5, 6, 7, 8 and 9. If something was changed in step 2, do step 9.
+
+
+
+Step 0: verb_exceptions
+
+
+
+ Search for some frequent irregular short verbs which wouldn’t have been found otherwise and give them a chosen stem.
+
+
+
joon jood joob joote joome joovad
+
replace by joo
+
jõin jõid jõi jõime jõite
+
replace by joo
+
joomata juuakse joodakse juua jooma
+
replace by joo
+
saan saad saab saate saame saavad
+
replace by saa
+
saaksin saaksid saaks saaksite saaksime
+
replace by saa
+
sain said sai saite saime
+
replace by saa
+
saamata saadakse saadi saama saada
+
replace by saa
+
viin viid viib viite viime viivad
+
replace by viima
+
viiksin viiksid viiks viiksite viiksime
+
replace by viima
+
viisin viisite viisime
+
replace by viima
+
viimata viiakse viidi viima viia
+
replace by viima
+
keen keeb keed kees keeme keete keevad
+
replace by keesi
+
keeksin keeks keeksid keeksime keeksite
+
replace by keesi
+
keemata keema keeta keedakse
+
replace by keesi
+
löön lööd lööb lööme lööte löövad
+
replace by löö
+
lööksin lööksid lööks lööksime lööksite
+
replace by löö
+
löömata lüüakse löödakse löödi lööma lüüa
+
replace by löö
+
lõin lõid lõi lõime lõite
+
replace by lõi
+
loon lood loob loome loote loovad
+
replace by loo
+
looksin looksid looks looksime looksite
+
replace by loo
+
loomata luuakse loodi luua looma
+
replace by loo
+
käin käib käid käis käime käite käivad
+
replace by käisi
+
käiksin käiks käiksid käiksime käiksite
+
replace by käisi
+
käimata käiakse käidi käia käima
+
replace by käisi
+
söön sööb sööd sööme sööte söövad
+
replace by söö
+
sööksin sööks sööksid sööksime sööksite
+
replace by söö
+
sõin sõi sõid sõime sõite
+
replace by söö
+
söömata süüakse söödakse söödi sööma süüa
+
replace by söö
+
toon tood toob toote toome toovad
+
replace by too
+
tooksin tooksid tooks tooksite tooksime
+
replace by too
+
tõin tõid tõi tõime tõite
+
replace by too
+
toomata tuuakse toodi tooma tuua
+
replace by too
+
võin võid võib võime võis võite võivad
+
replace by võisi
+
võiksin võiksid võiks võiksime võiksite
+
replace by võisi
+
võimata võidakse võidi võida võima
+
replace by võisi
+
jään jääd jääb jääme jääte jäävad
+
replace by jääma
+
jääksin jääksid jääks jääksime jääksite
+
replace by jääma
+
jäime jäite jäin jäid jäi
+
replace by jääma
+
jäämata jäädakse jääda jääma jäädi
+
replace by jääma
+
müün müüd müüb müüs müüme müüte müüvad
+
replace by müüsi
+
müüksin müüksid müüks müüksime müüksite
+
replace by müüsi
+
müümata müüakse müüdi müüa müüma
+
replace by müüsi
+
loeb loen loed loeme loete loevad
+
replace by luge
+
loeks loeksin loeksid loeksime loeksite
+
replace by luge
+
põen põeb põed põeme põete põevad
+
replace by põde
+
põeksin põeks põeksid põeksime põeksite
+
replace by põde
+
laon laob laod laome laote laovad
+
replace by ladu
+
laoksin laoks laoksid laoksime laoksite
+
replace by ladu
+
teeksin teeks teeksid teeksime teeksite
+
replace by tegi
+
teen teeb teed teeme teete teevad
+
replace by tegi
+
tegemata tehakse tehti tegema teha
+
replace by tegi
+
näen näeb näed näeme näete näevad
+
replace by nägi
+
näeksin näeks näeksid näeksime näeksite
+
replace by nägi
+
nägemata nähakse nähti näha nägema
+
replace by nägi
+
+
+
+
+
+Step 1: emphasis
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the action indicated
+
+
+
Test if there’s at least 4 characters before R1 region. If so, continue this step
+
gi
+
if preceded by a character from GI which is not the second character of a long vowel as defined by LONGV, delete
+
ki
+
if preceded by KI, delete
+
+
+
+
+Step 2: verb
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
+
nuksin nuksime nuksid nuksite
+
delete
+
ksin ksid ksime ksite
+
delete
+
mata
+
delete
+
takse dakse
+
delete
+
taks daks
+
delete
+
akse
+
replace with a
+
sime
+
delete
+
site
+
delete
+
sin
+
delete
+
me
+
if preceded by V1, delete
+
da
+
if preceded by V1, delete
+
n
+
if preceded by V1, delete
+
b
+
if preceded by V1, delete
+
+
+
+
+Step 3: special_noun_endings
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
lasse
+
replace by lase
+
last
+
replace by lase
+
lane
+
replace by lase
+
lasi
+
replace by lase
+
misse
+
replace by mise
+
mist
+
replace by mise
+
mine
+
replace by mise
+
misi
+
replace by mise
+
lisse
+
replace by lise
+
list
+
replace by lise
+
line
+
replace by lise
+
lisi
+
replace by lise
+
+
+
+
+Step 4: case_ending
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
sse if preceded by RV or LONGV
+
st if preceded by RV or LONGV
+
le if preceded by RV or LONGV
+
lt if preceded by RV or LONGV
+
ga if preceded by RV or LONGV
+
ks if preceded by RV or LONGV
+
ta if preceded by RV or LONGV
+
t if preceded by at least 4 characters
+
s if preceded by RV or LONGV
+
l if preceded by RV or LONGV
+
delete
+
+
+
+
+Step 5: plural_three_first_cases
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
ikkude
+
replace by iku
+
ikke
+
replace by iku
+
ike
+
replace by iku
+
sid
+
if it is not preceded by LONGV, delete
+
te
+
if it doesn't have at least 4 characters before it, replace by t.
+
Otherwise:
+
a) if it is preceded by mis, replace with e,
+
b) if it is preceded by las, replace with e,
+
c) if it is preceded by lis, replace with e,
+
if it wasn't replaced with e in steps a)-c) and it isn't preceded by t, delete
+
de if preceded by RV or LONGV
+
delete
+
d if preceded by RV or LONGV
+
delete
+
+
+
+
+Step 6: degrees
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
mai if preceded by RV
+
ma
+
m if preceded by RV
+
delete
+
+
+
+
+Step 7: i_plural
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
i if preceded by RV
+
delete
+
+
+
+
+Step 8: nu
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
nu
+
tu
+
du
+
va
+
delete
+
+
+
+
+Step 9: undouble_kpt
+
+
+
+ Undouble consonant if word ending is kk+V1, tt+V1, pp+V1,
+ provided the vowel is in R1.
+
+Finnish is not an Indo-European language, but belongs to the Finno-Ugric
+group, which again belongs to the Uralic group (*). Distinctions between
+a-, i- and d-suffixes can be made in Finnish, but they are much
+less sharply separated than in an Indo-European language. The system of
+endings is extremely elaborate, but strictly defined, and applies equally to
+all nominals, that is, to nouns, adjectives and pronouns. Verb endings have a
+close similarity to nominal endings, which again makes Finnish very different
+from any Indo-European language.
+
+
+
+More problematical than the endings themselves is the change that can be
+effected in a stem as a result of taking a particular ending. A stem typically
+has two forms, strong and weak, where one class of ending follows the
+strong form and the complementary class the weak. Normalising strong and weak
+forms after ending removal is not generally possible, although the common case
+where strong and weak forms only differ in the single or double form of a
+final consonant can be dealt with.
+
+
+
+Finnish includes the following accented forms,
+
+
+
+ ä ö
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u y ä ö
+
+
+
+R1 and
+R2 are then defined in the usual way
+(see the note on R1 and R2).
+
+
+
+Do each of steps 1, 2, 3, 4, 5 and 6.
+
+
+
+Step 1: particles etc
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
(a) kin kaan kään ko kö han hän pa pä
+
delete if preceded by n, t or a vowel
+
(b) sti
+
delete if in R2
+
+
+
+
+(Of course, the n, t or vowel of 1(a) need not be in R1: only
+the suffix removed must be in R1. And similarly below.
+
+
+
+Step 2: possessives
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
si
+
delete if not preceded by k
+
ni
+
delete
+
if preceded by kse, replace with ksi
+
nsa nsä mme nne
+
delete
+
an
+
delete if preceded by one of ta ssa sta lla lta na
+
än
+
delete if preceded by one of tä ssä stä llä ltä nä
+
en
+
delete if preceded by one of lle ine
+
+
+
+
+The remaining steps require a few definitions.
+
+
+
+Define a v (vowel) as one of a e i o u y ä ö.
+
+Define a V (restricted vowel) as one of a e i o u ä ö.
+
+So Vi means a V followed by letter i.
+
+Define LV (long vowel) as one of aa ee ii oo uu ää öö.
+
+Define a c (consonant) as a character from ASCII a-z which isn't in
+v (originally this was "a character other than a v but since
+2018-04-11 we've changed this definition to avoid the stemmer from altering
+sequences of digits).
+
+So cv means a c followed by a v.
+
+
+
+Step 3: cases
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
+
hXn preceded by X, where X is a V other than u (a/han, e/hen etc)
+
siin den tten preceded by Vi
+
seen preceded by LV
+
a ä preceded by cv
+
tta ttä preceded by e
+
ta tä ssa ssä sta stä lla llä lta ltä lle na nä ksi ine
+
delete
+
n
+
delete, and if preceded by LV or ie, delete the last vowel
+
+
+
+
+So aarteisiin → aartei, the longest matching suffix being siin,
+preceded as it is by Vi. But adressiin → adressi. The longest
+matching suffix is not siin, because there is no preceding Vi, but n,
+and then the last vowel of the preceding LV is removed.
+
+
+
+Step 4: other endings
+
+
+
+ Search for the longest among the following suffixes in R2, and perform the
+ action indicated
+
+
+
mpi mpa mpä mmi mma mmä
+
delete if not preceded by po
+
impi impa impä immi imma immä eja ejä
+
delete
+
+
+
+
+Step 5: plurals
+
+
+
+If an ending was removed in step 3, delete a final i or j if in R1;
+otherwise, if an ending was not removed in step 3, delete a final t in
+R1 if it follows a vowel, and, if a t is removed, delete a final mma or
+imma in R2, unless the mma is preceded by po.
+
+
+
+Step 6: tidying up
+
+
+
+Do in turn steps (a), (b), (c), (d), restricting all tests to the region
+R1.
+
+
+
+a) If R1 ends LV delete the last letter
+b) If R1 ends cX, c a consonant and X one of a ä e i, delete the last
+letter
+c) If R1 ends oj or uj delete the last letter
+d) If R1 ends jo delete the last letter
+
+
+
+Do step (e), which is not restricted to R1.
+
+
+
+e) If the word ends with a double consonant followed by zero or more vowels,
+remove the last consonant (so eläkk → eläk, aatonaatto →
+aatonaato)
+
+Finnish is not an Indo-European language, but belongs to the Finno-Ugric
+group, which again belongs to the Uralic group (*). Distinctions between
+a-, i- and d-suffixes can be made in Finnish, but they are much
+less sharply separated than in an Indo-European language. The system of
+endings is extremely elaborate, but strictly defined, and applies equally to
+all nominals, that is, to nouns, adjectives and pronouns. Verb endings have a
+close similarity to nominal endings, which again makes Finnish very different
+from any Indo-European language.
+
+
+
+More problematical than the endings themselves is the change that can be
+effected in a stem as a result of taking a particular ending. A stem typically
+has two forms, strong and weak, where one class of ending follows the
+strong form and the complementary class the weak. Normalising strong and weak
+forms after ending removal is not generally possible, although the common case
+where strong and weak forms only differ in the single or double form of a
+final consonant can be dealt with.
+
+
+
+Finnish includes the following accented forms,
+
+
+
+ ä ö
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u y ä ö
+
+
+
+R1 and
+R2 are then defined in the usual way
+(see the note on R1 and R2).
+
+
+
+Do each of steps 1, 2, 3, 4, 5 and 6.
+
+
+
+Step 1: particles etc
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
(a) kin kaan kään ko kö han hän pa pä
+
delete if preceded by n, t or a vowel
+
(b) sti
+
delete if in R2
+
+
+
+
+(Of course, the n, t or vowel of 1(a) need not be in R1: only
+the suffix removed must be in R1. And similarly below.
+
+
+
+Step 2: possessives
+
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
si
+
delete if not preceded by k
+
ni
+
delete
+
if preceded by kse, replace with ksi
+
nsa nsä mme nne
+
delete
+
an
+
delete if preceded by one of ta ssa sta lla lta na
+
än
+
delete if preceded by one of tä ssä stä llä ltä nä
+
en
+
delete if preceded by one of lle ine
+
+
+
+
+The remaining steps require a few definitions.
+
+
+
+Define a v (vowel) as one of a e i o u y ä ö.
+
+Define a V (restricted vowel) as one of a e i o u ä ö.
+
+So Vi means a V followed by letter i.
+
+Define LV (long vowel) as one of aa ee ii oo uu ää öö.
+
+Define a c (consonant) as a character from ASCII a-z which isn't in
+v (originally this was "a character other than a v but since
+2018-04-11 we've changed this definition to avoid the stemmer from altering
+sequences of digits).
+
+So cv means a c followed by a v.
+
+
+
+Step 3: cases
+
+
+ Search for the longest among the following suffixes in R1, and perform the
+ action indicated
+
+
+
+
hXn preceded by X, where X is a V other than u (a/han, e/hen etc)
+
siin den tten preceded by Vi
+
seen preceded by LV
+
a ä preceded by cv
+
tta ttä preceded by e
+
ta tä ssa ssä sta stä lla llä lta ltä lle na nä ksi ine
+
delete
+
n
+
delete, and if preceded by LV or ie, delete the last vowel
+
+
+
+
+So aarteisiin → aartei, the longest matching suffix being siin,
+preceded as it is by Vi. But adressiin → adressi. The longest
+matching suffix is not siin, because there is no preceding Vi, but n,
+and then the last vowel of the preceding LV is removed.
+
+
+
+Step 4: other endings
+
+
+
+ Search for the longest among the following suffixes in R2, and perform the
+ action indicated
+
+
+
mpi mpa mpä mmi mma mmä
+
delete if not preceded by po
+
impi impa impä immi imma immä eja ejä
+
delete
+
+
+
+
+Step 5: plurals
+
+
+
+If an ending was removed in step 3, delete a final i or j if in R1;
+otherwise, if an ending was not removed in step 3, delete a final t in
+R1 if it follows a vowel, and, if a t is removed, delete a final mma or
+imma in R2, unless the mma is preceded by po.
+
+
+
+Step 6: tidying up
+
+
+
+Do in turn steps (a), (b), (c), (d), restricting all tests to the region
+R1.
+
+
+
+a) If R1 ends LV delete the last letter
+b) If R1 ends cX, c a consonant and X one of a ä e i, delete the last
+letter
+c) If R1 ends oj or uj delete the last letter
+d) If R1 ends jo delete the last letter
+
+
+
+Do step (e), which is not restricted to R1.
+
+
+
+e) If the word ends with a double consonant followed by zero or more vowels,
+remove the last consonant (so eläkk → eläk, aatonaatto →
+aatonaato)
+
+
+
+
The full algorithm in Snowball
+
+[% highlight_file('finnish') %]
+
+[% footer %]
diff --git a/algorithms/finnish/stop.txt b/algorithms/finnish/stop.txt
new file mode 100644
index 0000000..2be66c0
--- /dev/null
+++ b/algorithms/finnish/stop.txt
@@ -0,0 +1,88 @@
+
+| forms of BE
+
+olla
+olen
+olet
+on
+olemme
+olette
+ovat
+ole | negative form
+
+oli
+olisi
+olisit
+olisin
+olisimme
+olisitte
+olisivat
+olit
+olin
+olimme
+olitte
+olivat
+ollut
+olleet
+
+en | negation
+et
+ei
+emme
+ette
+eivät
+
+|Nom Gen Acc Part Iness Elat Illat Adess Ablat Allat Ess Trans
+minä minun minut minua minussa minusta minuun minulla minulta minulle | I
+sinä sinun sinut sinua sinussa sinusta sinuun sinulla sinulta sinulle | you
+hän hänen hänet häntä hänessä hänestä häneen hänellä häneltä hänelle | he she
+me meidän meidät meitä meissä meistä meihin meillä meiltä meille | we
+te teidän teidät teitä teissä teistä teihin teillä teiltä teille | you
+he heidän heidät heitä heissä heistä heihin heillä heiltä heille | they
+
+tämä tämän tätä tässä tästä tähän tällä tältä tälle tänä täksi | this
+tuo tuon tuota tuossa tuosta tuohon tuolla tuolta tuolle tuona tuoksi | that
+se sen sitä siinä siitä siihen sillä siltä sille sinä siksi | it
+nämä näiden näitä näissä näistä näihin näillä näiltä näille näinä näiksi | these
+nuo noiden noita noissa noista noihin noilla noilta noille noina noiksi | those
+ne niiden niitä niissä niistä niihin niillä niiltä niille niinä niiksi | they
+
+kuka kenen kenet ketä kenessä kenestä keneen kenellä keneltä kenelle kenenä keneksi| who
+ketkä keiden ketkä keitä keissä keistä keihin keillä keiltä keille keinä keiksi | (pl)
+mikä minkä minkä mitä missä mistä mihin millä miltä mille minä miksi | which what
+mitkä | (pl)
+
+joka jonka jota jossa josta johon jolla jolta jolle jona joksi | who which
+jotka joiden joita joissa joista joihin joilla joilta joille joina joiksi | (pl)
+
+| conjunctions
+
+että | that
+ja | and
+jos | if
+koska | because
+kuin | than
+mutta | but
+niin | so
+sekä | and
+sillä | for
+tai | or
+vaan | but
+vai | or
+vaikka | although
+
+
+| prepositions
+
+kanssa | with
+mukaan | according to
+noin | about
+poikki | across
+yli | over, across
+
+| other
+
+kun | when
+nyt | now
+itse | self
+
diff --git a/algorithms/french/stemmer.html b/algorithms/french/stemmer.html
new file mode 100644
index 0000000..8b361cc
--- /dev/null
+++ b/algorithms/french/stemmer.html
@@ -0,0 +1,811 @@
+
+
+
+
+
+
+
+
+
+ French stemming algorithm - Snowball
+
+
+
+
+
+
+
+
+
+
+
+Letters in French include the following accented forms,
+
+
+
+ â à ç ë é ê è ï î ô û ù
+
+The following letters are vowels:
+
+ a e i o u y â à ë é ê è ï î ô û ù
+
+Assume the word is in lower case. Then, taking the letters in turn from the
+beginning to end of the word, put u or i into upper
+case when it is both preceded and followed by a vowel; put y into
+upper case when it is either preceded or followed by a vowel; and put u into upper case when it follows q. For example,
+
+
jouer
→
joUer
+
ennuie
→
ennuIe
+
yeux
→
Yeux
+
quand
→
qUand
+
croyiez
→
croYiez
+
+
+
+In the last example, y becomes Y because it is
+between two vowels, but i does not become I because
+it is between Y and e, and Y is not
+defined as a vowel above.
+
+
+
+(The upper case forms are not then classed as vowels — see note on vowel
+marking.)
+
+
+
+Replace ë and ï with He and Hi. The H
+marks the vowel as having originally had a diaeresis, while the vowel itself, lacking an accent, is able to
+match suffixes beginning in e or i.
+
+
+
+If the word begins with two vowels, RV is the region after the third
+letter, otherwise the region after the first vowel not at the beginning of
+the word, or the end of the word if these positions cannot be found. (Exceptionally,
+par, col or tap, at the beginning of a word is also taken to define
+RV as the region to their right.)
+
+
+
+For example,
+
+
+
+ a i m e r a d o r e r v o l e r t a p i s
+ |...| |.....| |.....| |...|
+
+
+
+R1 is the region after the first non-vowel following a vowel, or the end of
+the word if there is no such non-vowel.
+
+
+
+R2 is the region after the first non-vowel following a vowel in R1, or the
+end of the word if there is no such non-vowel.
+(See note on R1 and R2.)
+
+
+
+For example:
+
+
+
+ f a m e u s e m e n t
+ |......R1.......|
+ |...R2....|
+
+
+
+Note that R1 can contain RV (adorer), and RV can contain R1 (voler).
+
+
+
+Below, ‘delete if in R2’ means that a found suffix should be removed if it
+lies entirely in R2, but not if it overlaps R2 and the rest of the word.
+‘delete if in R1 and preceded by X’ means that X itself does not have to
+come in R1, while ‘delete if preceded by X in R1’ means that X, like the
+suffix, must be entirely in R1.
+
+
+
+Start with step 1
+
+
+
+Step 1: Standard suffix removal
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
ance iqUe isme able iste eux ances iqUes ismes ables istes
+
delete if in R2
+
atrice ateur ation atrices ateurs ations
+
delete if in R2
+
if preceded by ic, delete if in R2, else replace by iqU
+
logie logies
+
replace with log if in R2
+
usion ution usions utions
+
replace with u if in R2
+
ence ences
+
replace with ent if in R2
+
ement ements
+
delete if in RV
+
if preceded by iv, delete if in R2 (and if further preceded by at,
+ delete if in R2), otherwise,
+
if preceded by eus, delete if in R2, else replace by eux
+ if in R1, otherwise,
+
if preceded by abl or iqU, delete if in R2, otherwise,
+
if preceded by ièr or Ièr, replace by i if in RV
+
ité ités
+
delete if in R2
+
if preceded by abil, delete if in R2, else replace by abl,
+ otherwise,
+
if preceded by ic, delete if in R2, else replace by iqU, otherwise,
+
if preceded by iv, delete if in R2
+
if ive ifs ives
+
delete if in R2
+
if preceded by at, delete if in R2 (and if further preceded by ic,
+ delete if in R2, else replace by iqU)
+
eaux
+
replace with eau
+
aux
+
replace with al if in R1
+
euse euses
+
delete if in R2, else replace by eux if in R1
+
issement issements
+
delete if in R1 and preceded by a non-vowel
+
amment
+
replace with ant if in RV
+
emment
+
replace with ent if in RV
+
ment ments
+
delete if preceded by a vowel in RV
+
+
+
+
+In steps 2a and 2b all tests are confined to the RV region.
+
+
+
+Do step 2a if either no ending was removed by step 1, or if one of endings
+amment, emment, ment, ments was found.
+
+
+
+Step 2a: Verb suffixes beginning i
+
+
+
+ Search for the longest among the following suffixes and if found,
+ delete if the preceding character is neither a vowel nor H.
+
+ îmes ît îtes i ie ies ir ira irai iraIent irais irait iras
+ irent irez iriez irions irons iront is issaIent issais issait
+ issant issante issantes issants isse issent isses issez issiez
+ issions issons it
+
+
+ (Note that the preceding character itself must also be in RV.)
+
+
+
+Do step 2b if step 2a was done, but failed to remove a suffix.
+
+
+
+Step 2b: Other verb suffixes
+
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
ions
+
delete if in R2
+
é ée ées és èrent er era erai eraIent erais erait eras erez
+ eriez erions erons eront ez iez
+
delete
+
âmes ât âtes a ai aIent ais ait ant ante antes ants as asse
+ assent asses assiez assions
+
delete
+
if preceded by e, delete
+
+
+ (Note that the e that may be deleted in this last step must also be in
+ RV.)
+
+
+
+If the last step to be obeyed — either step 1, 2a or 2b — altered the word,
+do step 3
+
+
+
+Step 3
+
+
+ Replace final Y with i or final ç with c
+
+
+Alternatively, if the last step to be obeyed did not alter the word, do
+step 4
+
+
+
+Step 4: Residual suffix
+
+
+
+
+ If the word ends s, not preceded by a, i (unless itself preceded by H), o, u, è or s, delete it.
+
+
+
+ In the rest of step 4, all tests are confined to the RV region.
+
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
+
+
ion
+
delete if in R2 and preceded by s or t
+
ier ière Ier Ière
+
replace with i
+
e
+
delete
+
+
+ (So note that ion is removed only when it is in R2 — as well as being
+ in RV — and preceded by s or t which must be in RV.)
+
+
+
+Always do steps 5 and 6.
+
+
+
+Step 5: Undouble
+
+
+
+ If the word ends enn, onn, ett, ell or eill, delete the last letter
+
+
+
+Step 6: Un-accent
+
+
+
+ If the words ends é or è followed by at least one non-vowel, remove
+ the accent from the e.
+
+
+
+And finally:
+
+
+
+
+ Turn any remaining I, U and Y letters in the word back into lower case.
+
+
+
+ Turn He and Hi back into ë and ï, and remove any
+ remaining H.
+
+Letters in French include the following accented forms,
+
+
+
+ â à ç ë é ê è ï î ô û ù
+
+The following letters are vowels:
+
+ a e i o u y â à ë é ê è ï î ô û ù
+
+Assume the word is in lower case. Then, taking the letters in turn from the
+beginning to end of the word, put u or i into upper
+case when it is both preceded and followed by a vowel; put y into
+upper case when it is either preceded or followed by a vowel; and put u into upper case when it follows q. For example,
+
+
jouer
→
joUer
+
ennuie
→
ennuIe
+
yeux
→
Yeux
+
quand
→
qUand
+
croyiez
→
croYiez
+
+
+
+In the last example, y becomes Y because it is
+between two vowels, but i does not become I because
+it is between Y and e, and Y is not
+defined as a vowel above.
+
+
+
+(The upper case forms are not then classed as vowels — see note on vowel
+marking.)
+
+
+
+Replace ë and ï with He and Hi. The H
+marks the vowel as having originally had a diaeresis, while the vowel itself, lacking an accent, is able to
+match suffixes beginning in e or i.
+
+
+
+If the word begins with two vowels, RV is the region after the third
+letter, otherwise the region after the first vowel not at the beginning of
+the word, or the end of the word if these positions cannot be found. (Exceptionally,
+par, col or tap, at the beginning of a word is also taken to define
+RV as the region to their right.)
+
+
+
+For example,
+
+
+
+ a i m e r a d o r e r v o l e r t a p i s
+ |...| |.....| |.....| |...|
+
+
+
+R1 is the region after the first non-vowel following a vowel, or the end of
+the word if there is no such non-vowel.
+
+
+
+R2 is the region after the first non-vowel following a vowel in R1, or the
+end of the word if there is no such non-vowel.
+(See note on R1 and R2.)
+
+
+
+For example:
+
+
+
+ f a m e u s e m e n t
+ |......R1.......|
+ |...R2....|
+
+
+
+Note that R1 can contain RV (adorer), and RV can contain R1 (voler).
+
+
+
+Below, ‘delete if in R2’ means that a found suffix should be removed if it
+lies entirely in R2, but not if it overlaps R2 and the rest of the word.
+‘delete if in R1 and preceded by X’ means that X itself does not have to
+come in R1, while ‘delete if preceded by X in R1’ means that X, like the
+suffix, must be entirely in R1.
+
+
+
+Start with step 1
+
+
+
+Step 1: Standard suffix removal
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
ance iqUe isme able iste eux ances iqUes ismes ables istes
+
delete if in R2
+
atrice ateur ation atrices ateurs ations
+
delete if in R2
+
if preceded by ic, delete if in R2, else replace by iqU
+
logie logies
+
replace with log if in R2
+
usion ution usions utions
+
replace with u if in R2
+
ence ences
+
replace with ent if in R2
+
ement ements
+
delete if in RV
+
if preceded by iv, delete if in R2 (and if further preceded by at,
+ delete if in R2), otherwise,
+
if preceded by eus, delete if in R2, else replace by eux
+ if in R1, otherwise,
+
if preceded by abl or iqU, delete if in R2, otherwise,
+
if preceded by ièr or Ièr, replace by i if in RV
+
ité ités
+
delete if in R2
+
if preceded by abil, delete if in R2, else replace by abl,
+ otherwise,
+
if preceded by ic, delete if in R2, else replace by iqU, otherwise,
+
if preceded by iv, delete if in R2
+
if ive ifs ives
+
delete if in R2
+
if preceded by at, delete if in R2 (and if further preceded by ic,
+ delete if in R2, else replace by iqU)
+
eaux
+
replace with eau
+
aux
+
replace with al if in R1
+
euse euses
+
delete if in R2, else replace by eux if in R1
+
issement issements
+
delete if in R1 and preceded by a non-vowel
+
amment
+
replace with ant if in RV
+
emment
+
replace with ent if in RV
+
ment ments
+
delete if preceded by a vowel in RV
+
+
+
+
+In steps 2a and 2b all tests are confined to the RV region.
+
+
+
+Do step 2a if either no ending was removed by step 1, or if one of endings
+amment, emment, ment, ments was found.
+
+
+
+Step 2a: Verb suffixes beginning i
+
+
+
+ Search for the longest among the following suffixes and if found,
+ delete if the preceding character is neither a vowel nor H.
+
+ îmes ît îtes i ie ies ir ira irai iraIent irais irait iras
+ irent irez iriez irions irons iront is issaIent issais issait
+ issant issante issantes issants isse issent isses issez issiez
+ issions issons it
+
+
+ (Note that the preceding character itself must also be in RV.)
+
+
+
+Do step 2b if step 2a was done, but failed to remove a suffix.
+
+
+
+Step 2b: Other verb suffixes
+
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
ions
+
delete if in R2
+
é ée ées és èrent er era erai eraIent erais erait eras erez
+ eriez erions erons eront ez iez
+
delete
+
âmes ât âtes a ai aIent ais ait ant ante antes ants as asse
+ assent asses assiez assions
+
delete
+
if preceded by e, delete
+
+
+ (Note that the e that may be deleted in this last step must also be in
+ RV.)
+
+
+
+If the last step to be obeyed — either step 1, 2a or 2b — altered the word,
+do step 3
+
+
+
+Step 3
+
+
+ Replace final Y with i or final ç with c
+
+
+Alternatively, if the last step to be obeyed did not alter the word, do
+step 4
+
+
+
+Step 4: Residual suffix
+
+
+
+
+ If the word ends s, not preceded by a, i (unless itself preceded by H), o, u, è or s, delete it.
+
+
+
+ In the rest of step 4, all tests are confined to the RV region.
+
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
+
+
ion
+
delete if in R2 and preceded by s or t
+
ier ière Ier Ière
+
replace with i
+
e
+
delete
+
+
+ (So note that ion is removed only when it is in R2 — as well as being
+ in RV — and preceded by s or t which must be in RV.)
+
+
+
+Always do steps 5 and 6.
+
+
+
+Step 5: Undouble
+
+
+
+ If the word ends enn, onn, ett, ell or eill, delete the last letter
+
+
+
+Step 6: Un-accent
+
+
+
+ If the words ends é or è followed by at least one non-vowel, remove
+ the accent from the e.
+
+
+
+And finally:
+
+
+
+
+ Turn any remaining I, U and Y letters in the word back into lower case.
+
+
+
+ Turn He and Hi back into ë and ï, and remove any
+ remaining H.
+
+
+
+
The same algorithm in Snowball
+
+[% highlight_file('french') %]
+
+[% footer %]
diff --git a/algorithms/french/stop.txt b/algorithms/french/stop.txt
new file mode 100644
index 0000000..d525c99
--- /dev/null
+++ b/algorithms/french/stop.txt
@@ -0,0 +1,178 @@
+
+ | A French stop word list. Comments begin with vertical bar. Each stop
+ | word is at the start of a line.
+
+au | a + le
+aux | a + les
+avec | with
+ce | this
+ces | these
+dans | with
+de | of
+des | de + les
+du | de + le
+elle | she
+en | `of them' etc
+et | and
+eux | them
+il | he
+je | I
+la | the
+le | the
+leur | their
+lui | him
+ma | my (fem)
+mais | but
+me | me
+même | same; as in moi-même (myself) etc
+mes | me (pl)
+moi | me
+mon | my (masc)
+ne | not
+nos | our (pl)
+notre | our
+nous | we
+on | one
+ou | where
+par | by
+pas | not
+pour | for
+qu | que before vowel
+que | that
+qui | who
+sa | his, her (fem)
+se | oneself
+ses | his (pl)
+ | son | his, her (masc). Omitted because it is homonym of "sound"
+sur | on
+ta | thy (fem)
+te | thee
+tes | thy (pl)
+toi | thee
+ton | thy (masc)
+tu | thou
+un | a
+une | a
+vos | your (pl)
+votre | your
+vous | you
+
+ | single letter forms
+
+c | c'
+d | d'
+j | j'
+l | l'
+à | to, at
+m | m'
+n | n'
+s | s'
+t | t'
+y | there
+
+ | forms of être (not including the infinitive):
+ | été - Omitted because it is homonym of "summer"
+étée
+étées
+ | étés - Omitted because it is homonym of "summers"
+étant
+suis
+es
+ | est - Omitted because it is homonym of "east"
+ | sommes - Omitted because it is homonym of "sums"
+êtes
+sont
+serai
+seras
+sera
+serons
+serez
+seront
+serais
+serait
+serions
+seriez
+seraient
+étais
+était
+étions
+étiez
+étaient
+fus
+fut
+fûmes
+fûtes
+furent
+sois
+soit
+soyons
+soyez
+soient
+fusse
+fusses
+ | fût - Omitted because it is homonym of "tap", like in "beer on tap"
+fussions
+fussiez
+fussent
+
+ | forms of avoir (not including the infinitive):
+ayant
+eu
+eue
+eues
+eus
+ai
+ | as - Omitted because it is homonym of "ace"
+avons
+avez
+ont
+aurai
+ | auras - Omitted because it is also the name of a kind of wind
+ | aura - Omitted because it is also the name of a kind of wind and homonym of "aura"
+aurons
+aurez
+auront
+aurais
+aurait
+aurions
+auriez
+auraient
+avais
+avait
+ | avions - Omitted because it is homonym of "planes"
+aviez
+avaient
+eut
+eûmes
+eûtes
+eurent
+aie
+aies
+ait
+ayons
+ayez
+aient
+eusse
+eusses
+eût
+eussions
+eussiez
+eussent
+
+ | Later additions (from Jean-Christophe Deschamps)
+ceci | this
+cela | that (added 11 Apr 2012. Omission reported by Adrien Grand)
+celà | that (incorrect, though common)
+cet | this
+cette | this
+ici | here
+ils | they
+les | the (pl)
+leurs | their (pl)
+quel | which
+quels | which
+quelle | which
+quelles | which
+sans | without
+soi | oneself
+
diff --git a/algorithms/german/stemmer.html b/algorithms/german/stemmer.html
new file mode 100644
index 0000000..140d616
--- /dev/null
+++ b/algorithms/german/stemmer.html
@@ -0,0 +1,547 @@
+
+
+
+
+
+
+
+
+
+ German stemming algorithm - Snowball
+
+
+
+
+
+
+
+
+
+
+
+and a special letter, ß, equivalent to double s.
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u y ä ö ü
+
+
+
+First put u and y between vowels into
+upper case, and then do the following mappings,
+
+
+ (a) replace ß with ss,
+ (a) replace ae with ä,
+ (a) replace oe with ö,
+ (a) replace ue with ü unless preceded by q.
+
+
+
+(The rules here for ae, oe and ue were
+added in Snowball 2.3.0, but were previously present as a variant of the
+algorithm termed "german2"). The condition
+on the replacement of ue prevents the unwanted changing of
+quelle. Also note that feuer is not modified because the first
+part of the rule changes it to feUer, so ue is not
+found.)
+
+
+
+R1 and R2 are first set up in the standard way
+(see the note on R1 and R2),
+but then R1 is adjusted so that the region before it contains at least 3 letters.
+
+
+
+Define a valid s-ending as one of b, d, f, g, h, k, l, m, n, r or t.
+
+
+
+Define a valid st-ending as the same list, excluding letter r.
+
+
+
+Do each of steps 1, 2 and 3.
+
+
+
+Step 1:
+
+ Search for the longest among the following suffixes,
+
+ (a) em ern er
+ (b) e en es
+ (c) s (preceded by a valid s-ending)
+
+
+ and delete if in R1. (Of course the letter of the valid s-ending is
+ not necessarily in R1.) If an ending of group (b) is deleted, and the ending
+ is preceded by niss, delete the final s.
+
+and a special letter, ß, equivalent to double s.
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u y ä ö ü
+
+
+
+First put u and y between vowels into
+upper case, and then do the following mappings,
+
+
+ (a) replace ß with ss,
+ (a) replace ae with ä,
+ (a) replace oe with ö,
+ (a) replace ue with ü unless preceded by q.
+
+
+
+(The rules here for ae, oe and ue were
+added in Snowball 2.3.0, but were previously present as a variant of the
+algorithm termed "german2"). The condition
+on the replacement of ue prevents the unwanted changing of
+quelle. Also note that feuer is not modified because the first
+part of the rule changes it to feUer, so ue is not
+found.)
+
+
+
+R1 and R2 are first set up in the standard way
+(see the note on R1 and R2),
+but then R1 is adjusted so that the region before it contains at least 3 letters.
+
+
+
+Define a valid s-ending as one of b, d, f, g, h, k, l, m, n, r or t.
+
+
+
+Define a valid st-ending as the same list, excluding letter r.
+
+
+
+Do each of steps 1, 2 and 3.
+
+
+
+Step 1:
+
+ Search for the longest among the following suffixes,
+
+ (a) em ern er
+ (b) e en es
+ (c) s (preceded by a valid s-ending)
+
+
+ and delete if in R1. (Of course the letter of the valid s-ending is
+ not necessarily in R1.) If an ending of group (b) is deleted, and the ending
+ is preceded by niss, delete the final s.
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
+
end ung
+
delete if in R2
+
if preceded by ig, delete if in R2 and not preceded by e
+
ig ik isch
+
delete if in R2 and not preceded by e
+
lich heit
+
delete if in R2
+
if preceded by er or en, delete if in R1
+
keit
+
delete if in R2
+
if preceded by lich or ig, delete if in R2
+
+
+
+
+Finally,
+
+
+
+ turn U and Y back into lower case, and remove the umlaut accent from a,
+ o and u.
+
+
+
The same algorithm in Snowball
+
+[% highlight_file('german') %]
+
+[% footer %]
diff --git a/algorithms/german/stop.txt b/algorithms/german/stop.txt
new file mode 100644
index 0000000..5c45a51
--- /dev/null
+++ b/algorithms/german/stop.txt
@@ -0,0 +1,286 @@
+
+ | A German stop word list. Comments begin with vertical bar. Each stop
+ | word is at the start of a line.
+
+ | The number of forms in this list is reduced significantly by passing it
+ | through the German stemmer.
+
+
+aber | but
+
+alle | all
+allem
+allen
+aller
+alles
+
+als | than, as
+also | so
+am | an + dem
+an | at
+
+ander | other
+andere
+anderem
+anderen
+anderer
+anderes
+anderm
+andern
+anderr
+anders
+
+auch | also
+auf | on
+aus | out of
+bei | by
+bin | am
+bis | until
+bist | art
+da | there
+damit | with it
+dann | then
+
+der | the
+den
+des
+dem
+die
+das
+
+daß | that
+
+derselbe | the same
+derselben
+denselben
+desselben
+demselben
+dieselbe
+dieselben
+dasselbe
+
+dazu | to that
+
+dein | thy
+deine
+deinem
+deinen
+deiner
+deines
+
+denn | because
+
+derer | of those
+dessen | of him
+
+dich | thee
+dir | to thee
+du | thou
+
+dies | this
+diese
+diesem
+diesen
+dieser
+dieses
+
+
+doch | (several meanings)
+dort | (over) there
+
+
+durch | through
+
+ein | a
+eine
+einem
+einen
+einer
+eines
+
+einig | some
+einige
+einigem
+einigen
+einiger
+einiges
+
+einmal | once
+
+er | he
+ihn | him
+ihm | to him
+
+es | it
+etwas | something
+
+euer | your
+eure
+eurem
+euren
+eurer
+eures
+
+für | for
+gegen | towards
+gewesen | p.p. of sein
+hab | have
+habe | have
+haben | have
+hat | has
+hatte | had
+hatten | had
+hier | here
+hin | there
+hinter | behind
+
+ich | I
+mich | me
+mir | to me
+
+
+ihr | you, to her
+ihre
+ihrem
+ihren
+ihrer
+ihres
+euch | to you
+
+im | in + dem
+in | in
+indem | while
+ins | in + das
+ist | is
+
+jede | each, every
+jedem
+jeden
+jeder
+jedes
+
+jene | that
+jenem
+jenen
+jener
+jenes
+
+jetzt | now
+kann | can
+
+kein | no
+keine
+keinem
+keinen
+keiner
+keines
+
+können | can
+könnte | could
+machen | do
+man | one
+
+manche | some, many a
+manchem
+manchen
+mancher
+manches
+
+mein | my
+meine
+meinem
+meinen
+meiner
+meines
+
+mit | with
+muss | must
+musste | had to
+nach | to(wards)
+nicht | not
+nichts | nothing
+noch | still, yet
+nun | now
+nur | only
+ob | whether
+oder | or
+ohne | without
+sehr | very
+
+sein | his
+seine
+seinem
+seinen
+seiner
+seines
+
+selbst | self
+sich | herself
+
+sie | they, she
+ihnen | to them
+
+sind | are
+so | so
+
+solche | such
+solchem
+solchen
+solcher
+solches
+
+soll | shall
+sollte | should
+sondern | but
+sonst | else
+über | over
+um | about, around
+und | and
+
+uns | us
+unse
+unsem
+unsen
+unser
+unses
+
+unter | under
+viel | much
+vom | von + dem
+von | from
+vor | before
+während | while
+war | was
+waren | were
+warst | wast
+was | what
+weg | away, off
+weil | because
+weiter | further
+
+welche | which
+welchem
+welchen
+welcher
+welches
+
+wenn | when
+werde | will
+werden | will
+wie | how
+wieder | again
+will | want
+wir | we
+wird | will
+wirst | willst
+wo | where
+wollen | want
+wollte | wanted
+würde | would
+würden | would
+zu | to
+zum | zu + dem
+zur | zu + der
+zwar | indeed
+zwischen | between
+
diff --git a/algorithms/german2/stemmer.html b/algorithms/german2/stemmer.html
new file mode 100644
index 0000000..d57bdd5
--- /dev/null
+++ b/algorithms/german2/stemmer.html
@@ -0,0 +1,86 @@
+
+
+
+
+
+
+
+
+
+ German stemming algorithm variant - Snowball
+
+
+
+
+
+
+
+
+
+
+
+We used to present a variant of the main German stemmer, termed "german2" which
+was the same as the German stemmer but adjusted the first step to improve
+handling of input text where the German letters ä,
+ö and ü, were written as ae,
+oe and ue respectively.
+
+
+
+Snowball 2.3.0 added these adjustments to the main German stemmer, so there
+is no longer a "german2" variant - just used the "german" stemmer.
+
+We used to present a variant of the main German stemmer, termed "german2" which
+was the same as the German stemmer but adjusted the first step to improve
+handling of input text where the German letters ä,
+ö and ü, were written as ae,
+oe and ue respectively.
+
+
+
+Snowball 2.3.0 added these adjustments to the main German stemmer, so there
+is no longer a "german2" variant - just used the "german" stemmer.
+
+Despite its inflexional complexities, German has quite a simple suffix
+structure, so that, if one ignores the almost intractable problems of
+compound words, separable verb prefixes, and prefixed and infixed ge, an
+algorithmic stemmer can be made quite short. (Infixed zu can be removed
+algorithmically, but this minor feature is not shown here.) The umlaut in
+German is a regular feature of plural formation, so its removal is a
+natural feature of stemming, but this leads to certain false conflations
+(for example, schön, beautiful; schon, already).
+
+
+
+By contrast, Dutch is inflexionally simple, but even so, this does not make
+for any great difference between the stemmers. A feature of Dutch that
+makes it markedly different from German is that the grammar of the written
+language has changed, and continues to change, relatively rapidly, and that
+it has assimilated a large and mixed foreign vocabulary with some of the
+accompanying foreign suffixes. Foreign words may, or may not, be
+transliterated into a Dutch style. Naturally these create problems in
+stemming. The stemmer here is intended for native words of contemporary
+Dutch.
+
+
+
+In a Dutch noun, a vowel may double in the singular form (manen = moons, maan
+= moon). We attempt to solve this by undoubling the double vowel (Kraaij
+Pohlman by contrast attempt to double the single vowel). The endings je,
+tje, pje etc., although extremely common, are not stemmed. They are
+diminutives and can significantly alter word meaning.
+
+
+
A note on compound words
+
+
+Famously, German allows for the formation of long compound words, written
+without spaces. For retrieval purposes, it is useful to be able to search
+on the parts of such words, as well as the on the complete words
+themselves. This is not just peculiar to German: Dutch, Danish, Norwegian,
+Swedish, Icelandic and Finnish have the same property. To split up
+compound words cannot be done without a dictionary, and the purely
+algorithmic stemmers presented here do not attempt it.
+
+
+
+We would suggest, however, that the need for compound word splitting in
+these languages has been somewhat overstated. In the case of German:
+
+
+
+1) There are many English compounds one would see no advantage in
+splitting,
+
+
+
+
blackberry
blackboard
rainbow
coastguard
....
+
+
+
+Many German compounds are like this,
+
+
+
+
Bleistift (pencil)
=
Blei (lead) + Stift (stick)
+
Eisenbahn (railway)
=
Eisen (iron) + Bahn (road)
+
Unterseeboot (submarine)
=
under + sea + boat
+
+
+
+2) Other compounds correspond to what in English one would want to do by
+phrase searching, so they are ready made for that purpose,
+
+
+
+
Gesundheitspflege
=
‘health care’
+
Fachhochschule
=
‘technical college’
+
Kunstmuseum
=
‘museum of fine art’
+
+
+
+3) In any case, longer compounds, especially involving personal names, are
+frequently hyphenated,
+
+
+
Heinrich-Heine-Universität
+
+
+
+4) It is possible to construct participial adjectives of almost any
+length, but they are little used in contemporary German, and regarded now
+as poor style. As in English, very long words are not always to be taken
+too seriously. On the author's last visit to Germany, the longest word he
+had to struggle with was
+
+
+
Nasenspitzenwurzelentzündung
+
+
+
+It means ‘inflammation of the root of the tip of the nose’, and comes from
+a cautionary tale for children.
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/algorithms/germanic.tt b/algorithms/germanic.tt
new file mode 100644
index 0000000..eb52a74
--- /dev/null
+++ b/algorithms/germanic.tt
@@ -0,0 +1,113 @@
+[% header('Germanic language stemmers') %]
+
+
+Despite its inflexional complexities, German has quite a simple suffix
+structure, so that, if one ignores the almost intractable problems of
+compound words, separable verb prefixes, and prefixed and infixed ge, an
+algorithmic stemmer can be made quite short. (Infixed zu can be removed
+algorithmically, but this minor feature is not shown here.) The umlaut in
+German is a regular feature of plural formation, so its removal is a
+natural feature of stemming, but this leads to certain false conflations
+(for example, schön, beautiful; schon, already).
+
+
+
+By contrast, Dutch is inflexionally simple, but even so, this does not make
+for any great difference between the stemmers. A feature of Dutch that
+makes it markedly different from German is that the grammar of the written
+language has changed, and continues to change, relatively rapidly, and that
+it has assimilated a large and mixed foreign vocabulary with some of the
+accompanying foreign suffixes. Foreign words may, or may not, be
+transliterated into a Dutch style. Naturally these create problems in
+stemming. The stemmer here is intended for native words of contemporary
+Dutch.
+
+
+
+In a Dutch noun, a vowel may double in the singular form (manen = moons, maan
+= moon). We attempt to solve this by undoubling the double vowel (Kraaij
+Pohlman by contrast attempt to double the single vowel). The endings je,
+tje, pje etc., although extremely common, are not stemmed. They are
+diminutives and can significantly alter word meaning.
+
+
+
A note on compound words
+
+
+Famously, German allows for the formation of long compound words, written
+without spaces. For retrieval purposes, it is useful to be able to search
+on the parts of such words, as well as the on the complete words
+themselves. This is not just peculiar to German: Dutch, Danish, Norwegian,
+Swedish, Icelandic and Finnish have the same property. To split up
+compound words cannot be done without a dictionary, and the purely
+algorithmic stemmers presented here do not attempt it.
+
+
+
+We would suggest, however, that the need for compound word splitting in
+these languages has been somewhat overstated. In the case of German:
+
+
+
+1) There are many English compounds one would see no advantage in
+splitting,
+
+
+
+
blackberry
blackboard
rainbow
coastguard
....
+
+
+
+Many German compounds are like this,
+
+
+
+
Bleistift (pencil)
=
Blei (lead) + Stift (stick)
+
Eisenbahn (railway)
=
Eisen (iron) + Bahn (road)
+
Unterseeboot (submarine)
=
under + sea + boat
+
+
+
+2) Other compounds correspond to what in English one would want to do by
+phrase searching, so they are ready made for that purpose,
+
+
+
+
Gesundheitspflege
=
‘health care’
+
Fachhochschule
=
‘technical college’
+
Kunstmuseum
=
‘museum of fine art’
+
+
+
+3) In any case, longer compounds, especially involving personal names, are
+frequently hyphenated,
+
+
+
Heinrich-Heine-Universität
+
+
+
+4) It is possible to construct participial adjectives of almost any
+length, but they are little used in contemporary German, and regarded now
+as poor style. As in English, very long words are not always to be taken
+too seriously. On the author's last visit to Germany, the longest word he
+had to struggle with was
+
+
+
Nasenspitzenwurzelentzündung
+
+
+
+It means ‘inflammation of the root of the tip of the nose’, and comes from
+a cautionary tale for children.
+
+This is an implementation of the "Lightweight Stemmer for Hindi" described in:
+
+
+
+ A. Ramanathan and D. Rao (2003) A Lightweight Stemmer for Hindi
+
+
+
+The major difference in our implementation is that rather than transliterating
+to the Latin alphabet we instead work in the original Devanagari script. We
+have modified the suffixes in the list by converting them back to Devanagari
+like so:
+
+
+
+
within the suffixes, "a" after a consonant is dropped since
+consonants have an implicit "a".
+
within the suffixes, a vowel other than "a" after a consonant
+is a dependent vowel (vowel sign); a vowel (including "a") after a
+non-consonant is an independent vowel.
+
to allow for the vowel at the start of each suffix being dependent or
+independent, we include each suffix twice. For the dependent version, a
+leading "a" is dropped and we check that the suffix is preceded by a
+consonant (which will have an implicit "a").
+
+
+
+The transliterations of our stems would end with "a" when our
+stems end in a consonant, so we also include the character virama in the
+list of suffixes to remove (this affects 222 words from our sample vocabulary).
+
+
+
+Aside from this, our implementation attempts to be faithful to the algorithm
+described in the paper, though in a few places we've had to resolve ambiguities
+in the paper:
+
+
+
+
+
+We assume that the whole word doesn't count as a valid suffix to remove, so we
+remove the longest suffix from the list which leaves at least one character.
+The paper doesn't seem to clearly state either way which is intended, but producing
+an empty stem seems unhelpful in general. If we instead allowed an empty stem
+to be produced this would result in a different stem for 47 words out of the
+65,140 in our sample vocabulary from Hindi wikipedia.
+
+
+
+We add a to the list of suffixes to remove in figure 3. This needed for
+the example given right at the end of section 5 to work (conflating BarawIya
+and BarawIyawA, and which §3.1 a.v strongly suggests should be in the list:
+"Thus, the following suffix deletions (longest possible match) are required
+to reduce inflected forms of masculine nouns to a common stem: a A i [...]"
+Adding a only affect 2 words out of the 65,140 in our sample vocabulary.
+
+
+
+We've also assumed that Mh in the suffix list isn't meant to match
+M followed by h. Only one of the 65,140 words in the
+sample vocabulary stems differently due to this (and that word
+seems to be a typo).
+
+
+
+
+
The full algorithm in Snowball
+
+
// An implementation of "A Lightweight Stemmer for Hindi":
+// http://www.kbcs.in/downloads/papers/StmmerHindi.pdf
+
+externals(stem)
+
+stringescapes{}
+
+// The transliteration scheme used for our stringdefs matches that used in the
+// paper, as documented in the appendix. It appears to match the WX notation
+// (https://en.wikipedia.org/wiki/WX_notation) except that WX apparently
+// uses 'z' for Anunasika whereas the paper uses Mh.
+//
+// We discriminate dependent vowels by adding a leading "_" to their stringdef
+// names (mnemonic: the _ signifies removing the implicit a from the preceding
+// character).
+
+// Vowels and sonorants:
+stringdefa'{U+0905}'
+stringdefA'{U+0906}'
+stringdefi'{U+0907}'
+stringdefI'{U+0908}'
+stringdefu'{U+0909}'
+stringdefU'{U+090A}'
+stringdefq'{U+090B}'
+stringdefe'{U+090F}'
+stringdefE'{U+0910}'
+stringdefo'{U+0913}'
+stringdefO'{U+0914}'
+
+// Vowel signs:
+stringdef_A'{U+093E}'
+stringdef_i'{U+093F}'
+stringdef_I'{U+0940}'
+stringdef_u'{U+0941}'
+stringdef_U'{U+0942}'
+stringdef_q'{U+0943}'
+stringdef_e'{U+0947}'
+stringdef_E'{U+0948}'
+stringdef_o'{U+094B}'
+stringdef_O'{U+094C}'
+
+// Diacritics:
+stringdefM'{U+0902}'
+stringdefH'{U+0903}'
+stringdefMh'{U+0901}'
+stringdefZ'{U+093C}'// Nukta
+stringdefvirama'{U+094D}'
+
+// Velar consonants:
+stringdefk'{U+0915}'
+stringdefK'{U+0916}'
+stringdefg'{U+0917}'
+stringdefG'{U+0918}'
+stringdeff'{U+0919}'
+
+// Palatal consonants:
+stringdefc'{U+091A}'
+stringdefC'{U+091B}'
+stringdefj'{U+091C}'
+stringdefJ'{U+091D}'
+stringdefF'{U+091E}'
+
+// Retroflex consonants:
+stringdeft'{U+091F}'
+stringdefT'{U+0920}'
+stringdefd'{U+0921}'
+stringdefD'{U+0922}'
+stringdefN'{U+0923}'
+
+// Dental consonants:
+stringdefw'{U+0924}'
+stringdefW'{U+0925}'
+stringdefx'{U+0926}'
+stringdefX'{U+0927}'
+stringdefn'{U+0928}'
+
+// Labial consonants:
+stringdefp'{U+092A}'
+stringdefP'{U+092B}'
+stringdefb'{U+092C}'
+stringdefB'{U+092D}'
+stringdefm'{U+092E}'
+
+// Semi-vowels:
+stringdefy'{U+092F}'
+stringdefr'{U+0930}'
+stringdefl'{U+0932}'
+stringdefv'{U+0935}'
+
+// Fricatives:
+stringdefS'{U+0936}'
+stringdefR'{U+0937}'
+stringdefs'{U+0938}'
+stringdefh'{U+0939}'
+
+stringdeflY'{U+0933}'
+
+// Precomposed characters - letters + nukta:
+stringdefnZ'{U+0929}'// ≡ {n}{Z}
+stringdefrZ'{U+0931}'// ≡ {r}{Z}
+stringdeflYZ'{U+0934}'// ≡ {lY}{Z}
+stringdefkZ'{U+0958}'// ≡ {k}{Z}
+stringdefKZ'{U+0959}'// ≡ {K}{Z}
+stringdefgZ'{U+095A}'// ≡ {g}{Z}
+stringdefjZ'{U+095B}'// ≡ {j}{Z}
+stringdefdZ'{U+095C}'// ≡ {d}{Z}
+stringdefDZ'{U+095D}'// ≡ {D}{Z}
+stringdefPZ'{U+095E}'// ≡ {P}{Z}
+stringdefyZ'{U+095F}'// ≡ {y}{Z}
+
+groupings(consonant)
+
+routines(CONSONANT)
+
+defineconsonant'{k}{K}{g}{G}{f}'+
+'{c}{C}{j}{J}{F}'+
+'{t}{T}{d}{D}{N}'+
+'{w}{W}{x}{X}{n}'+
+'{p}{P}{b}{B}{m}'+
+'{y}{r}{l}{v}'+
+'{S}{R}{s}{h}'+
+'{lY}'+
+'{Z}'+// Nukta
+// Precomposed characters - letter and nukta:
+'{nZ}{rZ}{lYZ}{kZ}{KZ}{gZ}{jZ}{dZ}{DZ}{PZ}{yZ}'
+
+backwardmode(defineCONSONANTas(consonant))
+
+definestemas(
+// We assume in this implementation that the whole word doesn't count
+// as a valid suffix to remove, so we remove the longest suffix from
+// the list which leaves at least one character. This change affects
+// 47 words out of the 65,140 in the sample vocabulary from Hindi
+// wikipedia.
+//
+// The trick here is we use `next` in forward mode to advance the cursor
+// to the second character, then `backwards` swaps the cursor and limit.
+next
+backwards(
+[substring]among(
+// The list below is derived from figure 3 in the paper.
+//
+// We perform the stemming on the Devanagari characters rather than
+// transliterating to Latin, so we have adapted the list below to
+// reflect this by converting suffixes back to Devanagari as
+// follows:
+//
+// * within the suffixes, "a" after a consonant is dropped since
+// consonants have an implicit "a".
+//
+// * within the suffixes, a vowel other than "a" after a consonant
+// is a dependent vowel (vowel sign); a vowel (including "a")
+// after a non-consonant is an independent vowel.
+//
+// * to allow the vowel at the start of each suffix being dependent
+// or independent, we include each suffix twice. For the
+// dependent version, a leading "a" is dropped and we check that
+// the suffix is preceded by a consonant (which will have an
+// implicit "a").
+//
+// * we add '{a}', which is needed for the example given right at
+// the end of section 5 to work (conflating BarawIya and
+// BarawIyawA), and which 3.1 a.v strongly suggests should be in
+// the list:
+//
+// Thus, the following suffix deletions (longest possible
+// match) are required to reduce inflected forms of masculine
+// nouns to a common stem:
+// a A i [...]
+//
+// Adding '{a}' only affect 2 words out of the 65,140 in the
+// sample vocabulary.
+//
+// * The transliterations of our stems would end with "a" when our
+// stems end in a consonant, so we also include {virama} in the
+// list of suffixes to remove (this affects 222 words from the
+// sample vocabulary).
+//
+// We've also assumed that Mh in the suffix list always means {Mh}
+// and never {M}{h}{virama}. Only one of the 65,140 words in the
+// sample vocabulary stems differently due to this (and that word
+// seems to be a typo).
+
+'{virama}'
+
+'{a}'
+'{A}'
+'{i}'
+'{I}'
+'{u}'
+'{U}'
+'{e}'
+'{o}'
+'{e}{M}'
+'{o}{M}'
+'{A}{M}'
+'{u}{A}{M}'
+'{u}{e}{M}'
+'{u}{o}{M}'
+'{A}{e}{M}'
+'{A}{o}{M}'
+'{i}{y}{_A}{M}'
+'{i}{y}{_o}{M}'
+'{A}{i}{y}{_A}{M}'
+'{A}{i}{y}{_o}{M}'
+'{A}{Mh}'
+'{i}{y}{_A}{Mh}'
+'{A}{i}{y}{_A}{Mh}'
+'{a}{w}{_A}{e}{M}'
+'{a}{w}{_A}{o}{M}'
+'{a}{n}{_A}{e}{M}'
+'{a}{n}{_A}{o}{M}'
+'{a}{w}{_A}'
+'{a}{w}{_I}'
+'{I}{M}'
+'{a}{w}{_I}{M}'
+'{a}{w}{_e}'
+'{A}{w}{_A}'
+'{A}{w}{_I}'
+'{A}{w}{_I}{M}'
+'{A}{w}{_e}'
+'{a}{n}{_A}'
+'{a}{n}{_I}'
+'{a}{n}{_e}'
+'{A}{n}{_A}'
+'{A}{n}{_e}'
+'{U}{M}{g}{_A}'
+'{U}{M}{g}{_I}'
+'{A}{U}{M}{g}{_A}'
+'{A}{U}{M}{g}{_I}'
+'{e}{M}{g}{_e}'
+'{e}{M}{g}{_I}'
+'{A}{e}{M}{g}{_e}'
+'{A}{e}{M}{g}{_I}'
+'{o}{g}{_e}'
+'{o}{g}{_I}'
+'{A}{o}{g}{_e}'
+'{A}{o}{g}{_I}'
+'{e}{g}{_A}'
+'{e}{g}{_I}'
+'{A}{e}{g}{_A}'
+'{A}{e}{g}{_I}'
+'{A}{y}{_A}'
+'{A}{e}'
+'{A}{I}'
+'{A}{I}{M}'
+'{i}{e}'
+'{A}{o}'
+'{A}{i}{e}'
+'{a}{k}{r}'
+'{A}{k}{r}'
+
+'{_A}'
+'{_i}'
+'{_I}'
+'{_u}'
+'{_U}'
+'{_e}'
+'{_o}'
+'{_e}{M}'
+'{_o}{M}'
+'{_A}{M}'
+'{_u}{A}{M}'
+'{_u}{e}{M}'
+'{_u}{o}{M}'
+'{_A}{e}{M}'
+'{_A}{o}{M}'
+'{_i}{y}{_A}{M}'
+'{_i}{y}{_o}{M}'
+'{_A}{i}{y}{_A}{M}'
+'{_A}{i}{y}{_o}{M}'
+'{_A}{Mh}'
+'{_i}{y}{_A}{Mh}'
+'{_A}{i}{y}{_A}{Mh}'
+'{_I}{M}'
+'{_A}{w}{_A}'
+'{_A}{w}{_I}'
+'{_A}{w}{_I}{M}'
+'{_A}{w}{_e}'
+'{_A}{n}{_A}'
+'{_A}{n}{_e}'
+'{_U}{M}{g}{_A}'
+'{_U}{M}{g}{_I}'
+'{_A}{U}{M}{g}{_A}'
+'{_A}{U}{M}{g}{_I}'
+'{_e}{M}{g}{_e}'
+'{_e}{M}{g}{_I}'
+'{_A}{e}{M}{g}{_e}'
+'{_A}{e}{M}{g}{_I}'
+'{_o}{g}{_e}'
+'{_o}{g}{_I}'
+'{_A}{o}{g}{_e}'
+'{_A}{o}{g}{_I}'
+'{_e}{g}{_A}'
+'{_e}{g}{_I}'
+'{_A}{e}{g}{_A}'
+'{_A}{e}{g}{_I}'
+'{_A}{y}{_A}'
+'{_A}{e}'
+'{_A}{I}'
+'{_A}{I}{M}'
+'{_i}{e}'
+'{_A}{o}'
+'{_A}{i}{e}'
+'{_A}{k}{r}'
+
+/* Suffixes with a leading implicit a: */
+'{w}{_A}{e}{M}'CONSONANT
+'{w}{_A}{o}{M}'CONSONANT
+'{n}{_A}{e}{M}'CONSONANT
+'{n}{_A}{o}{M}'CONSONANT
+'{w}{_A}'CONSONANT
+'{w}{_I}'CONSONANT
+'{w}{_I}{M}'CONSONANT
+'{w}{_e}'CONSONANT
+'{n}{_A}'CONSONANT
+'{n}{_I}'CONSONANT
+'{n}{_e}'CONSONANT
+'{k}{r}'CONSONANT
+)
+delete
+)
+)
+
+This is an implementation of the "Lightweight Stemmer for Hindi" described in:
+
+
+
+ A. Ramanathan and D. Rao (2003) A Lightweight Stemmer for Hindi
+
+
+
+The major difference in our implementation is that rather than transliterating
+to the Latin alphabet we instead work in the original Devanagari script. We
+have modified the suffixes in the list by converting them back to Devanagari
+like so:
+
+
+
+
within the suffixes, "a" after a consonant is dropped since
+consonants have an implicit "a".
+
within the suffixes, a vowel other than "a" after a consonant
+is a dependent vowel (vowel sign); a vowel (including "a") after a
+non-consonant is an independent vowel.
+
to allow for the vowel at the start of each suffix being dependent or
+independent, we include each suffix twice. For the dependent version, a
+leading "a" is dropped and we check that the suffix is preceded by a
+consonant (which will have an implicit "a").
+
+
+
+The transliterations of our stems would end with "a" when our
+stems end in a consonant, so we also include the character virama in the
+list of suffixes to remove (this affects 222 words from our sample vocabulary).
+
+
+
+Aside from this, our implementation attempts to be faithful to the algorithm
+described in the paper, though in a few places we've had to resolve ambiguities
+in the paper:
+
+
+
+
+
+We assume that the whole word doesn't count as a valid suffix to remove, so we
+remove the longest suffix from the list which leaves at least one character.
+The paper doesn't seem to clearly state either way which is intended, but producing
+an empty stem seems unhelpful in general. If we instead allowed an empty stem
+to be produced this would result in a different stem for 47 words out of the
+65,140 in our sample vocabulary from Hindi wikipedia.
+
+
+
+We add a to the list of suffixes to remove in figure 3. This needed for
+the example given right at the end of section 5 to work (conflating BarawIya
+and BarawIyawA, and which §3.1 a.v strongly suggests should be in the list:
+"Thus, the following suffix deletions (longest possible match) are required
+to reduce inflected forms of masculine nouns to a common stem: a A i [...]"
+Adding a only affect 2 words out of the 65,140 in our sample vocabulary.
+
+
+
+We've also assumed that Mh in the suffix list isn't meant to match
+M followed by h. Only one of the 65,140 words in the
+sample vocabulary stems differently due to this (and that word
+seems to be a typo).
+
+This stemming algorithm removes the inflectional suffixes of nouns. Nouns are
+inflected for case, person/possession and number.
+
+
+
+Letters in Hungarian include the following accented forms,
+
+
+
+ á é í ó ö ő ú ü ű
+
+
+
+The following letters are vowels:
+
+
+
+ a á e é i í o ó ö ő u ú
+ ü ű
+
+
+
+The following letters are digraphs:
+
+
+
+ cs dz dzs gy ly ny ty zs
+
+
+
+A double consonant is defined as:
+
+
+
+ bb cc ccs dd ff gg ggy jj kk ll lly mm
+ nn nny pp rr ss ssz tt tty vv zz zzs
+
+
+
+If the word begins with a vowel, R1 is defined as the region after the
+first consonant or digraph in the word. If the word begins with a consonant, it
+is defined as the region after the first vowel in the word. If the word does
+not contain both a vowel and consonant, R1 is the null region at the end of
+the word.
+
+
+
+For example:
+
+
+
+ t ó b a n consonant-vowel
+ |.....| R1 is 'a b a n'
+
+ a b l a k a n vowel-consonant
+ |.........| R1 is 'l a k a n'
+
+ a c s o n y vowel-digraph
+ |.....| R1 is 'o n y'
+
+ c v s
+ --->|<--- null R1 region
+
+
+
+‘Delete if in R1’ means that the suffix should be removed if it is in
+region R1 but not if it is outside.
+
+
+
+Do steps 1 to 9 in turn
+
+
+
+Step 1: Remove instrumental case
+
+
+
+ Search for one of the following suffixes and perform the action indicated.
+
+
al el
+
delete if in R1 and preceded by a double consonant, and
+ remove one of the double consonants. (In the case of consonant plus digraph, such as ccs, remove a c).
+
+
+
+
+Step 2: Remove frequent cases
+
+
+
+ Search for the longest among the following suffixes and perform the action indicated.
+
+
ban ben ba be ra re nak nek val vel tól
+ től ról ről ból ből hoz hez höz
+ nál nél ig at et ot öt ért képp
+ képpen kor ul ül vá vé onként enként
+ anként ként en on an ön n t
+
+
+
delete if in R1
+
if the remaining word ends á replace by a
+
if the remaining word ends é replace by e
+
+
+
+
+Step 3: Remove special cases:
+
+
+
+ Search for the longest among the following suffixes and perform the action
+ indicated.
+
+
án ánként
+
replace by a if in R1
+
én
+
replace by e if in R1
+
+
+
+
+Step 4: Remove other cases:
+
+
+
+ Search for the longest among the following suffixes and perform the action indicated
+
+
astul estül stul stül
+
delete if in R1
+
ástul
+
replace with a if in R1
+
éstül
+
replace with e if in R1
+
+
+
+
+Step 5: Remove factive case
+
+
+
+ Search for one of the following suffixes and perform the action indicated.
+
+
á é
+
delete if in R1 and preceded by a double consonant, and
+ remove one of the double consonants (as in step 1).
+
+
+
+
+Step 6: Remove owned
+
+
+
+ Search for the longest among the following suffixes and perform the action
+ indicated.
+
+
oké öké aké eké ké éi é
+
delete if in R1
+
áké áéi
+
replace with a if in R1
+
éké ééi éé
+
replace with e if in R1
+
+
+
+
+Step 7: Remove singular owner suffixes
+
+
+
+ Search for the longest among the following suffixes and perform the action
+ indicated.
+
+
ünk unk nk juk jük uk ük em om am m
+ od ed ad öd d ja je a e o
+
delete if in R1
+
ánk ájuk ám ád á
+
replace with a if in R1
+
énk éjük ém éd é
+
replace with e if in R1
+
+
+
+
+Step 8: Remove plural owner suffixes
+
+
+
+ Search for the longest among the following suffixes and perform the action
+ indicated.
+
+
jaim jeim aim eim im jaid jeid aid eid id
+ jai jei ai ei i jaink jeink eink aink ink
+ jaitok jeitek aitok eitek itek jeik jaik aik eik
+ ik
+
+
delete if in R1
+
áim áid ái áink áitok áik
+
replace with a if in R1
+
éim éid éi éink éitek éik
+
replace with e if in R1
+
+
+
+
+Step 9: Remove plural suffixes
+
+
+
+ Search for the longest among the following suffixes and perform the action
+ indicated.
+
+This stemming algorithm removes the inflectional suffixes of nouns. Nouns are
+inflected for case, person/possession and number.
+
+
+
+Letters in Hungarian include the following accented forms,
+
+
+
+ á é í ó ö ő ú ü ű
+
+
+
+The following letters are vowels:
+
+
+
+ a á e é i í o ó ö ő u ú
+ ü ű
+
+
+
+The following letters are digraphs:
+
+
+
+ cs dz dzs gy ly ny ty zs
+
+
+
+A double consonant is defined as:
+
+
+
+ bb cc ccs dd ff gg ggy jj kk ll lly mm
+ nn nny pp rr ss ssz tt tty vv zz zzs
+
+
+
+If the word begins with a vowel, R1 is defined as the region after the
+first consonant or digraph in the word. If the word begins with a consonant, it
+is defined as the region after the first vowel in the word. If the word does
+not contain both a vowel and consonant, R1 is the null region at the end of
+the word.
+
+
+
+For example:
+
+
+
+ t ó b a n consonant-vowel
+ |.....| R1 is 'a b a n'
+
+ a b l a k a n vowel-consonant
+ |.........| R1 is 'l a k a n'
+
+ a c s o n y vowel-digraph
+ |.....| R1 is 'o n y'
+
+ c v s
+ --->|<--- null R1 region
+
+
+
+‘Delete if in R1’ means that the suffix should be removed if it is in
+region R1 but not if it is outside.
+
+
+
+Do steps 1 to 9 in turn
+
+
+
+Step 1: Remove instrumental case
+
+
+
+ Search for one of the following suffixes and perform the action indicated.
+
+
al el
+
delete if in R1 and preceded by a double consonant, and
+ remove one of the double consonants. (In the case of consonant plus digraph, such as ccs, remove a c).
+
+
+
+
+Step 2: Remove frequent cases
+
+
+
+ Search for the longest among the following suffixes and perform the action indicated.
+
+
ban ben ba be ra re nak nek val vel tól
+ től ról ről ból ből hoz hez höz
+ nál nél ig at et ot öt ért képp
+ képpen kor ul ül vá vé onként enként
+ anként ként en on an ön n t
+
+
+
delete if in R1
+
if the remaining word ends á replace by a
+
if the remaining word ends é replace by e
+
+
+
+
+Step 3: Remove special cases:
+
+
+
+ Search for the longest among the following suffixes and perform the action
+ indicated.
+
+
án ánként
+
replace by a if in R1
+
én
+
replace by e if in R1
+
+
+
+
+Step 4: Remove other cases:
+
+
+
+ Search for the longest among the following suffixes and perform the action indicated
+
+
astul estül stul stül
+
delete if in R1
+
ástul
+
replace with a if in R1
+
éstül
+
replace with e if in R1
+
+
+
+
+Step 5: Remove factive case
+
+
+
+ Search for one of the following suffixes and perform the action indicated.
+
+
á é
+
delete if in R1 and preceded by a double consonant, and
+ remove one of the double consonants (as in step 1).
+
+
+
+
+Step 6: Remove owned
+
+
+
+ Search for the longest among the following suffixes and perform the action
+ indicated.
+
+
oké öké aké eké ké éi é
+
delete if in R1
+
áké áéi
+
replace with a if in R1
+
éké ééi éé
+
replace with e if in R1
+
+
+
+
+Step 7: Remove singular owner suffixes
+
+
+
+ Search for the longest among the following suffixes and perform the action
+ indicated.
+
+
ünk unk nk juk jük uk ük em om am m
+ od ed ad öd d ja je a e o
+
delete if in R1
+
ánk ájuk ám ád á
+
replace with a if in R1
+
énk éjük ém éd é
+
replace with e if in R1
+
+
+
+
+Step 8: Remove plural owner suffixes
+
+
+
+ Search for the longest among the following suffixes and perform the action
+ indicated.
+
+
jaim jeim aim eim im jaid jeid aid eid id
+ jai jei ai ei i jaink jeink eink aink ink
+ jaitok jeitek aitok eitek itek jeik jaik aik eik
+ ik
+
+
delete if in R1
+
áim áid ái áink áitok áik
+
replace with a if in R1
+
éim éid éi éink éitek éik
+
replace with e if in R1
+
+
+
+
+Step 9: Remove plural suffixes
+
+
+
+ Search for the longest among the following suffixes and perform the action
+ indicated.
+
+There are two English stemmers, the original Porter stemmer,
+and an improved stemmer which has been called Porter2. Read the accounts of them to
+learn a bit more about using Snowball.
+
+
+
+Each formal algorithm should be compared with the corresponding Snowball program.
+
+
+
+Surprisingly, among the Indo-European languages (*), the French stemmer turns out to be the most complicated, whereas
+the Russian stemmer, despite its large number of suffixes, is very simple. In
+fact it is interesting that English, with its minimal use of i-suffixes,
+has such a complex stemmer. This is partly due to the delicate nature of
+i-suffix removal (undoubling the p after removing ing from hopping etc),
+and partly to the wealth of forms of d-suffixes, deriving as they do from
+the mixed Romance and Germanic ancestry of the language.
+
+
+
+Note that by i-suffix we mean inflexional suffix, and by d-suffix,
+derivational suffix (*).
+
+
+
Other Stemming Algorithms
+
+
+We also provide Snowball implementations of some algorithms developed by other parties:
+
+There are two English stemmers, the original Porter stemmer,
+and an improved stemmer which has been called Porter2. Read the accounts of them to
+learn a bit more about using Snowball.
+
+
+
+Each formal algorithm should be compared with the corresponding Snowball program.
+
+
+
+Surprisingly, among the Indo-European languages (*), the French stemmer turns out to be the most complicated, whereas
+the Russian stemmer, despite its large number of suffixes, is very simple. In
+fact it is interesting that English, with its minimal use of i-suffixes,
+has such a complex stemmer. This is partly due to the delicate nature of
+i-suffix removal (undoubling the p after removing ing from hopping etc),
+and partly to the wealth of forms of d-suffixes, deriving as they do from
+the mixed Romance and Germanic ancestry of the language.
+
+
+
+Note that by i-suffix we mean inflexional suffix, and by d-suffix,
+derivational suffix (*).
+
+
+
Other Stemming Algorithms
+
+
+We also provide Snowball implementations of some algorithms developed by other parties:
+
+This is an implementation of the "Porter Stemmer for Bahasa Indonesia" described
+in:
+
+
+
+ Tala F Z (2003) A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia. M.S. thesis, University of Amsterdam.
+
+
+
+It would be more accurately described as "Porter-style" or "Porter-inspired"
+since Martin Porter wasn't directly involved in its development.
+
+
+
+Our implementation attempts to be faithful to the algorithm described in the
+paper, but we have had to address some places in the paper which are unclear,
+and a case where an example doesn't match the described algorithm.
+
+
+
+
+
+In table 2.7 on page 9, the additional condition on the remaining stem for
+removing the suffix "i" reads "V|K...c1c1, c1
+≠ s, c2 ≠ i and prefix ∉ {ber, ke, peng}".
+
+
+
+The meaning of this is unclear in several ways, and none of the
+examples given of the stemmer's behaviour in the paper help to
+resolve these issues.
+
+
+
+Notice that c2 isn't actually used - the most obvious explanation
+seems to be that "c1c1" should read
+"c1c2", or maybe "c2c1".
+
+
+
+Elsewhere the paper defines V... as meaning "the stem starts with
+a vowel" and K... as meaning "the stem starts with a consonant".
+
+
+
+In other places where it says X|Y... it seems the | binds more
+tightly, so it's (V|K)...cicj not
+V|(K...cicj). That seems a bit
+odd as the first letter must be either a vowel or a consonant, so
+that really just means "ends cicj". However, nowhere in
+the paper uses or defines a notation such as ...X, which may explain this
+seemingly redundant way of specifying this.
+
+
+
+The conditions elsewhere on prefix removal (e.g. V...) are clearly
+on the stem left after the prefix is removed. None of the other
+rules for suffix removal have conditions on the stem, but for
+consistency with the prefix rules we might expect that the
+cicj test is on what's left after removing the
+"i" suffix.
+
+
+
+However, studying Indonesian wordlists and discussion with a native
+speaker leads us to conclude that the purpose of this check is to
+protect words of foreign origin (e.g. "televisi", "organisasi",
+"komunikasi") from stemming, and the common feature of these is
+that the word ends "-si", so we conclude that the condition here
+should be read as "word does not end -si", and this is what we
+have implemented.
+
+
+
+
+
+On page 29, the example "kompas Q.31" says "Both Nazief and Porter stemmer
+converted the word peledakan (blast, explotion) to ledak (to
+blast, to explode)". However, the algorithm as described doesn't behave in
+this way - grammatically the prefix pe- occurs as a variation of both the
+first-order derivational prefix peng- and the second-order derivational prefix
+per-, but table 2.5 doesn't include "pe", only table 2.6 does, so "peledakan"
+is handled (incorrectly) as having prefix "per" not "peng", and so we remove
+derivational suffix "kan" rather than "an" to give stem leda.
+(Porter-style stemmers remove the longest suffix they can amongst those
+available, which this paper notes in the last paragraph on page 15).
+
+
+
+We resolve this by amending the condition on suffix "kan" to "prefix ∉
+{ke, peng, per}", which seems to make the stemmer's behaviour match all the
+examples in the paper except for one: "perbaikan" is shown in table 3.4
+as stemming to "bai", but with this change it now stems to "baik". The
+table notes that "baik" is the actual root so this deviation is an
+improvement. In a sample vocabulary derived from the most common words in
+id.wikipedia.org, this change only affects 0.12% of words (76 out of 64,587,
+including "peledakan" and "perbaikan").
+
+
+
+
+The paper has the condition on removal of prefix "bel" and "pel" as
+just "ajar" not "ajar..." but it seems that the latter must be what
+is intended so that e.g. "pelajaran" stems to "ajar" not "lajar".
+This change only affects a very small number of words (11 out of
+64,587), and only for the better.
+
+
+
+
The full algorithm in Snowball
+
+
// An implementation of the "Porter Stemmer for Bahasa Indonesia" from:
+// http://www.illc.uva.nl/Research/Publications/Reports/MoL-2003-02.text.pdf
+
+integers(
+// The paper defines measure as the number of vowels in the word. We
+// count this initially, then adjust the count each time we remove a
+// prefix or suffix.
+measure
+
+// Numeric code for the type of prefix removed:
+//
+// 0 other/none
+// 1 'di' or 'meng' or 'ter'
+// 2 'per'
+// 3 'ke' or 'peng'
+// 4 'ber'
+//
+// Some of these have variant forms, so e.g. "meng" includes "men", "me",
+// "meny", "mem".
+//
+// Note that the value of prefix is only used in remove_suffix (and
+// routines it calls) so we don't need to worry about
+// remove_second_order_prefix overwriting a value of prefix set by
+// remove_first_order_prefix since remove_suffix gets called between
+// the two.
+prefix
+)
+
+groupings(vowel)
+
+routines(
+remove_particle
+remove_possessive_pronoun
+remove_first_order_prefix
+remove_second_order_prefix
+remove_suffix
+KER
+SUFFIX_KAN_OK
+SUFFIX_AN_OK
+SUFFIX_I_OK
+VOWEL
+)
+
+externals(stem)
+
+stringescapes{}
+
+backwardmode(
+
+defineremove_particleas(
+[substring]among(
+'kah''lah''pun'(delete$measure-=1)
+)
+)
+
+defineremove_possessive_pronounas(
+[substring]among(
+'ku''mu''nya'(delete$measure-=1)
+)
+)
+
+// prefix not in {ke, peng, per}
+defineSUFFIX_KAN_OKas(
+// On page 29, the example "kompas Q.31" says "Both Nazief and Porter
+// stemmer converted the word peledakan (blast, explotion [sic]) to
+// ledak (to blast, to explode)". However, the algorithm as described
+// doesn't behave in this way - grammatically the prefix pe- occurs as a
+// variation of both the first-order derivational prefix peng- and the
+// second-order derivational prefix per-, but table 2.5 doesn't include
+// "pe", only table 2.6 does, so "peledakan" is handled (incorrectly)
+// as having prefix "per" not "peng", and so we remove derivational
+// suffix "kan" rather than "an" to give stem leda. (Porter-style
+// stemmers remove the longest suffix they can amongst those available,
+// which this paper notes in the last paragraph on page 15).
+//
+// We resolve this by amending the condition on suffix "kan" to
+// "prefix ∉ {ke, peng, per}", which seems to make the stemmer's
+// behaviour match all the examples in the paper except for one:
+// "perbaikan" is shown in table 3.4 as stemming to "bai", but with
+// this change it now stems to "baik". The table notes that "baik" is
+// the actual root so this deviation is an improvement. In a sample
+// vocabulary derived from the most common words in id.wikipedia.org,
+// this change only affects 0.12% of words (76 out of 64,587, including
+// "peledakan" and "perbaikan").
+$prefix!=3and$prefix!=2
+)
+
+// prefix not in {di, meng, ter}
+defineSUFFIX_AN_OKas($prefix!=1)
+
+defineSUFFIX_I_OKas(
+// prefix not in {ke, peng, ber}
+$prefix<=2
+
+// The rest of the condition from the paper is:
+// V|K...c₁c₁, c₁ ≠ s, c₂ ≠ i
+//
+// The meaning of this is unclear in several ways, and none of the
+// examples given of the stemmer's behaviour in the paper help to
+// resolve these issues.
+//
+// Notice that c₂ isn't actually used - the most obvious explanation
+// seems to be that "c₁c₁" should read "c₁c₂", or maybe "c₂c₁".
+//
+// Elsewhere the paper defines V... as meaning "the stem starts with
+// a vowel" and K... as meaning "the stem starts with a consonant".
+//
+// In other places where it says X|Y... it seems the | binds more
+// tightly, so it's (V|K)...cᵢcⱼ not V|(K...cᵢcⱼ). That seems a bit
+// odd as the first letter must be either a vowel or a consonant, so
+// that really just means "ends cᵢcⱼ". However, nowhere in the paper
+// uses or defines a notation such as ...X, which may explain this
+// seemingly redundant way of specifying this.
+//
+// The conditions elsewhere on prefix removal (e.g. V...) are clearly
+// on the stem left after the prefix is removed. None of the other
+// rules for suffix removal have conditions on the stem, but for
+// consistency with the prefix rules we might expect that the cᵢcⱼ
+// test is on what's left *after* removing the "i" suffix.
+//
+// However, studying Indonesian wordlists and discussion with a native
+// speaker leads us to conclude that the purpose of this check is to
+// protect words of foreign origin (e.g. "televisi", "organisasi",
+// "komunikasi") from stemming, and the common feature of these is
+// that the word ends "-si", so we conclude that the condition here
+// should be read as "word does not end -si", and this is what we
+// have implemented.
+not's'
+)
+
+defineremove_suffixas(
+[substring]among(
+'kan'SUFFIX_KAN_OK'an'SUFFIX_AN_OK'i'SUFFIX_I_OK
+(delete$measure-=1)
+)
+)
+)
+
+definevowel'aeiou'
+
+defineVOWELas(vowel)
+
+defineKERas(non-vowel'er')
+
+defineremove_first_order_prefixas(
+[substring]among(
+'di''meng''men''me''ter'(delete$prefix=1$measure-=1)
+'ke''peng''pen'(delete$prefix=3$measure-=1)
+'meny'VOWEL($prefix=1<-'s'$measure-=1)
+'peny'VOWEL($prefix=3<-'s'$measure-=1)
+'mem'($prefix=1$measure-=1voweland<-'p'ordelete)
+'pem'($prefix=3$measure-=1voweland<-'p'ordelete)
+)
+)
+
+defineremove_second_order_prefixas(
+// The paper has the condition on removal of prefix "bel" and "pel" as
+// just "ajar" not "ajar..." but it seems that the latter must be what
+// is intended so that e.g. "pelajaran" stems to "ajar" not "lajar".
+// This change only affects a very small number of words (11 out of
+// 64,587) and only for the better.
+[substring]among(
+'per''pe'(delete$prefix=2$measure-=1)
+'pelajar'(<-'ajar'$measure-=1)
+'ber'(delete$prefix=4$measure-=1)
+'belajar'(<-'ajar'$prefix=4$measure-=1)
+'be'KER(delete$prefix=4$measure-=1)
+)
+)
+
+definestemas(
+$measure=0
+do(repeat(gopastvowel$measure+=1))
+$measure>2
+$prefix=0
+backwards(
+doremove_particle
+$measure>2
+doremove_possessive_pronoun
+)
+$measure>2
+test(
+remove_first_order_prefix
+do(
+test($measure>2backwardsremove_suffix)
+$measure>2remove_second_order_prefix
+)
+)or(
+doremove_second_order_prefix
+do($measure>2backwardsremove_suffix)
+)
+)
+
+This is an implementation of the "Porter Stemmer for Bahasa Indonesia" described
+in:
+
+
+
+ Tala F Z (2003) A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia. M.S. thesis, University of Amsterdam.
+
+
+
+It would be more accurately described as "Porter-style" or "Porter-inspired"
+since Martin Porter wasn't directly involved in its development.
+
+
+
+Our implementation attempts to be faithful to the algorithm described in the
+paper, but we have had to address some places in the paper which are unclear,
+and a case where an example doesn't match the described algorithm.
+
+
+
+
+
+In table 2.7 on page 9, the additional condition on the remaining stem for
+removing the suffix "i" reads "V|K...c1c1, c1
+≠ s, c2 ≠ i and prefix ∉ {ber, ke, peng}".
+
+
+
+The meaning of this is unclear in several ways, and none of the
+examples given of the stemmer's behaviour in the paper help to
+resolve these issues.
+
+
+
+Notice that c2 isn't actually used - the most obvious explanation
+seems to be that "c1c1" should read
+"c1c2", or maybe "c2c1".
+
+
+
+Elsewhere the paper defines V... as meaning "the stem starts with
+a vowel" and K... as meaning "the stem starts with a consonant".
+
+
+
+In other places where it says X|Y... it seems the | binds more
+tightly, so it's (V|K)...cicj not
+V|(K...cicj). That seems a bit
+odd as the first letter must be either a vowel or a consonant, so
+that really just means "ends cicj". However, nowhere in
+the paper uses or defines a notation such as ...X, which may explain this
+seemingly redundant way of specifying this.
+
+
+
+The conditions elsewhere on prefix removal (e.g. V...) are clearly
+on the stem left after the prefix is removed. None of the other
+rules for suffix removal have conditions on the stem, but for
+consistency with the prefix rules we might expect that the
+cicj test is on what's left after removing the
+"i" suffix.
+
+
+
+However, studying Indonesian wordlists and discussion with a native
+speaker leads us to conclude that the purpose of this check is to
+protect words of foreign origin (e.g. "televisi", "organisasi",
+"komunikasi") from stemming, and the common feature of these is
+that the word ends "-si", so we conclude that the condition here
+should be read as "word does not end -si", and this is what we
+have implemented.
+
+
+
+
+
+On page 29, the example "kompas Q.31" says "Both Nazief and Porter stemmer
+converted the word peledakan (blast, explotion) to ledak (to
+blast, to explode)". However, the algorithm as described doesn't behave in
+this way - grammatically the prefix pe- occurs as a variation of both the
+first-order derivational prefix peng- and the second-order derivational prefix
+per-, but table 2.5 doesn't include "pe", only table 2.6 does, so "peledakan"
+is handled (incorrectly) as having prefix "per" not "peng", and so we remove
+derivational suffix "kan" rather than "an" to give stem leda.
+(Porter-style stemmers remove the longest suffix they can amongst those
+available, which this paper notes in the last paragraph on page 15).
+
+
+
+We resolve this by amending the condition on suffix "kan" to "prefix ∉
+{ke, peng, per}", which seems to make the stemmer's behaviour match all the
+examples in the paper except for one: "perbaikan" is shown in table 3.4
+as stemming to "bai", but with this change it now stems to "baik". The
+table notes that "baik" is the actual root so this deviation is an
+improvement. In a sample vocabulary derived from the most common words in
+id.wikipedia.org, this change only affects 0.12% of words (76 out of 64,587,
+including "peledakan" and "perbaikan").
+
+
+
+
+The paper has the condition on removal of prefix "bel" and "pel" as
+just "ajar" not "ajar..." but it seems that the latter must be what
+is intended so that e.g. "pelajaran" stems to "ajar" not "lajar".
+This change only affects a very small number of words (11 out of
+64,587), and only for the better.
+
+
+
+
The full algorithm in Snowball
+
+[% highlight_file('indonesian') %]
+
+[% footer %]
diff --git a/algorithms/indonesian/stop.txt b/algorithms/indonesian/stop.txt
new file mode 100644
index 0000000..c433b01
--- /dev/null
+++ b/algorithms/indonesian/stop.txt
@@ -0,0 +1,91 @@
+yang | that
+dan | and
+di | in
+dari | from
+ini | this
+pada kepada | at, to [person]
+ada adalah | there is, is
+dengan | with
+untuk | for
+dalam | in the
+oleh | by
+sebagai | as
+juga | also, too
+ke | to
+atau | or
+tidak | not
+itu | that
+sebuah | a
+tersebut | the
+dapat | can, may
+ia | he/she, yes
+telah | already
+satu | one
+memiliki | have
+mereka | they
+bahwa | that
+lebih | more, more than
+karena | because, since
+seorang | one person, same
+akan | will, about to
+seperti | as, like
+secara | on
+kemudian | later, then
+beberapa | some
+banyak | many
+antara | between
+setelah | after
+yaitu | that is
+hanya | only
+hingga | to
+serta | along with
+sama | same, and
+dia | he/she/it (informal)
+tetapi | but
+namun | however
+melalui | through
+bisa | can
+sehingga | so
+ketika | when
+suatu | a
+sendiri | own (adverb)
+bagi | for
+semua | all
+harus | must
+setiap | each, every
+maka | then
+maupun | as well
+tanpa | without
+saja | only
+jika | if
+bukan | not
+belum | not yet
+sedangkan | while
+yakni | i.e.
+meskipun | although
+hampir | almost
+kita | we/us (inclusive)
+demikian | thereby
+daripada | from/than/instead of
+apa | what/which/or/eh
+ialah | is
+sana | there
+begitu | so
+seseorang | someone
+selain | besides
+terlalu | too
+ataupun | or
+saya | me/I (formal)
+bila | if/when
+bagaimana | how
+tapi | but
+apabila | when/if
+kalau | if
+kami | we/us (exclusive)
+melainkan | but (rather)
+boleh | may,can
+aku | I/me (informal)
+anda | you (formal)
+kamu | you (informal)
+beliau | he/she/it (formal)
+kalian | you (plural)
diff --git a/algorithms/irish/stemmer.html b/algorithms/irish/stemmer.html
new file mode 100644
index 0000000..664838e
--- /dev/null
+++ b/algorithms/irish/stemmer.html
@@ -0,0 +1,457 @@
+
+
+
+
+
+
+
+
+
+ Irish Gaelic stemming algorithm - Snowball
+
+
+
+
+
+
+
+
+
+
+
+This basic stemmer for Irish was developed and contributed by Jim
+O’Regan.
+
+
+
+One thing that should be taken into account with Irish is the initial
+mutation (n-eclipsis and h-prothesis) which causes problems if words
+are simply folded to lowercase before stemming in the way that is
+usually assumed by Snowball stemmers. A Snowball version of an algorithm to
+fold to lowercase while taking this into account would look something like:
+
+This basic stemmer for Irish was developed and contributed by Jim
+O’Regan.
+
+
+
+One thing that should be taken into account with Irish is the initial
+mutation (n-eclipsis and h-prothesis) which causes problems if words
+are simply folded to lowercase before stemming in the way that is
+usually assumed by Snowball stemmers. A Snowball version of an algorithm to
+fold to lowercase while taking this into account would look something like:
+
+
+[% highlight_file('tolower_irish') %]
+
+
+The following characters are vowels for the purposes of this algorithm:
+
+
+ a e i o u á é í ó ú
+
+
+
+The algorithm first addresses the initial mutation, then regions are determined
+based on the word after this first step:
+
+
+
+
RV is the region after the first vowel, or the end of the word
+if it contains no vowels.
+
R1 is the region after the first non-vowel following a vowel, or the
+end of the word if there is no such non-vowel.
+
R2 is the region after the first non-vowel following a vowel in
+R1, or the end of the word if there is no such non-vowel.
+Italian can include the following accented forms:
+
+
+
+ á é í ó ú à è ì ò ù
+
+
+
+First, replace all acute accents by grave accents. And, as in French, put u after
+q, and u, i between vowels into upper case.
+(See note on vowel marking.)
+
+
+
+The vowels are then
+
+
+
+ a e i o u à è ì ò ù
+
+
+
+R2
+(see the note on R1 and R2)
+and RV have the same definition as in the
+ Spanish stemmer.
+
+
+
+First exceptional cases are checked for. These need to match the whole word, and currently are:
+
+
+
+
divano: replace with divan (to avoid conflating with diva) [Added 2022-11-16]
+
+
+
+If found then handle as described and that's it.
+
+
+
+Otherwise always do steps 0 and 1.
+
+
+
+Step 0: Attached pronoun
+
+
+
+ Search for the longest among the following suffixes
+
+ ci gli la le li lo mi ne si ti vi
+ sene gliela gliele glieli glielo gliene
+ mela mele meli melo mene
+ tela tele teli telo tene
+ cela cele celi celo cene
+ vela vele veli velo vene
+
+
+ following one of
+
+
+ (a) ando endo
+ (b) ar er ir
+
+
+ in RV. In case of (a) the suffix is deleted, in case (b) it is replace
+ by e (guardandogli → guardando, accomodarci → accomodare)
+
+
+
+
+Step 1: Standard suffix removal
+
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
anza anze ico ici ica ice iche ichi ismo ismi abile abili ibile ibili
+ ista iste isti istà istè istì oso osi osa ose mente
+ atrice atrici ante anti
+
delete if in R2
+
azione azioni atore atori
+ delete if in R2
+
if preceded by ic, delete if in R2
+
logia logie
+
replace with log if in R2
+
uzione uzioni usione usioni
+
replace with u if in R2
+
enza enze
+
replace with ente if in R2
+
amento amenti imento imenti
+
delete if in RV
+
amente
+
delete if in R1
+
if preceded by iv, delete if in R2 (and if further preceded by at,
+ delete if in R2), otherwise,
+
if preceded by os, ic or abil, delete if in R2
+
ità
+
delete if in R2
+
if preceded by abil, ic or iv, delete if in R2
+
ivo ivi iva ive
+
delete if in R2
+
if preceded by at, delete if in R2 (and if further preceded by ic,
+ delete if in R2)
+
+
+
+
+Do step 2 if no ending was removed by step 1.
+
+
+
+Step 2: Verb suffixes
+
+
+
+ Search for the longest among the following suffixes in RV, and if found,
+ delete.
+
+ ammo ando ano are arono
+ asse assero assi assimo ata ate
+ ati ato ava avamo avano avate avi avo emmo
+ enda ende endi endo erà erai eranno ere
+ erebbe erebbero erei eremmo eremo ereste
+ eresti erete erò erono essero ete eva evamo
+ evano evate evi evo Yamo iamo immo irà
+ irai iranno ire irebbe irebbero irei iremmo
+ iremo ireste iresti irete irò irono isca
+ iscano isce isci isco iscono issero ita ite
+ iti ito iva ivamo ivano ivate ivi ivo
+ ono uta ute uti uto ar ir
+
+
+
+Always do steps 3a and 3b.
+
+
+
+
+
+Step 3a
+
+
+
+ Delete a final a, e, i, o, à, è, ì or ò if it is in RV, and a
+ preceding i if it is in RV (crocchi → crocch, crocchio → crocch)
+
+
+
+Step 3b
+
+
+
+ Replace final ch (or gh) with c (or g) if in RV (crocch → crocc)
+
+Italian can include the following accented forms:
+
+
+
+ á é í ó ú à è ì ò ù
+
+
+
+First, replace all acute accents by grave accents. And, as in French, put u after
+q, and u, i between vowels into upper case.
+(See note on vowel marking.)
+
+
+
+The vowels are then
+
+
+
+ a e i o u à è ì ò ù
+
+
+
+R2
+(see the note on R1 and R2)
+and RV have the same definition as in the
+ Spanish stemmer.
+
+
+
+First exceptional cases are checked for. These need to match the whole word, and currently are:
+
+
+
+
divano: replace with divan (to avoid conflating with diva) [Added 2022-11-16]
+
+
+
+If found then handle as described and that's it.
+
+
+
+Otherwise always do steps 0 and 1.
+
+
+
+Step 0: Attached pronoun
+
+
+
+ Search for the longest among the following suffixes
+
+ ci gli la le li lo mi ne si ti vi
+ sene gliela gliele glieli glielo gliene
+ mela mele meli melo mene
+ tela tele teli telo tene
+ cela cele celi celo cene
+ vela vele veli velo vene
+
+
+ following one of
+
+
+ (a) ando endo
+ (b) ar er ir
+
+
+ in RV. In case of (a) the suffix is deleted, in case (b) it is replace
+ by e (guardandogli → guardando, accomodarci → accomodare)
+
+
+
+
+Step 1: Standard suffix removal
+
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
anza anze ico ici ica ice iche ichi ismo ismi abile abili ibile ibili
+ ista iste isti istà istè istì oso osi osa ose mente
+ atrice atrici ante anti
+
delete if in R2
+
azione azioni atore atori
+ delete if in R2
+
if preceded by ic, delete if in R2
+
logia logie
+
replace with log if in R2
+
uzione uzioni usione usioni
+
replace with u if in R2
+
enza enze
+
replace with ente if in R2
+
amento amenti imento imenti
+
delete if in RV
+
amente
+
delete if in R1
+
if preceded by iv, delete if in R2 (and if further preceded by at,
+ delete if in R2), otherwise,
+
if preceded by os, ic or abil, delete if in R2
+
ità
+
delete if in R2
+
if preceded by abil, ic or iv, delete if in R2
+
ivo ivi iva ive
+
delete if in R2
+
if preceded by at, delete if in R2 (and if further preceded by ic,
+ delete if in R2)
+
+
+
+
+Do step 2 if no ending was removed by step 1.
+
+
+
+Step 2: Verb suffixes
+
+
+
+ Search for the longest among the following suffixes in RV, and if found,
+ delete.
+
+ ammo ando ano are arono
+ asse assero assi assimo ata ate
+ ati ato ava avamo avano avate avi avo emmo
+ enda ende endi endo erà erai eranno ere
+ erebbe erebbero erei eremmo eremo ereste
+ eresti erete erò erono essero ete eva evamo
+ evano evate evi evo Yamo iamo immo irà
+ irai iranno ire irebbe irebbero irei iremmo
+ iremo ireste iresti irete irò irono isca
+ iscano isce isci isco iscono issero ita ite
+ iti ito iva ivamo ivano ivate ivi ivo
+ ono uta ute uti uto ar ir
+
+
+
+Always do steps 3a and 3b.
+
+
+
+
+
+Step 3a
+
+
+
+ Delete a final a, e, i, o, à, è, ì or ò if it is in RV, and a
+ preceding i if it is in RV (crocchi → crocch, crocchio → crocch)
+
+
+
+Step 3b
+
+
+
+ Replace final ch (or gh) with c (or g) if in RV (crocch → crocc)
+
+
+
+Finally,
+
+
+
+ turn I and U back into lower case
+
+
+
The same algorithm in Snowball
+
+[% highlight_file('italian') %]
+
+[% footer %]
diff --git a/algorithms/italian/stop.txt b/algorithms/italian/stop.txt
new file mode 100644
index 0000000..a20bb95
--- /dev/null
+++ b/algorithms/italian/stop.txt
@@ -0,0 +1,295 @@
+
+ | An Italian stop word list. Comments begin with vertical bar. Each stop
+ | word is at the start of a line.
+
+ad | a (to) before vowel
+al | a + il
+allo | a + lo
+ai | a + i
+agli | a + gli
+all | a + l'
+agl | a + gl'
+alla | a + la
+alle | a + le
+con | with
+col | con + il
+coi | con + i (forms collo, cogli etc are now very rare)
+da | from
+dal | da + il
+dallo | da + lo
+dai | da + i
+dagli | da + gli
+dall | da + l'
+dagl | da + gll'
+dalla | da + la
+dalle | da + le
+di | of
+del | di + il
+dello | di + lo
+dei | di + i
+degli | di + gli
+dell | di + l'
+degl | di + gl'
+della | di + la
+delle | di + le
+in | in
+nel | in + el
+nello | in + lo
+nei | in + i
+negli | in + gli
+nell | in + l'
+negl | in + gl'
+nella | in + la
+nelle | in + le
+su | on
+sul | su + il
+sullo | su + lo
+sui | su + i
+sugli | su + gli
+sull | su + l'
+sugl | su + gl'
+sulla | su + la
+sulle | su + le
+per | through, by
+tra | among
+contro | against
+io | I
+tu | thou
+lui | he
+lei | she
+noi | we
+voi | you
+loro | they
+mio | my
+mia |
+miei |
+mie |
+tuo |
+tua |
+tuoi | thy
+tue |
+suo |
+sua |
+suoi | his, her
+sue |
+nostro | our
+nostra |
+nostri |
+nostre |
+vostro | your
+vostra |
+vostri |
+vostre |
+mi | me
+ti | thee
+ci | us, there
+vi | you, there
+lo | him, the
+la | her, the
+li | them
+le | them, the
+gli | to him, the
+ne | from there etc
+il | the
+un | a
+uno | a
+una | a
+ma | but
+ed | and
+se | if
+perché | why, because
+anche | also
+come | how
+dov | where (as dov')
+dove | where
+che | who, that
+chi | who
+cui | whom
+non | not
+più | more
+quale | who, that
+quanto | how much
+quanti |
+quanta |
+quante |
+quello | that
+quelli |
+quella |
+quelle |
+questo | this
+questi |
+questa |
+queste |
+si | yes
+tutto | all
+tutti | all
+
+ | single letter forms:
+
+a | at
+c | as c' for ce or ci
+e | and
+i | the
+l | as l'
+o | or
+
+ | forms of avere, to have (not including the infinitive):
+
+ho
+hai
+ha
+abbiamo
+avete
+hanno
+abbia
+abbiate
+abbiano
+avrò
+avrai
+avrà
+avremo
+avrete
+avranno
+avrei
+avresti
+avrebbe
+avremmo
+avreste
+avrebbero
+avevo
+avevi
+aveva
+avevamo
+avevate
+avevano
+ebbi
+avesti
+ebbe
+avemmo
+aveste
+ebbero
+avessi
+avesse
+avessimo
+avessero
+avendo
+avuto
+avuta
+avuti
+avute
+
+ | forms of essere, to be (not including the infinitive):
+sono
+sei
+è
+siamo
+siete
+sia
+siate
+siano
+sarò
+sarai
+sarà
+saremo
+sarete
+saranno
+sarei
+saresti
+sarebbe
+saremmo
+sareste
+sarebbero
+ero
+eri
+era
+eravamo
+eravate
+erano
+fui
+fosti
+fu
+fummo
+foste
+furono
+fossi
+fosse
+fossimo
+fossero
+essendo
+
+ | forms of fare, to do (not including the infinitive, fa, fat-):
+faccio
+fai
+facciamo
+fanno
+faccia
+facciate
+facciano
+farò
+farai
+farà
+faremo
+farete
+faranno
+farei
+faresti
+farebbe
+faremmo
+fareste
+farebbero
+facevo
+facevi
+faceva
+facevamo
+facevate
+facevano
+feci
+facesti
+fece
+facemmo
+faceste
+fecero
+facessi
+facesse
+facessimo
+facessero
+facendo
+
+ | forms of stare, to be (not including the infinitive):
+sto
+stai
+sta
+stiamo
+stanno
+stia
+stiate
+stiano
+starò
+starai
+starà
+staremo
+starete
+staranno
+starei
+staresti
+starebbe
+staremmo
+stareste
+starebbero
+stavo
+stavi
+stava
+stavamo
+stavate
+stavano
+stetti
+stesti
+stette
+stemmo
+steste
+stettero
+stessi
+stesse
+stessimo
+stessero
+stando
diff --git a/algorithms/kraaij_pohlmann/kraij-pohlmann-uplift-dutch-stemmer.zip b/algorithms/kraaij_pohlmann/kraij-pohlmann-uplift-dutch-stemmer.zip
new file mode 100644
index 0000000..db38905
Binary files /dev/null and b/algorithms/kraaij_pohlmann/kraij-pohlmann-uplift-dutch-stemmer.zip differ
diff --git a/algorithms/kraaij_pohlmann/stemmer.html b/algorithms/kraaij_pohlmann/stemmer.html
new file mode 100644
index 0000000..17ca0d0
--- /dev/null
+++ b/algorithms/kraaij_pohlmann/stemmer.html
@@ -0,0 +1,376 @@
+
+
+
+
+
+
+
+
+
+ The Kraaij-Pohlmann stemming algorithm - Snowball
+
+
+
+
+
+
+
+
+
+
+
+The Kraaij-Pohlmann stemming algorithm is an ANSI C program for stemming in Dutch. Although
+advertised as an algorithm, it is in fact a program without an accompanying
+algorithmic description. It is possible to produce a fairly clean Snowball
+version, but only by sacrificing exact functional equivalence. But that does not
+matter too much, since in the demonstration vocabulary only 32 words out of over
+45,000 stem differently. Here they are:
+
+
+
+
+source
ANSI C stemmer
Snowball stemmer
+
airways
airways
airway
+
algerije
algerije
alrije
+
assays
assays
assay
+
bruys
bruys
bruy
+
cleanaways
cleanaways
cleanaway
+
creys
creys
crey
+
croyden
croyd
croy
+
edele
edel
edeel
+
essays
essays
essay
+
gedijen
gedij
dij
+
geoff
of
off
+
gevrey
gevrey
vrey
+
geysels
ysel
gey
+
grootmeesteres
grootmee
grootmeest
+
gròotmeesteres
gròotmee
gròotmeest
+
hectares
hectaar
hect
+
huys
huys
huy
+
kayen
kayen
kaay
+
lagerwey
lagerwey
larwey
+
mayen
mayen
maay
+
meesteres
meester
meest
+
oppasseres
oppasser
oppas
+
pays
pays
pay
+
royale
royale
royaal
+
schilderes
schilder
schild
+
summerhayes
summerhayes
summerhaye
+
tyumen
tyuum
tyum
+
verheyen
verheyen
verheey
+
verleideres
verleider
verleid
+
ytsen
yts
ytsen
+
yves
yve
yves
+
zangeres
zanger
zang
+
+
+
+The Kraaij-Pohmann stemmer can make fairly drastic reductions to a word. For
+example, infixed ge is removed, so geluidgevoelige stems to
+luidvoel. Often, therefore, the original word cannot be easily guessed from
+the stemmed form.
+
+
+
+Here then is the Snowball equivalent of the Kraaij-Pohlmann algorithm.
+
+The Kraaij-Pohlmann stemming algorithm is an ANSI C program for stemming in Dutch. Although
+advertised as an algorithm, it is in fact a program without an accompanying
+algorithmic description. It is possible to produce a fairly clean Snowball
+version, but only by sacrificing exact functional equivalence. But that does not
+matter too much, since in the demonstration vocabulary only 32 words out of over
+45,000 stem differently. Here they are:
+
+
+
+
+source
ANSI C stemmer
Snowball stemmer
+
airways
airways
airway
+
algerije
algerije
alrije
+
assays
assays
assay
+
bruys
bruys
bruy
+
cleanaways
cleanaways
cleanaway
+
creys
creys
crey
+
croyden
croyd
croy
+
edele
edel
edeel
+
essays
essays
essay
+
gedijen
gedij
dij
+
geoff
of
off
+
gevrey
gevrey
vrey
+
geysels
ysel
gey
+
grootmeesteres
grootmee
grootmeest
+
gròotmeesteres
gròotmee
gròotmeest
+
hectares
hectaar
hect
+
huys
huys
huy
+
kayen
kayen
kaay
+
lagerwey
lagerwey
larwey
+
mayen
mayen
maay
+
meesteres
meester
meest
+
oppasseres
oppasser
oppas
+
pays
pays
pay
+
royale
royale
royaal
+
schilderes
schilder
schild
+
summerhayes
summerhayes
summerhaye
+
tyumen
tyuum
tyum
+
verheyen
verheyen
verheey
+
verleideres
verleider
verleid
+
ytsen
yts
ytsen
+
yves
yve
yves
+
zangeres
zanger
zang
+
+
+
+The Kraaij-Pohmann stemmer can make fairly drastic reductions to a word. For
+example, infixed ge is removed, so geluidgevoelige stems to
+luidvoel. Often, therefore, the original word cannot be easily guessed from
+the stemmed form.
+
+
+
+Here then is the Snowball equivalent of the Kraaij-Pohlmann algorithm.
+
+This is a revised version of Martin Porter’s paper which was published as part
+of the Karen Sparck Jones Festschrift of 2005.
+
+
+
+Charting a New Course: Progress in Natural Language Processing and
+Information Retrieval: A Festschrift for Professor Karen Sparck Jones, edited
+by John Tait, Amsterdam: Kluwer, 2005.
+
+
+
Lovins Revisited
+
+
+Martin Porter, December 2001 (revised November 2008).
+
+
+
Abstract
+
+ The Lovins stemming algorithm for English is analysed, and compared
+ with the Porter stemming algorithm, using Snowball, a language designed
+ specifically for the development of stemming algorithms. It is shown
+ how the algorithms manage to function in a similar way, while appearing
+ to be quite different. The Porter algorithm is recoded in the style of
+ the Lovins algorithm, which leads to the discovery of a few possible
+ improvements.
+
+
+
Preamble
+
+
+This is a festschrift paper, so I am allowed to begin on a personal note.
+In 1979 I was working with Keith van Rijsbergen and Stephen Robertson on a
+British Library funded IR project to investigate the selection of good
+index terms, and one of the things we found ourselves having to do was to
+establish a document test collection from some raw data that had been sent
+to us on a magnetic tape by Peter Vaswani of the National Physical
+Laboratory. I was the tame programmer in the project, so it was my job to
+set up the test collection.
+
+
+
+On the whole it did not prove too difficult. The data we received was a
+collection of about 11,000 documents (titles and short abstracts), 93
+queries — in a free text form, and relevance judgements. All the text was
+in upper case without punctuation, and there were one or two marker
+characters to act as field terminators. By modern standards the data was
+really very small indeed, but at the time it was considerably larger than
+any of the other test collections we had. What you had to do was to cast it
+into a standard form
+for experimental work. You represented terms and documents by numbers, and
+created flat files in text form corresponding to the queries, relevance
+assessments, and term to document index. One process however was less
+straightforward. On their way to becoming numeric terms, the words of the
+source text were put through a process of linguistic normalisation called
+suffix stripping, in which certain derivational and inflectional suffixes
+attached to the words were removed. There was a standard piece of software
+used in Cambridge at that time to do this, written in 1971 by Keith
+Andrews (Andrews, 1971) as part of a Diploma Project.
+One of the courses in
+Cambridge is the one year post-graduate Diploma in Computer Science. Each
+student on the course is required to do a special project, which includes
+writing a significant piece of software — significant in the sense of being
+both useful and substantial.
+Keith's piece of software was more useful than most, and it continued to be
+used as a suffix stripping program, or stemmer, for many years after it was
+written.
+
+
+
+Now by an odd chance I was privy to much of Keith Andrews’ original
+thinking at the time that he was doing the work. The reason for this was
+that in 1971 I was looking for a house in Cambridge, and the base I was
+operating from was a sleeping bag on the living room floor of an old friend
+called John Dawson, who was Keith’s diploma supervisor. Keith used to come round
+and discuss stemming algorithms with him, while I formed a mute audience. I
+learnt about the Lovins stemming algorithm of 1968 (Lovins, 1968),
+and must I think have
+at least looked at her paper then, since I know it was not new to me when I
+saw it again in 1979. Their view of Lovins’ work was that it did not go far
+enough. There needed to be many more suffixes, and more complex rules to
+determine the criteria for their removal. Much of their discussion was
+about new suffixes to add to the list, and removal rules. It was interesting
+therefore to find myself needing to use Andrews’ work eight years later,
+and questioning some of its assumptions. Did you need that many suffixes?
+Did the rules need to be so complicated? Perhaps one would do better to
+break composite suffixes into smaller units and remove them piecemeal.
+And perhaps syllables would be a better count of stem length than letters.
+So I wrote my own stemmer, which became known as the Porter stemmer, and
+which was published in 1980 (Porter, 1980).
+
+
+
+I must explain where Karen Sparck Jones fits into all of this. Keith
+Andrews’ piece of work was originally suggested by Karen as a Diploma
+student project, and she was able to use the Andrews stemmer in her IR
+experiments throughout the seventies. In 1979 however Karen had moved much
+more into the field of Natural Language Processing and Artificial
+Intelligence, and by then had two or three research students in that field
+just writing up their PhDs (only one of whom I really got to know — John
+Tait, the editor of this volume). So we were in contact, but not working
+together. That again was an odd chance: that Karen had been my research
+supervisor in a topic other than IR, and that when later I was doing IR
+research at Cambridge I was not working with Karen. While I was engaged on
+writing the stemmer, Karen showed some justifiable irritation that I had
+become interested in a topic so very remote from the one for which we had
+received the British Library funding. Nevertheless, she came into my room
+one day, said, ‘Look, if you're getting interested in stemming, you’d
+better read this,’ and handed me the 1968 issue of Mechanical
+Translation that contains the Lovins paper. I still have this issue with
+Karen’s name across the top. (And I hope she didn't expect it back!)
+
+
+
+Another 20 years have gone by, and I have been studying the Lovins stemmer
+again, really because I was looking for examples to code up in Snowball, a
+small string processing language I devised in the latter half of 2001
+particularly adapted for writing stemming algorithms. Lovins’ stemmer
+strikes me now as a fine piece of work, for which she never quite received
+the credit she deserved. It was the first stemmer for English set out as
+an algorithm that described the stemming process exactly. She explained
+how it was intended to be used to improve IR performance, in just the way
+in which stemmers are used today. It is not seriously short of suffixes:
+the outstanding omissions are the plural forms ements and ents
+corresponding to her ement and ent, and it is easy enough to add
+them into the definition. It performs well in practice. In fact it is
+still in use, and can be downloaded in various languages from the net (1).
+The tendency since 1980 has been to attach the name ‘Porter’ to any
+language stemming process that does not use a dictionary, even when it is
+quite dissimilar to the original Porter stemmer (witness the Dutch Porter
+stemmer of Kraaij and Pohlmann (2) (Kraaij, 1994 and Kraaij, 1995), but
+the priority really belongs to Lovins. It also has one clear advantage
+over the Porter algorithm, in that it involves fewer steps. Coded up well,
+it should run a lot faster.
+
+
+
+A number of things intrigued me. Why are the Lovins and Porter stemmers so
+different, when what they do looks so similar? Could the stemmer, in some
+sense, be brought up-to-date? Could the Porter stemmer be cast into the
+Lovins form, and so run faster?
+
+
+
+This paper is about the answers for these questions. In discovering them, I
+have learned a lot more about my own stemmer.
+
+
+
Why stem?
+
+
+It may be worth saying a little on what stemming is all about. We can imagine
+a document with the title,
+
+
+
+ Pre-raphaelitism: A Study of Four Critical Approaches
+
+
+
+and a query, containing the words
+
+
+
+ PRE-RAPHAELITE CRITICISM
+
+
+
+We want to match query against title so that ‘Pre-raphaelitism’ matches
+‘PRE-RAPHAELITE’ and ‘Critical’ matches ‘CRITICISM’. This leads to the
+idea of removing endings from words as part of the process of extracting index
+terms from documents, a similar process of ending removal being applied to
+queries prior to the match. For example, we would like to remove the endings
+from
+
+so that each word is reduced to ‘critic’. This is the stem, from which the
+other words are formed, so the process as a whole is called stemming. It is
+a feature of English morphology that the part of the word we want to remove is
+at the end — the suffix. But the same is broadly true of French, German and other
+languages of the Indo-European group. It is also true of numerous languages
+outside Indo-European, Finnish for example, although there is a
+boundary beyond which it is not true. So Chinese, where words are simple
+units without affixes, and Arabic, where the stem is modified by
+prefixes and infixes as well as suffixes, lie outside the
+boundary. As an IR technique it therefore has wide applicability. In developing
+stemmers two points were recognised quite early on. One is that the
+morphological regularities that you find in English (or other languages) mean
+that you can attempt to do stemming by a purely algorithmic process. Endings
+al, ally, ism etc. occur throughout English vocabulary, and are
+easy to detect and remove: you don’t need access to an on-line dictionary. The
+other is that the morphological irregularities of English set a limit to the
+success of an algorithmic approach. Syntactically, what look like endings may
+not be endings (offspring is not offspr + ing), and the list of
+endings seems to extend indefinitely (trapez-oid, likeli-hood,
+guardian-ship, Tibet-an, juven-ilia, Roman-esque, ox-en
+...) It is difficult to gauge where to set the cut-off for these rarer forms.
+Semantically, the addition of a suffix may alter the meaning of a word a
+little, a lot, or completely, and morphology alone cannot measure the degree of
+change (prove and provable have closely related meanings; probe and
+probable do not.) This meant that stemming, if employed at all, became the
+most challenging, and the most difficult part of the indexing process.
+
+
+
+In the seventies, stemming might be applied as part of the process of
+establishing a test collection, and when it was there would not usually be any
+attempt to make the stemming process well-defined, or easily repeatable by
+another researcher. This was really because the basis for experiment replication
+was the normalised data that came out of the stemming process, rather than the
+source data plus a description of stemming procedures. Stemming tended to be
+applied, and then forgotten about. But by the 1980s, stemming itself was being
+investigated. Lennon and others (Lennon, 1981) found no substantial differences
+between the use of different stemmers for English. Harman (Harman, 1991)
+challenged the effectiveness of stemming altogether, when she reported no
+substantial differences between using and not using stemming in a series of
+experiments. But later work has been more positive. Krovetz (Krovetz, 1995), for example,
+reported small but significant improvements with stemming over a range of test
+collections.
+
+
+
+Of course, all these experiments assume some IR model which will use stemming in
+a particular way, and will measure just those features that tests collections
+are, notoriously, able to measure. We might imagine an IR system where the users
+have been educated in the advantages and disadvantages to be expected from
+stemming, and are able to flag individual search terms to say whether or not
+they are to be used stemmed or unstemmed. Stemming sometimes improves,
+occasionally degrades, search performance, and this would be the best way of
+using it as an IR facility. Again stemming helps regularise the IR vocabulary,
+which is very useful when preparing a list of terms to present to a user as
+candidates for query expansion. But this advantage too is difficult to quantify.
+
+
+
+An evaluative comparison between the Lovins and later stemmers lies in any case
+outside the scope of this paper, but it is important to
+bear in mind that it is not a straightforward undertaking.
+
+
+
The Lovins Stemmer
+
+
+Structurally, the Lovins stemmer is in four parts, collected together in
+four Appendices A, B, C and D in her paper. Part A is a list of 294
+endings, each with a letter which identifies a condition for whether or
+not the ending should be removed. (I will follow Lovins in using ‘ending’
+rather than ‘suffix’ as a name for the items on the list.)
+Part A therefore looks like this:
+
+
+
+ .11.
+ alistically B
+ arizability A
+ izationally B
+ .10.
+ antialness A
+ arisations A
+ arizations A
+ entialness A
+ .09.
+ allically C
+ antaneous A
+ antiality A
+ . . .
+
+ .01.
+ a A
+ e A
+ i A
+ o A
+ s W
+ y B
+
+
+
+Endings are banked by length, from 11 letters down to 1. Each bank is tried
+in turn until an ending is found which matches the end of the word to be
+stemmed and leaves a stem which satisfies the given condition, when the
+ending is removed. For example condition C says that the stem must have at
+least 4 letters, so bimetallically would lose allically leaving a
+stem bimet of length 5, but metallically would not reduce to
+met, since its length is only 3.
+
+
+
+There are 29 such conditions, called A to Z, AA, BB and CC, and they
+constitute part B of the stemmer. Here they are (* stands for any letter):
+
+
+
+
+
A
No restrictions on stem
+
B
Minimum stem length = 3
+
C
Minimum stem length = 4
+
D
Minimum stem length = 5
+
E
Do not remove ending after e
+
F
Minimum stem length = 3 and do not remove ending after e
+
G
Minimum stem length = 3 and remove ending only after f
+
H
Remove ending only after t or ll
+
I
Do not remove ending after o or e
+
J
Do not remove ending after a or e
+
K
Minimum stem length = 3 and remove ending only after l, i or
+u*e
+
L
Do not remove ending after u, x or s, unless s follows
+o
+
Minimum stem length = 3 and do not remove ending after l or
+n
+
R
Remove ending only after n or r
+
S
Remove ending only after dr or t, unless t follows t
+
T
Remove ending only after s or t, unless t follows o
+
U
Remove ending only after l, m, n or r
+
V
Remove ending only after c
+
W
Do not remove ending after s or u
+
X
Remove ending only after l, i or u*e
+
Y
Remove ending only after in
+
Z
Do not remove ending after f
+
AA
Remove ending only after d, f, ph, th, l, er, or, es or t
+
BB
Minimum stem length = 3 and do not remove ending after met or
+ryst
+
CC
Remove ending only after l
+
+
+
+
+There is an implicit assumption in each condition, A included, that the minimum
+stem length is 2.
+
+
+
+This is much less complicated than it seems at first. Conditions A to D
+depend on a simple measure of minimum stem length, and E and F are slight
+variants of A and B. Out of the 294 endings, 259 use one of these
+6 conditions. The remaining 35 endings use the other 23 conditions, so
+conditions G, H ... CC have less than 2 suffixes each, on average. What is
+happening here is that Lovins is trying to capture a rule which gives a
+good removal criterion for one ending, or a small number of similar
+endings. She does not explain the thinking behind the conditions, but it is
+often not too difficult to reconstruct. Here for example are the last few
+conditions with their endings,
+
+
+
+
+Y (early, ealy, eal, ear). collinearly, multilinear are
+stemmed.
+
+Z (eature). misfeature does not lose eature.
+
+AA (ite). acolouthite, hemimorphite lose ite, ignite and
+requite retain it.
+
+BB (allic, als, al). Words ending metal, crystal retain
+al.
+
+CC (inity). crystallinity → crystall, but affinity,
+infinity are unaltered.
+
+
+
+
+Part C of the Lovins stemmer is a set of 35 transformation rules used to
+adjust the letters at the end of the stem. These rules are invoked after the
+stemming step proper, irrespective of whether an ending was actually
+removed. Here are about half of them, with examples to show the type of
+transformation intended (letters in square brackets indicate the full form
+of the words),
+
+
+
+
+
1)
bb
→
b
rubb[ing] → rub
+
ll
→
l
controll[ed] → control
+
mm
→
m
trimm[ed] → trim
+
rr
→
r
abhorr[ing] → abhor
+
2)
iev
→
ief
believ[e] → belief
+
3)
uct
→
uc
induct[ion] → induc[e]
+
4)
umpt
→
um
consumpt[ion] → consum[e]
+
5)
rpt
→
rb
absorpt[ion] → absorb
+
6)
urs
→
ur
recurs[ive] → recur
+
7a)
metr
→
meter
parametr[ic] → paramet[er]
+
8)
olv
→
olut
dissolv[ed] → dissolut[ion]
+
11)
dex
→
dic
index → indic[es]
+
16)
ix
→
ic
matrix → matric[es]
+
18)
uad
→
uas
persuad[e] → persuas[ion]
+
19)
vad
→
vas
evad[e] → evas[ion]
+
20)
cid
→
cis
decid[e] → decis[ion]
+
21)
lid
→
lis
elid[e] → elis[ion]
+
31)
ert
→
ers
convert[ed] → convers[ion]
+
33)
yt
→
ys
analytic → analysis
+
34)
yz
→
ys
analyzed → analysed
+
+
+
+
+Finally, part D suggests certain relaxed matching rules between query terms
+and index terms when the stemmer has been used to set up an IR system, but
+we can regard that as not being part of the stemmer proper.
+
+
+
The Lovins stemmer in Snowball
+
+
+Snowball is a string processing language designed with the idea of making
+the definition of stemming algorithms much more rigorous. The Snowball
+compiler translates a Snowball script into a thread-safe ANSI C module,
+where speed of execution is a major design consideration. The resulting
+stemmers are pleasantly fast, and will process one million or so words a
+second on a high-performance modern PC. The Snowball website (3) gives a
+full description of the language, and also presents stemmers for a range of
+natural languages. Each stemmer is written out as a formal algorithm, with
+the corresponding Snowball script following. The algorithm definition acts
+as program comment for the Snowball script, and the Snowball script gives a
+precise definition to the algorithm. The ANSI C code with the
+same functionality can also be inspected, and sample vocabularies in source
+and stemmed form can be used for test purposes.
+An essential function of
+the Snowball script is therefore comprehensibility — it should be fully understood
+by the reader of the script, and Snowball has been designed with this in mind.
+It contrasts interestingly in this respect with a system like Perl.
+Perl has a very big definition. Writing your own scripts in Perl is easy,
+after the initial learning hurdle, but understanding other scripts can be
+quite hard. The size of the language means that there are many different
+ways of doing the same thing, which gives programmers the opportunity of
+developing highly idiosyncratic styles. Snowball has a small, tight
+definition. Writing Snowball is much less easy than writing Perl, but on
+the other hand once it is written it is fairly easy to understand
+(or at least one hopes that it is). This is
+illustrated by the Lovins stemmer in Snowball, which is given in Appendix
+1. There is a very easy and natural correspondence
+between the different parts of the stemmer definition in Lovins' original
+paper and their Snowball equivalents.
+For example, the Lovins conditions A, B ... CC code up very neatly
+into routines with the same name. Taking condition L,
+
+
+
+ L Do not remove ending after u, x or s, unless s follows
+ o
+
+
+
+corresponds to
+
+
+
defineLas(testhop2not'u'not'x'not('s'not'o'))
+
+
+
+
+When L is called, we are the right end of the stem, moving left towards the
+front of the word. Each Lovins condition has an implicit test for a stem of
+length 2, and this is done by testhop2
+
+
+, which sees if it is possible to
+hop two places left. If it is not, the routine immediately returns with a
+false signal, otherwise it carries on. It tests that the character at the
+right hand end is not u, and also not x, and also not s following a letter
+which is not o. This is equivalent to the Lovins condition. Here is not of
+course the place to give the exact semantics, but the you can quickly get
+the feel of the language by comparing the 29 Lovins conditions with their
+Snowball definitions.
+
+
+
+Something must be said about the among
+
+
+ feature of Snowball however,
+since this is central to the efficient implementation of stemmers. It is
+also the one part of Snowball that requires just a little effort to
+understand.
+
+
+
+At its simplest, among
+
+
+ can be used to test for alternative strings. The
+among
+
+
+s used in the definition of condition AA and the undouble
+routine have this form. In Snowball you can write
+
+
+
'sh'or's'or't''o'or'i''p'
+
+
+
+
+which will match the various forms shop, ship, sop, sip, top, tip. The
+order is important, because if 'sh'
+
+
+ and 's'
+
+
+ are swapped over, the
+'s'
+
+
+ would match the first letter of ship, while 'o'
+
+
+ or 'i'
+
+
+would fail to match with the following 'h'
+
+
+ — in other words the pattern
+matching has no backtracking. But it can also be written as
+
+
+
among('sh''s''t')among('i''o')'p'
+
+
+
+
+The order of the strings in each among
+
+
+ is not important, because the
+match will be with the longest of all the strings that can match. In
+Snowball the implementation of among
+
+
+ is based on the binary-chop idea,
+but has been carefully optimised. For example, in the Lovins stemmer, the
+main among
+
+
+ in the endings routine has 294 different strings of average
+length 5.2 characters. A search for an ending involves accessing a number
+of characters within these 294 strings. The order is going to be
+Klog2294, or 8.2K, where K is a number that one hopes will
+be small, although one must certainly expect it to be greater than 1. It
+turns out that, for the successive words of a standard test vocabulary,
+K averages to 1.6, so for each word there are about 13 character
+comparisons needed to determine whether it has one of the Lovins endings.
+
+
+
+Each string in an among
+
+
+ construction can be followed by a routine name. The
+routine returns a true/false signal, and then the among
+
+
+ searches for the
+longest substring whose associated routine gives a true signal. A string not
+followed by a routine name can be thought of as a string which is associated
+with a routine that does nothing except give a true signal. This is the way
+that the among
+
+
+ in the endings routine works, where indeed every string is
+followed by a routine name.
+
+
+
+More generally, lists of strings in the among
+
+
+ construction can be followed
+by bracketed commands, which are obeyed if one of the strings in the list is
+picked out for the longest match. The syntax is then
+
+where the Sij are strings, optionally followed by their routine names,
+and the Ci are Snowball command sequences. The semantics is a bit
+like a switch in C, where the switch is on a string rather than a numerical
+value:
+
+
+
+ switch(...) {
+ case S11: case S12: ... C1; break;
+ case S21: case S22: ... C2; break;
+ ...
+
+ case Sn1: case Sn2: ... Cn; break;
+ }
+
+
+
+The among
+
+
+ in the respell routine has this form.
+
+
+
+The full form however is to use among
+
+
+ with a preceding substring
+
+
+, with
+substring
+
+
+ and among
+
+
+ possibly separated by further commands.
+substring
+
+
+triggers the test for the longest matching substring, and the among
+
+
+ then
+causes the corresponding bracketed command to be obeyed. At a simple
+level this can be used to cut down the size of the code, in that
+
+More importantly, substring
+
+
+ and among
+
+
+ can work in different contexts. For
+example, substring
+
+
+ could be used to test for the longest string, matching from
+right to left, while the commands in the among
+
+
+ could operate in a left to
+right direction. In the Lovins stemmer, substring
+
+
+ is used in this style:
+
+
+
[substring]among(...)
+
+
+
+
+The two square brackets are in fact individual commands, so before the among
+
+
+come three commands. [
+
+
+ sets a lower marker, substring
+
+
+ is obeyed, searching
+for the strings in the following among, and then ]
+
+
+ sets an upper marker.
+The region between the lower and upper markers is called the slice, and this
+may subsequently be copied, replaced or deleted.
+
+
+
+It was possible to get the Lovins stemmer working in Snowball very quickly.
+The Sourceforge versions (1) could be used to get the long list of endings and
+to help with the debugging. There was however one problem, that rules 24 and
+30 of part C conflicted. They are given as
+
+
+
+ 24) end → ens except following s
+ ...
+ 30) end → ens except following m
+
+
+
+This had not been noticed in the Sourceforge implementations, but
+immediately gave rise to a compilation error in Snowball. Experience
+suggested that I was very unlikely to get this problem resolved. Only a few
+months before, I had hit a point in a stemming algorithm where
+something did not quite make sense. The algorithm had been published just a
+few years ago, and contacting one at least of the authors was quite easy.
+But I never sorted it out. The author I traced was not au fait
+with the linguistic background, and the language expert had been swallowed
+up in the wilds of America. So what chance would I have here? Even if I was
+able to contact Lovins, it seemed to me inconceivable that she would have
+any memory of, or even interest in, a tiny problem in a paper which she
+published 33 years ago. But the spirit of academic enquiry forced me to
+venture the attempt. After pursuing a number of red-herrings, email contact
+was finally made.
+
+
+
+Her reply was a most pleasant surprise.
+
+
+
+ ... The explanation is both mundane and exciting. You have just found
+ a typo in the MT article, which I was unaware of all these years, and I
+ suspect has puzzled a lot of other people too. The original paper, an
+ MIT-published memorandum from June 1968, has rule 30 as
+
+
+
+ ent → ens except following m
+
+
+
+ and that is undoubtedly what it should be ...
+
+
+
+
An analysis of the Lovins stemmer
+
+
+It is very important in understanding the Lovins stemmer to know something
+of the IR background of the late sixties. In the first place there was an
+assumption that IR was all, or mainly, about the retrieval of
+technical scientific papers, and research projects were set up accordingly.
+I remember being shown, in about 1968, a graph illustrating the
+‘information explosion’, as it was understood at the time, which showed
+just the rate of growth of publications of scientific papers in various
+different domains over the previous 10 or 20 years. Computing resources
+were very precious, and they could not be wasted by setting up IR systems
+for information that was, by comparison, merely frivolous (articles in
+popular magazines, say). And even in 1980, when I was working in IR, the
+data I was using came from the familiar, and narrow, scientific domain.
+Lovins was working with Project Intrex (Overhage, 1966), where the data came from
+papers in materials science and engineering.
+
+
+
+Secondly, the idea of indexing on every word in a document, or even looking
+at every word before deciding whether or not to put it into an index, would
+have seemed quite impractical, even though it might have been recognised as
+theoretically best. In the first place, the computing resources necessary to
+store and analyse complete documents in machine readable form were absent, and in the
+second, the rigidities of the printing industry almost guaranteed that one
+would never get access to them.
+A stemmer, therefore, would be seen as something not
+applied to general text but to certain special words, and in the case of the
+Lovins stemmer, the plan was to apply it to the subject terms that were used
+to categorize each document. Subsequently it would be used with each word
+in a query, where it
+was hoped that the vocabulary of the queries would match the vocabulary of
+the catalogue of subject terms.
+
+
+
+This accounts for: —
+
+
+
+
The emphasis on the scientific vocabulary. This can be seen in the
+endings, which include oidal, on, oid, ide, for words like colloidal,
+proton, spheroid, nucleotide. It can be seen in the transformation rules,
+with their concern for Greek sis and Latin ix suffixes. And also it can be
+seen in in the word samples of the paper (magnesia, magnesite, magnesian,
+magnesium, magnet, magnetic, magneto etc. of Fig. 2).
+
+
+
The slight shortage of plural forms. The subject terms would naturally
+have been mainly in the singular, and one might also expect the same of
+query terms.
+
+
+
The surprising shortness of the allowed minimum stems — usually 2
+letters. A controlled technical vocabulary will contain longish words, and
+the problem of minimum stem lengths only shows up with shorter words.
+
+
+
+
+If we take a fairly ordinary vocabulary of modern English, derived from
+non-scientific writing, it is interesting to see how much of the Lovins
+stemmer does not actually get used. We use vocabulary V, derived from a
+sample of modern texts from Project Gutenberg (4). V can be inspected
+at (5). It contains 29,401 words, and begins
+
+We find that 22,311, or about 76%, of the words in V have one of the
+294 endings removed if passed through the Lovins stemmer. Of this 76%, over a
+half (55%) of the removals are done by just five of the endings, the breakdown
+being,
+
+ s (13%) ed (12%) e (10%) ing (10%) es (6%) y (4%)
+
+If, on the other hand, you look at the least frequent endings, 51% of them
+do only 1.4% of the removals. So of the ones removed, half the endings in
+V
+correspond to 2% of the endings in the stemmer, and 1.4% of the endings in
+V
+correspond to half the endings in the stemmer. In fact 62 of the endings
+(about a fifth) do not lead to any ending removals in V at all. These are
+made up of the rarer ‘scientific’ endings, such as aroid and oidal, and
+long endings, such as alistically and entiality.
+
+
+
+This helps explain why the Porter and Lovins stemmers behave in a fairly
+similar way despite the fact that they look completely different — it is
+because most of the work is being done in just a small part of the stemmer,
+and in that part there is a lot of overlap. Porter and Lovins stem 64% of
+the words in V identically, which is quite high. (By contrast, an
+erroneous but plausibly written Perl script
+advertised on the Web as an implementation of the Porter stemmer
+still proves to stem only 86% of the words in V
+to the same forms that are produced by the Porter stemmer.)
+
+
+
+A feature of the Lovins stemmer that is worth looking at in some detail is
+the transformation rules. People who come to the problem of stemming for
+the first time usually devote a lot of mental energy to the issue of
+morphological irregularity which they are trying to address.
+
+
+
+A good starting point is the verbs of English. Although grammatically
+complex, the morphological forms of the English verb are few, and are
+illustrated by the pattern harm, harms, harming, harmed, where the basic
+verb form adds s, ing and ed to make the other three forms. There are
+certain special rules: to add s to a verb ending ss an e is inserted,
+so pass becomes passes, and adding e and ing replaces a final e of
+the verb (love to loves), and can cause consonant doubling (hop to
+hopped), but
+apart from this all verbs in the language follow the basic pattern with the
+exception of a finite class of irregular verbs.
+In a regular verb, the addition of ed to the basic verb creates both the
+past form (‘I harmed’) and the p.p. (past participle) form (‘I have
+harmed’). An irregular verb, such as ring, forms its past in some other
+way (‘I rang’), and may have a distinct p.p. (‘I have rung’).
+The irregular verbs have a
+different past form, and sometimes a separate p.p. form.
+It is easy to think up more examples,
+
+
+
stem
past
p.p.
+
+
ring
rang
rung
+
rise
rose
risen
+
sleep
slept
slept
+
fight
fought
fought
+
come
came
come
+
go
went
gone
+
hit
hit
hit
+
+
+How many of these verbs are there altogether? On 20 Jan 2000, in order to
+test the hypothesis that the number is consistently over-estimated, I asked
+this question in a carefully worded email to a mixed group of
+about 50
+well-educated
+work colleagues (business rather than academic people). Ten of them replied,
+and here are the
+guesses they made:
+
+
+
+The last two numbers mean 10% and 20% of all English verbs.
+My hypothesis was of course wrong. The truth is that most people have no
+idea at all how many irregular verbs there are in English.
+In
+fact there are around 135 (see section 3.3 of Palmer, 1965).
+If a stemming algorithm handles suffix removal
+of all regular verbs correctly, the question arises as to whether it is
+worth making it do the same for the irregular forms. Conflating fought and
+fight, for example, could be useful in IR queries about boxing. It seems
+easy: you make a list of the irregular verbs and create a mapping of the
+past and p.p. forms to the main form. We can call the process
+English verb respelling. But when you try it, numerous problems arise. Are
+forsake, beseech, cleave really verbs of contemporary English? If so, what
+is the p.p. of cleave?
+Or take the verb stride, which is common enough. What is its p.p.? My
+Concise Oxford English Dictionary says it is stridden (6), but have we ever
+heard this word used? (‘I have stridden across the paving.’)
+
+
+
+To compose a realistic list for English verb respelling we therefore need to
+judge word rarity. But among the commoner verb forms even greater problems
+arise because of their use as homonyms. A rose is a type of flower, so
+is it wise
+to conflate rose and rise? Is it wise to conflate
+saw and see when saw can mean a cutting instrument?
+
+
+
+We suddenly get to
+the edge of what it is useful to include in a stemming algorithm. So long as
+a stemming algorithm is built around general rules, the full impact of the
+stemmer on a vocabulary need not be studied too closely. It is sufficient to
+know that the stemmer, judiciously used, improves retrieval performance. But
+when we look at its effect on individual words these issues can no longer be
+ignored. To build even a short list of words into a stemmer for special
+treatment takes us into the area of the dictionary-based stemmer, and the
+problem of determining, for a pair of related words in the dictionary, a
+measure of semantic similarity which tells us whether or not the words
+should be conflated together.
+
+
+
+About half the transformation rules in the Lovins stemmer deal with a
+problem which is similar to that posed by the irregular verbs of English,
+and which ultimately goes back to the irregular forms of second conjugation
+verbs in Latin. We can call it Latin verb respelling. Verbs like
+induce, consume, commit are perfectly regular in modern English, but
+the adjectival and noun forms induction, consumptive, commission that
+derive from them correspond to p.p. forms in Latin.
+You can see the descendants of these Latin irregularities
+in modern Italian, which has commettere with p.p.
+commesso, like our commit and commission, and scendere with
+p.p. sceso like our ascend and ascension (although scendere
+means ‘to go down’ rather than ‘to go up’).
+
+
+
+Latin verb respelling often seems to be more the territory of a stemmer than
+English verb respelling, presumably because Latin verb irregularities
+correspond to consonantal changes at the end of the stem, where the
+stemmer naturally operates, while English verb irregularities more often
+correspond to vowel changes in the middle. Lovins was no doubt
+particularly interested in Latin verb respelling because so many of the
+words affected have scientific usages.
+
+
+
+We can judge that Latin verb respellings constitute a small set because the
+number of second conjugation verbs of Latin form a small, fixed set. Again,
+looking at Italian, a modern list of irregular verbs contains 150 basic forms
+(nearly all of them second conjugation), not unlike the number of forms in
+English. Extra verbs are formed with prefixes. Corresponding English words
+that exhibit the Latin verb respelling problem
+will be a subset of this system. In fact we
+can offer a Snowball script that does the Latin verb respelling with more
+care. It should be invoked, in the Porter stemmer, after removal of ive or
+ion endings only,
+
+
defineprefixas(
+
+among(
+
+'a''ab''ad''al''ap''col''com''con''cor''de'
+'di''dis''e''ex''in''inter''o''ob''oc''of'
+'per''pre''pro''re''se''sub''suc''trans'
+)atlimit
+)
+
+definesecond_conjugation_formas(
+
+[substring]prefixamong(
+
+'cept'(<-'ceiv')//-e con de re
+'cess'(<-'ced')//-e con ex inter pre re se suc
+'cis'(<-'cid')//-e de (20)
+'clus'(<-'clud')//-e con ex in oc (26)
+'curs'(<-'cur')// re (6)
+'dempt'(<-'deem')// re
+'duct'(<-'duc')//-e de in re pro (3)
+'fens'(<-'fend')// de of
+'hes'(<-'her')//-e ad (28)
+'lis'(<-'lid')//-e e col (21)
+'lus'(<-'lud')//-e al de e
+'miss'(<-'mit')// ad com o per re sub trans (29)
+'pans'(<-'pand')// ex (23)
+'plos'(<-'plod')//-e ex
+'prehens'(<-'prehend')// ap com
+'ris'(<-'rid')//-e de (22)
+'ros'(<-'rod')//-e cor e
+'scens'(<-'scend')// a
+'script'(<-'scrib')//-e de in pro
+'solut'(<-'solv')//-e dis re (8)
+'sorpt'(<-'sorb')// ab (5)
+'spons'(<-'spond')// re (25)
+'sumpt'(<-'sum')// con pre re (4)
+'suas'(<-'suad')//-e dis per (18)
+'tens'(<-'tend')// ex in pre (24)
+'trus'(<-'trud')//-e ob (27)
+'vas'(<-'vad')//-e e (19)
+'vers'(<-'vert')// con in re (31)
+'vis'(<-'vid')//-e di pro
+)
+)
+
+
+
+This means that if suas, for example, is preceded by one of the strings
+in prefix
+
+
+, and there is nothing more before the prefix string (which is
+what the
+atlimit
+
+
+command tests), it is replaced by suad. So dissuas(ion) goes to
+dissuad(e)
+and persuas(ive) to persuad(e). Of course, asuas(ion), absuas(ion),
+adsuas(ion) and so on would get the same treatment, but not being words of
+English that does not really matter. The corresponding Lovins rules are
+shown in brackets.
+This is not quite the end
+of the story, however, because the Latin forms ex + cedere (‘go
+beyond’) pro + cedere (‘go forth’), and sub + cedere
+(‘go after’) give rise to verbs which,
+by an oddity of English orthography, have an extra letter e: exceed, proceed,
+succeed. They can be sorted out in a final respelling step:
+
+
+
definefinal_respellas(
+
+[substring]atlimitamong(
+
+'exced'(<-'exceed')
+'proced'(<-'proceed')
+'succed'(<-'succeed')
+/* extra forms here perhaps */
+)
+)
+
+
+
+
+As you might expect, close inspection of this process creates doubts in
+the same way as for English verb respelling. (Should we really conflate
+commission and commit? etc.)
+
+
+
+The other transformation rules are concerned with unusual plurals, mainly
+of Latin or Greek origin, er and re differences, as in parameter and
+parametric, and the sis/tic connection of certain words of Greek origin:
+analysis/analytic, paralysis/paralytic ... (rule 33), and
+hypothesis/hypothetic, kinesis/kinetic ... (rule 32). Again, these
+irregularities might be tackled by forming explicit word lists. Certainly
+rule 30, given as,
+
+
+
+ ent → ens except following m,
+
+
+
+goes somewhat wild when given a general English vocabulary (dent becomes
+dens for example), although it is the only rule that might be said to
+have a damaging effect.
+
+
+
A Lovins shape for the Porter stemmer
+
+
+The 1980 paper (Porter, 1980) may be said to define the ‘pure’ Porter stemmer.
+The stemmer distributed at (7) can be called the ‘real’ Porter
+stemmer, and differs from the pure stemmer in three small respects, which
+are carefully explained. This disparity does not require much excuse,
+since the oldest traceable encodings of the stemmer have always contained
+these differences. There is also a revised stemmer for English, called
+‘Porter2’ and still subject to slight changes. Unless otherwise stated,
+it is the real Porter stemmer which is being studied below.
+
+
+
+The Porter stemmer differs from the Lovins stemmer in a number of
+respects. In the first place, it only takes account of fairly common
+features of English. So rare suffixes are not included, and there is no
+equivalent of Lovins’ transformation rules, other than her rule (1), the
+undoubling of terminal double letters. Secondly, it removes suffixes only
+when the residual stem is fairly substantial. Some suffixes are removed
+only when at least one syllable is left, and most are removed only when at least two
+syllables are left. (One might say that this is based on a guess about the
+way in which the meanings of a stem is related to its length in syllables (8).)
+The Porter stemmer is therefore ‘conservative’ in its removal
+of suffixes, or at least that is how it has often been described. Thirdly,
+it removes suffixes in a series of steps, often reducing a compound suffix
+to its first part, so a step might reduce ibility to ible, where
+ibility is thought of as being ible + ity. Although the
+description of the whole stemmer is a bit complicated, the total number of
+suffixes is quite small — about 60.
+
+
+
+The Porter stemmer has five basic steps. Step 1 removes an
+inflectional suffix. There are only three of these: ed and ing, which are
+verbal, and s, which is verbal (he sings), plural (the songs) or possessive
+(the horses’ hooves), although the rule for s removal is the same in all
+three cases. Step 1 may also restore an e (hoping → hope), undouble a
+double letter pair (hopping → hop), or change y to i (poppy →
+poppi, to match with poppies → poppi.) Steps 2 to 4 remove derivational
+suffixes. So
+ibility may reduce to ible in step 2, and ible itself may be removed in step
+4. Step 5 is for removing final e, and undoubling ll.
+
+
+
+A clear advantage of the Lovins stemmer over the Porter stemmer is speed.
+The Porter stemmer has five steps of suffix removal to the Lovins stemmer’s
+one. It is instructive therefore to try and cast the Porter stemmer into
+the shape of the Lovins stemmer, if only for the promise of certain speed
+advantages. As we will see, we learn a few other things from the exercise
+as well.
+
+
+
+First we need a list of endings. The Lovins endings were built up by hand,
+but we can construct a set of endings for the Porter stemmer by writing an
+ending generator that follows the algorithm definition. From an analysis of
+the suffixes in steps 2 to 4 of the Porter stemmer we can construct
+the following diagram:
+
+
+
+
+
+This is not meant to be a linguistic analysis of the suffix structure of
+English, but is merely intended to show how the system of endings works in
+the stemming algorithm. Suffixes combine if their boxes are connected by
+an arrow. So ful combines with ness to make fulness.
+
+
+ ful + ness → fulness
+
+
+The combination is not always a concatenation of the strings
+however, for we have,
+
+
+ able + ity → ability
+ able + ly → ably
+ ate + ion → ation
+ ible + ity → ibility
+ ible + ly → ibly
+ ize + ate + ion → ization
+
+
+The path from ize to ion goes via ate, so we can form ization, but there is
+no suffix izate. Three of the suffixes, ator, ance and ence, do not connect
+into the rest of the diagram, and ance, ence also appear in the forms
+ancy, ency. The letter to the left of the box is going to be the
+condition for the
+removal of the suffix in the box, so
+
+
+ B +-------+ n
+ | ism |
+ +-------+
+
+
+means that ism will be removed if it follows a stem that satisfies
+condition B. On the right of the box is either n, v or hyphen. n means the
+suffix is of noun type. So if a word ends ism it is a noun. v means verb
+type. hyphen means neither: ly (adverbial) and ful, ous (adjectival) are of
+this type. If a suffix is a noun type it can have a plural form (criticism,
+criticisms), so we have to generate isms as well as ism. Again, the
+combining is not just concatenation,
+
+
+ ity + s → ities
+ ness + s → nesses
+
+
+If a suffix has v type, it has s, ed and ing forms,
+
+
+ ize + s → izes
+ ize + ed → ized
+ ize + ing → izing
+
+
+Type v therefore includes type n, and we should read this type as ‘verb or
+noun’, rather than just ‘verb’. For example, condition, with suffix ion, is
+both verb (‘They have been conditioned to behave like that’) and noun
+(‘It is subject to certain conditions’).
+
+
+
+The diagram is therefore a scheme for generating combined derivational
+suffixes, each combination possibly terminated with an inflectional suffix.
+A problem is that it contains a loop in
+
+
+ ize → ate → ion → al → ize → ...
+
+
+suggesting suffixes of the form izationalizational... We break the loop by
+limiting the number of joined derivational suffixes of diagram 1 to four.
+(Behaviour of the Porter stemmer shows that removal of five combined
+derivation suffixes is never desirable, even supposing five ever combine.)
+We can then generate 181 endings, with their removal codes. But 75 of these
+suffixes do not occur as endings in V, and they can be eliminated as rare
+forms, leaving 106. Alphabetically, the endings begin,
+
+
+
+The eliminated rare forms are shown bracketed.
+
+
+
+The 106 endings are arranged in a file as a list of strings followed by
+condition letter,
+
+
+ 'abilities' B
+ 'ability' B
+ 'able' B
+ 'ables' B
+ 'ably' B
+ 'al' B
+ ....
+
+
+This ending list is generated by running the ANSI C program shown in
+Appendix 4, and line-sorting the result into a file,
+and this file is called in by the get
+
+
+ directive in the Snowball script of
+Appendix 2, which is the Porter stemming algorithm laid out in the style of
+the Lovins algorithm. In fact, precise equivalence cannot be achieved, but
+in V only 137 words stem differently, which is 0.4% of V. There are 10
+removal conditions, compared with Lovins’ 29, and 11 transformation or
+respelling rules, compared with Lovins’ 35. We can describe the process in
+Lovins style, once we have got over a few preliminaries.
+
+
+
+We have to distinguish y as a vowel from y as a consonant. We treat initial
+y, and y before vowel, as a consonant, and make it upper case. Thereafter
+a, e, i, o, u and y are vowels, and the other lower case letters and Y are
+consonants. If [C] stands for zero or more consonants, C for one or more
+consonants, and V for one or more vowels, then a stem of shape [C]VC has
+length 1s (1 syllable), of shape [C]VCVC length 2s, and so on.
+
+
+
+A stem ends with a short vowel if the ending has the form cvx, where c is a
+consonant, v a vowel, and x a consonant other than w, x or Y.
+(Short vowel endings with ed and ing imply loss of an e from
+the stem, as in removing = remove + ing.)
+
+
+
+Here are the removal conditions,
+
+
+
+
+
A
Minimum stem length = 1s
+
B
Minimum stem length = 2s
+
C
Minimum stem length = 2s and remove ending only after s or t
+
D
Minimum stem length = 2s and do not remove ending after m
+
E
Remove ending only after e or ous after minimum stem length 1s
+
F
Remove ending only after ss or i
+
G
Do not remove ending after s
+
H
Remove ending only if stem contains a vowel
+
I
Remove ending only if stem contains a vowel and does not end in e
+
J
Remove ending only after ee after minimum stem length 1s
+
+
+
+
+In condition J the stem must end ee, and the part of the stem before the
+ee must have minimum length 1s. Condition E is similar.
+
+
+
+Here are the respelling rules, defined with the help of the removal
+conditions. In each case, the stem being tested does not include the string
+at the end which has been identified for respelling.
+
+
+
1)
Remove e if A, or if B and the stem does not end with a short vowel
+
2)
Remove l if B and the stem ends with l
+
3)
enci/ency → enc if A, otherwise → enci
+
4)
anci/ancy → anc if A, otherwise → anci
+
5)
ally → al if A, otherwise → alli
+
6)
ently → ent if A, otherwise → entli
+
7)
ator → at if A
+
8)
logi/logy → log if A, otherwise → log
+
9)
bli/bly → bl if A, otherwise → bli
+
10)
bil → bl if stem ends vowel after A
+
11)
y/Y → i if stem contains a vowel
+
+
+The 106 endings are distributed among conditions A to E as A(5), B(87),
+C(8), D(3) and E(1). F to J deal with the purely inflectional endings: F
+with es, G with s, H with ing and ings, I with ed and J with d.
+There is however one point at which the Lovins structure breaks down, in that
+removal of ed and ing(s) after conditions I and H requires a special
+adjustment that cannot be left to a separate transformation rule. It is to
+undouble the last letter, and to restore a final e if the stem has length 1s
+and ends with a short vowel (so shopping loses a p and becomes shop,
+sloping gains an e and becomes slope.)
+
+
+
+The Porter stemmer cast into this form runs significantly faster than the
+multi-stage stemmer — about twice as fast in tests with Snowball.
+
+
+
+We will call the Porter stemmer P, the Lovins stemmer L, and this Lovins
+version of the Porter stemmer LP. As we have said, P and LP are not identical,
+but stem 137 of the 29,401 words of V differently.
+
+
+
+A major cause of difference is unexpected suffix combinations. These can be
+subdivided into combinations of what seem to be suffixes but are not, and
+rare combinations of valid suffixes.
+
+
+
+The first case is illustrated by the word disenchanted. P stems this to
+disench, first taking off suffix ed, and then removing ant, which is
+a suffix in English, although not a suffix in this word. P also stems
+disenchant to disench, so the two words disenchant and
+disenchanted are conflated by P, even though they make an error in the
+stemming process. But ant is a noun type suffix, and so does not combine
+with ed. anted is therefore omitted from the suffix list of LP, so LP
+stems disenchanted to disenchant, but disenchant to disench.
+
+
+
+This illustrates a frequently encountered problem in stemming. S1
+and S2 are suffixes of a language, but the combination
+S1S2 is
+not. A word has the form xS1, where x is some string, but in
+xS1, S1 is not actually a suffix, but part of the stem.
+S2 is a valid suffix for this word, so xS1S2 is
+another word in the language. An algorithmic stemmer stems xS1 to
+x in error. If presented with xS1S2 it can either
+(a) stem it to xS1, knowing S1 cannot be a suffix in
+this context, or (b) stem it to x, ignoring the knowledge to be
+derived from the presence of S2. (a) gives the correct stemming
+of at least xS1S2, although the stemming of xS1
+will be wrong, while (b) overstems both words, but at least achieves
+their conflation. In other words (a) fails to conflate the two forms, but
+may achieve correct conflations of xS1S2 with similar forms
+xS1S3, xS1S4 etc., while (b) conflates
+the two forms, but at the risk of additional false conflations. Often a study
+of the results of a stemming strategy on a sample vocabulary leads one to
+prefer approach (b) to (a) for certain classes of ending. This is
+true in particular of the inflectional endings of English, which is why the
+removals in step 1 of P are not remembered in some state variable, which
+records whether the ending just removed is verb-type, noun-or-verb-type etc.
+On balance you get better results by throwing that information away, and then
+the many word pairs on the pattern of disenchant / disenchanted will
+conflate together.
+
+
+
+Other examples from V can be given: in misrepresenting, ent is
+not a suffix, and enting not a valid suffix combination); in
+witnessed, ness is not a suffix, and nessed not a valid
+suffix combination.
+
+
+
+This highlights a disadvantage of stemmers that work with a fixed list of
+endings. To get the flexibility of context-free ending removal, we need to
+build in extra endings which are not grammatically correct (like anted =
+ant + ed), and this adds considerably to the burden of constructing
+the list. In fact L does not include anted, but it does include for
+example antic (ant + ic), which may be serving a similar
+purpose.
+
+
+
+For the second case, the rare combinations of valid suffixes, one may instance
+ableness. Here again the multi-step stemmer makes life easier. P removes
+ness in step 3 and able in step 4, but without making any necessary
+connection. L has ableness as an ending, dictionaries contain many
+ableness words, and it is an easy matter to make the connection across from
+able to ness in diagram 1 and generate extra endings. Nevertheless the
+ending is very rare in actual use. For example, Dickens’ Nicholas Nickleby
+contains no examples, Bleak House contains two, in the same sentence:
+
+
+
+ I was sure you would feel it yourself and would excuse the
+ reasonableness of MY feelings when coupled with the known
+ excitableness of my little woman.
+
+
+
+reasonableness is perhaps the commonest word in English of this form, and
+excitableness (instead of excitability) is there for contrast. Thackeray’s
+Vanity Fair, a major source in testing out P and Porter2, contains one
+word of this form, charitableness. One may say of this word that it is
+inevitably rare, because it has no really distinct
+meaning from the simpler charity, but that it has to be formed by adding
+ableness rather than ability, because the repeated ity in charity +
+ability is morphologically unacceptable. Other rare combinations are
+ateness, entness
+and eds (as in intendeds and beloveds).
+fuls is another interesting case. The ful suffix, usually adjectival,
+can sometimes create nouns, giving plurals such as mouthfuls and
+spoonfuls. But in longer words sful is a more ‘elegant’ plural
+(handbagsful, dessertspoonsful).
+
+
+
+These account for most of the differences, but there are a few others.
+
+
+
+One is in forms like bricklayers → bricklai (P), bricklay (LP).
+Terminal y is usefully turned to i to help conflate words where y is changed
+to i and es added to form the plural, but this does not happen when
+y
+follows a vowel. LP improves on P here, but the Porter2 algorithm makes the
+same improvement, so we have nothing to learn.
+There is also a difference in words endings lle or lles,
+quadrille → quadril (P), quadrill (LP). This is because e and
+l
+removal are successive in step 5 of P, and done as alternatives in the
+respelling rules
+of LP. In LP this is not quite correct, since
+Lovins makes it clear that her transformation rules should be
+applied in succession. Even so, LP seems better than P, suggesting
+that step 5b of P (undouble l) should not have been attempted after e removal
+in step 5a. So here is a possible small improvement to Porter2. Another
+small, but quite interesting difference, is the condition attached to the
+ative ending. The ending generator makes B the removal condition by a
+natural process, but in P its removal condition is A. This goes back to step
+3 as originally presented in the paper of 1980:
+
+
+ (m>0) ICATE → IC
+ (m>0) ATIVE →
+ (m>0) ALIZE → AL
+ (m>0) ICITI → IC
+ (m>0) ICAL → IC
+ (m>0) FUL →
+ (m>0) NESS →
+
+(m>0) corresponds to A. With removal condition B, the second line would be
+
+
+ (m>1) ATIVE →
+
+
+which looks slightly incongruous. Nevertheless it is probably correct, because we
+remove a half suffix from icate, alize, icity and ical when the stem
+length is at least s1, and so we should remove the full ate + ive suffix when the stem
+length is at least s2. We should not be influenced by ful and ness.
+They are ‘native English’ stems, unlike the other five, which
+have a ‘Romance’ origin, and for these two condition A has been found to
+be more appropriate. In fact putting in this adjustment to Porter2 results in an
+improvement in the small class of words thereby affected.
+
+
+
Conclusion
+
+
+You never learn all there is to know about a computer program, unless the
+program is really very simple. So even after 20 years of regular use,
+we can learn something new about P by creating LP and comparing the
+two. And in the process we learn a lot about L, the Lovins stemmer itself.
+
+
+
+The truth is that the main motivation for studying L was to see how well the
+Snowball system could be used for implementing and analyzing Lovins’
+original work, and the interest in what she had actually achieved in 1968
+only came later. I hope that this short account helps clarify her work, and
+place it the context of the development of stemmers since then.
+
+
+
Notes
+
+
+The http addresses below have a ‘last visited’ date of December 2001.
+
+
+
+
The Lovins stemmer is available at
+
+
+
+
http://www.cs.waikato.ac.nz/~eibe/stemmers
+
http://sourceforge.net/projects/stemmers
+
+
+
+
See http://www-uilots.let.uu.nl/~uplift/
+
+
See http://snowball.sourceforge.net
+
+
See http://promo.net/pg/
+
+
See http://snowball.sourceforge.net/english/voc.txt
+
+
In looking at verbs with the pattern ride, rode, ridden, Palmer,
+1965, notes that ‘we should perhaps add STRIDE, with past tense strode,
+but without a past participle (there is no *stridden).’
Lovins (1968), p. 25, mentions that a stemming algorithm developed by
+ James L. Dolby in California used a two-syllable minimum stem length as a
+ condition for most of the stemming.
+
+
+
Bibiliography
+
+
+Andrews K (1971) The development of a fast conflation algorithm for English.
+Dissertation for the Diploma in Computer Science, Computer Laboratory,
+University of Cambridge.
+
+
+
+Harman D (1991) How effective is suffixing? Journal of the American
+Society for Information Science, 42: 7-15.
+
+
+
+Kraaij W and Pohlmann R (1994) Porter’s stemming algorithm for Dutch. In
+Noordman LGM and de Vroomen WAM, eds. Informatiewetenschap 1994:
+Wetenschappelijke bijdragen aan de derde STINFON Conferentie, Tilburg,
+1994. pp. 167-180.
+
+
+
+Kraaij W and Pohlmann R (1995) Evaluation of a Dutch stemming algorithm.
+Rowley J, ed. The New Review of Document and Text Management, volume 1,
+Taylor Graham, London, 1995. pp. 25-43,
+
+
+
+Krovetz B (1995) Word sense disambiguation for large text databases. PhD
+Thesis. Department of Computer Science, University of Massachusetts
+Amherst.
+
+
+
+Lennon M, Pierce DS, Tarry BD and Willett P (1981) An evaluation of some
+conflation algorithms for information retrieval. Journal of Information
+Science, 3: 177-183.
+
+
+
+Lovins JB (1968) Development of a stemming algorithm. Mechanical
+Translation and Computational Linguistics, 11: 22-31.
+
+The list of 181 endings included by the get
+
+
+ directive in the program
+of Appendix 2. The numbers to the right show their frequency of occurrence
+in the sample vocabulary. The 75 rare endings are shown commented out.
+
+An ANSI C program which will generate on stdout the raw ending list
+(endings with condition letters) from which the list of Appendix 3 is
+constructed.
+
+This is a revised version of Martin Porter’s paper which was published as part
+of the Karen Sparck Jones Festschrift of 2005.
+
+
+
+Charting a New Course: Progress in Natural Language Processing and
+Information Retrieval: A Festschrift for Professor Karen Sparck Jones, edited
+by John Tait, Amsterdam: Kluwer, 2005.
+
+
+
Lovins Revisited
+
+
+Martin Porter, December 2001 (revised November 2008).
+
+
+
Abstract
+
+ The Lovins stemming algorithm for English is analysed, and compared
+ with the Porter stemming algorithm, using Snowball, a language designed
+ specifically for the development of stemming algorithms. It is shown
+ how the algorithms manage to function in a similar way, while appearing
+ to be quite different. The Porter algorithm is recoded in the style of
+ the Lovins algorithm, which leads to the discovery of a few possible
+ improvements.
+
+
+
Preamble
+
+
+This is a festschrift paper, so I am allowed to begin on a personal note.
+In 1979 I was working with Keith van Rijsbergen and Stephen Robertson on a
+British Library funded IR project to investigate the selection of good
+index terms, and one of the things we found ourselves having to do was to
+establish a document test collection from some raw data that had been sent
+to us on a magnetic tape by Peter Vaswani of the National Physical
+Laboratory. I was the tame programmer in the project, so it was my job to
+set up the test collection.
+
+
+
+On the whole it did not prove too difficult. The data we received was a
+collection of about 11,000 documents (titles and short abstracts), 93
+queries — in a free text form, and relevance judgements. All the text was
+in upper case without punctuation, and there were one or two marker
+characters to act as field terminators. By modern standards the data was
+really very small indeed, but at the time it was considerably larger than
+any of the other test collections we had. What you had to do was to cast it
+into a standard form
+for experimental work. You represented terms and documents by numbers, and
+created flat files in text form corresponding to the queries, relevance
+assessments, and term to document index. One process however was less
+straightforward. On their way to becoming numeric terms, the words of the
+source text were put through a process of linguistic normalisation called
+suffix stripping, in which certain derivational and inflectional suffixes
+attached to the words were removed. There was a standard piece of software
+used in Cambridge at that time to do this, written in 1971 by Keith
+Andrews (Andrews, 1971) as part of a Diploma Project.
+One of the courses in
+Cambridge is the one year post-graduate Diploma in Computer Science. Each
+student on the course is required to do a special project, which includes
+writing a significant piece of software — significant in the sense of being
+both useful and substantial.
+Keith's piece of software was more useful than most, and it continued to be
+used as a suffix stripping program, or stemmer, for many years after it was
+written.
+
+
+
+Now by an odd chance I was privy to much of Keith Andrews’ original
+thinking at the time that he was doing the work. The reason for this was
+that in 1971 I was looking for a house in Cambridge, and the base I was
+operating from was a sleeping bag on the living room floor of an old friend
+called John Dawson, who was Keith’s diploma supervisor. Keith used to come round
+and discuss stemming algorithms with him, while I formed a mute audience. I
+learnt about the Lovins stemming algorithm of 1968 (Lovins, 1968),
+and must I think have
+at least looked at her paper then, since I know it was not new to me when I
+saw it again in 1979. Their view of Lovins’ work was that it did not go far
+enough. There needed to be many more suffixes, and more complex rules to
+determine the criteria for their removal. Much of their discussion was
+about new suffixes to add to the list, and removal rules. It was interesting
+therefore to find myself needing to use Andrews’ work eight years later,
+and questioning some of its assumptions. Did you need that many suffixes?
+Did the rules need to be so complicated? Perhaps one would do better to
+break composite suffixes into smaller units and remove them piecemeal.
+And perhaps syllables would be a better count of stem length than letters.
+So I wrote my own stemmer, which became known as the Porter stemmer, and
+which was published in 1980 (Porter, 1980).
+
+
+
+I must explain where Karen Sparck Jones fits into all of this. Keith
+Andrews’ piece of work was originally suggested by Karen as a Diploma
+student project, and she was able to use the Andrews stemmer in her IR
+experiments throughout the seventies. In 1979 however Karen had moved much
+more into the field of Natural Language Processing and Artificial
+Intelligence, and by then had two or three research students in that field
+just writing up their PhDs (only one of whom I really got to know — John
+Tait, the editor of this volume). So we were in contact, but not working
+together. That again was an odd chance: that Karen had been my research
+supervisor in a topic other than IR, and that when later I was doing IR
+research at Cambridge I was not working with Karen. While I was engaged on
+writing the stemmer, Karen showed some justifiable irritation that I had
+become interested in a topic so very remote from the one for which we had
+received the British Library funding. Nevertheless, she came into my room
+one day, said, ‘Look, if you're getting interested in stemming, you’d
+better read this,’ and handed me the 1968 issue of Mechanical
+Translation that contains the Lovins paper. I still have this issue with
+Karen’s name across the top. (And I hope she didn't expect it back!)
+
+
+
+Another 20 years have gone by, and I have been studying the Lovins stemmer
+again, really because I was looking for examples to code up in Snowball, a
+small string processing language I devised in the latter half of 2001
+particularly adapted for writing stemming algorithms. Lovins’ stemmer
+strikes me now as a fine piece of work, for which she never quite received
+the credit she deserved. It was the first stemmer for English set out as
+an algorithm that described the stemming process exactly. She explained
+how it was intended to be used to improve IR performance, in just the way
+in which stemmers are used today. It is not seriously short of suffixes:
+the outstanding omissions are the plural forms ements and ents
+corresponding to her ement and ent, and it is easy enough to add
+them into the definition. It performs well in practice. In fact it is
+still in use, and can be downloaded in various languages from the net (1).
+The tendency since 1980 has been to attach the name ‘Porter’ to any
+language stemming process that does not use a dictionary, even when it is
+quite dissimilar to the original Porter stemmer (witness the Dutch Porter
+stemmer of Kraaij and Pohlmann (2) (Kraaij, 1994 and Kraaij, 1995), but
+the priority really belongs to Lovins. It also has one clear advantage
+over the Porter algorithm, in that it involves fewer steps. Coded up well,
+it should run a lot faster.
+
+
+
+A number of things intrigued me. Why are the Lovins and Porter stemmers so
+different, when what they do looks so similar? Could the stemmer, in some
+sense, be brought up-to-date? Could the Porter stemmer be cast into the
+Lovins form, and so run faster?
+
+
+
+This paper is about the answers for these questions. In discovering them, I
+have learned a lot more about my own stemmer.
+
+
+
Why stem?
+
+
+It may be worth saying a little on what stemming is all about. We can imagine
+a document with the title,
+
+
+
+ Pre-raphaelitism: A Study of Four Critical Approaches
+
+
+
+and a query, containing the words
+
+
+
+ PRE-RAPHAELITE CRITICISM
+
+
+
+We want to match query against title so that ‘Pre-raphaelitism’ matches
+‘PRE-RAPHAELITE’ and ‘Critical’ matches ‘CRITICISM’. This leads to the
+idea of removing endings from words as part of the process of extracting index
+terms from documents, a similar process of ending removal being applied to
+queries prior to the match. For example, we would like to remove the endings
+from
+
+so that each word is reduced to ‘critic’. This is the stem, from which the
+other words are formed, so the process as a whole is called stemming. It is
+a feature of English morphology that the part of the word we want to remove is
+at the end — the suffix. But the same is broadly true of French, German and other
+languages of the Indo-European group. It is also true of numerous languages
+outside Indo-European, Finnish for example, although there is a
+boundary beyond which it is not true. So Chinese, where words are simple
+units without affixes, and Arabic, where the stem is modified by
+prefixes and infixes as well as suffixes, lie outside the
+boundary. As an IR technique it therefore has wide applicability. In developing
+stemmers two points were recognised quite early on. One is that the
+morphological regularities that you find in English (or other languages) mean
+that you can attempt to do stemming by a purely algorithmic process. Endings
+al, ally, ism etc. occur throughout English vocabulary, and are
+easy to detect and remove: you don’t need access to an on-line dictionary. The
+other is that the morphological irregularities of English set a limit to the
+success of an algorithmic approach. Syntactically, what look like endings may
+not be endings (offspring is not offspr + ing), and the list of
+endings seems to extend indefinitely (trapez-oid, likeli-hood,
+guardian-ship, Tibet-an, juven-ilia, Roman-esque, ox-en
+...) It is difficult to gauge where to set the cut-off for these rarer forms.
+Semantically, the addition of a suffix may alter the meaning of a word a
+little, a lot, or completely, and morphology alone cannot measure the degree of
+change (prove and provable have closely related meanings; probe and
+probable do not.) This meant that stemming, if employed at all, became the
+most challenging, and the most difficult part of the indexing process.
+
+
+
+In the seventies, stemming might be applied as part of the process of
+establishing a test collection, and when it was there would not usually be any
+attempt to make the stemming process well-defined, or easily repeatable by
+another researcher. This was really because the basis for experiment replication
+was the normalised data that came out of the stemming process, rather than the
+source data plus a description of stemming procedures. Stemming tended to be
+applied, and then forgotten about. But by the 1980s, stemming itself was being
+investigated. Lennon and others (Lennon, 1981) found no substantial differences
+between the use of different stemmers for English. Harman (Harman, 1991)
+challenged the effectiveness of stemming altogether, when she reported no
+substantial differences between using and not using stemming in a series of
+experiments. But later work has been more positive. Krovetz (Krovetz, 1995), for example,
+reported small but significant improvements with stemming over a range of test
+collections.
+
+
+
+Of course, all these experiments assume some IR model which will use stemming in
+a particular way, and will measure just those features that tests collections
+are, notoriously, able to measure. We might imagine an IR system where the users
+have been educated in the advantages and disadvantages to be expected from
+stemming, and are able to flag individual search terms to say whether or not
+they are to be used stemmed or unstemmed. Stemming sometimes improves,
+occasionally degrades, search performance, and this would be the best way of
+using it as an IR facility. Again stemming helps regularise the IR vocabulary,
+which is very useful when preparing a list of terms to present to a user as
+candidates for query expansion. But this advantage too is difficult to quantify.
+
+
+
+An evaluative comparison between the Lovins and later stemmers lies in any case
+outside the scope of this paper, but it is important to
+bear in mind that it is not a straightforward undertaking.
+
+
+
The Lovins Stemmer
+
+
+Structurally, the Lovins stemmer is in four parts, collected together in
+four Appendices A, B, C and D in her paper. Part A is a list of 294
+endings, each with a letter which identifies a condition for whether or
+not the ending should be removed. (I will follow Lovins in using ‘ending’
+rather than ‘suffix’ as a name for the items on the list.)
+Part A therefore looks like this:
+
+
+
+ .11.
+ alistically B
+ arizability A
+ izationally B
+ .10.
+ antialness A
+ arisations A
+ arizations A
+ entialness A
+ .09.
+ allically C
+ antaneous A
+ antiality A
+ . . .
+
+ .01.
+ a A
+ e A
+ i A
+ o A
+ s W
+ y B
+
+
+
+Endings are banked by length, from 11 letters down to 1. Each bank is tried
+in turn until an ending is found which matches the end of the word to be
+stemmed and leaves a stem which satisfies the given condition, when the
+ending is removed. For example condition C says that the stem must have at
+least 4 letters, so bimetallically would lose allically leaving a
+stem bimet of length 5, but metallically would not reduce to
+met, since its length is only 3.
+
+
+
+There are 29 such conditions, called A to Z, AA, BB and CC, and they
+constitute part B of the stemmer. Here they are (* stands for any letter):
+
+
+
+
+
A
No restrictions on stem
+
B
Minimum stem length = 3
+
C
Minimum stem length = 4
+
D
Minimum stem length = 5
+
E
Do not remove ending after e
+
F
Minimum stem length = 3 and do not remove ending after e
+
G
Minimum stem length = 3 and remove ending only after f
+
H
Remove ending only after t or ll
+
I
Do not remove ending after o or e
+
J
Do not remove ending after a or e
+
K
Minimum stem length = 3 and remove ending only after l, i or
+u*e
+
L
Do not remove ending after u, x or s, unless s follows
+o
+
Minimum stem length = 3 and do not remove ending after l or
+n
+
R
Remove ending only after n or r
+
S
Remove ending only after dr or t, unless t follows t
+
T
Remove ending only after s or t, unless t follows o
+
U
Remove ending only after l, m, n or r
+
V
Remove ending only after c
+
W
Do not remove ending after s or u
+
X
Remove ending only after l, i or u*e
+
Y
Remove ending only after in
+
Z
Do not remove ending after f
+
AA
Remove ending only after d, f, ph, th, l, er, or, es or t
+
BB
Minimum stem length = 3 and do not remove ending after met or
+ryst
+
CC
Remove ending only after l
+
+
+
+
+There is an implicit assumption in each condition, A included, that the minimum
+stem length is 2.
+
+
+
+This is much less complicated than it seems at first. Conditions A to D
+depend on a simple measure of minimum stem length, and E and F are slight
+variants of A and B. Out of the 294 endings, 259 use one of these
+6 conditions. The remaining 35 endings use the other 23 conditions, so
+conditions G, H ... CC have less than 2 suffixes each, on average. What is
+happening here is that Lovins is trying to capture a rule which gives a
+good removal criterion for one ending, or a small number of similar
+endings. She does not explain the thinking behind the conditions, but it is
+often not too difficult to reconstruct. Here for example are the last few
+conditions with their endings,
+
+
+
+
+Y (early, ealy, eal, ear). collinearly, multilinear are
+stemmed.
+
+Z (eature). misfeature does not lose eature.
+
+AA (ite). acolouthite, hemimorphite lose ite, ignite and
+requite retain it.
+
+BB (allic, als, al). Words ending metal, crystal retain
+al.
+
+CC (inity). crystallinity → crystall, but affinity,
+infinity are unaltered.
+
+
+
+
+Part C of the Lovins stemmer is a set of 35 transformation rules used to
+adjust the letters at the end of the stem. These rules are invoked after the
+stemming step proper, irrespective of whether an ending was actually
+removed. Here are about half of them, with examples to show the type of
+transformation intended (letters in square brackets indicate the full form
+of the words),
+
+
+
+
+
1)
bb
→
b
rubb[ing] → rub
+
ll
→
l
controll[ed] → control
+
mm
→
m
trimm[ed] → trim
+
rr
→
r
abhorr[ing] → abhor
+
2)
iev
→
ief
believ[e] → belief
+
3)
uct
→
uc
induct[ion] → induc[e]
+
4)
umpt
→
um
consumpt[ion] → consum[e]
+
5)
rpt
→
rb
absorpt[ion] → absorb
+
6)
urs
→
ur
recurs[ive] → recur
+
7a)
metr
→
meter
parametr[ic] → paramet[er]
+
8)
olv
→
olut
dissolv[ed] → dissolut[ion]
+
11)
dex
→
dic
index → indic[es]
+
16)
ix
→
ic
matrix → matric[es]
+
18)
uad
→
uas
persuad[e] → persuas[ion]
+
19)
vad
→
vas
evad[e] → evas[ion]
+
20)
cid
→
cis
decid[e] → decis[ion]
+
21)
lid
→
lis
elid[e] → elis[ion]
+
31)
ert
→
ers
convert[ed] → convers[ion]
+
33)
yt
→
ys
analytic → analysis
+
34)
yz
→
ys
analyzed → analysed
+
+
+
+
+Finally, part D suggests certain relaxed matching rules between query terms
+and index terms when the stemmer has been used to set up an IR system, but
+we can regard that as not being part of the stemmer proper.
+
+
+
The Lovins stemmer in Snowball
+
+
+Snowball is a string processing language designed with the idea of making
+the definition of stemming algorithms much more rigorous. The Snowball
+compiler translates a Snowball script into a thread-safe ANSI C module,
+where speed of execution is a major design consideration. The resulting
+stemmers are pleasantly fast, and will process one million or so words a
+second on a high-performance modern PC. The Snowball website (3) gives a
+full description of the language, and also presents stemmers for a range of
+natural languages. Each stemmer is written out as a formal algorithm, with
+the corresponding Snowball script following. The algorithm definition acts
+as program comment for the Snowball script, and the Snowball script gives a
+precise definition to the algorithm. The ANSI C code with the
+same functionality can also be inspected, and sample vocabularies in source
+and stemmed form can be used for test purposes.
+An essential function of
+the Snowball script is therefore comprehensibility — it should be fully understood
+by the reader of the script, and Snowball has been designed with this in mind.
+It contrasts interestingly in this respect with a system like Perl.
+Perl has a very big definition. Writing your own scripts in Perl is easy,
+after the initial learning hurdle, but understanding other scripts can be
+quite hard. The size of the language means that there are many different
+ways of doing the same thing, which gives programmers the opportunity of
+developing highly idiosyncratic styles. Snowball has a small, tight
+definition. Writing Snowball is much less easy than writing Perl, but on
+the other hand once it is written it is fairly easy to understand
+(or at least one hopes that it is). This is
+illustrated by the Lovins stemmer in Snowball, which is given in Appendix
+1. There is a very easy and natural correspondence
+between the different parts of the stemmer definition in Lovins' original
+paper and their Snowball equivalents.
+For example, the Lovins conditions A, B ... CC code up very neatly
+into routines with the same name. Taking condition L,
+
+
+
+ L Do not remove ending after u, x or s, unless s follows
+ o
+
+
+
+corresponds to
+
+
+[% highlight("
+ define L as ( test hop 2 not 'u' not 'x' not ('s' not 'o') )
+") %]
+
+
+When L is called, we are the right end of the stem, moving left towards the
+front of the word. Each Lovins condition has an implicit test for a stem of
+length 2, and this is done by [% highlight_inline('test hop 2') %], which sees if it is possible to
+hop two places left. If it is not, the routine immediately returns with a
+false signal, otherwise it carries on. It tests that the character at the
+right hand end is not u, and also not x, and also not s following a letter
+which is not o. This is equivalent to the Lovins condition. Here is not of
+course the place to give the exact semantics, but the you can quickly get
+the feel of the language by comparing the 29 Lovins conditions with their
+Snowball definitions.
+
+
+
+Something must be said about the [% highlight_inline('among') %] feature of Snowball however,
+since this is central to the efficient implementation of stemmers. It is
+also the one part of Snowball that requires just a little effort to
+understand.
+
+
+
+At its simplest, [% highlight_inline('among') %] can be used to test for alternative strings. The
+[% highlight_inline('among') %]s used in the definition of condition AA and the undouble
+routine have this form. In Snowball you can write
+
+
+[% highlight("
+ 'sh' or 's' or 't' 'o' or 'i' 'p'
+") %]
+
+
+which will match the various forms shop, ship, sop, sip, top, tip. The
+order is important, because if [% highlight_inline("'sh'") %] and [% highlight_inline("'s'") %] are swapped over, the
+[% highlight_inline("'s'") %] would match the first letter of ship, while [% highlight_inline("'o'") %] or [% highlight_inline("'i'") %]
+would fail to match with the following [% highlight_inline("'h'") %] — in other words the pattern
+matching has no backtracking. But it can also be written as
+
+The order of the strings in each [% highlight_inline('among') %] is not important, because the
+match will be with the longest of all the strings that can match. In
+Snowball the implementation of [% highlight_inline('among') %] is based on the binary-chop idea,
+but has been carefully optimised. For example, in the Lovins stemmer, the
+main [% highlight_inline('among') %] in the endings routine has 294 different strings of average
+length 5.2 characters. A search for an ending involves accessing a number
+of characters within these 294 strings. The order is going to be
+Klog2294, or 8.2K, where K is a number that one hopes will
+be small, although one must certainly expect it to be greater than 1. It
+turns out that, for the successive words of a standard test vocabulary,
+K averages to 1.6, so for each word there are about 13 character
+comparisons needed to determine whether it has one of the Lovins endings.
+
+
+
+Each string in an [% highlight_inline('among') %] construction can be followed by a routine name. The
+routine returns a true/false signal, and then the [% highlight_inline('among') %] searches for the
+longest substring whose associated routine gives a true signal. A string not
+followed by a routine name can be thought of as a string which is associated
+with a routine that does nothing except give a true signal. This is the way
+that the [% highlight_inline('among') %] in the endings routine works, where indeed every string is
+followed by a routine name.
+
+
+
+More generally, lists of strings in the [% highlight_inline('among') %] construction can be followed
+by bracketed commands, which are obeyed if one of the strings in the list is
+picked out for the longest match. The syntax is then
+
+where the Sij are strings, optionally followed by their routine names,
+and the Ci are Snowball command sequences. The semantics is a bit
+like a switch in C, where the switch is on a string rather than a numerical
+value:
+
+
+
+ switch(...) {
+ case S11: case S12: ... C1; break;
+ case S21: case S22: ... C2; break;
+ ...
+
+ case Sn1: case Sn2: ... Cn; break;
+ }
+
+
+
+The [% highlight_inline('among') %] in the respell routine has this form.
+
+
+
+The full form however is to use [% highlight_inline('among') %] with a preceding [% highlight_inline('substring') %], with
+[% highlight_inline('substring') %] and [% highlight_inline('among') %] possibly separated by further commands.
+[% highlight_inline('substring') %]
+triggers the test for the longest matching substring, and the [% highlight_inline('among') %] then
+causes the corresponding bracketed command to be obeyed. At a simple
+level this can be used to cut down the size of the code, in that
+
+More importantly, [% highlight_inline('substring') %] and [% highlight_inline('among') %] can work in different contexts. For
+example, [% highlight_inline('substring') %] could be used to test for the longest string, matching from
+right to left, while the commands in the [% highlight_inline('among') %] could operate in a left to
+right direction. In the Lovins stemmer, [% highlight_inline('substring') %] is used in this style:
+
+The two square brackets are in fact individual commands, so before the [% highlight_inline('among') %]
+come three commands. [% highlight_inline('[') %] sets a lower marker, [% highlight_inline('substring') %] is obeyed, searching
+for the strings in the following among, and then [% highlight_inline(']') %] sets an upper marker.
+The region between the lower and upper markers is called the slice, and this
+may subsequently be copied, replaced or deleted.
+
+
+
+It was possible to get the Lovins stemmer working in Snowball very quickly.
+The Sourceforge versions (1) could be used to get the long list of endings and
+to help with the debugging. There was however one problem, that rules 24 and
+30 of part C conflicted. They are given as
+
+
+
+ 24) end → ens except following s
+ ...
+ 30) end → ens except following m
+
+
+
+This had not been noticed in the Sourceforge implementations, but
+immediately gave rise to a compilation error in Snowball. Experience
+suggested that I was very unlikely to get this problem resolved. Only a few
+months before, I had hit a point in a stemming algorithm where
+something did not quite make sense. The algorithm had been published just a
+few years ago, and contacting one at least of the authors was quite easy.
+But I never sorted it out. The author I traced was not au fait
+with the linguistic background, and the language expert had been swallowed
+up in the wilds of America. So what chance would I have here? Even if I was
+able to contact Lovins, it seemed to me inconceivable that she would have
+any memory of, or even interest in, a tiny problem in a paper which she
+published 33 years ago. But the spirit of academic enquiry forced me to
+venture the attempt. After pursuing a number of red-herrings, email contact
+was finally made.
+
+
+
+Her reply was a most pleasant surprise.
+
+
+
+ ... The explanation is both mundane and exciting. You have just found
+ a typo in the MT article, which I was unaware of all these years, and I
+ suspect has puzzled a lot of other people too. The original paper, an
+ MIT-published memorandum from June 1968, has rule 30 as
+
+
+
+ ent → ens except following m
+
+
+
+ and that is undoubtedly what it should be ...
+
+
+
+
An analysis of the Lovins stemmer
+
+
+It is very important in understanding the Lovins stemmer to know something
+of the IR background of the late sixties. In the first place there was an
+assumption that IR was all, or mainly, about the retrieval of
+technical scientific papers, and research projects were set up accordingly.
+I remember being shown, in about 1968, a graph illustrating the
+‘information explosion’, as it was understood at the time, which showed
+just the rate of growth of publications of scientific papers in various
+different domains over the previous 10 or 20 years. Computing resources
+were very precious, and they could not be wasted by setting up IR systems
+for information that was, by comparison, merely frivolous (articles in
+popular magazines, say). And even in 1980, when I was working in IR, the
+data I was using came from the familiar, and narrow, scientific domain.
+Lovins was working with Project Intrex (Overhage, 1966), where the data came from
+papers in materials science and engineering.
+
+
+
+Secondly, the idea of indexing on every word in a document, or even looking
+at every word before deciding whether or not to put it into an index, would
+have seemed quite impractical, even though it might have been recognised as
+theoretically best. In the first place, the computing resources necessary to
+store and analyse complete documents in machine readable form were absent, and in the
+second, the rigidities of the printing industry almost guaranteed that one
+would never get access to them.
+A stemmer, therefore, would be seen as something not
+applied to general text but to certain special words, and in the case of the
+Lovins stemmer, the plan was to apply it to the subject terms that were used
+to categorize each document. Subsequently it would be used with each word
+in a query, where it
+was hoped that the vocabulary of the queries would match the vocabulary of
+the catalogue of subject terms.
+
+
+
+This accounts for: —
+
+
+
+
The emphasis on the scientific vocabulary. This can be seen in the
+endings, which include oidal, on, oid, ide, for words like colloidal,
+proton, spheroid, nucleotide. It can be seen in the transformation rules,
+with their concern for Greek sis and Latin ix suffixes. And also it can be
+seen in in the word samples of the paper (magnesia, magnesite, magnesian,
+magnesium, magnet, magnetic, magneto etc. of Fig. 2).
+
+
+
The slight shortage of plural forms. The subject terms would naturally
+have been mainly in the singular, and one might also expect the same of
+query terms.
+
+
+
The surprising shortness of the allowed minimum stems — usually 2
+letters. A controlled technical vocabulary will contain longish words, and
+the problem of minimum stem lengths only shows up with shorter words.
+
+
+
+
+If we take a fairly ordinary vocabulary of modern English, derived from
+non-scientific writing, it is interesting to see how much of the Lovins
+stemmer does not actually get used. We use vocabulary V, derived from a
+sample of modern texts from Project Gutenberg (4). V can be inspected
+at (5). It contains 29,401 words, and begins
+
+We find that 22,311, or about 76%, of the words in V have one of the
+294 endings removed if passed through the Lovins stemmer. Of this 76%, over a
+half (55%) of the removals are done by just five of the endings, the breakdown
+being,
+
+ s (13%) ed (12%) e (10%) ing (10%) es (6%) y (4%)
+
+If, on the other hand, you look at the least frequent endings, 51% of them
+do only 1.4% of the removals. So of the ones removed, half the endings in
+V
+correspond to 2% of the endings in the stemmer, and 1.4% of the endings in
+V
+correspond to half the endings in the stemmer. In fact 62 of the endings
+(about a fifth) do not lead to any ending removals in V at all. These are
+made up of the rarer ‘scientific’ endings, such as aroid and oidal, and
+long endings, such as alistically and entiality.
+
+
+
+This helps explain why the Porter and Lovins stemmers behave in a fairly
+similar way despite the fact that they look completely different — it is
+because most of the work is being done in just a small part of the stemmer,
+and in that part there is a lot of overlap. Porter and Lovins stem 64% of
+the words in V identically, which is quite high. (By contrast, an
+erroneous but plausibly written Perl script
+advertised on the Web as an implementation of the Porter stemmer
+still proves to stem only 86% of the words in V
+to the same forms that are produced by the Porter stemmer.)
+
+
+
+A feature of the Lovins stemmer that is worth looking at in some detail is
+the transformation rules. People who come to the problem of stemming for
+the first time usually devote a lot of mental energy to the issue of
+morphological irregularity which they are trying to address.
+
+
+
+A good starting point is the verbs of English. Although grammatically
+complex, the morphological forms of the English verb are few, and are
+illustrated by the pattern harm, harms, harming, harmed, where the basic
+verb form adds s, ing and ed to make the other three forms. There are
+certain special rules: to add s to a verb ending ss an e is inserted,
+so pass becomes passes, and adding e and ing replaces a final e of
+the verb (love to loves), and can cause consonant doubling (hop to
+hopped), but
+apart from this all verbs in the language follow the basic pattern with the
+exception of a finite class of irregular verbs.
+In a regular verb, the addition of ed to the basic verb creates both the
+past form (‘I harmed’) and the p.p. (past participle) form (‘I have
+harmed’). An irregular verb, such as ring, forms its past in some other
+way (‘I rang’), and may have a distinct p.p. (‘I have rung’).
+The irregular verbs have a
+different past form, and sometimes a separate p.p. form.
+It is easy to think up more examples,
+
+
+
stem
past
p.p.
+
+
ring
rang
rung
+
rise
rose
risen
+
sleep
slept
slept
+
fight
fought
fought
+
come
came
come
+
go
went
gone
+
hit
hit
hit
+
+
+How many of these verbs are there altogether? On 20 Jan 2000, in order to
+test the hypothesis that the number is consistently over-estimated, I asked
+this question in a carefully worded email to a mixed group of
+about 50
+well-educated
+work colleagues (business rather than academic people). Ten of them replied,
+and here are the
+guesses they made:
+
+
+
+The last two numbers mean 10% and 20% of all English verbs.
+My hypothesis was of course wrong. The truth is that most people have no
+idea at all how many irregular verbs there are in English.
+In
+fact there are around 135 (see section 3.3 of Palmer, 1965).
+If a stemming algorithm handles suffix removal
+of all regular verbs correctly, the question arises as to whether it is
+worth making it do the same for the irregular forms. Conflating fought and
+fight, for example, could be useful in IR queries about boxing. It seems
+easy: you make a list of the irregular verbs and create a mapping of the
+past and p.p. forms to the main form. We can call the process
+English verb respelling. But when you try it, numerous problems arise. Are
+forsake, beseech, cleave really verbs of contemporary English? If so, what
+is the p.p. of cleave?
+Or take the verb stride, which is common enough. What is its p.p.? My
+Concise Oxford English Dictionary says it is stridden (6), but have we ever
+heard this word used? (‘I have stridden across the paving.’)
+
+
+
+To compose a realistic list for English verb respelling we therefore need to
+judge word rarity. But among the commoner verb forms even greater problems
+arise because of their use as homonyms. A rose is a type of flower, so
+is it wise
+to conflate rose and rise? Is it wise to conflate
+saw and see when saw can mean a cutting instrument?
+
+
+
+We suddenly get to
+the edge of what it is useful to include in a stemming algorithm. So long as
+a stemming algorithm is built around general rules, the full impact of the
+stemmer on a vocabulary need not be studied too closely. It is sufficient to
+know that the stemmer, judiciously used, improves retrieval performance. But
+when we look at its effect on individual words these issues can no longer be
+ignored. To build even a short list of words into a stemmer for special
+treatment takes us into the area of the dictionary-based stemmer, and the
+problem of determining, for a pair of related words in the dictionary, a
+measure of semantic similarity which tells us whether or not the words
+should be conflated together.
+
+
+
+About half the transformation rules in the Lovins stemmer deal with a
+problem which is similar to that posed by the irregular verbs of English,
+and which ultimately goes back to the irregular forms of second conjugation
+verbs in Latin. We can call it Latin verb respelling. Verbs like
+induce, consume, commit are perfectly regular in modern English, but
+the adjectival and noun forms induction, consumptive, commission that
+derive from them correspond to p.p. forms in Latin.
+You can see the descendants of these Latin irregularities
+in modern Italian, which has commettere with p.p.
+commesso, like our commit and commission, and scendere with
+p.p. sceso like our ascend and ascension (although scendere
+means ‘to go down’ rather than ‘to go up’).
+
+
+
+Latin verb respelling often seems to be more the territory of a stemmer than
+English verb respelling, presumably because Latin verb irregularities
+correspond to consonantal changes at the end of the stem, where the
+stemmer naturally operates, while English verb irregularities more often
+correspond to vowel changes in the middle. Lovins was no doubt
+particularly interested in Latin verb respelling because so many of the
+words affected have scientific usages.
+
+
+
+We can judge that Latin verb respellings constitute a small set because the
+number of second conjugation verbs of Latin form a small, fixed set. Again,
+looking at Italian, a modern list of irregular verbs contains 150 basic forms
+(nearly all of them second conjugation), not unlike the number of forms in
+English. Extra verbs are formed with prefixes. Corresponding English words
+that exhibit the Latin verb respelling problem
+will be a subset of this system. In fact we
+can offer a Snowball script that does the Latin verb respelling with more
+care. It should be invoked, in the Porter stemmer, after removal of ive or
+ion endings only,
+
+[% highlight("
+define prefix as (
+
+ among (
+
+ 'a' 'ab' 'ad' 'al' 'ap' 'col' 'com' 'con' 'cor' 'de'
+ 'di' 'dis' 'e' 'ex' 'in' 'inter' 'o' 'ob' 'oc' 'of'
+ 'per' 'pre' 'pro' 're' 'se' 'sub' 'suc' 'trans'
+ ) atlimit
+)
+
+define second_conjugation_form as (
+
+ [substring] prefix among (
+
+ 'cept' (<-'ceiv') //-e con de re
+ 'cess' (<-'ced') //-e con ex inter pre re se suc
+ 'cis' (<-'cid') //-e de (20)
+ 'clus' (<-'clud') //-e con ex in oc (26)
+ 'curs' (<-'cur') // re (6)
+ 'dempt' (<-'deem') // re
+ 'duct' (<-'duc') //-e de in re pro (3)
+ 'fens' (<-'fend') // de of
+ 'hes' (<-'her') //-e ad (28)
+ 'lis' (<-'lid') //-e e col (21)
+ 'lus' (<-'lud') //-e al de e
+ 'miss' (<-'mit') // ad com o per re sub trans (29)
+ 'pans' (<-'pand') // ex (23)
+ 'plos' (<-'plod') //-e ex
+ 'prehens' (<-'prehend') // ap com
+ 'ris' (<-'rid') //-e de (22)
+ 'ros' (<-'rod') //-e cor e
+ 'scens' (<-'scend') // a
+ 'script' (<-'scrib') //-e de in pro
+ 'solut' (<-'solv') //-e dis re (8)
+ 'sorpt' (<-'sorb') // ab (5)
+ 'spons' (<-'spond') // re (25)
+ 'sumpt' (<-'sum') // con pre re (4)
+ 'suas' (<-'suad') //-e dis per (18)
+ 'tens' (<-'tend') // ex in pre (24)
+ 'trus' (<-'trud') //-e ob (27)
+ 'vas' (<-'vad') //-e e (19)
+ 'vers' (<-'vert') // con in re (31)
+ 'vis' (<-'vid') //-e di pro
+ )
+)
+") %]
+
+This means that if suas, for example, is preceded by one of the strings
+in [% highlight_inline('prefix') %], and there is nothing more before the prefix string (which is
+what the
+[% highlight_inline('atlimit') %]
+command tests), it is replaced by suad. So dissuas(ion) goes to
+dissuad(e)
+and persuas(ive) to persuad(e). Of course, asuas(ion), absuas(ion),
+adsuas(ion) and so on would get the same treatment, but not being words of
+English that does not really matter. The corresponding Lovins rules are
+shown in brackets.
+This is not quite the end
+of the story, however, because the Latin forms ex + cedere (‘go
+beyond’) pro + cedere (‘go forth’), and sub + cedere
+(‘go after’) give rise to verbs which,
+by an oddity of English orthography, have an extra letter e: exceed, proceed,
+succeed. They can be sorted out in a final respelling step:
+
+
+[% highlight("
+define final_respell as (
+
+ [substring] atlimit among(
+
+ 'exced' (<-'exceed')
+ 'proced' (<-'proceed')
+ 'succed' (<-'succeed')
+ /* extra forms here perhaps */
+ )
+)
+") %]
+
+
+As you might expect, close inspection of this process creates doubts in
+the same way as for English verb respelling. (Should we really conflate
+commission and commit? etc.)
+
+
+
+The other transformation rules are concerned with unusual plurals, mainly
+of Latin or Greek origin, er and re differences, as in parameter and
+parametric, and the sis/tic connection of certain words of Greek origin:
+analysis/analytic, paralysis/paralytic ... (rule 33), and
+hypothesis/hypothetic, kinesis/kinetic ... (rule 32). Again, these
+irregularities might be tackled by forming explicit word lists. Certainly
+rule 30, given as,
+
+
+
+ ent → ens except following m,
+
+
+
+goes somewhat wild when given a general English vocabulary (dent becomes
+dens for example), although it is the only rule that might be said to
+have a damaging effect.
+
+
+
A Lovins shape for the Porter stemmer
+
+
+The 1980 paper (Porter, 1980) may be said to define the ‘pure’ Porter stemmer.
+The stemmer distributed at (7) can be called the ‘real’ Porter
+stemmer, and differs from the pure stemmer in three small respects, which
+are carefully explained. This disparity does not require much excuse,
+since the oldest traceable encodings of the stemmer have always contained
+these differences. There is also a revised stemmer for English, called
+‘Porter2’ and still subject to slight changes. Unless otherwise stated,
+it is the real Porter stemmer which is being studied below.
+
+
+
+The Porter stemmer differs from the Lovins stemmer in a number of
+respects. In the first place, it only takes account of fairly common
+features of English. So rare suffixes are not included, and there is no
+equivalent of Lovins’ transformation rules, other than her rule (1), the
+undoubling of terminal double letters. Secondly, it removes suffixes only
+when the residual stem is fairly substantial. Some suffixes are removed
+only when at least one syllable is left, and most are removed only when at least two
+syllables are left. (One might say that this is based on a guess about the
+way in which the meanings of a stem is related to its length in syllables (8).)
+The Porter stemmer is therefore ‘conservative’ in its removal
+of suffixes, or at least that is how it has often been described. Thirdly,
+it removes suffixes in a series of steps, often reducing a compound suffix
+to its first part, so a step might reduce ibility to ible, where
+ibility is thought of as being ible + ity. Although the
+description of the whole stemmer is a bit complicated, the total number of
+suffixes is quite small — about 60.
+
+
+
+The Porter stemmer has five basic steps. Step 1 removes an
+inflectional suffix. There are only three of these: ed and ing, which are
+verbal, and s, which is verbal (he sings), plural (the songs) or possessive
+(the horses’ hooves), although the rule for s removal is the same in all
+three cases. Step 1 may also restore an e (hoping → hope), undouble a
+double letter pair (hopping → hop), or change y to i (poppy →
+poppi, to match with poppies → poppi.) Steps 2 to 4 remove derivational
+suffixes. So
+ibility may reduce to ible in step 2, and ible itself may be removed in step
+4. Step 5 is for removing final e, and undoubling ll.
+
+
+
+A clear advantage of the Lovins stemmer over the Porter stemmer is speed.
+The Porter stemmer has five steps of suffix removal to the Lovins stemmer’s
+one. It is instructive therefore to try and cast the Porter stemmer into
+the shape of the Lovins stemmer, if only for the promise of certain speed
+advantages. As we will see, we learn a few other things from the exercise
+as well.
+
+
+
+First we need a list of endings. The Lovins endings were built up by hand,
+but we can construct a set of endings for the Porter stemmer by writing an
+ending generator that follows the algorithm definition. From an analysis of
+the suffixes in steps 2 to 4 of the Porter stemmer we can construct
+the following diagram:
+
+
+
+
+
+This is not meant to be a linguistic analysis of the suffix structure of
+English, but is merely intended to show how the system of endings works in
+the stemming algorithm. Suffixes combine if their boxes are connected by
+an arrow. So ful combines with ness to make fulness.
+
+
+ ful + ness → fulness
+
+
+The combination is not always a concatenation of the strings
+however, for we have,
+
+
+ able + ity → ability
+ able + ly → ably
+ ate + ion → ation
+ ible + ity → ibility
+ ible + ly → ibly
+ ize + ate + ion → ization
+
+
+The path from ize to ion goes via ate, so we can form ization, but there is
+no suffix izate. Three of the suffixes, ator, ance and ence, do not connect
+into the rest of the diagram, and ance, ence also appear in the forms
+ancy, ency. The letter to the left of the box is going to be the
+condition for the
+removal of the suffix in the box, so
+
+
+ B +-------+ n
+ | ism |
+ +-------+
+
+
+means that ism will be removed if it follows a stem that satisfies
+condition B. On the right of the box is either n, v or hyphen. n means the
+suffix is of noun type. So if a word ends ism it is a noun. v means verb
+type. hyphen means neither: ly (adverbial) and ful, ous (adjectival) are of
+this type. If a suffix is a noun type it can have a plural form (criticism,
+criticisms), so we have to generate isms as well as ism. Again, the
+combining is not just concatenation,
+
+
+ ity + s → ities
+ ness + s → nesses
+
+
+If a suffix has v type, it has s, ed and ing forms,
+
+
+ ize + s → izes
+ ize + ed → ized
+ ize + ing → izing
+
+
+Type v therefore includes type n, and we should read this type as ‘verb or
+noun’, rather than just ‘verb’. For example, condition, with suffix ion, is
+both verb (‘They have been conditioned to behave like that’) and noun
+(‘It is subject to certain conditions’).
+
+
+
+The diagram is therefore a scheme for generating combined derivational
+suffixes, each combination possibly terminated with an inflectional suffix.
+A problem is that it contains a loop in
+
+
+ ize → ate → ion → al → ize → ...
+
+
+suggesting suffixes of the form izationalizational... We break the loop by
+limiting the number of joined derivational suffixes of diagram 1 to four.
+(Behaviour of the Porter stemmer shows that removal of five combined
+derivation suffixes is never desirable, even supposing five ever combine.)
+We can then generate 181 endings, with their removal codes. But 75 of these
+suffixes do not occur as endings in V, and they can be eliminated as rare
+forms, leaving 106. Alphabetically, the endings begin,
+
+
+
+The eliminated rare forms are shown bracketed.
+
+
+
+The 106 endings are arranged in a file as a list of strings followed by
+condition letter,
+
+
+ 'abilities' B
+ 'ability' B
+ 'able' B
+ 'ables' B
+ 'ably' B
+ 'al' B
+ ....
+
+
+This ending list is generated by running the ANSI C program shown in
+Appendix 4, and line-sorting the result into a file,
+and this file is called in by the [% highlight_inline('get') %] directive in the Snowball script of
+Appendix 2, which is the Porter stemming algorithm laid out in the style of
+the Lovins algorithm. In fact, precise equivalence cannot be achieved, but
+in V only 137 words stem differently, which is 0.4% of V. There are 10
+removal conditions, compared with Lovins’ 29, and 11 transformation or
+respelling rules, compared with Lovins’ 35. We can describe the process in
+Lovins style, once we have got over a few preliminaries.
+
+
+
+We have to distinguish y as a vowel from y as a consonant. We treat initial
+y, and y before vowel, as a consonant, and make it upper case. Thereafter
+a, e, i, o, u and y are vowels, and the other lower case letters and Y are
+consonants. If [C] stands for zero or more consonants, C for one or more
+consonants, and V for one or more vowels, then a stem of shape [C]VC has
+length 1s (1 syllable), of shape [C]VCVC length 2s, and so on.
+
+
+
+A stem ends with a short vowel if the ending has the form cvx, where c is a
+consonant, v a vowel, and x a consonant other than w, x or Y.
+(Short vowel endings with ed and ing imply loss of an e from
+the stem, as in removing = remove + ing.)
+
+
+
+Here are the removal conditions,
+
+
+
+
+
A
Minimum stem length = 1s
+
B
Minimum stem length = 2s
+
C
Minimum stem length = 2s and remove ending only after s or t
+
D
Minimum stem length = 2s and do not remove ending after m
+
E
Remove ending only after e or ous after minimum stem length 1s
+
F
Remove ending only after ss or i
+
G
Do not remove ending after s
+
H
Remove ending only if stem contains a vowel
+
I
Remove ending only if stem contains a vowel and does not end in e
+
J
Remove ending only after ee after minimum stem length 1s
+
+
+
+
+In condition J the stem must end ee, and the part of the stem before the
+ee must have minimum length 1s. Condition E is similar.
+
+
+
+Here are the respelling rules, defined with the help of the removal
+conditions. In each case, the stem being tested does not include the string
+at the end which has been identified for respelling.
+
+
+
1)
Remove e if A, or if B and the stem does not end with a short vowel
+
2)
Remove l if B and the stem ends with l
+
3)
enci/ency → enc if A, otherwise → enci
+
4)
anci/ancy → anc if A, otherwise → anci
+
5)
ally → al if A, otherwise → alli
+
6)
ently → ent if A, otherwise → entli
+
7)
ator → at if A
+
8)
logi/logy → log if A, otherwise → log
+
9)
bli/bly → bl if A, otherwise → bli
+
10)
bil → bl if stem ends vowel after A
+
11)
y/Y → i if stem contains a vowel
+
+
+The 106 endings are distributed among conditions A to E as A(5), B(87),
+C(8), D(3) and E(1). F to J deal with the purely inflectional endings: F
+with es, G with s, H with ing and ings, I with ed and J with d.
+There is however one point at which the Lovins structure breaks down, in that
+removal of ed and ing(s) after conditions I and H requires a special
+adjustment that cannot be left to a separate transformation rule. It is to
+undouble the last letter, and to restore a final e if the stem has length 1s
+and ends with a short vowel (so shopping loses a p and becomes shop,
+sloping gains an e and becomes slope.)
+
+
+
+The Porter stemmer cast into this form runs significantly faster than the
+multi-stage stemmer — about twice as fast in tests with Snowball.
+
+
+
+We will call the Porter stemmer P, the Lovins stemmer L, and this Lovins
+version of the Porter stemmer LP. As we have said, P and LP are not identical,
+but stem 137 of the 29,401 words of V differently.
+
+
+
+A major cause of difference is unexpected suffix combinations. These can be
+subdivided into combinations of what seem to be suffixes but are not, and
+rare combinations of valid suffixes.
+
+
+
+The first case is illustrated by the word disenchanted. P stems this to
+disench, first taking off suffix ed, and then removing ant, which is
+a suffix in English, although not a suffix in this word. P also stems
+disenchant to disench, so the two words disenchant and
+disenchanted are conflated by P, even though they make an error in the
+stemming process. But ant is a noun type suffix, and so does not combine
+with ed. anted is therefore omitted from the suffix list of LP, so LP
+stems disenchanted to disenchant, but disenchant to disench.
+
+
+
+This illustrates a frequently encountered problem in stemming. S1
+and S2 are suffixes of a language, but the combination
+S1S2 is
+not. A word has the form xS1, where x is some string, but in
+xS1, S1 is not actually a suffix, but part of the stem.
+S2 is a valid suffix for this word, so xS1S2 is
+another word in the language. An algorithmic stemmer stems xS1 to
+x in error. If presented with xS1S2 it can either
+(a) stem it to xS1, knowing S1 cannot be a suffix in
+this context, or (b) stem it to x, ignoring the knowledge to be
+derived from the presence of S2. (a) gives the correct stemming
+of at least xS1S2, although the stemming of xS1
+will be wrong, while (b) overstems both words, but at least achieves
+their conflation. In other words (a) fails to conflate the two forms, but
+may achieve correct conflations of xS1S2 with similar forms
+xS1S3, xS1S4 etc., while (b) conflates
+the two forms, but at the risk of additional false conflations. Often a study
+of the results of a stemming strategy on a sample vocabulary leads one to
+prefer approach (b) to (a) for certain classes of ending. This is
+true in particular of the inflectional endings of English, which is why the
+removals in step 1 of P are not remembered in some state variable, which
+records whether the ending just removed is verb-type, noun-or-verb-type etc.
+On balance you get better results by throwing that information away, and then
+the many word pairs on the pattern of disenchant / disenchanted will
+conflate together.
+
+
+
+Other examples from V can be given: in misrepresenting, ent is
+not a suffix, and enting not a valid suffix combination); in
+witnessed, ness is not a suffix, and nessed not a valid
+suffix combination.
+
+
+
+This highlights a disadvantage of stemmers that work with a fixed list of
+endings. To get the flexibility of context-free ending removal, we need to
+build in extra endings which are not grammatically correct (like anted =
+ant + ed), and this adds considerably to the burden of constructing
+the list. In fact L does not include anted, but it does include for
+example antic (ant + ic), which may be serving a similar
+purpose.
+
+
+
+For the second case, the rare combinations of valid suffixes, one may instance
+ableness. Here again the multi-step stemmer makes life easier. P removes
+ness in step 3 and able in step 4, but without making any necessary
+connection. L has ableness as an ending, dictionaries contain many
+ableness words, and it is an easy matter to make the connection across from
+able to ness in diagram 1 and generate extra endings. Nevertheless the
+ending is very rare in actual use. For example, Dickens’ Nicholas Nickleby
+contains no examples, Bleak House contains two, in the same sentence:
+
+
+
+ I was sure you would feel it yourself and would excuse the
+ reasonableness of MY feelings when coupled with the known
+ excitableness of my little woman.
+
+
+
+reasonableness is perhaps the commonest word in English of this form, and
+excitableness (instead of excitability) is there for contrast. Thackeray’s
+Vanity Fair, a major source in testing out P and Porter2, contains one
+word of this form, charitableness. One may say of this word that it is
+inevitably rare, because it has no really distinct
+meaning from the simpler charity, but that it has to be formed by adding
+ableness rather than ability, because the repeated ity in charity +
+ability is morphologically unacceptable. Other rare combinations are
+ateness, entness
+and eds (as in intendeds and beloveds).
+fuls is another interesting case. The ful suffix, usually adjectival,
+can sometimes create nouns, giving plurals such as mouthfuls and
+spoonfuls. But in longer words sful is a more ‘elegant’ plural
+(handbagsful, dessertspoonsful).
+
+
+
+These account for most of the differences, but there are a few others.
+
+
+
+One is in forms like bricklayers → bricklai (P), bricklay (LP).
+Terminal y is usefully turned to i to help conflate words where y is changed
+to i and es added to form the plural, but this does not happen when
+y
+follows a vowel. LP improves on P here, but the Porter2 algorithm makes the
+same improvement, so we have nothing to learn.
+There is also a difference in words endings lle or lles,
+quadrille → quadril (P), quadrill (LP). This is because e and
+l
+removal are successive in step 5 of P, and done as alternatives in the
+respelling rules
+of LP. In LP this is not quite correct, since
+Lovins makes it clear that her transformation rules should be
+applied in succession. Even so, LP seems better than P, suggesting
+that step 5b of P (undouble l) should not have been attempted after e removal
+in step 5a. So here is a possible small improvement to Porter2. Another
+small, but quite interesting difference, is the condition attached to the
+ative ending. The ending generator makes B the removal condition by a
+natural process, but in P its removal condition is A. This goes back to step
+3 as originally presented in the paper of 1980:
+
+
+ (m>0) ICATE → IC
+ (m>0) ATIVE →
+ (m>0) ALIZE → AL
+ (m>0) ICITI → IC
+ (m>0) ICAL → IC
+ (m>0) FUL →
+ (m>0) NESS →
+
+(m>0) corresponds to A. With removal condition B, the second line would be
+
+
+ (m>1) ATIVE →
+
+
+which looks slightly incongruous. Nevertheless it is probably correct, because we
+remove a half suffix from icate, alize, icity and ical when the stem
+length is at least s1, and so we should remove the full ate + ive suffix when the stem
+length is at least s2. We should not be influenced by ful and ness.
+They are ‘native English’ stems, unlike the other five, which
+have a ‘Romance’ origin, and for these two condition A has been found to
+be more appropriate. In fact putting in this adjustment to Porter2 results in an
+improvement in the small class of words thereby affected.
+
+
+
Conclusion
+
+
+You never learn all there is to know about a computer program, unless the
+program is really very simple. So even after 20 years of regular use,
+we can learn something new about P by creating LP and comparing the
+two. And in the process we learn a lot about L, the Lovins stemmer itself.
+
+
+
+The truth is that the main motivation for studying L was to see how well the
+Snowball system could be used for implementing and analyzing Lovins’
+original work, and the interest in what she had actually achieved in 1968
+only came later. I hope that this short account helps clarify her work, and
+place it the context of the development of stemmers since then.
+
+
+
Notes
+
+
+The http addresses below have a ‘last visited’ date of December 2001.
+
+
+
+
The Lovins stemmer is available at
+
+
+
+
http://www.cs.waikato.ac.nz/~eibe/stemmers
+
http://sourceforge.net/projects/stemmers
+
+
+
+
See http://www-uilots.let.uu.nl/~uplift/
+
+
See http://snowball.sourceforge.net
+
+
See http://promo.net/pg/
+
+
See http://snowball.sourceforge.net/english/voc.txt
+
+
In looking at verbs with the pattern ride, rode, ridden, Palmer,
+1965, notes that ‘we should perhaps add STRIDE, with past tense strode,
+but without a past participle (there is no *stridden).’
Lovins (1968), p. 25, mentions that a stemming algorithm developed by
+ James L. Dolby in California used a two-syllable minimum stem length as a
+ condition for most of the stemming.
+
+
+
Bibiliography
+
+
+Andrews K (1971) The development of a fast conflation algorithm for English.
+Dissertation for the Diploma in Computer Science, Computer Laboratory,
+University of Cambridge.
+
+
+
+Harman D (1991) How effective is suffixing? Journal of the American
+Society for Information Science, 42: 7-15.
+
+
+
+Kraaij W and Pohlmann R (1994) Porter’s stemming algorithm for Dutch. In
+Noordman LGM and de Vroomen WAM, eds. Informatiewetenschap 1994:
+Wetenschappelijke bijdragen aan de derde STINFON Conferentie, Tilburg,
+1994. pp. 167-180.
+
+
+
+Kraaij W and Pohlmann R (1995) Evaluation of a Dutch stemming algorithm.
+Rowley J, ed. The New Review of Document and Text Management, volume 1,
+Taylor Graham, London, 1995. pp. 25-43,
+
+
+
+Krovetz B (1995) Word sense disambiguation for large text databases. PhD
+Thesis. Department of Computer Science, University of Massachusetts
+Amherst.
+
+
+
+Lennon M, Pierce DS, Tarry BD and Willett P (1981) An evaluation of some
+conflation algorithms for information retrieval. Journal of Information
+Science, 3: 177-183.
+
+
+
+Lovins JB (1968) Development of a stemming algorithm. Mechanical
+Translation and Computational Linguistics, 11: 22-31.
+
+The list of 181 endings included by the [% highlight_inline('get') %] directive in the program
+of Appendix 2. The numbers to the right show their frequency of occurrence
+in the sample vocabulary. The 75 rare endings are shown commented out.
+
+
+[% highlight("
+ 'abilities' B /* (3) */
+ 'ability' B /* (14) */
+ 'able' B /* (293) */
+ 'ables' B /* (4) */
+ 'ably' B /* (68) */
+ 'al' B /* (285) */
+ 'alism' B /* (5) */
+// 'alisms' B /* (-) */
+ 'alities' B /* (7) */
+ 'ality' B /* (24) */
+ 'alization' B /* (1) */
+// 'alizationed' B /* (-) */
+// 'alizationing' B /* (-) */
+// 'alizations' B /* (-) */
+ 'alize' B /* (2) */
+ 'alized' B /* (4) */
+// 'alizer' B /* (-) */
+// 'alizered' B /* (-) */
+// 'alizering' B /* (-) */
+// 'alizers' B /* (-) */
+// 'alizes' B /* (-) */
+// 'alizing' B /* (-) */
+ 'ally' B /* (78) */
+ 'alness' B /* (2) */
+// 'alnesses' B /* (-) */
+ 'als' B /* (46) */
+ 'ance' B /* (93) */
+ 'ances' B /* (30) */
+ 'ancies' B /* (2) */
+ 'ancy' B /* (18) */
+ 'ant' B /* (92) */
+ 'ants' B /* (29) */
+ 'ate' B /* (261) */
+ 'ated' B /* (208) */
+ 'ately' B /* (38) */
+ 'ates' B /* (73) */
+ 'ating' B /* (119) */
+ 'ation' B /* (356) */
+ 'ational' B /* (4) */
+// 'ationalism' B /* (-) */
+// 'ationalisms' B /* (-) */
+// 'ationalities' B /* (-) */
+// 'ationality' B /* (-) */
+// 'ationalize' B /* (-) */
+// 'ationalized' B /* (-) */
+// 'ationalizes' B /* (-) */
+// 'ationalizing' B /* (-) */
+ 'ationally' B /* (2) */
+// 'ationalness' B /* (-) */
+// 'ationalnesses' B /* (-) */
+// 'ationals' B /* (-) */
+// 'ationed' B /* (-) */
+// 'ationing' B /* (-) */
+ 'ations' B /* (139) */
+ 'ative' B /* (40) */
+ 'atively' B /* (4) */
+// 'ativeness' B /* (-) */
+// 'ativenesses' B /* (-) */
+ 'atives' B /* (7) */
+// 'ativities' B /* (-) */
+// 'ativity' B /* (-) */
+ 'ator' B /* (25) */
+ 'ators' B /* (10) */
+ 'ement' B /* (70) */
+// 'emently' B /* (-) */
+ 'ements' B /* (31) */
+ 'ence' B /* (100) */
+ 'ences' B /* (25) */
+ 'encies' B /* (9) */
+ 'ency' B /* (41) */
+ 'ent' D /* (154) */
+ 'ently' D /* (53) */
+ 'ents' D /* (25) */
+ 'er' B /* (613) */
+ 'ered' B /* (44) */
+ 'ering' B /* (31) */
+ 'ers' B /* (281) */
+ 'ful' A /* (163) */
+ 'fulness' A /* (31) */
+// 'fulnesses' A /* (-) */
+ 'fuls' A /* (5) */
+ 'ibilities' B /* (2) */
+ 'ibility' B /* (10) */
+ 'ible' B /* (53) */
+ 'ibles' B /* (2) */
+ 'ibly' B /* (14) */
+ 'ic' B /* (142) */
+ 'ical' B /* (91) */
+// 'icalism' B /* (-) */
+// 'icalisms' B /* (-) */
+// 'icalities' B /* (-) */
+ 'icality' B /* (1) */
+// 'icalize' B /* (-) */
+// 'icalized' B /* (-) */
+// 'icalizer' B /* (-) */
+// 'icalizered' B /* (-) */
+// 'icalizering' B /* (-) */
+// 'icalizers' B /* (-) */
+// 'icalizes' B /* (-) */
+// 'icalizing' B /* (-) */
+ 'ically' B /* (59) */
+// 'icalness' B /* (-) */
+// 'icalnesses' B /* (-) */
+ 'icals' B /* (2) */
+ 'icate' B /* (9) */
+ 'icated' B /* (7) */
+// 'icately' B /* (-) */
+ 'icates' B /* (4) */
+ 'icating' B /* (3) */
+ 'ication' B /* (23) */
+// 'icational' B /* (-) */
+// 'icationals' B /* (-) */
+// 'icationed' B /* (-) */
+// 'icationing' B /* (-) */
+ 'ications' B /* (8) */
+ 'icative' B /* (2) */
+// 'icatively' B /* (-) */
+// 'icativeness' B /* (-) */
+// 'icativenesses' B /* (-) */
+// 'icatives' B /* (-) */
+// 'icativities' B /* (-) */
+// 'icativity' B /* (-) */
+ 'icities' B /* (1) */
+ 'icity' B /* (5) */
+ 'ics' B /* (21) */
+ 'ion' C /* (383) */
+ 'ional' C /* (18) */
+// 'ionalism' C /* (-) */
+// 'ionalisms' C /* (-) */
+ 'ionalities' C /* (1) */
+ 'ionality' C /* (1) */
+// 'ionalize' C /* (-) */
+// 'ionalized' C /* (-) */
+// 'ionalizer' C /* (-) */
+// 'ionalizered' C /* (-) */
+// 'ionalizering' C /* (-) */
+// 'ionalizers' C /* (-) */
+// 'ionalizes' C /* (-) */
+// 'ionalizing' C /* (-) */
+ 'ionally' C /* (12) */
+ 'ionalness' C /* (1) */
+// 'ionalnesses' C /* (-) */
+ 'ionals' C /* (1) */
+ 'ioned' C /* (13) */
+ 'ioning' C /* (3) */
+ 'ions' C /* (192) */
+ 'ism' B /* (33) */
+ 'isms' B /* (5) */
+ 'ities' B /* (62) */
+ 'ity' B /* (236) */
+ 'ive' B /* (132) */
+ 'ively' B /* (34) */
+ 'iveness' B /* (14) */
+// 'ivenesses' B /* (-) */
+ 'ives' B /* (12) */
+// 'ivities' B /* (-) */
+ 'ivity' B /* (1) */
+ 'ization' B /* (4) */
+// 'izational' B /* (-) */
+// 'izationals' B /* (-) */
+// 'izationed' B /* (-) */
+// 'izationing' B /* (-) */
+ 'izations' B /* (1) */
+ 'ize' B /* (32) */
+ 'ized' B /* (32) */
+ 'izer' B /* (3) */
+// 'izered' B /* (-) */
+// 'izering' B /* (-) */
+ 'izers' B /* (1) */
+ 'izes' B /* (6) */
+ 'izing' B /* (30) */
+ 'ly' E /* (135) */
+ 'ment' B /* (105) */
+// 'mently' B /* (-) */
+ 'ments' B /* (50) */
+ 'ness' A /* (428) */
+ 'nesses' A /* (21) */
+ 'ous' B /* (340) */
+ 'ously' B /* (130) */
+ 'ousness' B /* (22) */
+// 'ousnesses' B /* (-) */
+") %]
+
+
Appendix 4
+
+
+An ANSI C program which will generate on stdout the raw ending list
+(endings with condition letters) from which the list of Appendix 3 is
+constructed.
+
+The first ever published stemming algorithm was: Lovins JB (1968) Development of
+a stemming algorithm. Mechanical Translation and Computational Linguistics,
+11: 22-31. Julie Beth Lovins’ paper was remarkable for the early date at which
+it was done, and for its seminal influence on later work in
+this area.
+
+
+
+The design of the algorithm was much influenced by the technical vocabulary
+with which Lovins found herself working (subject term keywords attached to
+documents in the materials science and engineering field). The subject term
+list may also have been slightly limiting in that certain common endings
+are not represented (ements and ents for example, corresponding to
+the singular forms ement and ent), and also in that the algorithm's
+treatment of short words, or words with short stems, can be rather
+destructive.
+
+
+
+The Lovins algorithm is noticeably bigger than the Porter algorithm,
+because of its very extensive endings list. But in one way that is used to
+advantage: it is faster. It has effectively traded space for time, and with
+its large suffix set it needs just two major steps to remove a suffix,
+compared with the eight of the Porter algorithm.
+
+
+
+transformation rules. Each ending is associated with one of the
+conditions. In the first step the longest ending is found which satisfies
+its associated condition, and is removed. In the second step the 35 rules
+are applied to transform the ending. The second step is done whether or not
+an ending is removed in the first step.
+
+
+
+For example, nationally has the ending ationally, with associated
+condition, B, ‘minimum stem length = 3’. Since removing ationally
+would leave a stem of length 1 this is rejected. But it also has ending
+ionally with associated condition A. Condition A is ‘no restriction on
+stem length’, so ionally is removed, leaving nat.
+
+
+
+The transformation rules handle features like letter undoubling (sitting
+→ sitt → sit), irregular plurals (matrix and matrices),
+and English morphological oddities ultimately caused by the behaviour of
+Latin verbs of the second conjugation (assume / assumption,
+commit / commission etc). Although they are described as being
+applied in turn, they can be broken into two stages, rule 1 being done in
+stage 1, and either zero or one of rules 2 to 35 being done in stage 2.
+
+
+
+Here is the list of endings as given in Appendix A of Lovins’ paper. They
+are grouped by length, from 11 characters down to 1. Each ending is
+followed by its condition code.
+
+
+
+
Appendix A. The list of endings
+
+
+
+
.11.
+
alistically B
arizability A
izationally B
+
+
.10.
+
antialness A
arisations A
arizations A
entialness A
+
+
.09.
+
allically C
antaneous A
antiality A
arisation A
+
arization A
ationally B
ativeness A
eableness E
+
entations A
entiality A
entialize A
entiation A
+
ionalness A
istically A
itousness A
izability A
+
izational A
+
+
.08.
+
ableness A
arizable A
entation A
entially A
+
eousness A
ibleness A
icalness A
ionalism A
+
ionality A
ionalize A
iousness A
izations A
+
lessness A
+
+
.07.
+
ability A
aically A
alistic B
alities A
+
ariness E
aristic A
arizing A
ateness A
+
atingly A
ational B
atively A
ativism A
+
elihood E
encible A
entally A
entials A
+
entiate A
entness A
fulness A
ibility A
+
icalism A
icalist A
icality A
icalize A
+
ication G
icianry A
ination A
ingness A
+
ionally A
isation A
ishness A
istical A
+
iteness A
iveness A
ivistic A
ivities A
+
ization F
izement A
oidally A
ousness A
+
+
.06.
+
aceous A
acious B
action G
alness A
+
ancial A
ancies A
ancing B
ariser A
+
arized A
arizer A
atable A
ations B
+
atives A
eature Z
efully A
encies A
+
encing A
ential A
enting C
entist A
+
eously A
ialist A
iality A
ialize A
+
ically A
icance A
icians A
icists A
+
ifully A
ionals A
ionate D
ioning A
+
ionist A
iously A
istics A
izable E
+
lessly A
nesses A
oidism A
+
+
.05.
+
acies A
acity A
aging B
aical A
+
alist A
alism B
ality A
alize A
+
allic BB
anced B
ances B
antic C
+
arial A
aries A
arily A
arity B
+
arize A
aroid A
ately A
ating I
+
ation B
ative A
ators A
atory A
+
ature E
early Y
ehood A
eless A
+
elity A
ement A
enced A
ences A
+
eness E
ening E
ental A
ented C
+
ently A
fully A
ially A
icant A
+
ician A
icide A
icism A
icist A
+
icity A
idine I
iedly A
ihood A
+
inate A
iness A
ingly B
inism J
+
inity CC
ional A
ioned A
ished A
+
istic A
ities A
itous A
ively A
+
ivity A
izers F
izing F
oidal A
+
oides A
otide A
ously A
+
+
.04.
+
able A
ably A
ages B
ally B
+
ance B
ancy B
ants B
aric A
+
arly K
ated I
ates A
atic B
+
ator A
ealy Y
edly E
eful A
+
eity A
ence A
ency A
ened E
+
enly E
eous A
hood A
ials A
+
ians A
ible A
ibly A
ical A
+
ides L
iers A
iful A
ines M
+
ings N
ions B
ious A
isms B
+
ists A
itic H
ized F
izer F
+
less A
lily A
ness A
ogen A
+
ward A
wise A
ying B
yish A
+
+
.03.
+
acy A
age B
aic A
als BB
+
ant B
ars O
ary F
ata A
+
ate A
eal Y
ear Y
ely E
+
ene E
ent C
ery E
ese A
+
ful A
ial A
ian A
ics A
+
ide L
ied A
ier A
ies P
+
ily A
ine M
ing N
ion Q
+
ish C
ism B
ist A
ite AA
+
ity A
ium A
ive A
ize F
+
oid A
one R
ous A
+
+
.02.
+
ae A
al BB
ar X
as B
+
ed E
en F
es E
ia A
+
ic A
is A
ly B
on S
+
or T
um U
us V
yl R
+
s' A
's A
+
+
.01.
+
a A
e A
i A
o A
+
s W
y B
+
+
+
+
+Here are the 29 conditions, called A to Z, AA, BB and CC (* stands for any letter):
+
+
+
+
Appendix B. Codes for context-sensitive rules associated with
+certain endings
+
+
+
+
A
No restrictions on stem
+
B
Minimum stem length = 3
+
C
Minimum stem length = 4
+
D
Minimum stem length = 5
+
E
Do not remove ending after e
+
F
Minimum stem length = 3 and do not remove ending after e
+
G
Minimum stem length = 3 and remove ending only after f
+
H
Remove ending only after t or ll
+
I
Do not remove ending after o or e
+
J
Do not remove ending after a or e
+
K
Minimum stem length = 3 and remove ending only after l, i or u*e
+
L
Do not remove ending after u, x or s, unless s follows o
+
M
Do not remove ending after a, c, e or m
+
N
Minimum stem length = 4 after s**, elsewhere = 3
+
O
Remove ending only after l or i
+
P
Do not remove ending after c
+
Q
Minimum stem length = 3 and do not remove ending after l or n
+
R
Remove ending only after n or r
+
S
Remove ending only after dr or t, unless t follows t
+
T
Remove ending only after s or t, unless t follows o
+
U
Remove ending only after l, m, n or r
+
V
Remove ending only after c
+
W
Do not remove ending after s or u
+
X
Remove ending only after l, i or u*e
+
Y
Remove ending only after in
+
Z
Do not remove ending after f
+
AA
Remove ending only after d, f, ph, th, l, er, or, es or t
+
BB
Minimum stem length = 3 and do not remove ending after met or ryst
+
CC
Remove ending only after l
+
+
+
+
+There is an implicit assumption in each condition, A included, that the minimum
+stem length is 2.
+
+
+
+Finally, here are the 35 transformation rules.
+
+
+
+
Appendix C. Transformation rules used in recoding stem terminations
+
+
+
+
1
remove one of double b, d, g, l, m, n, p, r, s, t
+
2
iev → ief
+
3
uct → uc
+
4
umpt → um
+
5
rpt → rb
+
6
urs → ur
+
7
istr → ister
+
7a
metr → meter
+
8
olv → olut
+
9
ul → l except following a, o, i
+
10
bex → bic
+
11
dex → dic
+
12
pex → pic
+
13
tex → tic
+
14
ax → ac
+
15
ex → ec
+
16
ix → ic
+
17
lux → luc
+
18
uad → uas
+
19
vad → vas
+
20
cid → cis
+
21
lid → lis
+
22
erid → eris
+
23
pand → pans
+
24
end → ens except following s
+
25
ond → ons
+
26
lud → lus
+
27
rud → rus
+
28
her → hes except following p, t
+
29
mit → mis
+
30
ent → ens except following m
+
31
ert → ers
+
32
et → es except following n
+
33
yt → ys
+
34
yz → ys
+
+
+
+
+(Rule 30 as given here corrects a typographical error in the published
+paper of 1968.)
+
+
+
+The following examples show the intentions behind these rules.
+
+
+
+
+
1
rubb[ing] → rub, embedd[ed] → embed etc
+
2
believ[e] → belief
+
3
induct[ion] → induc[e]
+
4
consumpt[ion] → consum[e]
+
5
absorpt[ion] → absorb
+
6
recurs[ive] → recur
+
7
administr[ate] → administ[er]
+
7a
parametr[ic] → paramet[er]
+
8
dissolv[ed] → dissolut[ion]
+
9
angul[ar] → angl[e]
+
10
vibex → vibic[es]
+
11
index → indic[es]
+
12
apex → apic[es]
+
13
cortex → cortic[al]
+
14
anthrax → anthrac[ite]
+
15
?
+
16
matrix → matric[es]
+
17
?
+
18
persuad[e] → persuas[ion]
+
19
evad[e] → evas[ion]
+
20
decid[e] → decis[ion]
+
21
elid[e] → elis[ion]
+
22
derid[e] → deris[ion]
+
23
expand → expans[ion]
+
24
defend → defens[ive]
+
25
respond → respons[ive]
+
26
collud[e] → collus[ion]
+
27
obtrud[e] → obtrus[ion]
+
28
adher[e] → adhes[ion]
+
29
remit → remis[s][ion]
+
30
extent → extens[ion]
+
31
convert[ed] → convers[ion]
+
32
parenthet[ic] → parenthes[is]
+
33
analyt[ic] → analys[is]
+
34
analyz[ed] → analys[ed]
+
+
+
+
The Lovins algorithm in Snowball
+
+
+And here is the Lovins algorithm in Snowball. The natural representation
+of the Lovins endings, conditions and rules in Snowball, is, I believe, a
+vindication of the appropriateness of Snowball for stemming work. Once the
+tables had been established, getting the Snowball version running was the
+work of a few minutes.
+
+
+
stringescapes{}
+
+routines(
+ABCDEFGHIJKLMNOPQRSTUVWXYZAABBCC
+
+endings
+
+undoublerespell
+)
+
+externals(stem)
+
+backwardmode(
+
+/* Lovins' conditions A, B ... CC, as given in her Appendix B, where
+ a test for a two letter prefix ('test hop 2') is implicitly
+ assumed. Note that 'e' next 'u' corresponds to her u*e because
+ Snowball is scanning backwards. */
+
+defineAas(hop2)
+defineBas(hop3)
+defineCas(hop4)
+defineDas(hop5)
+defineEas(testhop2not'e')
+defineFas(testhop3not'e')
+defineGas(testhop3'f')
+defineHas(testhop2't'or'll')
+defineIas(testhop2not'o'not'e')
+defineJas(testhop2not'a'not'e')
+defineKas(testhop3'l'or'i'or('e'next'u'))
+defineLas(testhop2not'u'not'x'not('s'not'o'))
+defineMas(testhop2not'a'not'c'not'e'not'm')
+defineNas(testhop3(hop2not's'orhop2))
+defineOas(testhop2'l'or'i')
+definePas(testhop2not'c')
+defineQas(testhop2testhop3not'l'not'n')
+defineRas(testhop2'n'or'r')
+defineSas(testhop2'dr'or('t'not't'))
+defineTas(testhop2's'or('t'not'o'))
+defineUas(testhop2'l'or'm'or'n'or'r')
+defineVas(testhop2'c')
+defineWas(testhop2not's'not'u')
+defineXas(testhop2'l'or'i'or('e'next'u'))
+defineYas(testhop2'in')
+defineZas(testhop2not'f')
+defineAAas(testhop2among('d''f''ph''th''l''er''or'
+'es''t'))
+defineBBas(testhop3not'met'not'ryst')
+defineCCas(testhop2'l')
+
+
+/* The system of endings, as given in Appendix A. */
+
+defineendingsas(
+[substring]among(
+'alistically'B'arizability'A'izationally'B
+
+'antialness'A'arisations'A'arizations'A'entialness'A
+
+'allically'C'antaneous'A'antiality'A'arisation'A
+'arization'A'ationally'B'ativeness'A'eableness'E
+'entations'A'entiality'A'entialize'A'entiation'A
+'ionalness'A'istically'A'itousness'A'izability'A
+'izational'A
+
+'ableness'A'arizable'A'entation'A'entially'A
+'eousness'A'ibleness'A'icalness'A'ionalism'A
+'ionality'A'ionalize'A'iousness'A'izations'A
+'lessness'A
+
+'ability'A'aically'A'alistic'B'alities'A
+'ariness'E'aristic'A'arizing'A'ateness'A
+'atingly'A'ational'B'atively'A'ativism'A
+'elihood'E'encible'A'entally'A'entials'A
+'entiate'A'entness'A'fulness'A'ibility'A
+'icalism'A'icalist'A'icality'A'icalize'A
+'ication'G'icianry'A'ination'A'ingness'A
+'ionally'A'isation'A'ishness'A'istical'A
+'iteness'A'iveness'A'ivistic'A'ivities'A
+'ization'F'izement'A'oidally'A'ousness'A
+
+'aceous'A'acious'B'action'G'alness'A
+'ancial'A'ancies'A'ancing'B'ariser'A
+'arized'A'arizer'A'atable'A'ations'B
+'atives'A'eature'Z'efully'A'encies'A
+'encing'A'ential'A'enting'C'entist'A
+'eously'A'ialist'A'iality'A'ialize'A
+'ically'A'icance'A'icians'A'icists'A
+'ifully'A'ionals'A'ionate'D'ioning'A
+'ionist'A'iously'A'istics'A'izable'E
+'lessly'A'nesses'A'oidism'A
+
+'acies'A'acity'A'aging'B'aical'A
+'alist'A'alism'B'ality'A'alize'A
+'allic'BB'anced'B'ances'B'antic'C
+'arial'A'aries'A'arily'A'arity'B
+'arize'A'aroid'A'ately'A'ating'I
+'ation'B'ative'A'ators'A'atory'A
+'ature'E'early'Y'ehood'A'eless'A
+'elity'A'ement'A'enced'A'ences'A
+'eness'E'ening'E'ental'A'ented'C
+'ently'A'fully'A'ially'A'icant'A
+'ician'A'icide'A'icism'A'icist'A
+'icity'A'idine'I'iedly'A'ihood'A
+'inate'A'iness'A'ingly'B'inism'J
+'inity'CC'ional'A'ioned'A'ished'A
+'istic'A'ities'A'itous'A'ively'A
+'ivity'A'izers'F'izing'F'oidal'A
+'oides'A'otide'A'ously'A
+
+'able'A'ably'A'ages'B'ally'B
+'ance'B'ancy'B'ants'B'aric'A
+'arly'K'ated'I'ates'A'atic'B
+'ator'A'ealy'Y'edly'E'eful'A
+'eity'A'ence'A'ency'A'ened'E
+'enly'E'eous'A'hood'A'ials'A
+'ians'A'ible'A'ibly'A'ical'A
+'ides'L'iers'A'iful'A'ines'M
+'ings'N'ions'B'ious'A'isms'B
+'ists'A'itic'H'ized'F'izer'F
+'less'A'lily'A'ness'A'ogen'A
+'ward'A'wise'A'ying'B'yish'A
+
+'acy'A'age'B'aic'A'als'BB
+'ant'B'ars'O'ary'F'ata'A
+'ate'A'eal'Y'ear'Y'ely'E
+'ene'E'ent'C'ery'E'ese'A
+'ful'A'ial'A'ian'A'ics'A
+'ide'L'ied'A'ier'A'ies'P
+'ily'A'ine'M'ing'N'ion'Q
+'ish'C'ism'B'ist'A'ite'AA
+'ity'A'ium'A'ive'A'ize'F
+'oid'A'one'R'ous'A
+
+'ae'A'al'BB'ar'X'as'B
+'ed'E'en'F'es'E'ia'A
+'ic'A'is'A'ly'B'on'S
+'or'T'um'U'us'V'yl'R
+'{'}s'A's{'}'A
+
+'a'A'e'A'i'A'o'A
+'s'W'y'B
+
+(delete)
+)
+)
+
+/* Undoubling is rule 1 of appendix C. */
+
+defineundoubleas(
+testsubstringamong('bb''dd''gg''ll''mm''nn''pp''rr''ss'
+'tt')
+[next]delete
+)
+
+/* The other appendix C rules can be done together. */
+
+definerespellas(
+[substring]among(
+'iev'(<-'ief')
+'uct'(<-'uc')
+'umpt'(<-'um')
+'rpt'(<-'rb')
+'urs'(<-'ur')
+'istr'(<-'ister')
+'metr'(<-'meter')
+'olv'(<-'olut')
+'ul'(not'a'not'i'not'o'<-'l')
+'bex'(<-'bic')
+'dex'(<-'dic')
+'pex'(<-'pic')
+'tex'(<-'tic')
+'ax'(<-'ac')
+'ex'(<-'ec')
+'ix'(<-'ic')
+'lux'(<-'luc')
+'uad'(<-'uas')
+'vad'(<-'vas')
+'cid'(<-'cis')
+'lid'(<-'lis')
+'erid'(<-'eris')
+'pand'(<-'pans')
+'end'(not's'<-'ens')
+'ond'(<-'ons')
+'lud'(<-'lus')
+'rud'(<-'rus')
+'her'(not'p'not't'<-'hes')
+'mit'(<-'mis')
+'ent'(not'm'<-'ens')
+/* 'ent' was 'end' in the 1968 paper - a typo. */
+'ert'(<-'ers')
+'et'(not'n'<-'es')
+'yt'(<-'ys')
+'yz'(<-'ys')
+)
+)
+)
+
+definestemas(
+
+backwards(
+doendings
+doundouble
+dorespell
+)
+)
+
+The first ever published stemming algorithm was: Lovins JB (1968) Development of
+a stemming algorithm. Mechanical Translation and Computational Linguistics,
+11: 22-31. Julie Beth Lovins’ paper was remarkable for the early date at which
+it was done, and for its seminal influence on later work in
+this area.
+
+
+
+The design of the algorithm was much influenced by the technical vocabulary
+with which Lovins found herself working (subject term keywords attached to
+documents in the materials science and engineering field). The subject term
+list may also have been slightly limiting in that certain common endings
+are not represented (ements and ents for example, corresponding to
+the singular forms ement and ent), and also in that the algorithm's
+treatment of short words, or words with short stems, can be rather
+destructive.
+
+
+
+The Lovins algorithm is noticeably bigger than the Porter algorithm,
+because of its very extensive endings list. But in one way that is used to
+advantage: it is faster. It has effectively traded space for time, and with
+its large suffix set it needs just two major steps to remove a suffix,
+compared with the eight of the Porter algorithm.
+
+
+
+transformation rules. Each ending is associated with one of the
+conditions. In the first step the longest ending is found which satisfies
+its associated condition, and is removed. In the second step the 35 rules
+are applied to transform the ending. The second step is done whether or not
+an ending is removed in the first step.
+
+
+
+For example, nationally has the ending ationally, with associated
+condition, B, ‘minimum stem length = 3’. Since removing ationally
+would leave a stem of length 1 this is rejected. But it also has ending
+ionally with associated condition A. Condition A is ‘no restriction on
+stem length’, so ionally is removed, leaving nat.
+
+
+
+The transformation rules handle features like letter undoubling (sitting
+→ sitt → sit), irregular plurals (matrix and matrices),
+and English morphological oddities ultimately caused by the behaviour of
+Latin verbs of the second conjugation (assume / assumption,
+commit / commission etc). Although they are described as being
+applied in turn, they can be broken into two stages, rule 1 being done in
+stage 1, and either zero or one of rules 2 to 35 being done in stage 2.
+
+
+
+Here is the list of endings as given in Appendix A of Lovins’ paper. They
+are grouped by length, from 11 characters down to 1. Each ending is
+followed by its condition code.
+
+
+
+
Appendix A. The list of endings
+
+
+
+
.11.
+
alistically B
arizability A
izationally B
+
+
.10.
+
antialness A
arisations A
arizations A
entialness A
+
+
.09.
+
allically C
antaneous A
antiality A
arisation A
+
arization A
ationally B
ativeness A
eableness E
+
entations A
entiality A
entialize A
entiation A
+
ionalness A
istically A
itousness A
izability A
+
izational A
+
+
.08.
+
ableness A
arizable A
entation A
entially A
+
eousness A
ibleness A
icalness A
ionalism A
+
ionality A
ionalize A
iousness A
izations A
+
lessness A
+
+
.07.
+
ability A
aically A
alistic B
alities A
+
ariness E
aristic A
arizing A
ateness A
+
atingly A
ational B
atively A
ativism A
+
elihood E
encible A
entally A
entials A
+
entiate A
entness A
fulness A
ibility A
+
icalism A
icalist A
icality A
icalize A
+
ication G
icianry A
ination A
ingness A
+
ionally A
isation A
ishness A
istical A
+
iteness A
iveness A
ivistic A
ivities A
+
ization F
izement A
oidally A
ousness A
+
+
.06.
+
aceous A
acious B
action G
alness A
+
ancial A
ancies A
ancing B
ariser A
+
arized A
arizer A
atable A
ations B
+
atives A
eature Z
efully A
encies A
+
encing A
ential A
enting C
entist A
+
eously A
ialist A
iality A
ialize A
+
ically A
icance A
icians A
icists A
+
ifully A
ionals A
ionate D
ioning A
+
ionist A
iously A
istics A
izable E
+
lessly A
nesses A
oidism A
+
+
.05.
+
acies A
acity A
aging B
aical A
+
alist A
alism B
ality A
alize A
+
allic BB
anced B
ances B
antic C
+
arial A
aries A
arily A
arity B
+
arize A
aroid A
ately A
ating I
+
ation B
ative A
ators A
atory A
+
ature E
early Y
ehood A
eless A
+
elity A
ement A
enced A
ences A
+
eness E
ening E
ental A
ented C
+
ently A
fully A
ially A
icant A
+
ician A
icide A
icism A
icist A
+
icity A
idine I
iedly A
ihood A
+
inate A
iness A
ingly B
inism J
+
inity CC
ional A
ioned A
ished A
+
istic A
ities A
itous A
ively A
+
ivity A
izers F
izing F
oidal A
+
oides A
otide A
ously A
+
+
.04.
+
able A
ably A
ages B
ally B
+
ance B
ancy B
ants B
aric A
+
arly K
ated I
ates A
atic B
+
ator A
ealy Y
edly E
eful A
+
eity A
ence A
ency A
ened E
+
enly E
eous A
hood A
ials A
+
ians A
ible A
ibly A
ical A
+
ides L
iers A
iful A
ines M
+
ings N
ions B
ious A
isms B
+
ists A
itic H
ized F
izer F
+
less A
lily A
ness A
ogen A
+
ward A
wise A
ying B
yish A
+
+
.03.
+
acy A
age B
aic A
als BB
+
ant B
ars O
ary F
ata A
+
ate A
eal Y
ear Y
ely E
+
ene E
ent C
ery E
ese A
+
ful A
ial A
ian A
ics A
+
ide L
ied A
ier A
ies P
+
ily A
ine M
ing N
ion Q
+
ish C
ism B
ist A
ite AA
+
ity A
ium A
ive A
ize F
+
oid A
one R
ous A
+
+
.02.
+
ae A
al BB
ar X
as B
+
ed E
en F
es E
ia A
+
ic A
is A
ly B
on S
+
or T
um U
us V
yl R
+
s' A
's A
+
+
.01.
+
a A
e A
i A
o A
+
s W
y B
+
+
+
+
+Here are the 29 conditions, called A to Z, AA, BB and CC (* stands for any letter):
+
+
+
+
Appendix B. Codes for context-sensitive rules associated with
+certain endings
+
+
+
+
A
No restrictions on stem
+
B
Minimum stem length = 3
+
C
Minimum stem length = 4
+
D
Minimum stem length = 5
+
E
Do not remove ending after e
+
F
Minimum stem length = 3 and do not remove ending after e
+
G
Minimum stem length = 3 and remove ending only after f
+
H
Remove ending only after t or ll
+
I
Do not remove ending after o or e
+
J
Do not remove ending after a or e
+
K
Minimum stem length = 3 and remove ending only after l, i or u*e
+
L
Do not remove ending after u, x or s, unless s follows o
+
M
Do not remove ending after a, c, e or m
+
N
Minimum stem length = 4 after s**, elsewhere = 3
+
O
Remove ending only after l or i
+
P
Do not remove ending after c
+
Q
Minimum stem length = 3 and do not remove ending after l or n
+
R
Remove ending only after n or r
+
S
Remove ending only after dr or t, unless t follows t
+
T
Remove ending only after s or t, unless t follows o
+
U
Remove ending only after l, m, n or r
+
V
Remove ending only after c
+
W
Do not remove ending after s or u
+
X
Remove ending only after l, i or u*e
+
Y
Remove ending only after in
+
Z
Do not remove ending after f
+
AA
Remove ending only after d, f, ph, th, l, er, or, es or t
+
BB
Minimum stem length = 3 and do not remove ending after met or ryst
+
CC
Remove ending only after l
+
+
+
+
+There is an implicit assumption in each condition, A included, that the minimum
+stem length is 2.
+
+
+
+Finally, here are the 35 transformation rules.
+
+
+
+
Appendix C. Transformation rules used in recoding stem terminations
+
+
+
+
1
remove one of double b, d, g, l, m, n, p, r, s, t
+
2
iev → ief
+
3
uct → uc
+
4
umpt → um
+
5
rpt → rb
+
6
urs → ur
+
7
istr → ister
+
7a
metr → meter
+
8
olv → olut
+
9
ul → l except following a, o, i
+
10
bex → bic
+
11
dex → dic
+
12
pex → pic
+
13
tex → tic
+
14
ax → ac
+
15
ex → ec
+
16
ix → ic
+
17
lux → luc
+
18
uad → uas
+
19
vad → vas
+
20
cid → cis
+
21
lid → lis
+
22
erid → eris
+
23
pand → pans
+
24
end → ens except following s
+
25
ond → ons
+
26
lud → lus
+
27
rud → rus
+
28
her → hes except following p, t
+
29
mit → mis
+
30
ent → ens except following m
+
31
ert → ers
+
32
et → es except following n
+
33
yt → ys
+
34
yz → ys
+
+
+
+
+(Rule 30 as given here corrects a typographical error in the published
+paper of 1968.)
+
+
+
+The following examples show the intentions behind these rules.
+
+
+
+
+
1
rubb[ing] → rub, embedd[ed] → embed etc
+
2
believ[e] → belief
+
3
induct[ion] → induc[e]
+
4
consumpt[ion] → consum[e]
+
5
absorpt[ion] → absorb
+
6
recurs[ive] → recur
+
7
administr[ate] → administ[er]
+
7a
parametr[ic] → paramet[er]
+
8
dissolv[ed] → dissolut[ion]
+
9
angul[ar] → angl[e]
+
10
vibex → vibic[es]
+
11
index → indic[es]
+
12
apex → apic[es]
+
13
cortex → cortic[al]
+
14
anthrax → anthrac[ite]
+
15
?
+
16
matrix → matric[es]
+
17
?
+
18
persuad[e] → persuas[ion]
+
19
evad[e] → evas[ion]
+
20
decid[e] → decis[ion]
+
21
elid[e] → elis[ion]
+
22
derid[e] → deris[ion]
+
23
expand → expans[ion]
+
24
defend → defens[ive]
+
25
respond → respons[ive]
+
26
collud[e] → collus[ion]
+
27
obtrud[e] → obtrus[ion]
+
28
adher[e] → adhes[ion]
+
29
remit → remis[s][ion]
+
30
extent → extens[ion]
+
31
convert[ed] → convers[ion]
+
32
parenthet[ic] → parenthes[is]
+
33
analyt[ic] → analys[is]
+
34
analyz[ed] → analys[ed]
+
+
+
+
The Lovins algorithm in Snowball
+
+
+And here is the Lovins algorithm in Snowball. The natural representation
+of the Lovins endings, conditions and rules in Snowball, is, I believe, a
+vindication of the appropriateness of Snowball for stemming work. Once the
+tables had been established, getting the Snowball version running was the
+work of a few minutes.
+
+The Norwegian alphabet includes the following additional letters,
+
+
+
+ æ å ø
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u y æ å ø
+
+
+
+R2 is not used: R1 is defined in the same way as in the
+German stemmer.
+(See the note on R1 and R2.)
+
+
+
+Define a valid s-ending as one of
+
+
+
+bcdfghj
+lmnoprtv
+yz,
+ or k not preceded by a vowel.
+
+
+
+Do each of steps 1, 2 and 3.
+
+
+
+Step 1:
+
+
+
+ Search for the longest among the following suffixes in R1, and
+ perform the action indicated.
+
+
(a)
+ a e ede ande ende ane ene hetene en
+ heten ar er heter as es edes endes
+ enes hetenes ens hetens ers ets et het
+ ast
+
delete
+
(b)
+ s
+
delete if preceded by a valid s-ending
+
(c)
+ erte ert
+
replace with er
+
+
+ (Of course the letter of the valid s-ending is
+ not necessarily in R1)
+
+
+
+
+Step 2:
+
+
+
+
+ If the word ends dt or vt in R1, delete the t.
+
+
+
+ (For example, meldt → meld, operativt → operativ)
+
+
+
+
+Step 3:
+
+
+
+ Search for the longest among the following suffixes in R1, and if found,
+ delete.
+
+ leg eleg ig eig lig elig els
+ lov elov slov hetslov
+
+
+
+
The same algorithm in Snowball
+
+[% highlight_file('norwegian') %]
+
+[% footer %]
diff --git a/algorithms/norwegian/stop.txt b/algorithms/norwegian/stop.txt
new file mode 100644
index 0000000..df1c509
--- /dev/null
+++ b/algorithms/norwegian/stop.txt
@@ -0,0 +1,182 @@
+
+ | A Norwegian stop word list. Comments begin with vertical bar. Each stop
+ | word is at the start of a line.
+
+ | This stop word list is for the dominant bokmål dialect. Words unique
+ | to nynorsk are marked *.
+
+ | Revised by Jan Bruusgaard , Jan 2005
+
+og | and
+i | in
+jeg | I
+det | it/this/that
+at | to (w. inf.)
+en | a/an
+et | a/an
+den | it/this/that
+til | to
+er | is/am/are
+som | who/which/that
+på | on
+de | they / you(formal)
+med | with
+han | he
+av | of
+ikke | not
+ikkje | not *
+der | there
+så | so
+var | was/were
+meg | me
+seg | you
+men | but
+ett | one
+har | have
+om | about
+vi | we
+min | my
+mitt | my
+ha | have
+hadde | had
+hun | she
+nå | now
+over | over
+da | when/as
+ved | by/know
+fra | from
+du | you
+ut | out
+sin | your
+dem | them
+oss | us
+opp | up
+man | you/one
+kan | can
+hans | his
+hvor | where
+eller | or
+hva | what
+skal | shall/must
+selv | self (reflective)
+sjøl | self (reflective)
+her | here
+alle | all
+vil | will
+bli | become
+ble | became
+blei | became *
+blitt | have become
+kunne | could
+inn | in
+når | when
+være | be
+kom | come
+noen | some
+noe | some
+ville | would
+dere | you
+deres | their/theirs
+kun | only/just
+ja | yes
+etter | after
+ned | down
+skulle | should
+denne | this
+for | for/because
+deg | you
+si | hers/his
+sine | hers/his
+sitt | hers/his
+mot | against
+å | to
+meget | much
+hvorfor | why
+dette | this
+disse | these/those
+uten | without
+hvordan | how
+ingen | none
+din | your
+ditt | your
+blir | become
+samme | same
+hvilken | which
+hvilke | which (plural)
+sånn | such a
+inni | inside/within
+mellom | between
+vår | our
+hver | each
+hvem | who
+vors | us/ours
+hvis | whose
+både | both
+bare | only/just
+enn | than
+fordi | as/because
+før | before
+mange | many
+også | also
+slik | just
+vært | been
+båe | both *
+begge | both
+siden | since
+dykk | your *
+dykkar | yours *
+dei | they *
+deira | them *
+deires | theirs *
+deim | them *
+di | your (fem.) *
+då | as/when *
+eg | I *
+ein | a/an *
+eit | a/an *
+eitt | a/an *
+elles | or *
+honom | he *
+hjå | at *
+ho | she *
+hoe | she *
+henne | her
+hennar | her/hers
+hennes | hers
+hoss | how *
+hossen | how *
+ingi | noone *
+inkje | noone *
+korleis | how *
+korso | how *
+kva | what/which *
+kvar | where *
+kvarhelst | where *
+kven | who/whom *
+kvi | why *
+kvifor | why *
+me | we *
+medan | while *
+mi | my *
+mine | my *
+mykje | much *
+no | now *
+nokon | some (masc./neut.) *
+noka | some (fem.) *
+nokor | some *
+noko | some *
+nokre | some *
+sia | since *
+sidan | since *
+so | so *
+somt | some *
+somme | some *
+um | about*
+upp | up *
+vere | be *
+vore | was *
+verte | become *
+vort | become *
+varte | became *
+vart | became *
+
diff --git a/algorithms/porter/stemmer.html b/algorithms/porter/stemmer.html
new file mode 100644
index 0000000..f68dbc7
--- /dev/null
+++ b/algorithms/porter/stemmer.html
@@ -0,0 +1,863 @@
+
+
+
+
+
+
+
+
+
+ The Porter stemming algorithm - Snowball
+
+
+
+
+
+
+
+
+
+
+
+Here is a case study on how to code up a stemming algorithm in Snowball. First,
+the definition of the Porter stemmer, as it appeared in Program, Vol 14 no. 3 pp
+130-137, July 1980.
+
+
+
+
THE ALGORITHM
+
+
+A consonant in a word is a letter other than A, E, I, O or U, and other
+than Y preceded by a consonant. (The fact that the term ‘consonant’ is
+defined to some extent in terms of itself does not make it ambiguous.) So in
+TOY the consonants are T and Y, and in SYZYGY they are S, Z and G. If a
+letter is not a consonant it is a vowel.
+
+
+
+A consonant will be denoted by c, a vowel by v. A list ccc... of length
+greater than 0 will be denoted by C, and a list vvv... of length greater
+than 0 will be denoted by V. Any word, or part of a word, therefore has one
+of the four forms:
+
+
+
+
CVCV ... C
+
CVCV ... V
+
VCVC ... C
+
VCVC ... V
+
+
+
+These may all be represented by the single form
+
+
+
+ [C]VCVC ... [V]
+
+
+
+where the square brackets denote arbitrary presence of their contents.
+Using (VC)m to denote VC repeated m times, this may again be written as
+
+
+
+ [C](VC)m[V].
+
+
+
+m will be called the measure of any word or word part when represented in
+this form. The case m = 0 covers the null word. Here are some examples:
+
+
+
+
m=0
TR, EE, TREE, Y, BY.
+
m=1
TROUBLE, OATS, TREES, IVY.
+
m=2
TROUBLES, PRIVATE, OATEN, ORRERY.
+
+
+
+The rules for removing a suffix will be given in the form
+
+
+
+ (condition) S1 → S2
+
+
+
+This means that if a word ends with the suffix S1, and the stem before S1
+satisfies the given condition, S1 is replaced by S2. The condition is
+usually given in terms of m, e.g.
+
+
+
+ (m > 1) EMENT →
+
+
+
+Here S1 is ‘EMENT’ and S2 is null. This would map REPLACEMENT to REPLAC,
+since REPLAC is a word part for which m = 2.
+
+
+
+The ‘condition’ part may also contain the following:
+
+
+
+
*S
-
the stem ends with S (and similarly for the other letters).
+
+
*v*
-
the stem contains a vowel.
+
+
*d
-
the stem ends with a double consonant (e.g. -TT, -SS).
+
+
*o
-
the stem ends cvc, where the second c is not W, X or Y (e.g.
+ -WIL, -HOP).
+
+
+
+And the condition part may also contain expressions with and, or and
+not, so that
+
+
+
+ (m>1 and (*S or *T))
+
+
+
+tests for a stem with m>1 ending in S or T, while
+
+
+
+ (*d and not (*L or *S or *Z))
+
+
+
+tests for a stem ending with a double consonant other than L, S or Z.
+Elaborate conditions like this are required only rarely.
+
+
+
+In a set of rules written beneath each other, only one is obeyed, and this
+will be the one with the longest matching S1 for the given word. For
+example, with
+
+
+
+
SSES
→
SS
+
IES
→
I
+
SS
→
SS
+
S
→
+
+
+
+(here the conditions are all null) CARESSES maps to CARESS since SSES is
+the longest match for S1. Equally CARESS maps to CARESS (S1=‘SS’) and CARES
+to CARE (S1=‘S’).
+
+
+
+In the rules below, examples of their application, successful or otherwise,
+are given on the right in lower case. The algorithm now follows:
+
+
+
+Step 1a
+
+
+
+
SSES
→
SS
caresses
→
caress
+
IES
→
I
ponies
→
poni
+
ties
→
ti
+
SS
→
SS
caress
→
caress
+
S
→
cats
→
cat
+
+
+
+Step 1b
+
+
+
+
(m>0) EED
→
EE
feed
→
feed
+
agreed
→
agree
+
(*v*) ED
→
plastered
→
plaster
+
bled
→
bled
+
(*v*) ING
→
motoring
→
motor
+
sing
→
sing
+
+
+
+If the second or third of the rules in Step 1b is successful, the following
+is done:
+
+
+
+
AT
→
ATE
conflat(ed)
→
conflate
+
BL
→
BLE
troubl(ed)
→
trouble
+
IZ
→
IZE
siz(ed)
→
size
+
(*d and not (*L or *S or *Z))
+
→
single letter
hopp(ing)
→
hop
+
tann(ed)
→
tan
+
fall(ing)
→
fall
+
hiss(ing)
→
hiss
+
fizz(ed)
→
fizz
+
(m=1 and *o)
+
→
E
fail(ing)
→
fail
+
fil(ing)
→
file
+
+
+
+The rule to map to a single letter causes the removal of one of the double
+letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes
+-ATE, -BLE and -IZE can be recognised later. This E may be removed in step
+4.
+
+
+
+Step 1c
+
+
+
+
(*v*) Y
→
I
happy
→
happi
+
sky
→
sky
+
+
+
+Step 1 deals with plurals and past participles. The subsequent steps are
+much more straightforward.
+
+
+
+Step 2
+
+
+
+
(m>0) ATIONAL
→
ATE
relational
→
relate
+
(m>0) TIONAL
→
TION
conditional
→
condition
+
rational
→
rational
+
(m>0) ENCI
→
ENCE
valenci
→
valence
+
(m>0) ANCI
→
ANCE
hesitanci
→
hesitance
+
(m>0) IZER
→
IZE
digitizer
→
digitize
+
(m>0) ABLI
→
ABLE
conformabli
→
conformable
+
(m>0) ALLI
→
AL
radicalli
→
radical
+
(m>0) ENTLI
→
ENT
differentli
→
different
+
(m>0) ELI
→
E
vileli
→
vile
+
(m>0) OUSLI
→
OUS
analogousli
→
analogous
+
(m>0) IZATION
→
IZE
vietnamization
→
vietnamize
+
(m>0) ATION
→
ATE
predication
→
predicate
+
(m>0) ATOR
→
ATE
operator
→
operate
+
(m>0) ALISM
→
AL
feudalism
→
feudal
+
(m>0) IVENESS
→
IVE
decisiveness
→
decisive
+
(m>0) FULNESS
→
FUL
hopefulness
→
hopeful
+
(m>0) OUSNESS
→
OUS
callousness
→
callous
+
(m>0) ALITI
→
AL
formaliti
→
formal
+
(m>0) IVITI
→
IVE
sensitiviti
→
sensitive
+
(m>0) BILITI
→
BLE
sensibiliti
→
sensible
+
+
+
+The test for the string S1 can be made fast by doing a program switch on
+the penultimate letter of the word being tested. This gives a fairly even
+breakdown of the possible values of the string S1. It will be seen in fact
+that the S1-strings in step 2 are presented here in the alphabetical order
+of their penultimate letter. Similar techniques may be applied in the other
+steps.
+
+
+
+Step 3
+
+
+
+
(m>0) ICATE
→
IC
triplicate
→
triplic
+
(m>0) ATIVE
→
formative
→
form
+
(m>0) ALIZE
→
AL
formalize
→
formal
+
(m>0) ICITI
→
IC
electriciti
→
electric
+
(m>0) ICAL
→
IC
electrical
→
electric
+
(m>0) FUL
→
hopeful
→
hope
+
(m>0) NESS
→
goodness
→
good
+
+
+
+Step 4
+
+
+
+
(m>1) AL
→
revival
→
reviv
+
(m>1) ANCE
→
allowance
→
allow
+
(m>1) ENCE
→
inference
→
infer
+
(m>1) ER
→
airliner
→
airlin
+
(m>1) IC
→
gyroscopic
→
gyroscop
+
(m>1) ABLE
→
adjustable
→
adjust
+
(m>1) IBLE
→
defensible
→
defens
+
(m>1) ANT
→
irritant
→
irrit
+
(m>1) EMENT
→
replacement
→
replac
+
(m>1) MENT
→
adjustment
→
adjust
+
(m>1) ENT
→
dependent
→
depend
+
(m>1 and (*S or *T)) ION
+
→
adoption
→
adopt
+
(m>1) OU
→
homologou
→
homolog
+
(m>1) ISM
→
communism
→
commun
+
(m>1) ATE
→
activate
→
activ
+
(m>1) ITI
→
angulariti
→
angular
+
(m>1) OUS
→
homologous
→
homolog
+
(m>1) IVE
→
effective
→
effect
+
(m>1) IZE
→
bowdlerize
→
bowdler
+
+
+
+The suffixes are now removed. All that remains is a little tidying up.
+
+
+
+Step 5a
+
+
+
+
(m>1) E
→
probate
→
probat
+
rate
→
rate
+
(m=1 and not *o) E
+
→
cease
→
ceas
+
+
+
+Step 5b
+
+
+
+
(m > 1 and *d and *L)
+
→
single letter
controll
→
control
+
roll
→
roll
+
+
+
+
+Now, turning it into Snowball.
+
+
+
+The Porter stemmer makes a use of a measure, m, of the length of a word or
+word part. If C is a sequence of one or more consonants, and V a sequence
+of one or more vowels, any word part has the form
+
+
+
+ [C](VC)m[V],
+
+
+
+which is to be read as an optional C, followed by m repetitions of VC,
+followed by an optional V. This defines m. So for crepuscular the
+measure would be 4.
+
+
+
+ c r e p u s c u l a r
+ | | | | |
+ [C] V C V C V C V C
+ 1 2 3 4
+
+
+
+Most of the rules for suffix removal involve leaving behind a stem whose
+measure exceeds some value, for example,
+
+
+
+ (m > 0) eed → ee
+
+
+
+means ‘replace eed with ee if the stem before eed has measure
+m > 0’. Implementations of the Porter stemmer usually have a routine that
+computes m each time there is a possible candidate for removal.
+
+
+
+In fact the only tests on m in the Porter stemmer are m > 0, m > 1, and,
+at two interesting points, m = 1. This suggests that there are two
+critical positions in a word: the point at which, going from left to
+right, m > 0 becomes true, and then the point at which m > 1 becomes true.
+It turns out that m > 0 becomes true at the point after the first consonant
+following a vowel, and m > 1 becomes true at the point after the first
+consonant following a vowel following a consonant following a vowel.
+Calling these positions p1 and p2, we can determine them quite simply in
+Snowball:
+
+The region to the right of p1 will be denoted by R1, the region to the
+right of p2 by R2:
+
+
+
+ c r e p u s c u l a r
+ | |
+ p1 p2
+ <--- R1 --->
+ <-- R2 -->
+
+
+
+We can test for being in these regions with calls to R1 and R2, defined by,
+
+
+
defineR1as<=cursor
+defineR2as<=cursor
+
+
+
+
+and using these tests instead of computing m is acceptable, so long as the
+stemming process never alters the p1 and p2 positions, which is indeed true
+in the Porter stemmer.
+
+
+
+A particularly interesting feature of the stemmers presented here is the
+common use they make of the positions p1 and p2. The details of marking
+p1
+and p2 vary between the languages because the definitions of vowel and
+consonant vary. For example, French i preceded and followed by vowel
+should be treated as a consonant (inquiétude); Portuguese (ã and õ
+should be treated as a vowel-consonant pair (São João). A third
+important position is pV, which tries to mark the position of the shortest
+acceptable verb stem. Its definition varies somewhat between languages.
+The Porter stemmer does not use a pV explicitly, but the idea appears when
+the verb endings ing and ed are removed only when preceded by a vowel.
+In English therefore pV would be defined as the position after the first
+vowel.
+
+
+
+The Porter stemmer is divided into five steps, step 1 is divided further
+into steps 1a, 1b and 1c, and step 5 into steps 5a and 5b. Step 1 removes
+the i-suffixes, and steps 2 to 4 the d-suffixes (*). Composite d-suffixes are
+reduced to single d-suffixes one at a time. So for example if a word ends
+icational, step 2 reduces it to icate and step 3 to ic. Three steps are
+sufficient for this process in English. Step 5 does some tidying up.
+
+
+
+One can see how easily the stemming rules translate into Snowball by
+comparing the definition of Step 1a from the 1980 paper,
+
+
+
+ Step 1a:
+ SSES → SS caresses → caress
+ IES → I ponies → poni
+ ties → ti
+ SS → SS caress → caress
+ S → cats → cat
+
+The word to be stemmed is being scanned right to left from the end. The
+longest of 'sses', 'ies', 'ss' or 's' is searched for and defined as the
+slice. (If none are found, Step_1a signals f.) If 'sses' is found, it is
+replaced by 'ss', and so on. Of course, replacing 'ss' by 'ss' is a dummy
+action, so we can write
+
+
+
'ss'()
+
+
+
+
+instead of
+
+
+
'ss'(<-'ss')
+
+
+
+
+Remember that delete just means <- ''.
+
+
+
+The really tricky part of the whole algorithm is step 1b,
+which may be worth looking at in detail. Here it is, without the
+example words on the far right,
+
+
+
+ Step 1b:
+ (m > 0) EED → EE
+ (*v*) ED →
+ (*v*) ING →
+
+ If the second or third of the rules in Step 1b is successful, the
+ following is done:
+
+ AT → ATE
+ BL → BLE
+ IZ → IZE
+ (*d and not (*L or *S or *Z)) → single letter
+ (m = 1 and *o) → E
+
+
+
+The first part of the rule means that eed maps to ee if eed is in R1
+(which is equivalent to m > 0), or ed and ing are removed if they are
+preceded by a vowel. In Snowball this is simply,
+
+But this must be modified by the second part of the rule. *d indicates a
+test for double letter consonant — bb, dd etc. *L, *S, *Z are tests
+for l, s, z. *o is a short vowel test — it is matched by
+consonant-vowel-consonant, where the consonant on the right is not w, x
+or y. If the short vowel test is satisfied, m = 1 is equivalent to the
+cursor being at p1. So the second part of the rule means, map at, bl, iz
+to ate, ble, ize; map certain double letters to single letters; and
+add e after a short vowel in words of one syllable.
+
+
+
+We first need two extra groupings,
+
+
+
definev'aeiouy'
+definev_WXYv+'wxY'// v with 'w', 'x' and 'y'-consonant
+definev_LSZv+'lsz'// v with 'l', 's', 'z'
+
+
+
+
+and a test for a short vowel,
+
+
+
defineshortvas(non-v_WXYvnon-v)
+
+
+
+
+(The v_WXY test comes first because we are scanning backwards, from right to
+left.)
+
+
+
+The double to single letter map can be done as follows: first define the
+slice as the next non-v_LSZ and copy it to a string, ch, as a single
+character,
+
+
+
strings(ch)
+
+/* ... */
+
+[non-v_LSZ]->ch
+
+
+
+
+A further test, ch, tests that the next letter of the string is the same
+as the one in ch, and if this gives signal t, delete deletes the slice,
+
+But we can improve the appearance, and speed, of this by turning the
+second part of the rule into another among command, noting that the only
+letters that need undoubling are b, d, f, g, m, n, p, r
+and t,
+
+
+
defineStep_1bas(
+[substring]among(
+'eed'(R1<-'ee')
+'ed'
+'ing'(
+testgopastvdelete
+testsubstringamong(
+'at''bl''iz'
+(<+'e')
+'bb''dd''ff''gg''mm''nn''pp''rr''tt'
+// ignoring double c, h, j, k, q, v, w, and x
+([next]delete)
+''(atmarkp1testshortv<+'e')
+)
+)
+)
+)
+
+
+
+
+Note the null string in the second among, which acts as a default case.
+
+
+
+The Porter stemmer in Snowball is given below. This is an exact
+implementation of the algorithm described in the 1980 paper, unlike the
+other implementations distributed by the author, which have, and have
+always had, three small points of difference (clearly indicated) from the
+original algorithm. Since all other implementations of the algorithm seen
+by the author are in some degree inexact, this may well be the first ever
+correct implementation.
+
+Here is a case study on how to code up a stemming algorithm in Snowball. First,
+the definition of the Porter stemmer, as it appeared in Program, Vol 14 no. 3 pp
+130-137, July 1980.
+
+
+
+
THE ALGORITHM
+
+
+A consonant in a word is a letter other than A, E, I, O or U, and other
+than Y preceded by a consonant. (The fact that the term ‘consonant’ is
+defined to some extent in terms of itself does not make it ambiguous.) So in
+TOY the consonants are T and Y, and in SYZYGY they are S, Z and G. If a
+letter is not a consonant it is a vowel.
+
+
+
+A consonant will be denoted by c, a vowel by v. A list ccc... of length
+greater than 0 will be denoted by C, and a list vvv... of length greater
+than 0 will be denoted by V. Any word, or part of a word, therefore has one
+of the four forms:
+
+
+
+
CVCV ... C
+
CVCV ... V
+
VCVC ... C
+
VCVC ... V
+
+
+
+These may all be represented by the single form
+
+
+
+ [C]VCVC ... [V]
+
+
+
+where the square brackets denote arbitrary presence of their contents.
+Using (VC)m to denote VC repeated m times, this may again be written as
+
+
+
+ [C](VC)m[V].
+
+
+
+m will be called the measure of any word or word part when represented in
+this form. The case m = 0 covers the null word. Here are some examples:
+
+
+
+
m=0
TR, EE, TREE, Y, BY.
+
m=1
TROUBLE, OATS, TREES, IVY.
+
m=2
TROUBLES, PRIVATE, OATEN, ORRERY.
+
+
+
+The rules for removing a suffix will be given in the form
+
+
+
+ (condition) S1 → S2
+
+
+
+This means that if a word ends with the suffix S1, and the stem before S1
+satisfies the given condition, S1 is replaced by S2. The condition is
+usually given in terms of m, e.g.
+
+
+
+ (m > 1) EMENT →
+
+
+
+Here S1 is ‘EMENT’ and S2 is null. This would map REPLACEMENT to REPLAC,
+since REPLAC is a word part for which m = 2.
+
+
+
+The ‘condition’ part may also contain the following:
+
+
+
+
*S
-
the stem ends with S (and similarly for the other letters).
+
+
*v*
-
the stem contains a vowel.
+
+
*d
-
the stem ends with a double consonant (e.g. -TT, -SS).
+
+
*o
-
the stem ends cvc, where the second c is not W, X or Y (e.g.
+ -WIL, -HOP).
+
+
+
+And the condition part may also contain expressions with and, or and
+not, so that
+
+
+
+ (m>1 and (*S or *T))
+
+
+
+tests for a stem with m>1 ending in S or T, while
+
+
+
+ (*d and not (*L or *S or *Z))
+
+
+
+tests for a stem ending with a double consonant other than L, S or Z.
+Elaborate conditions like this are required only rarely.
+
+
+
+In a set of rules written beneath each other, only one is obeyed, and this
+will be the one with the longest matching S1 for the given word. For
+example, with
+
+
+
+
SSES
→
SS
+
IES
→
I
+
SS
→
SS
+
S
→
+
+
+
+(here the conditions are all null) CARESSES maps to CARESS since SSES is
+the longest match for S1. Equally CARESS maps to CARESS (S1=‘SS’) and CARES
+to CARE (S1=‘S’).
+
+
+
+In the rules below, examples of their application, successful or otherwise,
+are given on the right in lower case. The algorithm now follows:
+
+
+
+Step 1a
+
+
+
+
SSES
→
SS
caresses
→
caress
+
IES
→
I
ponies
→
poni
+
ties
→
ti
+
SS
→
SS
caress
→
caress
+
S
→
cats
→
cat
+
+
+
+Step 1b
+
+
+
+
(m>0) EED
→
EE
feed
→
feed
+
agreed
→
agree
+
(*v*) ED
→
plastered
→
plaster
+
bled
→
bled
+
(*v*) ING
→
motoring
→
motor
+
sing
→
sing
+
+
+
+If the second or third of the rules in Step 1b is successful, the following
+is done:
+
+
+
+
AT
→
ATE
conflat(ed)
→
conflate
+
BL
→
BLE
troubl(ed)
→
trouble
+
IZ
→
IZE
siz(ed)
→
size
+
(*d and not (*L or *S or *Z))
+
→
single letter
hopp(ing)
→
hop
+
tann(ed)
→
tan
+
fall(ing)
→
fall
+
hiss(ing)
→
hiss
+
fizz(ed)
→
fizz
+
(m=1 and *o)
+
→
E
fail(ing)
→
fail
+
fil(ing)
→
file
+
+
+
+The rule to map to a single letter causes the removal of one of the double
+letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes
+-ATE, -BLE and -IZE can be recognised later. This E may be removed in step
+4.
+
+
+
+Step 1c
+
+
+
+
(*v*) Y
→
I
happy
→
happi
+
sky
→
sky
+
+
+
+Step 1 deals with plurals and past participles. The subsequent steps are
+much more straightforward.
+
+
+
+Step 2
+
+
+
+
(m>0) ATIONAL
→
ATE
relational
→
relate
+
(m>0) TIONAL
→
TION
conditional
→
condition
+
rational
→
rational
+
(m>0) ENCI
→
ENCE
valenci
→
valence
+
(m>0) ANCI
→
ANCE
hesitanci
→
hesitance
+
(m>0) IZER
→
IZE
digitizer
→
digitize
+
(m>0) ABLI
→
ABLE
conformabli
→
conformable
+
(m>0) ALLI
→
AL
radicalli
→
radical
+
(m>0) ENTLI
→
ENT
differentli
→
different
+
(m>0) ELI
→
E
vileli
→
vile
+
(m>0) OUSLI
→
OUS
analogousli
→
analogous
+
(m>0) IZATION
→
IZE
vietnamization
→
vietnamize
+
(m>0) ATION
→
ATE
predication
→
predicate
+
(m>0) ATOR
→
ATE
operator
→
operate
+
(m>0) ALISM
→
AL
feudalism
→
feudal
+
(m>0) IVENESS
→
IVE
decisiveness
→
decisive
+
(m>0) FULNESS
→
FUL
hopefulness
→
hopeful
+
(m>0) OUSNESS
→
OUS
callousness
→
callous
+
(m>0) ALITI
→
AL
formaliti
→
formal
+
(m>0) IVITI
→
IVE
sensitiviti
→
sensitive
+
(m>0) BILITI
→
BLE
sensibiliti
→
sensible
+
+
+
+The test for the string S1 can be made fast by doing a program switch on
+the penultimate letter of the word being tested. This gives a fairly even
+breakdown of the possible values of the string S1. It will be seen in fact
+that the S1-strings in step 2 are presented here in the alphabetical order
+of their penultimate letter. Similar techniques may be applied in the other
+steps.
+
+
+
+Step 3
+
+
+
+
(m>0) ICATE
→
IC
triplicate
→
triplic
+
(m>0) ATIVE
→
formative
→
form
+
(m>0) ALIZE
→
AL
formalize
→
formal
+
(m>0) ICITI
→
IC
electriciti
→
electric
+
(m>0) ICAL
→
IC
electrical
→
electric
+
(m>0) FUL
→
hopeful
→
hope
+
(m>0) NESS
→
goodness
→
good
+
+
+
+Step 4
+
+
+
+
(m>1) AL
→
revival
→
reviv
+
(m>1) ANCE
→
allowance
→
allow
+
(m>1) ENCE
→
inference
→
infer
+
(m>1) ER
→
airliner
→
airlin
+
(m>1) IC
→
gyroscopic
→
gyroscop
+
(m>1) ABLE
→
adjustable
→
adjust
+
(m>1) IBLE
→
defensible
→
defens
+
(m>1) ANT
→
irritant
→
irrit
+
(m>1) EMENT
→
replacement
→
replac
+
(m>1) MENT
→
adjustment
→
adjust
+
(m>1) ENT
→
dependent
→
depend
+
(m>1 and (*S or *T)) ION
+
→
adoption
→
adopt
+
(m>1) OU
→
homologou
→
homolog
+
(m>1) ISM
→
communism
→
commun
+
(m>1) ATE
→
activate
→
activ
+
(m>1) ITI
→
angulariti
→
angular
+
(m>1) OUS
→
homologous
→
homolog
+
(m>1) IVE
→
effective
→
effect
+
(m>1) IZE
→
bowdlerize
→
bowdler
+
+
+
+The suffixes are now removed. All that remains is a little tidying up.
+
+
+
+Step 5a
+
+
+
+
(m>1) E
→
probate
→
probat
+
rate
→
rate
+
(m=1 and not *o) E
+
→
cease
→
ceas
+
+
+
+Step 5b
+
+
+
+
(m > 1 and *d and *L)
+
→
single letter
controll
→
control
+
roll
→
roll
+
+
+
+
+Now, turning it into Snowball.
+
+
+
+The Porter stemmer makes a use of a measure, m, of the length of a word or
+word part. If C is a sequence of one or more consonants, and V a sequence
+of one or more vowels, any word part has the form
+
+
+
+ [C](VC)m[V],
+
+
+
+which is to be read as an optional C, followed by m repetitions of VC,
+followed by an optional V. This defines m. So for crepuscular the
+measure would be 4.
+
+
+
+ c r e p u s c u l a r
+ | | | | |
+ [C] V C V C V C V C
+ 1 2 3 4
+
+
+
+Most of the rules for suffix removal involve leaving behind a stem whose
+measure exceeds some value, for example,
+
+
+
+ (m > 0) eed → ee
+
+
+
+means ‘replace eed with ee if the stem before eed has measure
+m > 0’. Implementations of the Porter stemmer usually have a routine that
+computes m each time there is a possible candidate for removal.
+
+
+
+In fact the only tests on m in the Porter stemmer are m > 0, m > 1, and,
+at two interesting points, m = 1. This suggests that there are two
+critical positions in a word: the point at which, going from left to
+right, m > 0 becomes true, and then the point at which m > 1 becomes true.
+It turns out that m > 0 becomes true at the point after the first consonant
+following a vowel, and m > 1 becomes true at the point after the first
+consonant following a vowel following a consonant following a vowel.
+Calling these positions p1 and p2, we can determine them quite simply in
+Snowball:
+
+and using these tests instead of computing m is acceptable, so long as the
+stemming process never alters the p1 and p2 positions, which is indeed true
+in the Porter stemmer.
+
+
+
+A particularly interesting feature of the stemmers presented here is the
+common use they make of the positions p1 and p2. The details of marking
+p1
+and p2 vary between the languages because the definitions of vowel and
+consonant vary. For example, French i preceded and followed by vowel
+should be treated as a consonant (inquiétude); Portuguese (ã and õ
+should be treated as a vowel-consonant pair (São João). A third
+important position is pV, which tries to mark the position of the shortest
+acceptable verb stem. Its definition varies somewhat between languages.
+The Porter stemmer does not use a pV explicitly, but the idea appears when
+the verb endings ing and ed are removed only when preceded by a vowel.
+In English therefore pV would be defined as the position after the first
+vowel.
+
+
+
+The Porter stemmer is divided into five steps, step 1 is divided further
+into steps 1a, 1b and 1c, and step 5 into steps 5a and 5b. Step 1 removes
+the i-suffixes, and steps 2 to 4 the d-suffixes (*). Composite d-suffixes are
+reduced to single d-suffixes one at a time. So for example if a word ends
+icational, step 2 reduces it to icate and step 3 to ic. Three steps are
+sufficient for this process in English. Step 5 does some tidying up.
+
+
+
+One can see how easily the stemming rules translate into Snowball by
+comparing the definition of Step 1a from the 1980 paper,
+
+
+
+ Step 1a:
+ SSES → SS caresses → caress
+ IES → I ponies → poni
+ ties → ti
+ SS → SS caress → caress
+ S → cats → cat
+
+The word to be stemmed is being scanned right to left from the end. The
+longest of 'sses', 'ies', 'ss' or 's' is searched for and defined as the
+slice. (If none are found, Step_1a signals f.) If 'sses' is found, it is
+replaced by 'ss', and so on. Of course, replacing 'ss' by 'ss' is a dummy
+action, so we can write
+
+
+[% highlight("
+ 'ss' ()
+") %]
+
+
+instead of
+
+
+[% highlight("
+ 'ss' (<-'ss')
+") %]
+
+
+Remember that delete just means <- ''.
+
+
+
+The really tricky part of the whole algorithm is step 1b,
+which may be worth looking at in detail. Here it is, without the
+example words on the far right,
+
+
+
+ Step 1b:
+ (m > 0) EED → EE
+ (*v*) ED →
+ (*v*) ING →
+
+ If the second or third of the rules in Step 1b is successful, the
+ following is done:
+
+ AT → ATE
+ BL → BLE
+ IZ → IZE
+ (*d and not (*L or *S or *Z)) → single letter
+ (m = 1 and *o) → E
+
+
+
+The first part of the rule means that eed maps to ee if eed is in R1
+(which is equivalent to m > 0), or ed and ing are removed if they are
+preceded by a vowel. In Snowball this is simply,
+
+But this must be modified by the second part of the rule. *d indicates a
+test for double letter consonant — bb, dd etc. *L, *S, *Z are tests
+for l, s, z. *o is a short vowel test — it is matched by
+consonant-vowel-consonant, where the consonant on the right is not w, x
+or y. If the short vowel test is satisfied, m = 1 is equivalent to the
+cursor being at p1. So the second part of the rule means, map at, bl, iz
+to ate, ble, ize; map certain double letters to single letters; and
+add e after a short vowel in words of one syllable.
+
+
+
+We first need two extra groupings,
+
+
+[% highlight("
+ define v 'aeiouy'
+ define v_WXY v + 'wxY' // v with 'w', 'x' and 'y'-consonant
+ define v_LSZ v + 'lsz' // v with 'l', 's', 'z'
+") %]
+
+
+and a test for a short vowel,
+
+
+[% highlight("
+ define shortv as ( non-v_WXY v non-v )
+") %]
+
+
+(The v_WXY test comes first because we are scanning backwards, from right to
+left.)
+
+
+
+The double to single letter map can be done as follows: first define the
+slice as the next non-v_LSZ and copy it to a string, ch, as a single
+character,
+
+
+[% highlight("
+ define Step_1b as (
+ [substring] among (
+ 'eed' (R1 <-'ee')
+ 'ed'
+ 'ing' (
+ test gopast v delete
+ (test among('at' 'bl' 'iz') <+ 'e')
+ or
+ ([non-v_LSZ]->ch ch delete)
+ or
+ (atmark p1 test shortv <+ 'e')
+ )
+ )
+ )
+") %]
+
+
+But we can improve the appearance, and speed, of this by turning the
+second part of the rule into another among command, noting that the only
+letters that need undoubling are b, d, f, g, m, n, p, r
+and t,
+
+
+[% highlight("
+ define Step_1b as (
+ [substring] among (
+ 'eed' (R1 <-'ee')
+ 'ed'
+ 'ing' (
+ test gopast v delete
+ test substring among(
+ 'at' 'bl' 'iz'
+ (<+ 'e')
+ 'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'
+ // ignoring double c, h, j, k, q, v, w, and x
+ ([next] delete)
+ '' (atmark p1 test shortv <+ 'e')
+ )
+ )
+ )
+ )
+") %]
+
+
+Note the null string in the second among, which acts as a default case.
+
+
+
+The Porter stemmer in Snowball is given below. This is an exact
+implementation of the algorithm described in the 1980 paper, unlike the
+other implementations distributed by the author, which have, and have
+always had, three small points of difference (clearly indicated) from the
+original algorithm. Since all other implementations of the algorithm seen
+by the author are in some degree inexact, this may well be the first ever
+correct implementation.
+
+Letters in Portuguese include the following accented forms,
+
+
+
+ á é í ó ú â ê ô ç ã õ ü
+
+The following letters are vowels:
+
+ a e i o u á é í ó ú â ê ô
+
+And the two nasalised vowel forms,
+
+ ã õ
+
+
+
+should be treated as a vowel followed by a consonant.
+
+
+
+ã and õ are therefore replaced by a~ and o~ in the word, where ~ is a
+separate character to be treated as a consonant. And then —
+
+
+
+R2
+(see the note on R1 and R2)
+and RV have the same definition as in the
+ Spanish stemmer.
+
+
+
+Always do step 1.
+
+
+
+Step 1: Standard suffix removal
+
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
eza ezas ico ica icos icas ismo ismos
+ ável ível ista istas oso osa
+ osos osas amento amentos imento imentos
+ adora ador aça~o adoras adores aço~es
+ ante antes ância
+
delete if in R2
+
logia logias
+
replace with log if in R2
+
ução uções
+
replace with u if in R2
+
ência ências
+
replace with ente if in R2
+
amente
+
delete if in R1
+
if preceded by iv, delete if in R2 (and if further preceded by at,
+ delete if in R2), otherwise,
+
if preceded by os, ic or ad, delete if in R2
+
mente
+
delete if in R2
+
if preceded by ante, avel or ível, delete if in R2
+
idade idades
+
delete if in R2
+
if preceded by abil, ic or iv, delete if in R2
+
iva ivo ivas ivos
+
delete if in R2
+
if preceded by at, delete if in R2
+
ira iras
+
replace with ir if in RV and preceded by e
+
+
+
+
+Do step 2 if no ending was removed by step 1.
+
+
+
+Step 2: Verb suffixes
+
+
+
+ Search for the longest among the following suffixes in RV, and if found,
+ delete.
+
+ ada ida ia aria eria iria ará ara erá era irá ava asse esse
+ isse aste este iste ei arei erei irei am iam ariam eriam iriam
+ aram eram iram avam em arem erem irem assem essem issem ado ido
+ ando endo indo ara~o era~o ira~o ar er ir as adas idas ias arias
+ erias irias arás aras erás eras irás avas es ardes erdes
+ irdes ares eres ires asses esses isses astes estes istes is ais
+ eis íeis aríeis eríeis iríeis áreis areis éreis ereis
+ íreis ireis ásseis ésseis ísseis áveis ados idos ámos
+ amos íamos aríamos eríamos iríamos áramos éramos
+ íramos ávamos emos aremos eremos iremos ássemos êssemos
+ íssemos imos armos ermos irmos eu iu ou ira
+ iras
+
+If the last step to be obeyed — either step 1 or 2 — altered the word,
+do step 3
+
+Step 3
+
+ Delete suffix i if in RV and preceded by c
+
+
+
+Alternatively, if neither steps 1 nor 2 altered the word, do step 4
+
+
+
+Step 4: Residual suffix
+
+
+
+ If the word ends with one of the suffixes
+
+ os a i o á í ó
+
+ in RV, delete it
+
+
+
+Always do step 5
+
+
+
+Step 5:
+
+
+
+
+ If the word ends with one of
+
+
+ e é ê
+
+
+ in RV, delete it, and if preceded by gu (or ci) with the u (or i) in RV,
+ delete the u (or i).
+
+Letters in Portuguese include the following accented forms,
+
+
+
+ á é í ó ú â ê ô ç ã õ ü
+
+The following letters are vowels:
+
+ a e i o u á é í ó ú â ê ô
+
+And the two nasalised vowel forms,
+
+ ã õ
+
+
+
+should be treated as a vowel followed by a consonant.
+
+
+
+ã and õ are therefore replaced by a~ and o~ in the word, where ~ is a
+separate character to be treated as a consonant. And then —
+
+
+
+R2
+(see the note on R1 and R2)
+and RV have the same definition as in the
+ Spanish stemmer.
+
+
+
+Always do step 1.
+
+
+
+Step 1: Standard suffix removal
+
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
eza ezas ico ica icos icas ismo ismos
+ ável ível ista istas oso osa
+ osos osas amento amentos imento imentos
+ adora ador aça~o adoras adores aço~es
+ ante antes ância
+
delete if in R2
+
logia logias
+
replace with log if in R2
+
ução uções
+
replace with u if in R2
+
ência ências
+
replace with ente if in R2
+
amente
+
delete if in R1
+
if preceded by iv, delete if in R2 (and if further preceded by at,
+ delete if in R2), otherwise,
+
if preceded by os, ic or ad, delete if in R2
+
mente
+
delete if in R2
+
if preceded by ante, avel or ível, delete if in R2
+
idade idades
+
delete if in R2
+
if preceded by abil, ic or iv, delete if in R2
+
iva ivo ivas ivos
+
delete if in R2
+
if preceded by at, delete if in R2
+
ira iras
+
replace with ir if in RV and preceded by e
+
+
+
+
+Do step 2 if no ending was removed by step 1.
+
+
+
+Step 2: Verb suffixes
+
+
+
+ Search for the longest among the following suffixes in RV, and if found,
+ delete.
+
+ ada ida ia aria eria iria ará ara erá era irá ava asse esse
+ isse aste este iste ei arei erei irei am iam ariam eriam iriam
+ aram eram iram avam em arem erem irem assem essem issem ado ido
+ ando endo indo ara~o era~o ira~o ar er ir as adas idas ias arias
+ erias irias arás aras erás eras irás avas es ardes erdes
+ irdes ares eres ires asses esses isses astes estes istes is ais
+ eis íeis aríeis eríeis iríeis áreis areis éreis ereis
+ íreis ireis ásseis ésseis ísseis áveis ados idos ámos
+ amos íamos aríamos eríamos iríamos áramos éramos
+ íramos ávamos emos aremos eremos iremos ássemos êssemos
+ íssemos imos armos ermos irmos eu iu ou ira
+ iras
+
+If the last step to be obeyed — either step 1 or 2 — altered the word,
+do step 3
+
+Step 3
+
+ Delete suffix i if in RV and preceded by c
+
+
+
+Alternatively, if neither steps 1 nor 2 altered the word, do step 4
+
+
+
+Step 4: Residual suffix
+
+
+
+ If the word ends with one of the suffixes
+
+ os a i o á í ó
+
+ in RV, delete it
+
+
+
+Always do step 5
+
+
+
+Step 5:
+
+
+
+
+ If the word ends with one of
+
+
+ e é ê
+
+
+ in RV, delete it, and if preceded by gu (or ci) with the u (or i) in RV,
+ delete the u (or i).
+
+
+
+ Or if the word ends ç remove the cedilla
+
+
+
+
+And finally:
+
+
+
+ Turn a~, o~ back into ã, õ
+
+
+
The same algorithm in Snowball
+
+[% highlight_file('portuguese') %]
+
+[% footer %]
diff --git a/algorithms/portuguese/stop.txt b/algorithms/portuguese/stop.txt
new file mode 100644
index 0000000..9c3c9ac
--- /dev/null
+++ b/algorithms/portuguese/stop.txt
@@ -0,0 +1,245 @@
+
+ | A Portuguese stop word list. Comments begin with vertical bar. Each stop
+ | word is at the start of a line.
+
+
+ | The following is a ranked list (commonest to rarest) of stopwords
+ | deriving from a large sample of text.
+
+ | Extra words have been added at the end.
+
+de | of, from
+a | the; to, at; her
+o | the; him
+que | who, that
+e | and
+do | de + o
+da | de + a
+em | in
+um | a
+para | for
+ | é from SER
+com | with
+não | not, no
+uma | a
+os | the; them
+no | em + o
+se | himself etc
+na | em + a
+por | for
+mais | more
+as | the; them
+dos | de + os
+como | as, like
+mas | but
+ | foi from SER
+ao | a + o
+ele | he
+das | de + as
+ | tem from TER
+à | a + a
+seu | his
+sua | her
+ou | or
+ | ser from SER
+quando | when
+muito | much
+ | há from HAV
+nos | em + os; us
+já | already, now
+ | está from EST
+eu | I
+também | also
+só | only, just
+pelo | per + o
+pela | per + a
+até | up to
+isso | that
+ela | he
+entre | between
+ | era from SER
+depois | after
+sem | without
+mesmo | same
+aos | a + os
+ | ter from TER
+seus | his
+quem | whom
+nas | em + as
+me | me
+esse | that
+eles | they
+ | estão from EST
+você | you
+ | tinha from TER
+ | foram from SER
+essa | that
+num | em + um
+nem | nor
+suas | her
+meu | my
+às | a + as
+minha | my
+ | têm from TER
+numa | em + uma
+pelos | per + os
+elas | they
+ | havia from HAV
+ | seja from SER
+qual | which
+ | será from SER
+nós | we
+ | tenho from TER
+lhe | to him, her
+deles | of them
+essas | those
+esses | those
+pelas | per + as
+este | this
+ | fosse from SER
+dele | of him
+
+ | other words. There are many contractions such as naquele = em+aquele,
+ | mo = me+o, but they are rare.
+ | Indefinite article plural forms are also rare.
+
+tu | thou
+te | thee
+vocês | you (plural)
+vos | you
+lhes | to them
+meus | my
+minhas
+teu | thy
+tua
+teus
+tuas
+nosso | our
+nossa
+nossos
+nossas
+
+dela | of her
+delas | of them
+
+esta | this
+estes | these
+estas | these
+aquele | that
+aquela | that
+aqueles | those
+aquelas | those
+isto | this
+aquilo | that
+
+ | forms of estar, to be (not including the infinitive):
+estou
+está
+estamos
+estão
+estive
+esteve
+estivemos
+estiveram
+estava
+estávamos
+estavam
+estivera
+estivéramos
+esteja
+estejamos
+estejam
+estivesse
+estivéssemos
+estivessem
+estiver
+estivermos
+estiverem
+
+ | forms of haver, to have (not including the infinitive):
+hei
+há
+havemos
+hão
+houve
+houvemos
+houveram
+houvera
+houvéramos
+haja
+hajamos
+hajam
+houvesse
+houvéssemos
+houvessem
+houver
+houvermos
+houverem
+houverei
+houverá
+houveremos
+houverão
+houveria
+houveríamos
+houveriam
+
+ | forms of ser, to be (not including the infinitive):
+sou
+somos
+são
+era
+éramos
+eram
+fui
+foi
+fomos
+foram
+fora
+fôramos
+seja
+sejamos
+sejam
+fosse
+fôssemos
+fossem
+for
+formos
+forem
+serei
+será
+seremos
+serão
+seria
+seríamos
+seriam
+
+ | forms of ter, to have (not including the infinitive):
+tenho
+tem
+temos
+tém
+tinha
+tínhamos
+tinham
+tive
+teve
+tivemos
+tiveram
+tivera
+tivéramos
+tenha
+tenhamos
+tenham
+tivesse
+tivéssemos
+tivessem
+tiver
+tivermos
+tiverem
+terei
+terá
+teremos
+terão
+teria
+teríamos
+teriam
diff --git a/algorithms/romance.html b/algorithms/romance.html
new file mode 100644
index 0000000..8a64be3
--- /dev/null
+++ b/algorithms/romance.html
@@ -0,0 +1,170 @@
+
+
+
+
+
+
+
+
+
+ Romance language stemmers - Snowball
+
+
+
+
+
+
+
+
+
+
+
+The Romance languages have a wealth of different i-suffixes (*) among the verb
+forms, and relatively few for the other parts of speech. In addition to
+this, many verbs exhibit irregularities. Many also have short stems,
+leading to dangers of over-stemming. The verb, therefore, tends to
+dominate initial thinking about stemming in these languages.
+
+
+
+An algorithmic stemmer can usually reduce the multiple forms of a verb to at
+most two or three, and often just one. This is probably
+adequate for standard IR use, where the verb is used rather less than other
+parts of speech in short queries.
+
+
+
+In French the verb endings ent and ons cannot be removed without
+unacceptable overstemming. The ons form is rarer, but ent forms
+are quite common, and will appear regularly throughout a stemmed vocabulary.
+
+
+
+In Italian, the final vowel of nouns and adjectives indicates number and
+gender (amico is male friend, amica is female friend) and its removal is a
+necessary part of stemming, but the final vowel sometimes separates words
+of different meanings (banco is bench, banca is bank), which leads to some
+over-stemming.
+
+
+
+The d-suffixes of all four languages follow a similar pattern. They can be
+tabulated as follows,
+
+
+
+
+
French
Spanish
Portug.
Italian
+
+
noun
ANCE
ance
anza
eza
anza
+
adjective
IC
ique
ico
ico
ico
+
noun
ISM
isme
ismo
ismo
ismo
+
adjective
ABLE
able
able
ável
abile
+
adjective
IBLE
-
ible
ível
ibile
+
noun
IST
iste
ista
ista
ista
+
adjective
OUS
eux
oso
oso
oso
+
noun
MENT
ment
amiento
amento
mente
+
noun
ATOR
ateur
ador
ador
attore
+
noun
ATRESS
atrice
-
-
atrice
+
noun
ATION
ation
ación
ação
azione
+
noun
LOGY
logie
logía
logía
logia
+
noun
USION
usion
ución
ución
uzione
+
noun
ENCE
ence
encia
ência
enza
+
adjective
ENT
ent
ente
ente
ente
+
+
noun
ANCE
ance
ancia
ância
anza
+
noun
ANT
ant
ante
ante
ante
+
+
adverb
LY
(e)ment
(a)mente
(a)mente
+(a)mente
+
noun
ITY
ité
idad
idade
ità
+
adjective
IVE
if
ive
ivo
ivo
+
verb
ATE
at
at
at
at
+
+
+
+Equivalent English forms are shown in upper case. In English, ATE is a valid ending, but
+in the Romance languages it only exists in combinations. The endings can appear in a
+number of styles. In Italian, oso can also be osa, osi or ose, French
+ique becomes ic in combinations.
+
+
+
+The important combining forms are summarised in the following picture:
+
+
+
+
+
+In English, ABLE combines with LY to form ABLY. So in French, for example,
+able combines with (e)ment to form ablement.
+In some languages particular combinations are rare. In Italian, for example,
+ANT + LY, which would be the ending antemente, is so rare that it does not
+figure in the stemming algorithm.
+According to the picture, we
+should encounter the forms ICATIVELY and ICATIVITY, and dictionaries
+instance a few English words with these endings (communicatively for
+example).
+But in practice three is the maximum number of derivational
+suffixes that one need consider in combination.
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/algorithms/romance.tt b/algorithms/romance.tt
new file mode 100644
index 0000000..2c1d703
--- /dev/null
+++ b/algorithms/romance.tt
@@ -0,0 +1,106 @@
+[% header('Romance language stemmers') %]
+
+
+The Romance languages have a wealth of different i-suffixes (*) among the verb
+forms, and relatively few for the other parts of speech. In addition to
+this, many verbs exhibit irregularities. Many also have short stems,
+leading to dangers of over-stemming. The verb, therefore, tends to
+dominate initial thinking about stemming in these languages.
+
+
+
+An algorithmic stemmer can usually reduce the multiple forms of a verb to at
+most two or three, and often just one. This is probably
+adequate for standard IR use, where the verb is used rather less than other
+parts of speech in short queries.
+
+
+
+In French the verb endings ent and ons cannot be removed without
+unacceptable overstemming. The ons form is rarer, but ent forms
+are quite common, and will appear regularly throughout a stemmed vocabulary.
+
+
+
+In Italian, the final vowel of nouns and adjectives indicates number and
+gender (amico is male friend, amica is female friend) and its removal is a
+necessary part of stemming, but the final vowel sometimes separates words
+of different meanings (banco is bench, banca is bank), which leads to some
+over-stemming.
+
+
+
+The d-suffixes of all four languages follow a similar pattern. They can be
+tabulated as follows,
+
+
+
+
+
French
Spanish
Portug.
Italian
+
+
noun
ANCE
ance
anza
eza
anza
+
adjective
IC
ique
ico
ico
ico
+
noun
ISM
isme
ismo
ismo
ismo
+
adjective
ABLE
able
able
ável
abile
+
adjective
IBLE
-
ible
ível
ibile
+
noun
IST
iste
ista
ista
ista
+
adjective
OUS
eux
oso
oso
oso
+
noun
MENT
ment
amiento
amento
mente
+
noun
ATOR
ateur
ador
ador
attore
+
noun
ATRESS
atrice
-
-
atrice
+
noun
ATION
ation
ación
ação
azione
+
noun
LOGY
logie
logía
logía
logia
+
noun
USION
usion
ución
ución
uzione
+
noun
ENCE
ence
encia
ência
enza
+
adjective
ENT
ent
ente
ente
ente
+
+
noun
ANCE
ance
ancia
ância
anza
+
noun
ANT
ant
ante
ante
ante
+
+
adverb
LY
(e)ment
(a)mente
(a)mente
+(a)mente
+
noun
ITY
ité
idad
idade
ità
+
adjective
IVE
if
ive
ivo
ivo
+
verb
ATE
at
at
at
at
+
+
+
+Equivalent English forms are shown in upper case. In English, ATE is a valid ending, but
+in the Romance languages it only exists in combinations. The endings can appear in a
+number of styles. In Italian, oso can also be osa, osi or ose, French
+ique becomes ic in combinations.
+
+
+
+The important combining forms are summarised in the following picture:
+
+
+
+
+
+In English, ABLE combines with LY to form ABLY. So in French, for example,
+able combines with (e)ment to form ablement.
+In some languages particular combinations are rare. In Italian, for example,
+ANT + LY, which would be the ending antemente, is so rare that it does not
+figure in the stemming algorithm.
+According to the picture, we
+should encounter the forms ICATIVELY and ICATIVITY, and dictionaries
+instance a few English words with these endings (communicatively for
+example).
+But in practice three is the maximum number of derivational
+suffixes that one need consider in combination.
+
+(For the background to this work, see the
+credits page. Following earlier misgivings on the wisdom
+of removing IST/ISM endings, in this stemmer they are now conflated to a single
+form. It can easily be modified it to bring it in line with the other Romance
+stemmers: see the internal comments marked ‘IST’.
+
+
+
+It is assumed that hyphenated forms are split into separate words prior to
+stemming.)
+
+
+
The stemming algorithm
+
+
+Letters in Romanian include the following accented forms,
+
+
+
+ ă â î ș ț
+
+
+
+The following letters are vowels:
+
+
+
+ a ă â e i î o u
+
+
+
+Before full Unicode support was widespread it was common to use ş and
+ţ (cedilla instead of comma-below) in Romanian text as these characters
+were more readily available in 8-bit character sets. The original version of
+this algorithm only recognised the cedilla forms, but the current version
+instead normalises the old forms as a first step: replace ş by
+ș and ţ by ț.
+
+
+
+Then, i and u between vowels are put into upper case
+(so that they are treated as consonants).
+
+
+
+R1, R2
+(see the note on R1 and R2)
+and RV then have the same definition as in the
+ Spanish stemmer.
+
+
+
+Always do steps 0, 1, 2 and 4. (Step 3 is conditional on steps 1 and 2.)
+
+
+
+Step 0: Removal of plurals (and other simplifications)
+
+
+
+ Search for the longest among the following suffixes, and, if
+ it is in R1, perform the
+ action indicated.
+
+
ul ului
+
delete
+
aua
+
replace with a
+
ea ele elor
+
replace with e
+
ii iua iei iile iilor ilor
+
replace with i
+
ile
+
replace with i if not preceded by ab
+
atei
+
replace with at
+
ație ația
+
replace with ați
+
+
+
+
+Step 1: Reduction of combining suffixes
+
+
+
+ Search for the longest among the following suffixes, and, if
+ it is in R1, preform the replacement action indicated.
+ Then repeat this step until no replacement occurs.
+
+ Search for the longest among the following suffixes, and, if
+ it is in R2, perform the action indicated.
+
+
at ata ată ati ate
+ ut uta ută uti ute
+ it ita ită iti ite
+ ic ica ice ici ică
+ abil abila abile abili abilă
+ ibil ibila ibile ibili ibilă
+ oasa oasă oase os osi oși
+ ant anta ante anti antă
+ ator atori
+ itate itati ităi ități
+ iv iva ive ivi ivă
+
delete
+
iune iuni
+
delete if preceded by ț, and replace the ț by t.
+
ism isme
+ ist ista iste isti istă iști
+
replace with ist
+
+
+
+
+Do step 3 if no suffix was removed either by step 1 or step 2.
+
+
+
+Step 3: Removal of verb suffixes
+
+
+
+ Search for the longest suffix in region RV among the following,
+ and perform the action indicated.
+
+
are ere ire âre
+ ind ând
+ indu ându
+ eze
+ ească
+ ez ezi ează esc ești
+ ește
+ ăsc ăști
+ ăște
+ am ai au
+ eam eai ea eați eau
+ iam iai ia iați iau
+ ui
+ ași arăm arăți ară
+ uși urăm urăți ură
+ iși irăm irăți iră
+ âi âși ârăm ârăți âră
+ asem aseși ase aserăm aserăți aseră
+ isem iseși ise iserăm iserăți iseră
+ âsem âseși âse âserăm âserăți âseră
+ usem useși use userăm userăți useră
+
+
delete if preceded in RV by a consonant or u
+
ăm ați
+ em eți
+ im iți
+ âm âți
+ seși serăm serăți seră
+ sei se
+ sesem seseși sese seserăm seserăți seseră
+
+(For the background to this work, see the
+credits page. Following earlier misgivings on the wisdom
+of removing IST/ISM endings, in this stemmer they are now conflated to a single
+form. It can easily be modified it to bring it in line with the other Romance
+stemmers: see the internal comments marked ‘IST’.
+
+
+
+It is assumed that hyphenated forms are split into separate words prior to
+stemming.)
+
+
+
The stemming algorithm
+
+
+Letters in Romanian include the following accented forms,
+
+
+
+ ă â î ș ț
+
+
+
+The following letters are vowels:
+
+
+
+ a ă â e i î o u
+
+
+
+Before full Unicode support was widespread it was common to use ş and
+ţ (cedilla instead of comma-below) in Romanian text as these characters
+were more readily available in 8-bit character sets. The original version of
+this algorithm only recognised the cedilla forms, but the current version
+instead normalises the old forms as a first step: replace ş by
+ș and ţ by ț.
+
+
+
+Then, i and u between vowels are put into upper case
+(so that they are treated as consonants).
+
+
+
+R1, R2
+(see the note on R1 and R2)
+and RV then have the same definition as in the
+ Spanish stemmer.
+
+
+
+Always do steps 0, 1, 2 and 4. (Step 3 is conditional on steps 1 and 2.)
+
+
+
+Step 0: Removal of plurals (and other simplifications)
+
+
+
+ Search for the longest among the following suffixes, and, if
+ it is in R1, perform the
+ action indicated.
+
+
ul ului
+
delete
+
aua
+
replace with a
+
ea ele elor
+
replace with e
+
ii iua iei iile iilor ilor
+
replace with i
+
ile
+
replace with i if not preceded by ab
+
atei
+
replace with at
+
ație ația
+
replace with ați
+
+
+
+
+Step 1: Reduction of combining suffixes
+
+
+
+ Search for the longest among the following suffixes, and, if
+ it is in R1, preform the replacement action indicated.
+ Then repeat this step until no replacement occurs.
+
+ Search for the longest among the following suffixes, and, if
+ it is in R2, perform the action indicated.
+
+
at ata ată ati ate
+ ut uta ută uti ute
+ it ita ită iti ite
+ ic ica ice ici ică
+ abil abila abile abili abilă
+ ibil ibila ibile ibili ibilă
+ oasa oasă oase os osi oși
+ ant anta ante anti antă
+ ator atori
+ itate itati ităi ități
+ iv iva ive ivi ivă
+
delete
+
iune iuni
+
delete if preceded by ț, and replace the ț by t.
+
ism isme
+ ist ista iste isti istă iști
+
replace with ist
+
+
+
+
+Do step 3 if no suffix was removed either by step 1 or step 2.
+
+
+
+Step 3: Removal of verb suffixes
+
+
+
+ Search for the longest suffix in region RV among the following,
+ and perform the action indicated.
+
+
are ere ire âre
+ ind ând
+ indu ându
+ eze
+ ească
+ ez ezi ează esc ești
+ ește
+ ăsc ăști
+ ăște
+ am ai au
+ eam eai ea eați eau
+ iam iai ia iați iau
+ ui
+ ași arăm arăți ară
+ uși urăm urăți ură
+ iși irăm irăți iră
+ âi âși ârăm ârăți âră
+ asem aseși ase aserăm aserăți aseră
+ isem iseși ise iserăm iserăți iseră
+ âsem âseși âse âserăm âserăți âseră
+ usem useși use userăm userăți useră
+
+
delete if preceded in RV by a consonant or u
+
ăm ați
+ em eți
+ im iți
+ âm âți
+ seși serăm serăți seră
+ sei se
+ sesem seseși sese seserăm seserăți seseră
+
+i-suffixes (*) of Russian tend to be quite regular, with irregularities of
+declension involving a change to the stem. Irregular forms therefore
+usually just generate two or more possible stems. Stems in Russian can
+be very short, and many of the suffixes are also particle words that make
+‘natural stopwords’, so a tempting way of running the stemmer is to set a
+minimum stem length of zero, and thereby reduce to null all words which
+are made up entirely of suffix parts. We have been a little more cautious,
+and have insisted that a minimum stem contains one vowel.
+
+
+
+The 32 letters of the Russian alphabet are as follows, with the
+transliterated forms that we will use here shown in brackets:
+
+
+
+
а (a)
+
б (b)
+
в (v)
+
г (g)
+
д (d)
+
е (e)
+
ж (zh)
+
з (z)
+
+
и (i)
+
й (ì)
+
к (k)
+
л (l)
+
м (m)
+
н (n)
+
о (o)
+
п (p)
+
+
р (r)
+
с (s)
+
т (t)
+
у (u)
+
ф (f)
+
х (kh)
+
ц (ts)
+
ч (ch)
+
+
ш (sh)
+
щ (shch)
+
ъ (")
+
ы (y)
+
ь (')
+
э (è)
+
ю (iu)
+
я (ia)
+
+
+
+
+There is a 33rd letter, ё (e"), but it is rarely used and often
+replaced by е in informal writing. The original algorithm here assumed it
+had already been mapped to е (e); since 2018-03-16 the Snowball
+implementation we provide performs this mapping for you.
+
+
+
+The following are vowels:
+
+
+
+ а (a) е (e) и (i) о (o) у (u) ы (y)
+ э (è) ю (iu) я (ia)
+
+
+
+In any word, RV is the region after the first vowel, or the end of the word
+if it contains no vowel.
+
+
+
+R1 is the region after the first non-vowel following a vowel, or the end of
+the word if there is no such non-vowel.
+
+
+
+R2 is the region after the first non-vowel following a vowel in R1, or the
+end of the word if there is no such non-vowel.
+
+
+
+For example:
+
+
+
+ p r o t i v o e s t e s t v e n n o m
+ |<------ RV ------>|
+ |<----- R1 ------>|
+ |<----- R2 ------>|
+
+ ее (ee) ие (ie) ые (ye) ое (oe) ими (imi) ыми
+ (ymi) ей (eì) ий (iì) ый (yì) ой (oì) ем
+ (em) им (im) ым (ym) ом (om) его (ego) ого (ogo)
+ ему (emu) ому (omu) их (ikh) ых (ykh) ую (uiu)
+ юю (iuiu) ая (aia) яя (iaia)
+ ою (oiu)
+ ею (eiu)
+
+ group 1: ла (la) на (na) ете (ete) йте (ìte) ли (li)
+ й (ì) л (l) ем (em) н (n) ло (lo) но (no) ет
+ (et) ют (iut) ны (ny) ть (t') ешь (esh') нно (nno)
+
+
+
+ group 2: ила (ila) ыла (yla) ена (ena) ейте (eìte)
+ уйте (uìte) ите (ite) или (ili) ыли
+ (yli) ей (eì) уй (uì) ил (il) ыл (yl) им (im)
+ ым (ym) ен (en) ило (ilo) ыло (ylo) ено (eno) ят
+ (iat) ует (uet) уют (uiut) ит (it) ыт (yt) ены
+ (eny) ить (it') ыть (yt') ишь (ish')
+ ую (uiu) ю (iu)
+
+
+
+
+group 1 endings must follow а (a) or я (ia)
+
+
+
+NOUN:
+
+
+
+
+а (a) ев (ev) ов (ov) ие (ie) ье ('e) е (e) иями
+(iiami) ями (iami) ами (ami) еи (ei) ии (ii) и (i)
+ией (ieì) ей (eì) ой (oì) ий (iì) й (ì)
+иям (iiam) ям (iam) ием (iem) ем (em) ам (am) ом
+(om) о (o) у (u) ах (akh) иях (iiakh) ях (iakh) ы
+(y) ь (') ию (iiu) ью ('iu) ю (iu) ия (iia) ья
+('ia) я (ia)
+
+
+
+
+SUPERLATIVE:
+
+
+
+
+ ейш (eìsh) ейше (eìshe)
+
+
+
+
+These are all i-suffixes. The list of d-suffixes is very short,
+
+
+
+DERIVATIONAL:
+
+
+
+
+ ост (ost) ость (ost')
+
+
+
+
+Define an ADJECTIVAL ending as an ADJECTIVE ending optionally preceded
+by a PARTICIPLE ending.
+
+
+
+ For example, in
+
+
бегавшая
=
бега
+
вш
+
ая
+
(begavshaia
=
bega
+
vsh
+
aia)
+
+ ая (aia) is an adjective ending, and вш (vsh) a participle ending of group 1
+ (preceded by the final а (a) of бега (bega)), so вшая (vshaia) is an
+ adjectival ending.
+
+
+
+In searching for an ending in a class, always choose the longest one
+from the class.
+
+
+
+ So in seaching for a NOUN ending for величие (velichie), choose ие (ie) rather than
+ е (e).
+
+
+
+Undouble н (n) means, if the word ends нн (nn), remove the last letter.
+
+
+
+Here now are the stemming rules.
+
+
+
+All tests take place in the RV part of the word.
+
+
+
+ So in the test for perfective gerund, the а (a) or я (ia) which the group 1
+ endings must follow must itself be in RV. In other words the letters
+ before the RV region are never examined in the stemming process.
+
+
+
+Do each of steps 1, 2, 3 and 4.
+
+
+
+Step 1:
+Search for a PERFECTIVE GERUND ending. If one is found remove it, and that
+is then the end of step 1. Otherwise try and remove a REFLEXIVE ending,
+and then search in turn for (1) an ADJECTIVAL, (2) a VERB or (3) a
+NOUN ending. As soon as one of the endings (1) to (3) is found remove it,
+and terminate step 1.
+
+
+
+Step 2: If the word ends with и (i), remove it.
+
+
+
+Step 3: Search for a DERIVATIONAL ending in R2 (i.e. the entire ending
+must lie in R2), and if one is found, remove it.
+
+
+
+Step 4: (1) Undouble н (n), or, (2) if the word ends with a SUPERLATIVE ending,
+remove it and undouble н (n), or (3) if the word ends ь (') (soft sign) remove it.
+
+
+
The same algorithm in Snowball
+
+
stringescapes{}
+
+/* the 33 Cyrillic letters represented in ASCII characters following the
+ * conventions of the standard Library of Congress transliteration: */
+
+stringdefa'{U+0430}'
+stringdefb'{U+0431}'
+stringdefv'{U+0432}'
+stringdefg'{U+0433}'
+stringdefd'{U+0434}'
+stringdefe'{U+0435}'
+stringdefe"'{U+0451}'
+stringdefzh'{U+0436}'
+stringdefz'{U+0437}'
+stringdefi'{U+0438}'
+stringdefi`'{U+0439}'
+stringdefk'{U+043A}'
+stringdefl'{U+043B}'
+stringdefm'{U+043C}'
+stringdefn'{U+043D}'
+stringdefo'{U+043E}'
+stringdefp'{U+043F}'
+stringdefr'{U+0440}'
+stringdefs'{U+0441}'
+stringdeft'{U+0442}'
+stringdefu'{U+0443}'
+stringdeff'{U+0444}'
+stringdefkh'{U+0445}'
+stringdefts'{U+0446}'
+stringdefch'{U+0447}'
+stringdefsh'{U+0448}'
+stringdefshch'{U+0449}'
+stringdef"'{U+044A}'
+stringdefy'{U+044B}'
+stringdef''{U+044C}'
+stringdefe`'{U+044D}'
+stringdefiu'{U+044E}'
+stringdefia'{U+044F}'
+
+routines(mark_regionsR2
+perfective_gerund
+adjective
+adjectival
+reflexive
+verb
+noun
+derivational
+tidy_up
+)
+
+externals(stem)
+
+integers(pVp2)
+
+groupings(v)
+
+definev'{a}{e}{i}{o}{u}{y}{e`}{iu}{ia}'
+
+definemark_regionsas(
+
+$pV=limit
+$p2=limit
+do(
+gopastvsetmarkpVgopastnon-v
+gopastvgopastnon-vsetmarkp2
+)
+)
+
+backwardmode(
+
+defineR2as$p2<=cursor
+
+defineperfective_gerundas(
+[substring]among(
+'{v}'
+'{v}{sh}{i}'
+'{v}{sh}{i}{s}{'}'
+('{a}'or'{ia}'delete)
+'{i}{v}'
+'{i}{v}{sh}{i}'
+'{i}{v}{sh}{i}{s}{'}'
+'{y}{v}'
+'{y}{v}{sh}{i}'
+'{y}{v}{sh}{i}{s}{'}'
+(delete)
+)
+)
+
+defineadjectiveas(
+[substring]among(
+'{e}{e}''{i}{e}''{y}{e}''{o}{e}''{i}{m}{i}''{y}{m}{i}'
+'{e}{i`}''{i}{i`}''{y}{i`}''{o}{i`}''{e}{m}''{i}{m}'
+'{y}{m}''{o}{m}''{e}{g}{o}''{o}{g}{o}''{e}{m}{u}'
+'{o}{m}{u}''{i}{kh}''{y}{kh}''{u}{iu}''{iu}{iu}''{a}{ia}'
+'{ia}{ia}'
+// and -
+'{o}{iu}'// - which is somewhat archaic
+'{e}{iu}'// - soft form of {o}{iu}
+(delete)
+)
+)
+
+defineadjectivalas(
+adjective
+
+/* of the participle forms, em, vsh, ivsh, yvsh are readily removable.
+ nn, {iu}shch, shch, u{iu}shch can be removed, with a small proportion of
+ errors. Removing im, uem, enn creates too many errors.
+ */
+
+try(
+[substring]among(
+'{e}{m}'// present passive participle
+'{n}{n}'// adjective from past passive participle
+'{v}{sh}'// past active participle
+'{iu}{shch}''{shch}'// present active participle
+('{a}'or'{ia}'delete)
+
+//but not '{i}{m}' '{u}{e}{m}' // present passive participle
+//or '{e}{n}{n}' // adjective from past passive participle
+
+'{i}{v}{sh}''{y}{v}{sh}'// past active participle
+'{u}{iu}{shch}'// present active participle
+(delete)
+)
+)
+
+)
+
+definereflexiveas(
+[substring]among(
+'{s}{ia}'
+'{s}{'}'
+(delete)
+)
+)
+
+defineverbas(
+[substring]among(
+'{l}{a}''{n}{a}''{e}{t}{e}''{i`}{t}{e}''{l}{i}''{i`}'
+'{l}''{e}{m}''{n}''{l}{o}''{n}{o}''{e}{t}''{iu}{t}'
+'{n}{y}''{t}{'}''{e}{sh}{'}'
+
+'{n}{n}{o}'
+('{a}'or'{ia}'delete)
+
+'{i}{l}{a}''{y}{l}{a}''{e}{n}{a}''{e}{i`}{t}{e}'
+'{u}{i`}{t}{e}''{i}{t}{e}''{i}{l}{i}''{y}{l}{i}''{e}{i`}'
+'{u}{i`}''{i}{l}''{y}{l}''{i}{m}''{y}{m}''{e}{n}'
+'{i}{l}{o}''{y}{l}{o}''{e}{n}{o}''{ia}{t}''{u}{e}{t}'
+'{u}{iu}{t}''{i}{t}''{y}{t}''{e}{n}{y}''{i}{t}{'}'
+'{y}{t}{'}''{i}{sh}{'}''{u}{iu}''{iu}'
+(delete)
+/* note the short passive participle tests:
+ '{n}{a}' '{n}' '{n}{o}' '{n}{y}'
+ '{e}{n}{a}' '{e}{n}' '{e}{n}{o}' '{e}{n}{y}'
+ */
+)
+)
+
+definenounas(
+[substring]among(
+'{a}''{e}{v}''{o}{v}''{i}{e}''{'}{e}''{e}'
+'{i}{ia}{m}{i}''{ia}{m}{i}''{a}{m}{i}''{e}{i}''{i}{i}'
+'{i}''{i}{e}{i`}''{e}{i`}''{o}{i`}''{i}{i`}''{i`}'
+'{i}{ia}{m}''{ia}{m}''{i}{e}{m}''{e}{m}''{a}{m}''{o}{m}'
+'{o}''{u}''{a}{kh}''{i}{ia}{kh}''{ia}{kh}''{y}''{'}'
+'{i}{iu}''{'}{iu}''{iu}''{i}{ia}''{'}{ia}''{ia}'
+(delete)
+/* the small class of neuter forms '{e}{n}{i}' '{e}{n}{e}{m}'
+ '{e}{n}{a}' '{e}{n}' '{e}{n}{a}{m}' '{e}{n}{a}{m}{i}' '{e}{n}{a}{x}'
+ omitted - they only occur on 12 words.
+ */
+)
+)
+
+definederivationalas(
+[substring]R2among(
+'{o}{s}{t}'
+'{o}{s}{t}{'}'
+(delete)
+)
+)
+
+definetidy_upas(
+[substring]among(
+
+'{e}{i`}{sh}'
+'{e}{i`}{sh}{e}'// superlative forms
+(delete
+['{n}']'{n}'delete
+)
+'{n}'
+('{n}'delete)// e.g. -nno endings
+'{'}'
+(delete)// with some slight false conflations
+)
+)
+)
+
+definestemas(
+
+// Normalise {e"} to {e}. The documentation has long suggested the user
+// should do this before calling the stemmer - we now do it for them.
+dorepeat(goto(['{e"}'])<-'{e}')
+
+domark_regions
+backwardssetlimittomarkpVfor(
+do(
+perfective_gerundor
+(tryreflexive
+adjectivalorverbornoun
+)
+)
+try(['{i}']delete)
+// because noun ending -i{iu} is being treated as verb ending -{iu}
+
+doderivational
+dotidy_up
+)
+)
+
+The Snowball stemmer represents the Cyrillic alphabet with ASCII characters,
+following the standard Library of Congress transliteration scheme.
+
+
+
+[% algorithm_vocab([60, 'в', 'п']) %]
+
+
The stemming algorithm
+
+
+i-suffixes (*) of Russian tend to be quite regular, with irregularities of
+declension involving a change to the stem. Irregular forms therefore
+usually just generate two or more possible stems. Stems in Russian can
+be very short, and many of the suffixes are also particle words that make
+‘natural stopwords’, so a tempting way of running the stemmer is to set a
+minimum stem length of zero, and thereby reduce to null all words which
+are made up entirely of suffix parts. We have been a little more cautious,
+and have insisted that a minimum stem contains one vowel.
+
+
+
+The 32 letters of the Russian alphabet are as follows, with the
+transliterated forms that we will use here shown in brackets:
+
+
+
+
а (a)
+
б (b)
+
в (v)
+
г (g)
+
д (d)
+
е (e)
+
ж (zh)
+
з (z)
+
+
и (i)
+
й (ì)
+
к (k)
+
л (l)
+
м (m)
+
н (n)
+
о (o)
+
п (p)
+
+
р (r)
+
с (s)
+
т (t)
+
у (u)
+
ф (f)
+
х (kh)
+
ц (ts)
+
ч (ch)
+
+
ш (sh)
+
щ (shch)
+
ъ (")
+
ы (y)
+
ь (')
+
э (è)
+
ю (iu)
+
я (ia)
+
+
+
+
+There is a 33rd letter, ё (e"), but it is rarely used and often
+replaced by е in informal writing. The original algorithm here assumed it
+had already been mapped to е (e); since 2018-03-16 the Snowball
+implementation we provide performs this mapping for you.
+
+
+
+The following are vowels:
+
+
+
+ а (a) е (e) и (i) о (o) у (u) ы (y)
+ э (è) ю (iu) я (ia)
+
+
+
+In any word, RV is the region after the first vowel, or the end of the word
+if it contains no vowel.
+
+
+
+R1 is the region after the first non-vowel following a vowel, or the end of
+the word if there is no such non-vowel.
+
+
+
+R2 is the region after the first non-vowel following a vowel in R1, or the
+end of the word if there is no such non-vowel.
+
+
+
+For example:
+
+
+
+ p r o t i v o e s t e s t v e n n o m
+ |<------ RV ------>|
+ |<----- R1 ------>|
+ |<----- R2 ------>|
+
+ ее (ee) ие (ie) ые (ye) ое (oe) ими (imi) ыми
+ (ymi) ей (eì) ий (iì) ый (yì) ой (oì) ем
+ (em) им (im) ым (ym) ом (om) его (ego) ого (ogo)
+ ему (emu) ому (omu) их (ikh) ых (ykh) ую (uiu)
+ юю (iuiu) ая (aia) яя (iaia)
+ ою (oiu)
+ ею (eiu)
+
+ group 1: ла (la) на (na) ете (ete) йте (ìte) ли (li)
+ й (ì) л (l) ем (em) н (n) ло (lo) но (no) ет
+ (et) ют (iut) ны (ny) ть (t') ешь (esh') нно (nno)
+
+
+
+ group 2: ила (ila) ыла (yla) ена (ena) ейте (eìte)
+ уйте (uìte) ите (ite) или (ili) ыли
+ (yli) ей (eì) уй (uì) ил (il) ыл (yl) им (im)
+ ым (ym) ен (en) ило (ilo) ыло (ylo) ено (eno) ят
+ (iat) ует (uet) уют (uiut) ит (it) ыт (yt) ены
+ (eny) ить (it') ыть (yt') ишь (ish')
+ ую (uiu) ю (iu)
+
+
+
+
+group 1 endings must follow а (a) or я (ia)
+
+
+
+NOUN:
+
+
+
+
+а (a) ев (ev) ов (ov) ие (ie) ье ('e) е (e) иями
+(iiami) ями (iami) ами (ami) еи (ei) ии (ii) и (i)
+ией (ieì) ей (eì) ой (oì) ий (iì) й (ì)
+иям (iiam) ям (iam) ием (iem) ем (em) ам (am) ом
+(om) о (o) у (u) ах (akh) иях (iiakh) ях (iakh) ы
+(y) ь (') ию (iiu) ью ('iu) ю (iu) ия (iia) ья
+('ia) я (ia)
+
+
+
+
+SUPERLATIVE:
+
+
+
+
+ ейш (eìsh) ейше (eìshe)
+
+
+
+
+These are all i-suffixes. The list of d-suffixes is very short,
+
+
+
+DERIVATIONAL:
+
+
+
+
+ ост (ost) ость (ost')
+
+
+
+
+Define an ADJECTIVAL ending as an ADJECTIVE ending optionally preceded
+by a PARTICIPLE ending.
+
+
+
+ For example, in
+
+
бегавшая
=
бега
+
вш
+
ая
+
(begavshaia
=
bega
+
vsh
+
aia)
+
+ ая (aia) is an adjective ending, and вш (vsh) a participle ending of group 1
+ (preceded by the final а (a) of бега (bega)), so вшая (vshaia) is an
+ adjectival ending.
+
+
+
+In searching for an ending in a class, always choose the longest one
+from the class.
+
+
+
+ So in seaching for a NOUN ending for величие (velichie), choose ие (ie) rather than
+ е (e).
+
+
+
+Undouble н (n) means, if the word ends нн (nn), remove the last letter.
+
+
+
+Here now are the stemming rules.
+
+
+
+All tests take place in the RV part of the word.
+
+
+
+ So in the test for perfective gerund, the а (a) or я (ia) which the group 1
+ endings must follow must itself be in RV. In other words the letters
+ before the RV region are never examined in the stemming process.
+
+
+
+Do each of steps 1, 2, 3 and 4.
+
+
+
+Step 1:
+Search for a PERFECTIVE GERUND ending. If one is found remove it, and that
+is then the end of step 1. Otherwise try and remove a REFLEXIVE ending,
+and then search in turn for (1) an ADJECTIVAL, (2) a VERB or (3) a
+NOUN ending. As soon as one of the endings (1) to (3) is found remove it,
+and terminate step 1.
+
+
+
+Step 2: If the word ends with и (i), remove it.
+
+
+
+Step 3: Search for a DERIVATIONAL ending in R2 (i.e. the entire ending
+must lie in R2), and if one is found, remove it.
+
+
+
+Step 4: (1) Undouble н (n), or, (2) if the word ends with a SUPERLATIVE ending,
+remove it and undouble н (n), or (3) if the word ends ь (') (soft sign) remove it.
+
+
+
The same algorithm in Snowball
+
+[% highlight_file('russian') %]
+
+[% footer %]
diff --git a/algorithms/russian/stop.txt b/algorithms/russian/stop.txt
new file mode 100644
index 0000000..54fcc3d
--- /dev/null
+++ b/algorithms/russian/stop.txt
@@ -0,0 +1,236 @@
+
+
+ | a russian stop word list. comments begin with vertical bar. each stop
+ | word is at the start of a line.
+
+ | this is a ranked list (commonest to rarest) of stopwords derived from
+ | a large text sample.
+
+ | letter `ё' is translated to `е'.
+
+и | and
+в | in/into
+во | alternative form
+не | not
+что | what/that
+он | he
+на | on/onto
+я | i
+с | from
+со | alternative form
+как | how
+а | milder form of `no' (but)
+то | conjunction and form of `that'
+все | all
+она | she
+так | so, thus
+его | him
+но | but
+да | yes/and
+ты | thou
+к | towards, by
+у | around, chez
+же | intensifier particle
+вы | you
+за | beyond, behind
+бы | conditional/subj. particle
+по | up to, along
+только | only
+ее | her
+мне | to me
+было | it was
+вот | here is/are, particle
+от | away from
+меня | me
+еще | still, yet, more
+нет | no, there isnt/arent
+о | about
+из | out of
+ему | to him
+теперь | now
+когда | when
+даже | even
+ну | so, well
+вдруг | suddenly
+ли | interrogative particle
+если | if
+уже | already, but homonym of `narrower'
+или | or
+ни | neither
+быть | to be
+был | he was
+него | prepositional form of его
+до | up to
+вас | you accusative
+нибудь | indef. suffix preceded by hyphen
+опять | again
+уж | already, but homonym of `adder'
+вам | to you
+сказал | he said
+ведь | particle `after all'
+там | there
+потом | then
+себя | oneself
+ничего | nothing
+ей | to her
+может | usually with `быть' as `maybe'
+они | they
+тут | here
+где | where
+есть | there is/are
+надо | got to, must
+ней | prepositional form of ей
+для | for
+мы | we
+тебя | thee
+их | them, their
+чем | than
+была | she was
+сам | self
+чтоб | in order to
+без | without
+будто | as if
+человек | man, person, one
+чего | genitive form of `what'
+раз | once
+тоже | also
+себе | to oneself
+под | beneath
+жизнь | life
+будет | will be
+ж | short form of intensifer particle `же'
+тогда | then
+кто | who
+этот | this
+говорил | was saying
+того | genitive form of `that'
+потому | for that reason
+этого | genitive form of `this'
+какой | which
+совсем | altogether
+ним | prepositional form of `его', `они'
+здесь | here
+этом | prepositional form of `этот'
+один | one
+почти | almost
+мой | my
+тем | instrumental/dative plural of `тот', `то'
+чтобы | full form of `in order that'
+нее | her (acc.)
+кажется | it seems
+сейчас | now
+были | they were
+куда | where to
+зачем | why
+сказать | to say
+всех | all (acc., gen. preposn. plural)
+никогда | never
+сегодня | today
+можно | possible, one can
+при | by
+наконец | finally
+два | two
+об | alternative form of `о', about
+другой | another
+хоть | even
+после | after
+над | above
+больше | more
+тот | that one (masc.)
+через | across, in
+эти | these
+нас | us
+про | about
+всего | in all, only, of all
+них | prepositional form of `они' (they)
+какая | which, feminine
+много | lots
+разве | interrogative particle
+сказала | she said
+три | three
+эту | this, acc. fem. sing.
+моя | my, feminine
+впрочем | moreover, besides
+хорошо | good
+свою | ones own, acc. fem. sing.
+этой | oblique form of `эта', fem. `this'
+перед | in front of
+иногда | sometimes
+лучше | better
+чуть | a little
+том | preposn. form of `that one'
+нельзя | one must not
+такой | such a one
+им | to them
+более | more
+всегда | always
+конечно | of course
+всю | acc. fem. sing of `all'
+между | between
+
+
+ | b: some paradigms
+ |
+ | personal pronouns
+ |
+ | я меня мне мной [мною]
+ | ты тебя тебе тобой [тобою]
+ | он его ему им [него, нему, ним]
+ | она ее эи ею [нее, нэи, нею]
+ | оно его ему им [него, нему, ним]
+ |
+ | мы нас нам нами
+ | вы вас вам вами
+ | они их им ими [них, ним, ними]
+ |
+ | себя себе собой [собою]
+ |
+ | demonstrative pronouns: этот (this), тот (that)
+ |
+ | этот эта это эти
+ | этого эты это эти
+ | этого этой этого этих
+ | этому этой этому этим
+ | этим этой этим [этою] этими
+ | этом этой этом этих
+ |
+ | тот та то те
+ | того ту то те
+ | того той того тех
+ | тому той тому тем
+ | тем той тем [тою] теми
+ | том той том тех
+ |
+ | determinative pronouns
+ |
+ | (a) весь (all)
+ |
+ | весь вся все все
+ | всего всю все все
+ | всего всей всего всех
+ | всему всей всему всем
+ | всем всей всем [всею] всеми
+ | всем всей всем всех
+ |
+ | (b) сам (himself etc)
+ |
+ | сам сама само сами
+ | самого саму само самих
+ | самого самой самого самих
+ | самому самой самому самим
+ | самим самой самим [самою] самими
+ | самом самой самом самих
+ |
+ | stems of verbs `to be', `to have', `to do' and modal
+ |
+ | быть бы буд быв есть суть
+ | име
+ | дел
+ | мог мож мочь
+ | уме
+ | хоч хот
+ | долж
+ | можн
+ | нужн
+ | нельзя
+
diff --git a/algorithms/scandinavian.html b/algorithms/scandinavian.html
new file mode 100644
index 0000000..2ec48b9
--- /dev/null
+++ b/algorithms/scandinavian.html
@@ -0,0 +1,98 @@
+
+
+
+
+
+
+
+
+
+ Scandinavian language stemmers - Snowball
+
+
+
+
+
+
+
+
+
+
+
+The stemmers for these three Scandinavian languages are all very simple,
+and quite similar to each other. But between the languages there is a difference
+in which endings can be removed without difficulty, even though the endings
+are very similar. For example, in Norwegian
+the ending ede can be removed safely, but not in Danish.
+
+
+
+To the definite article (the in English, der etc in German) there
+corresponds
+a noun ending in the Scandinavian languages. This ending cannot always be removed
+with certainty. In Swedish, for example, the en form is removed, but not the
+t or n form,
+
+
+
+
husen
hus
+
flickan
→
flickan
+
äpplet
äpplet
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/algorithms/scandinavian.tt b/algorithms/scandinavian.tt
new file mode 100644
index 0000000..94f6d13
--- /dev/null
+++ b/algorithms/scandinavian.tt
@@ -0,0 +1,34 @@
+[% header('Scandinavian language stemmers') %]
+
+
+The stemmers for these three Scandinavian languages are all very simple,
+and quite similar to each other. But between the languages there is a difference
+in which endings can be removed without difficulty, even though the endings
+are very similar. For example, in Norwegian
+the ending ede can be removed safely, but not in Danish.
+
+
+
+To the definite article (the in English, der etc in German) there
+corresponds
+a noun ending in the Scandinavian languages. This ending cannot always be removed
+with certainty. In Swedish, for example, the en form is removed, but not the
+t or n form,
+
+The Serbian language is a Slavic language (Indo-European) of the South Slavic
+subgroup. It is highly inflected and uses similar rules for morphological
+derivation and flexion as other Slavic languages, especially ones derived from
+the Serbo-Croatian language used in the former Yugoslavia. Because of this
+highly inflected characteristic a stemmer for Serbian language will have many
+more rules than stemmers for less inflected languages.
+
+
+
+Serbian Stemmer described in this document is based on the Croatian
+Stemmer which is published under the GNU Lesser General Public License.
+Mark Regions, Morphological Changes (Step_1) and Stemming
+(Step_2) routines are based on the Croatian Stemming Algorithm. In
+addition, some of the existing rules for Morphological Changes and Stemming
+(Step_1 and Step_2 among lists) have been modified and new rules have
+been added for the needs of the Serbian Stemmer.
+
+
+
+Latin alphabet in Serbian includes the following letters with diacritics:
+
+
+
+ č ć đ š ž
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u
+
+
+
+There is also letter - r - that isn't a vowel but it is sometimes used for syllabification.
+
+
+
Main Routines of Serbian Stemming Algorithm are:
+
+
+
Conversion of Cyrillic alphabet to Latin alphabet
+
+The Serbian language uses both Cyrillic and Latin alphabets, but
+these days most people use the Latin alphabet on their PCs, Phones, etc. This
+algorithm is developed mostly for the purposes of the Information Retrieval,
+therefore the first thing it does is to convert Cyrillic letters to
+Latin.
+
+
+
+
Prelude
+
+In Serbian language there are two dialects: Ekavian and
+Ijekavian. For example words:
+
+
senka (Ekavian)
+
sjenka (Ijekavian)
+
+have the same meaning (Shadow), also words:
+
+
mleko (Ekavian)
+
mlijeko (Ijekavian)
+
+have the same meaning (Milk) but are spelled differently and because
+mostly used dialect in Serbia is Ekavian the next thing to do is to
+replace Ijekavian dialect with it.
+
+
+
+These days it is also common, although not valid, to use combination of letters
+"d" and "j" instead of a single letter "đ". For example
+people will more often write "Novak Djoković" instead of "Novak
+Đoković" and because this algorithm is developed with Information Retrieval
+in mind they should be treated as the same terms.
+
+
+
Mark Regions
+
+R1 is either:
+
+
a region after the first vowel if there are at least two letters outside
+of it, otherwise it is a region after the first non-vowel following a vowel,
+
a region after the first "r" if there are at least two letters
+outside of it, otherwise it is a region after the first non-"r"
+following an "r".
+
+
+Note that every suffix which the stemmer can remove contains at least one
+vowel, so in the degenerate case of an input which contains no vowels there
+is nothing to be done. The Snowball implementation of this stemmer sets
+R1 to be a zero length region at the end of the word if the input
+contains no vowels and no "r".
+
+
+In Serbian language there are some words in which "r" letter is used for
+syllabification and in such words vowels can appear at the very end - for
+example word "grmlje".
+
+
+
+So before algorithm decide what will R1 be, it needs to look if and
+where "r" letter occurs and where is the first vowel. If it finds "r"
+that occurred before the first vowel and there is at least one letter between
+them this means that "r" is used for syllabification and R1 is
+2), otherwise R1 is 1).
+
+
+
+For example:
+
+
"tr|go|va|čki" - in this word "tr" is the first syllable
+ which means that "r" is used for syllabification and R1 =
+ "govački"
+
+
"tre|ne|rka" - in this word there is a letter "r" before the
+ first vowel but there aren't any letters between them which means that
+ "r" isn't used for syllabification and R1 = "nerka".
+
+
"r|ta|njski" - in this word "r" is the first syllable but if
+ we use "tanjski" as R1 it won't left enough letters outside
+ of it, so we need to shrink it down to a region after the first
+ non-"r" following an "r" which is in this case =
+ "anjski".
+
+
"a|vi|on" - similar to the previous case but with a vowel instead
+ of an "r".
+
+
+Inside Mark Regions routine there is a test routine that is used to
+check for letters with diacritics and is used later to apply certain rules in
+stemming. Result of this test routine is stored inside no_diacritics flag.
+This test routine is used because people these days tend to use letters without
+diacritics (instead of the proper ones with diacritics) and we need to take
+this into account also.
+
+
+
+
Morphological Changes
+
+Very last thing to do, before any stemming is done, are morphological changes.
+These changes are applied so that we get the same stems for different forms of a
+word.
+
+
+
+For example words:
+
+
"pravilan" (Masculine, Singular)
+
"pravilna" (Feminine, Singular)
+
"pravilno" (Neuter, Singular)
+
+should have the same stem. To get that result the algorithm will first change
+word "pravilan" (Masculine, Singular) to "pravilni" (Masculine,
+Plural) and after that the word will be stemmed.
+
+
+
+
Stemming
+
+There are two steps for stemming. The first contains most of the rules and is
+the primary stemming routine and the second one will try to stem the word only
+if the first one failed to do so - whether it was because there were no rules
+that could be applied or the rule overlapped the R1 region. The second
+step contains a few rules that will do proper stemming for most words that
+couldn't be stemmed using the rules from the first step.
+
+The Serbian language is a Slavic language (Indo-European) of the South Slavic
+subgroup. It is highly inflected and uses similar rules for morphological
+derivation and flexion as other Slavic languages, especially ones derived from
+the Serbo-Croatian language used in the former Yugoslavia. Because of this
+highly inflected characteristic a stemmer for Serbian language will have many
+more rules than stemmers for less inflected languages.
+
+
+
+Serbian Stemmer described in this document is based on the Croatian
+Stemmer which is published under the GNU Lesser General Public License.
+Mark Regions, Morphological Changes (Step_1) and Stemming
+(Step_2) routines are based on the Croatian Stemming Algorithm. In
+addition, some of the existing rules for Morphological Changes and Stemming
+(Step_1 and Step_2 among lists) have been modified and new rules have
+been added for the needs of the Serbian Stemmer.
+
+
+
+Latin alphabet in Serbian includes the following letters with diacritics:
+
+
+
+ č ć đ š ž
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u
+
+
+
+There is also letter - r - that isn't a vowel but it is sometimes used for syllabification.
+
+
+
Main Routines of Serbian Stemming Algorithm are:
+
+
+
Conversion of Cyrillic alphabet to Latin alphabet
+
+The Serbian language uses both Cyrillic and Latin alphabets, but
+these days most people use the Latin alphabet on their PCs, Phones, etc. This
+algorithm is developed mostly for the purposes of the Information Retrieval,
+therefore the first thing it does is to convert Cyrillic letters to
+Latin.
+
+
+
+
Prelude
+
+In Serbian language there are two dialects: Ekavian and
+Ijekavian. For example words:
+
+
senka (Ekavian)
+
sjenka (Ijekavian)
+
+have the same meaning (Shadow), also words:
+
+
mleko (Ekavian)
+
mlijeko (Ijekavian)
+
+have the same meaning (Milk) but are spelled differently and because
+mostly used dialect in Serbia is Ekavian the next thing to do is to
+replace Ijekavian dialect with it.
+
+
+
+These days it is also common, although not valid, to use combination of letters
+"d" and "j" instead of a single letter "đ". For example
+people will more often write "Novak Djoković" instead of "Novak
+Đoković" and because this algorithm is developed with Information Retrieval
+in mind they should be treated as the same terms.
+
+
+
Mark Regions
+
+R1 is either:
+
+
a region after the first vowel if there are at least two letters outside
+of it, otherwise it is a region after the first non-vowel following a vowel,
+
a region after the first "r" if there are at least two letters
+outside of it, otherwise it is a region after the first non-"r"
+following an "r".
+
+
+Note that every suffix which the stemmer can remove contains at least one
+vowel, so in the degenerate case of an input which contains no vowels there
+is nothing to be done. The Snowball implementation of this stemmer sets
+R1 to be a zero length region at the end of the word if the input
+contains no vowels and no "r".
+
+
+In Serbian language there are some words in which "r" letter is used for
+syllabification and in such words vowels can appear at the very end - for
+example word "grmlje".
+
+
+
+So before algorithm decide what will R1 be, it needs to look if and
+where "r" letter occurs and where is the first vowel. If it finds "r"
+that occurred before the first vowel and there is at least one letter between
+them this means that "r" is used for syllabification and R1 is
+2), otherwise R1 is 1).
+
+
+
+For example:
+
+
"tr|go|va|čki" - in this word "tr" is the first syllable
+ which means that "r" is used for syllabification and R1 =
+ "govački"
+
+
"tre|ne|rka" - in this word there is a letter "r" before the
+ first vowel but there aren't any letters between them which means that
+ "r" isn't used for syllabification and R1 = "nerka".
+
+
"r|ta|njski" - in this word "r" is the first syllable but if
+ we use "tanjski" as R1 it won't left enough letters outside
+ of it, so we need to shrink it down to a region after the first
+ non-"r" following an "r" which is in this case =
+ "anjski".
+
+
"a|vi|on" - similar to the previous case but with a vowel instead
+ of an "r".
+
+
+Inside Mark Regions routine there is a test routine that is used to
+check for letters with diacritics and is used later to apply certain rules in
+stemming. Result of this test routine is stored inside no_diacritics flag.
+This test routine is used because people these days tend to use letters without
+diacritics (instead of the proper ones with diacritics) and we need to take
+this into account also.
+
+
+
+
Morphological Changes
+
+Very last thing to do, before any stemming is done, are morphological changes.
+These changes are applied so that we get the same stems for different forms of a
+word.
+
+
+
+For example words:
+
+
"pravilan" (Masculine, Singular)
+
"pravilna" (Feminine, Singular)
+
"pravilno" (Neuter, Singular)
+
+should have the same stem. To get that result the algorithm will first change
+word "pravilan" (Masculine, Singular) to "pravilni" (Masculine,
+Plural) and after that the word will be stemmed.
+
+
+
+
Stemming
+
+There are two steps for stemming. The first contains most of the rules and is
+the primary stemming routine and the second one will try to stem the word only
+if the first one failed to do so - whether it was because there were no rules
+that could be applied or the rule overlapped the R1 region. The second
+step contains a few rules that will do proper stemming for most words that
+couldn't be stemmed using the rules from the first step.
+
+Letters in Spanish include the following accented forms,
+
+
+
+ á é í ó ú ü ñ
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u á é í ó ú ü
+
+
+
+R2 is defined in the usual way —
+see the note on R1 and R2.
+
+
+
+RV is defined as follows (and this is not the same as the
+ French stemmer
+definition):
+
+
+
+If the second letter is a consonant, RV is the region after the next
+following vowel, or if the first two letters are vowels, RV is the region
+after the next consonant, and otherwise (consonant-vowel case) RV is the
+region after the third letter. But RV is the end of the word if these
+positions cannot be found.
+
+
+
+For example,
+
+
+
+ m a c h o o l i v a t r a b a j o á u r e o
+ |...| |...| |.......| |...|
+
+
+
+Always do steps 0 and 1.
+
+
+
+Step 0: Attached pronoun
+
+
+
+ Search for the longest among the following suffixes
+
+ me se sela selo selas selos la le lo las les los nos
+
+ and delete it, if comes after one of
+
+ (a) iéndo ándo ár ér ír
+ (b) ando iendo ar er ir
+ (c) yendo following u
+
+
+ in RV. In the case of (c), yendo must lie in RV, but the preceding
+ u can be outside it.
+
+
+
+ In the case of (a), deletion is followed by removing the acute accent
+ (for example, haciéndola → haciendo).
+
+
+
+
+Step 1: Standard suffix removal
+
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
anza anzas ico ica icos icas ismo ismos able ables ible ibles ista
+ istas oso osa osos osas amiento amientos imiento
+ imientos
+
delete if in R2
+
adora ador ación adoras adores aciones ante antes ancia ancias
+
delete if in R2
+
if preceded by ic, delete if in R2
+
logía logías
+
replace with log if in R2
+
ución uciones
+
replace with u if in R2
+
encia encias
+
replace with ente if in R2
+
amente
+
delete if in R1
+
if preceded by iv, delete if in R2 (and if further preceded by at,
+ delete if in R2), otherwise,
+
if preceded by os, ic or ad, delete if in R2
+
mente
+
delete if in R2
+
if preceded by ante, able or ible, delete if in R2
+
idad idades
+
delete if in R2
+
if preceded by abil, ic or iv, delete if in R2
+
iva ivo ivas ivos
+
delete if in R2
+
if preceded by at, delete if in R2
+
+
+
+
+Do step 2a if no ending was removed by step 1.
+
+
+
+Step 2a: Verb suffixes beginning y
+
+
+
+ Search for the longest among the following suffixes in RV, and if found,
+ delete if preceded by u.
+
+ ya ye yan yen yeron yendo yo yó yas yes yais
+ yamos
+
+ (Note that the preceding u need not be in RV.)
+
+
+
+Do Step 2b if step 2a was done, but failed to remove a suffix.
+
+
+
+Step 2b: Other verb suffixes
+
+
+
+ Search for the longest among the following suffixes in RV, and perform the
+ action indicated.
+
+
en es éis emos
+
delete, and if preceded by gu delete the u (the gu need not be in
+ RV)
+
arían arías arán arás aríais aría aréis aríamos aremos
+ ará aré
+ erían erías erán erás eríais ería eréis eríamos eremos
+ erá eré
+ irían irías irán irás iríais iría iréis iríamos iremos
+ irá iré
+ aba ada ida ía ara iera ad ed id ase iese aste iste an aban ían
+ aran ieran asen iesen aron ieron ado ido ando iendo ió ar er ir as
+ abas adas idas ías aras ieras ases ieses ís áis abais íais
+ arais ierais aseis ieseis asteis isteis ados idos amos ábamos
+ íamos imos áramos iéramos iésemos ásemos
+
delete
+
+
+
+
+Always do step 3.
+
+
+
+Step 3: residual suffix
+
+
+
+ Search for the longest among the following suffixes in RV, and perform the
+ action indicated.
+
+
os a o á í ó
+
delete if in RV
+
e é
+
delete if in RV, and if preceded by gu with the u in RV delete the u
+
+Letters in Spanish include the following accented forms,
+
+
+
+ á é í ó ú ü ñ
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u á é í ó ú ü
+
+
+
+R2 is defined in the usual way —
+see the note on R1 and R2.
+
+
+
+RV is defined as follows (and this is not the same as the
+ French stemmer
+definition):
+
+
+
+If the second letter is a consonant, RV is the region after the next
+following vowel, or if the first two letters are vowels, RV is the region
+after the next consonant, and otherwise (consonant-vowel case) RV is the
+region after the third letter. But RV is the end of the word if these
+positions cannot be found.
+
+
+
+For example,
+
+
+
+ m a c h o o l i v a t r a b a j o á u r e o
+ |...| |...| |.......| |...|
+
+
+
+Always do steps 0 and 1.
+
+
+
+Step 0: Attached pronoun
+
+
+
+ Search for the longest among the following suffixes
+
+ me se sela selo selas selos la le lo las les los nos
+
+ and delete it, if comes after one of
+
+ (a) iéndo ándo ár ér ír
+ (b) ando iendo ar er ir
+ (c) yendo following u
+
+
+ in RV. In the case of (c), yendo must lie in RV, but the preceding
+ u can be outside it.
+
+
+
+ In the case of (a), deletion is followed by removing the acute accent
+ (for example, haciéndola → haciendo).
+
+
+
+
+Step 1: Standard suffix removal
+
+
+
+ Search for the longest among the following suffixes, and perform the
+ action indicated.
+
+
anza anzas ico ica icos icas ismo ismos able ables ible ibles ista
+ istas oso osa osos osas amiento amientos imiento
+ imientos
+
delete if in R2
+
adora ador ación adoras adores aciones ante antes ancia ancias
+
delete if in R2
+
if preceded by ic, delete if in R2
+
logía logías
+
replace with log if in R2
+
ución uciones
+
replace with u if in R2
+
encia encias
+
replace with ente if in R2
+
amente
+
delete if in R1
+
if preceded by iv, delete if in R2 (and if further preceded by at,
+ delete if in R2), otherwise,
+
if preceded by os, ic or ad, delete if in R2
+
mente
+
delete if in R2
+
if preceded by ante, able or ible, delete if in R2
+
idad idades
+
delete if in R2
+
if preceded by abil, ic or iv, delete if in R2
+
iva ivo ivas ivos
+
delete if in R2
+
if preceded by at, delete if in R2
+
+
+
+
+Do step 2a if no ending was removed by step 1.
+
+
+
+Step 2a: Verb suffixes beginning y
+
+
+
+ Search for the longest among the following suffixes in RV, and if found,
+ delete if preceded by u.
+
+ ya ye yan yen yeron yendo yo yó yas yes yais
+ yamos
+
+ (Note that the preceding u need not be in RV.)
+
+
+
+Do Step 2b if step 2a was done, but failed to remove a suffix.
+
+
+
+Step 2b: Other verb suffixes
+
+
+
+ Search for the longest among the following suffixes in RV, and perform the
+ action indicated.
+
+
en es éis emos
+
delete, and if preceded by gu delete the u (the gu need not be in
+ RV)
+
arían arías arán arás aríais aría aréis aríamos aremos
+ ará aré
+ erían erías erán erás eríais ería eréis eríamos eremos
+ erá eré
+ irían irías irán irás iríais iría iréis iríamos iremos
+ irá iré
+ aba ada ida ía ara iera ad ed id ase iese aste iste an aban ían
+ aran ieran asen iesen aron ieron ado ido ando iendo ió ar er ir as
+ abas adas idas ías aras ieras ases ieses ís áis abais íais
+ arais ierais aseis ieseis asteis isteis ados idos amos ábamos
+ íamos imos áramos iéramos iésemos ásemos
+
delete
+
+
+
+
+Always do step 3.
+
+
+
+Step 3: residual suffix
+
+
+
+ Search for the longest among the following suffixes in RV, and perform the
+ action indicated.
+
+
os a o á í ó
+
delete if in RV
+
e é
+
delete if in RV, and if preceded by gu with the u in RV delete the u
+
+
+
+
+And finally:
+
+
+
+ Remove acute accents
+
+
+
The same algorithm in Snowball
+
+[% highlight_file('spanish') %]
+
+[% footer %]
diff --git a/algorithms/spanish/stop.txt b/algorithms/spanish/stop.txt
new file mode 100644
index 0000000..fd323a4
--- /dev/null
+++ b/algorithms/spanish/stop.txt
@@ -0,0 +1,348 @@
+
+ | A Spanish stop word list. Comments begin with vertical bar. Each stop
+ | word is at the start of a line.
+
+
+ | The following is a ranked list (commonest to rarest) of stopwords
+ | deriving from a large sample of text.
+
+ | Extra words have been added at the end.
+
+de | from, of
+la | the, her
+que | who, that
+el | the
+en | in
+y | and
+a | to
+los | the, them
+del | de + el
+se | himself, from him etc
+las | the, them
+por | for, by, etc
+un | a
+para | for
+con | with
+no | no
+una | a
+su | his, her
+al | a + el
+ | es from SER
+lo | him
+como | how
+más | more
+pero | pero
+sus | su plural
+le | to him, her
+ya | already
+o | or
+ | fue from SER
+este | this
+ | ha from HABER
+sí | himself etc
+porque | because
+esta | this
+ | son from SER
+entre | between
+ | está from ESTAR
+cuando | when
+muy | very
+sin | without
+sobre | on
+ | ser from SER
+ | tiene from TENER
+también | also
+me | me
+hasta | until
+hay | there is/are
+donde | where
+ | han from HABER
+quien | whom, that
+ | están from ESTAR
+ | estado from ESTAR
+desde | from
+todo | all
+nos | us
+durante | during
+ | estados from ESTAR
+todos | all
+uno | a
+les | to them
+ni | nor
+contra | against
+otros | other
+ | fueron from SER
+ese | that
+eso | that
+ | había from HABER
+ante | before
+ellos | they
+e | and (variant of y)
+esto | this
+mí | me
+antes | before
+algunos | some
+qué | what?
+unos | a
+yo | I
+otro | other
+otras | other
+otra | other
+él | he
+tanto | so much, many
+esa | that
+estos | these
+mucho | much, many
+quienes | who
+nada | nothing
+muchos | many
+cual | who
+ | sea from SER
+poco | few
+ella | she
+estar | to be
+ | haber from HABER
+estas | these
+ | estaba from ESTAR
+ | estamos from ESTAR
+algunas | some
+algo | something
+nosotros | we
+
+ | other forms
+
+mi | me
+mis | mi plural
+tú | thou
+te | thee
+ti | thee
+tu | thy
+tus | tu plural
+ellas | they
+nosotras | we
+vosotros | you
+vosotras | you
+os | you
+mío | mine
+mía |
+míos |
+mías |
+tuyo | thine
+tuya |
+tuyos |
+tuyas |
+suyo | his, hers, theirs
+suya |
+suyos |
+suyas |
+nuestro | ours
+nuestra |
+nuestros |
+nuestras |
+vuestro | yours
+vuestra |
+vuestros |
+vuestras |
+esos | those
+esas | those
+
+ | forms of estar, to be (not including the infinitive):
+estoy
+estás
+está
+estamos
+estáis
+están
+esté
+estés
+estemos
+estéis
+estén
+estaré
+estarás
+estará
+estaremos
+estaréis
+estarán
+estaría
+estarías
+estaríamos
+estaríais
+estarían
+estaba
+estabas
+estábamos
+estabais
+estaban
+estuve
+estuviste
+estuvo
+estuvimos
+estuvisteis
+estuvieron
+estuviera
+estuvieras
+estuviéramos
+estuvierais
+estuvieran
+estuviese
+estuvieses
+estuviésemos
+estuvieseis
+estuviesen
+estando
+estado
+estada
+estados
+estadas
+estad
+
+ | forms of haber, to have (not including the infinitive):
+he
+has
+ha
+hemos
+habéis
+han
+haya
+hayas
+hayamos
+hayáis
+hayan
+habré
+habrás
+habrá
+habremos
+habréis
+habrán
+habría
+habrías
+habríamos
+habríais
+habrían
+había
+habías
+habíamos
+habíais
+habían
+hube
+hubiste
+hubo
+hubimos
+hubisteis
+hubieron
+hubiera
+hubieras
+hubiéramos
+hubierais
+hubieran
+hubiese
+hubieses
+hubiésemos
+hubieseis
+hubiesen
+habiendo
+habido
+habida
+habidos
+habidas
+
+ | forms of ser, to be (not including the infinitive):
+soy
+eres
+es
+somos
+sois
+son
+sea
+seas
+seamos
+seáis
+sean
+seré
+serás
+será
+seremos
+seréis
+serán
+sería
+serías
+seríamos
+seríais
+serían
+era
+eras
+éramos
+erais
+eran
+fui
+fuiste
+fue
+fuimos
+fuisteis
+fueron
+fuera
+fueras
+fuéramos
+fuerais
+fueran
+fuese
+fueses
+fuésemos
+fueseis
+fuesen
+siendo
+sido
+ | sed also means 'thirst'
+
+ | forms of tener, to have (not including the infinitive):
+tengo
+tienes
+tiene
+tenemos
+tenéis
+tienen
+tenga
+tengas
+tengamos
+tengáis
+tengan
+tendré
+tendrás
+tendrá
+tendremos
+tendréis
+tendrán
+tendría
+tendrías
+tendríamos
+tendríais
+tendrían
+tenía
+tenías
+teníamos
+teníais
+tenían
+tuve
+tuviste
+tuvo
+tuvimos
+tuvisteis
+tuvieron
+tuviera
+tuvieras
+tuviéramos
+tuvierais
+tuvieran
+tuviese
+tuvieses
+tuviésemos
+tuvieseis
+tuviesen
+teniendo
+tenido
+tenida
+tenidos
+tenidas
+tened
+
diff --git a/algorithms/swedish/stemmer.html b/algorithms/swedish/stemmer.html
new file mode 100644
index 0000000..0ca315e
--- /dev/null
+++ b/algorithms/swedish/stemmer.html
@@ -0,0 +1,436 @@
+
+
+
+
+
+
+
+
+
+ Swedish stemming algorithm - Snowball
+
+
+
+
+
+
+
+
+
+
+
+The Swedish alphabet includes the following additional letters,
+
+
+
+ ä å ö
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u y ä å ö
+
+
+
+R2 is not used: R1 is defined in the same way as in the
+German stemmer.
+(See the note on R1 and R2.)
+
+
+
+Define a valid s-ending as one of
+
+
+
+bcdfghjk
+lmnoprtv
+y
+
+
+
+Do each of steps 1, 2 and 3.
+
+
+
+Step 1:
+
+
+
+ Search for the longest among the following suffixes in R1, and
+ perform the action indicated.
+
+
(a)
+ a arna erna heterna orna ad e ade
+ ande arne are aste en anden aren heten
+ ern ar er heter or as arnas ernas
+ ornas es ades andes ens arens hetens erns
+ at andet het ast
+
delete
+
(b)
+ s
+
delete if preceded by a valid s-ending
+
+ (Of course the letter of the valid s-ending is
+ not necessarily in R1)
+
+
+
+Step 2:
+
+
+
+ Search for one of the following suffixes in R1, and if found
+ delete the last letter.
+
+ dd gd nn dt gt kt tt
+
+ (For example, friskt → frisk, fröknarnnfröknarn)
+
+
+
+Step 3:
+
+
+
+ Search for the longest among the following suffixes in R1, and
+ perform the action indicated.
+
+The Swedish alphabet includes the following additional letters,
+
+
+
+ ä å ö
+
+
+
+The following letters are vowels:
+
+
+
+ a e i o u y ä å ö
+
+
+
+R2 is not used: R1 is defined in the same way as in the
+German stemmer.
+(See the note on R1 and R2.)
+
+
+
+Define a valid s-ending as one of
+
+
+
+bcdfghjk
+lmnoprtv
+y
+
+
+
+Do each of steps 1, 2 and 3.
+
+
+
+Step 1:
+
+
+
+ Search for the longest among the following suffixes in R1, and
+ perform the action indicated.
+
+
(a)
+ a arna erna heterna orna ad e ade
+ ande arne are aste en anden aren heten
+ ern ar er heter or as arnas ernas
+ ornas es ades andes ens arens hetens erns
+ at andet het ast
+
delete
+
(b)
+ s
+
delete if preceded by a valid s-ending
+
+ (Of course the letter of the valid s-ending is
+ not necessarily in R1)
+
+
+
+Step 2:
+
+
+
+ Search for one of the following suffixes in R1, and if found
+ delete the last letter.
+
+ dd gd nn dt gt kt tt
+
+ (For example, friskt → frisk, fröknarnnfröknarn)
+
+
+
+Step 3:
+
+
+
+ Search for the longest among the following suffixes in R1, and
+ perform the action indicated.
+
+
lig ig els
+
delete
+
löst
+
replace with lös
+
fullt
+
replace with full
+
+
+
+
The same algorithm in Snowball
+
+[% highlight_file('swedish') %]
+
+[% footer %]
diff --git a/algorithms/swedish/stop.txt b/algorithms/swedish/stop.txt
new file mode 100644
index 0000000..493b76a
--- /dev/null
+++ b/algorithms/swedish/stop.txt
@@ -0,0 +1,125 @@
+
+ | A Swedish stop word list. Comments begin with vertical bar. Each stop
+ | word is at the start of a line.
+
+ | This is a ranked list (commonest to rarest) of stopwords derived from
+ | a large text sample.
+
+ | Swedish stop words occasionally exhibit homonym clashes. For example
+ | så = so, but also seed. These are indicated clearly below.
+
+och | and
+det | it, this/that
+att | to (with infinitive)
+i | in, at
+en | a
+jag | I
+hon | she
+som | who, that
+han | he
+på | on
+den | it, this/that
+med | with
+var | where, each
+sig | him(self) etc
+för | for
+så | so (also: seed)
+till | to
+är | is
+men | but
+ett | a
+om | if; around, about
+hade | had
+de | they, these/those
+av | of
+icke | not, no
+mig | me
+du | you
+henne | her
+då | then, when
+sin | his
+nu | now
+har | have
+inte | inte någon = no one
+hans | his
+honom | him
+skulle | 'sake'
+hennes | her
+där | there
+min | my
+man | one (pronoun)
+ej | nor
+vid | at, by, on (also: vast)
+kunde | could
+något | some etc
+från | from, off
+ut | out
+när | when
+efter | after, behind
+upp | up
+vi | we
+dem | them
+vara | be
+vad | what
+över | over
+än | than
+dig | you
+kan | can
+sina | his
+här | here
+ha | have
+mot | towards
+alla | all
+under | under (also: wonder)
+någon | some etc
+eller | or (else)
+allt | all
+mycket | much
+sedan | since
+ju | why
+denna | this/that
+själv | myself, yourself etc
+detta | this/that
+åt | to
+utan | without
+varit | was
+hur | how
+ingen | no
+mitt | my
+ni | you
+bli | to be, become
+blev | from bli
+oss | us
+din | thy
+dessa | these/those
+några | some etc
+deras | their
+blir | from bli
+mina | my
+samma | (the) same
+vilken | who, that
+er | you, your
+sådan | such a
+vår | our
+blivit | from bli
+dess | its
+inom | within
+mellan | between
+sådant | such a
+varför | why
+varje | each
+vilka | who, that
+ditt | thy
+vem | who
+vilket | who, that
+sitt | his
+sådana | such a
+vart | each
+dina | thy
+vars | whose
+vårt | our
+våra | our
+ert | your
+era | your
+vilkas | whose
+
diff --git a/algorithms/turkish/accompanying_paper.doc b/algorithms/turkish/accompanying_paper.doc
new file mode 100644
index 0000000..f0b325a
Binary files /dev/null and b/algorithms/turkish/accompanying_paper.doc differ
diff --git a/algorithms/turkish/stemmer.html b/algorithms/turkish/stemmer.html
new file mode 100644
index 0000000..3a736ef
--- /dev/null
+++ b/algorithms/turkish/stemmer.html
@@ -0,0 +1,585 @@
+
+
+
+
+
+
+
+
+
+ Turkish stemming algorithm - Snowball
+
+
+
+
+
+
+
+
+
+
+
+The Turkish stemming algorithm was provided by Evren Kapusuz Cilden. It stems
+only noun and nominal verb suffixes because noun stems are more important for
+information retrieval, and only handling these simplifies the algorithm
+significantly.
+
+
+
+In her paper (linked above) Evren explains
+
+
+
+
+The stemmer can be enhanced to stem all kinds of verb suffixes. In Turkish,
+there are over fifty suffixes that can be affixed to verbs [2]. The
+morphological structure of verb suffixes is more complicated than noun
+suffixes. Despite this, one can use the methodology presented in this paper to
+enhance the stemmer to find stems of all kinds of Turkish words.
+
+
+
+
where [2] is a reference to the following paper:
+
+
+
+Gulsen Eryigit and Esref Adali.
+An Affix Stripping Morphological Analyzer for Turkish
+Proceedings of the IAESTED International
+Conference
+ARTIFICIAL INTELLIGENCE AND APPLICATIONS, February 16-18,2004, Innsbruck, Austria.
+
+
+
+
The algorithm in Snowball
+
+
/* Stemmer for Turkish
+ * author: Evren (Kapusuz) Çilden
+ * email: evren.kapusuz at gmail.com
+ * version: 1.0 (15.01.2007)
+
+
+ * stems nominal verb suffixes
+ * stems nominal inflections
+ * more than one syllable word check
+ * (y,n,s,U) context check
+ * vowel harmony check
+ * last consonant check and conversion (b, c, d, ğ to p, ç, t, k)
+
+ * The stemming algorithm is based on the paper "An Affix Stripping
+ * Morphological Analyzer for Turkish" by Gülşen Eryiğit and
+ * Eşref Adalı (Proceedings of the IAESTED International Conference
+ * ARTIFICIAL INTELLIGENCE AND APPLICATIONS, February 16-18,2004,
+ * Innsbruck, Austria
+
+ * Turkish is an agglutinative language and has a very rich morphological
+ * structure. In Turkish, you can form many different words from a single stem
+ * by appending a sequence of suffixes. Eg. The word "doktoruymuşsunuz" means
+ * "You had been the doctor of him". The stem of the word is "doktor" and it
+ * takes three different suffixes -sU, -ymUs, and -sUnUz. The rules about
+ * the append order of suffixes can be clearly described as FSMs.
+ * The paper referenced above defines some FSMs for right to left
+ * morphological analysis. I generated a method for constructing snowball
+ * expressions from right to left FSMs for stemming suffixes.
+*/
+
+routines(
+append_U_to_stems_ending_with_d_or_g// for preventing some overstemmings
+check_vowel_harmony// tests vowel harmony for suffixes
+is_reserved_word// tests whether current string is a reserved word ('ad','soyad')
+mark_cAsInA// nominal verb suffix
+mark_DA// noun suffix
+mark_DAn// noun suffix
+mark_DUr// nominal verb suffix
+mark_ki// noun suffix
+mark_lAr// noun suffix, nominal verb suffix
+mark_lArI// noun suffix
+mark_nA// noun suffix
+mark_ncA// noun suffix
+mark_ndA// noun suffix
+mark_ndAn// noun suffix
+mark_nU// noun suffix
+mark_nUn// noun suffix
+mark_nUz// nominal verb suffix
+mark_sU// noun suffix
+mark_sUn// nominal verb suffix
+mark_sUnUz// nominal verb suffix
+mark_possessives// -(U)m,-(U)n,-(U)mUz,-(U)nUz,
+mark_yA// noun suffix
+mark_ylA// noun suffix
+mark_yU// noun suffix
+mark_yUm// nominal verb suffix
+mark_yUz// nominal verb suffix
+mark_yDU// nominal verb suffix
+mark_yken// nominal verb suffix
+mark_ymUs_// nominal verb suffix
+mark_ysA// nominal verb suffix
+
+mark_suffix_with_optional_y_consonant
+mark_suffix_with_optional_U_vowel
+mark_suffix_with_optional_n_consonant
+mark_suffix_with_optional_s_consonant
+
+more_than_one_syllable_word
+
+post_process_last_consonants
+postlude
+
+stem_nominal_verb_suffixes
+stem_noun_suffixes
+stem_suffix_chain_before_ki
+)
+
+stringescapes{}
+
+/* Special characters in Unicode Latin-1 and Latin Extended-A */
+stringdefcc'{U+00E7}'// LATIN SMALL LETTER C WITH CEDILLA
+stringdefg~'{U+011F}'// LATIN SMALL LETTER G WITH BREVE
+stringdefi''{U+0131}'// LATIN SMALL LETTER I WITHOUT DOT
+stringdefo"'{U+00F6}'// LATIN SMALL LETTER O WITH DIAERESIS
+stringdefs,'{U+015F}'// LATIN SMALL LETTER S WITH CEDILLA
+stringdefu"'{U+00FC}'// LATIN SMALL LETTER U WITH DIAERESIS
+
+booleans(continue_stemming_noun_suffixes)
+
+groupings(vowelUvowel1vowel2vowel3vowel4vowel5vowel6)
+
+definevowel'ae{i'}io{o"}u{u"}'
+defineU'{i'}iu{u"}'
+
+// the vowel grouping definitions below are used for checking vowel harmony
+definevowel1'a{i'}ou'// vowels that can end with suffixes containing 'a'
+definevowel2'ei{o"}{u"}'// vowels that can end with suffixes containing 'e'
+definevowel3'a{i'}'// vowels that can end with suffixes containing 'i''
+definevowel4'ei'// vowels that can end with suffixes containing 'i'
+definevowel5'ou'// vowels that can end with suffixes containing 'o' or 'u'
+definevowel6'{o"}{u"}'// vowels that can end with suffixes containing 'o"' or 'u"'
+
+externals(stem)
+
+backwardmode(
+// checks vowel harmony for possible suffixes,
+// helps to detect whether the candidate for suffix applies to vowel harmony
+// this rule is added to prevent over stemming
+definecheck_vowel_harmonyas(
+test
+(
+(gotovowel)// if there is a vowel
+(
+('a'gotovowel1)or
+('e'gotovowel2)or
+('{i'}'gotovowel3)or
+('i'gotovowel4)or
+('o'gotovowel5)or
+('{o"}'gotovowel6)or
+('u'gotovowel5)or
+('{u"}'gotovowel6)
+)
+)
+)
+
+// if the last consonant before suffix is vowel and n then advance and delete
+// if the last consonant before suffix is non vowel and n do nothing
+// if the last consonant before suffix is not n then only delete the suffix
+// assumption: slice beginning is set correctly
+definemark_suffix_with_optional_n_consonantas(
+('n'(testvowel))
+or
+((not(test'n'))test(nextvowel))
+
+)
+
+// if the last consonant before suffix is vowel and s then advance and delete
+// if the last consonant before suffix is non vowel and s do nothing
+// if the last consonant before suffix is not s then only delete the suffix
+// assumption: slice beginning is set correctly
+definemark_suffix_with_optional_s_consonantas(
+('s'(testvowel))
+or
+((not(test's'))test(nextvowel))
+)
+
+// if the last consonant before suffix is vowel and y then advance and delete
+// if the last consonant before suffix is non vowel and y do nothing
+// if the last consonant before suffix is not y then only delete the suffix
+// assumption: slice beginning is set correctly
+definemark_suffix_with_optional_y_consonantas(
+('y'(testvowel))
+or
+((not(test'y'))test(nextvowel))
+)
+
+definemark_suffix_with_optional_U_vowelas(
+(U(testnon-vowel))
+or
+((not(testU))test(nextnon-vowel))
+
+)
+
+definemark_possessivesas(
+among('m{i'}z''miz''muz''m{u"}z'
+'n{i'}z''niz''nuz''n{u"}z''m''n')
+(mark_suffix_with_optional_U_vowel)
+)
+
+definemark_sUas(
+check_vowel_harmony
+U
+(mark_suffix_with_optional_s_consonant)
+)
+
+definemark_lArIas(
+among('leri''lar{i'}')
+)
+
+definemark_yUas(
+check_vowel_harmony
+U
+(mark_suffix_with_optional_y_consonant)
+)
+
+definemark_nUas(
+check_vowel_harmony
+among('n{i'}''ni''nu''n{u"}')
+)
+
+definemark_nUnas(
+check_vowel_harmony
+among('{i'}n''in''un''{u"}n')
+(mark_suffix_with_optional_n_consonant)
+)
+
+definemark_yAas(
+check_vowel_harmony
+among('a''e')
+(mark_suffix_with_optional_y_consonant)
+)
+
+definemark_nAas(
+check_vowel_harmony
+among('na''ne')
+)
+
+definemark_DAas(
+check_vowel_harmony
+among('da''de''ta''te')
+)
+
+definemark_ndAas(
+check_vowel_harmony
+among('nda''nde')
+)
+
+definemark_DAnas(
+check_vowel_harmony
+among('dan''den''tan''ten')
+)
+
+definemark_ndAnas(
+check_vowel_harmony
+among('ndan''nden')
+)
+
+definemark_ylAas(
+check_vowel_harmony
+among('la''le')
+(mark_suffix_with_optional_y_consonant)
+)
+
+definemark_kias(
+'ki'
+)
+
+definemark_ncAas(
+check_vowel_harmony
+among('ca''ce')
+(mark_suffix_with_optional_n_consonant)
+)
+
+definemark_yUmas(
+check_vowel_harmony
+among('{i'}m''im''um''{u"}m')
+(mark_suffix_with_optional_y_consonant)
+)
+
+definemark_sUnas(
+check_vowel_harmony
+among('s{i'}n''sin''sun''s{u"}n')
+)
+
+definemark_yUzas(
+check_vowel_harmony
+among('{i'}z''iz''uz''{u"}z')
+(mark_suffix_with_optional_y_consonant)
+)
+
+definemark_sUnUzas(
+among('s{i'}n{i'}z''siniz''sunuz''s{u"}n{u"}z')
+)
+
+definemark_lAras(
+check_vowel_harmony
+among('ler''lar')
+)
+
+definemark_nUzas(
+check_vowel_harmony
+among('n{i'}z''niz''nuz''n{u"}z')
+)
+
+definemark_DUras(
+check_vowel_harmony
+among('t{i'}r''tir''tur''t{u"}r''d{i'}r''dir''dur''d{u"}r')
+)
+
+definemark_cAsInAas(
+among('cas{i'}na''cesine')
+)
+
+definemark_yDUas(
+check_vowel_harmony
+among('t{i'}m''tim''tum''t{u"}m''d{i'}m''dim''dum''d{u"}m'
+'t{i'}n''tin''tun''t{u"}n''d{i'}n''din''dun''d{u"}n'
+'t{i'}k''tik''tuk''t{u"}k''d{i'}k''dik''duk''d{u"}k'
+'t{i'}''ti''tu''t{u"}''d{i'}''di''du''d{u"}')
+(mark_suffix_with_optional_y_consonant)
+)
+
+// does not fully obey vowel harmony
+definemark_ysAas(
+among('sam''san''sak''sem''sen''sek''sa''se')
+(mark_suffix_with_optional_y_consonant)
+)
+
+definemark_ymUs_as(
+check_vowel_harmony
+among('m{i'}{s,}''mi{s,}''mu{s,}''m{u"}{s,}')
+(mark_suffix_with_optional_y_consonant)
+)
+
+definemark_ykenas(
+'ken'(mark_suffix_with_optional_y_consonant)
+)
+
+definestem_nominal_verb_suffixesas(
+[
+setcontinue_stemming_noun_suffixes
+(mark_ymUs_ormark_yDUormark_ysAormark_yken)
+or
+(mark_cAsInA(mark_sUnUzormark_lArormark_yUmormark_sUnormark_yUzortrue)mark_ymUs_)
+or
+(
+mark_lAr]deletetry([(mark_DUrormark_yDUormark_ysAormark_ymUs_))
+unsetcontinue_stemming_noun_suffixes
+)
+or
+(mark_nUz(mark_yDUormark_ysA))
+or
+((mark_sUnUzormark_yUzormark_sUnormark_yUm)]deletetry([mark_ymUs_))
+or
+(mark_DUr]deletetry([(mark_sUnUzormark_lArormark_yUmormark_sUnormark_yUzortrue)mark_ymUs_))
+]delete
+)
+
+// stems noun suffix chains ending with -ki
+definestem_suffix_chain_before_kias(
+[
+mark_ki
+(
+(mark_DA]deletetry([
+(mark_lAr]deletetry(stem_suffix_chain_before_ki))
+or
+(mark_possessives]deletetry([mark_lAr]deletestem_suffix_chain_before_ki))
+
+))
+or
+(mark_nUn]deletetry([
+(mark_lArI]delete)
+or
+([mark_possessivesormark_sU]deletetry([mark_lAr]deletestem_suffix_chain_before_ki))
+or
+(stem_suffix_chain_before_ki)
+))
+or
+(mark_ndA(
+(mark_lArI]delete)
+or
+((mark_sU]deletetry([mark_lAr]deletestem_suffix_chain_before_ki)))
+or
+(stem_suffix_chain_before_ki)
+))
+)
+)
+
+definestem_noun_suffixesas(
+([mark_lAr]deletetry(stem_suffix_chain_before_ki))
+or
+([mark_ncA]delete
+try(
+([mark_lArI]delete)
+or
+([mark_possessivesormark_sU]deletetry([mark_lAr]deletestem_suffix_chain_before_ki))
+or
+([mark_lAr]deletestem_suffix_chain_before_ki)
+)
+)
+or
+([(mark_ndAormark_nA)
+(
+(mark_lArI]delete)
+or
+(mark_sU]deletetry([mark_lAr]deletestem_suffix_chain_before_ki))
+or
+(stem_suffix_chain_before_ki)
+)
+)
+or
+([(mark_ndAnormark_nU)((mark_sU]deletetry([mark_lAr]deletestem_suffix_chain_before_ki))or(mark_lArI)))
+or
+([mark_DAn]deletetry([
+(
+(mark_possessives]deletetry([mark_lAr]deletestem_suffix_chain_before_ki))
+or
+(mark_lAr]deletetry(stem_suffix_chain_before_ki))
+or
+(stem_suffix_chain_before_ki)
+))
+)
+or
+([mark_nUnormark_ylA]delete
+try(
+([mark_lAr]deletestem_suffix_chain_before_ki)
+or
+([mark_possessivesormark_sU]deletetry([mark_lAr]deletestem_suffix_chain_before_ki))
+or
+stem_suffix_chain_before_ki
+)
+)
+or
+([mark_lArI]delete)
+or
+(stem_suffix_chain_before_ki)
+or
+([mark_DAormark_yUormark_yA]deletetry([((mark_possessives]deletetry([mark_lAr))ormark_lAr)]delete[stem_suffix_chain_before_ki))
+or
+([mark_possessivesormark_sU]deletetry([mark_lAr]deletestem_suffix_chain_before_ki))
+)
+
+definepost_process_last_consonantsas(
+[substring]among(
+'b'(<-'p')
+'c'(<-'{cc}')
+'d'(<-'t')
+'{g~}'(<-'k')
+)
+)
+
+// after stemming if the word ends with 'd' or 'g' most probably last U is overstemmed
+// like in 'kedim' -> 'ked'
+// Turkish words don't usually end with 'd' or 'g'
+// some very well known words are ignored (like 'ad' 'soyad'
+// appends U to stems ending with d or g, decides which vowel to add
+// based on the last vowel in the stem
+defineappend_U_to_stems_ending_with_d_or_gas(
+test('d'or'g')
+(test((gotovowel)'a'or'{i'}')<+'{i'}')
+or
+(test((gotovowel)'e'or'i')<+'i')
+or
+(test((gotovowel)'o'or'u')<+'u')
+or
+(test((gotovowel)'{o"}'or'{u"}')<+'{u"}')
+)
+
+defineis_reserved_wordas(
+'ad'try'soy'atlimit
+)
+)
+
+// Tests if there are more than one syllables
+// In Turkish each vowel indicates a distinct syllable
+definemore_than_one_syllable_wordas(
+test(loop2gopastvowel)
+)
+
+definepostludeas(
+backwards(
+not(is_reserved_word)
+doappend_U_to_stems_ending_with_d_or_g
+dopost_process_last_consonants
+
+)
+)
+
+definestemas(
+(more_than_one_syllable_word)
+(
+backwards(
+dostem_nominal_verb_suffixes
+continue_stemming_noun_suffixes
+dostem_noun_suffixes
+)
+
+postlude
+)
+)
+
+The Turkish stemming algorithm was provided by Evren Kapusuz Cilden. It stems
+only noun and nominal verb suffixes because noun stems are more important for
+information retrieval, and only handling these simplifies the algorithm
+significantly.
+
+
+
+In her paper (linked above) Evren explains
+
+
+
+
+The stemmer can be enhanced to stem all kinds of verb suffixes. In Turkish,
+there are over fifty suffixes that can be affixed to verbs [2]. The
+morphological structure of verb suffixes is more complicated than noun
+suffixes. Despite this, one can use the methodology presented in this paper to
+enhance the stemmer to find stems of all kinds of Turkish words.
+
+
+
+
where [2] is a reference to the following paper:
+
+
+
+Gulsen Eryigit and Esref Adali.
+An Affix Stripping Morphological Analyzer for Turkish
+Proceedings of the IAESTED International
+Conference
+ARTIFICIAL INTELLIGENCE AND APPLICATIONS, February 16-18,2004, Innsbruck, Austria.
+
+ All actual letters in the Hebrew alphabet, including:
+
+
The alphabet itself: א ב ג ד ה ו ז ח ט י כ ל מ נ ס ע פ צ ק ר ש ת
+
Final consonants: ך ם ן ף ץ
+
Ligatures: װ ױ ײ
+
+
+
Vowel
+
א ו י ע ױ ײ
+
Consonant
+
AlefBeys - Vowel
+
+
+
Pre-processing
+
+
We replace two ו, where the second one is not וּ, with װ.
+
We replace ו י, where the י is not a יִ, with ױ.
+
We replace two י, where the second one is not a יִ, with ײ.
+
We replace final forms (e.g. ץ) with their normal form (e.g. צ).
+
We remove all niked.
+
+
+
Marking regions
+
+ Only a single marker is used: P1.
+ To begin with, this is set at the end of the word.
+
+
+
+
If the word begins with גע (except for געלט and געבן) it is replaced with "GE" and the cursor is advanced.
+
+ Next, if the word begins with any verbal prefix, the cursor is advanced past this prefix.
+ Prefixes include (niked added for clarity, not included in algorithm):
+
If the verbal prefix is followed by גע (except for געבן), it is replaced with "GE" and the cursor is advanced (e.g. אַװעקגעגאַנגען).
+
If the verbal prefix is followed by צו (except for צוגן, צוקט or צוקן with nothing afterwards), it is replaced with "TSU" and the cursor is advanced (e.g. אַרומצוגײן).
+
+
+
We are now at the start of the main portion of the word (past any verbal prefix and past participle marker).
+
+
+
The following valid Yiddish three-consonant sequences are skipped: שפר, שטר, שטש, דזש.
+
If there is a sequence of three consonants, the cursor is advanced past them, and P1 is marked.
+
Otherwise, the cursor is advanced to the first vowel, and then up to the first non-vowel, minus 1, and P1 is marked.
+
If P1 is not at least 3 letters beyond the main portion, it is advanced past the 3rd letter.
+
+
+
Backwards mode
+
+
Unless otherwise stated, all deletes ensure we are beyond P1.
+
In each pass, at the first level of bullets, the longest matching suffix always wins.
+ All actual letters in the Hebrew alphabet, including:
+
+
The alphabet itself: א ב ג ד ה ו ז ח ט י כ ל מ נ ס ע פ צ ק ר ש ת
+
Final consonants: ך ם ן ף ץ
+
Ligatures: װ ױ ײ
+
+
+
Vowel
+
א ו י ע ױ ײ
+
Consonant
+
AlefBeys - Vowel
+
+
+
Pre-processing
+
+
We replace two ו, where the second one is not וּ, with װ.
+
We replace ו י, where the י is not a יִ, with ױ.
+
We replace two י, where the second one is not a יִ, with ײ.
+
We replace final forms (e.g. ץ) with their normal form (e.g. צ).
+
We remove all niked.
+
+
+
Marking regions
+
+ Only a single marker is used: P1.
+ To begin with, this is set at the end of the word.
+
+
+
+
If the word begins with גע (except for געלט and געבן) it is replaced with "GE" and the cursor is advanced.
+
+ Next, if the word begins with any verbal prefix, the cursor is advanced past this prefix.
+ Prefixes include (niked added for clarity, not included in algorithm):
+
If the verbal prefix is followed by גע (except for געבן), it is replaced with "GE" and the cursor is advanced (e.g. אַװעקגעגאַנגען).
+
If the verbal prefix is followed by צו (except for צוגן, צוקט or צוקן with nothing afterwards), it is replaced with "TSU" and the cursor is advanced (e.g. אַרומצוגײן).
+
+
+
We are now at the start of the main portion of the word (past any verbal prefix and past participle marker).
+
+
+
The following valid Yiddish three-consonant sequences are skipped: שפר, שטר, שטש, דזש.
+
If there is a sequence of three consonants, the cursor is advanced past them, and P1 is marked.
+
Otherwise, the cursor is advanced to the first vowel, and then up to the first non-vowel, minus 1, and P1 is marked.
+
If P1 is not at least 3 letters beyond the main portion, it is advanced past the 3rd letter.
+
+
+
Backwards mode
+
+
Unless otherwise stated, all deletes ensure we are beyond P1.
+
In each pass, at the first level of bullets, the longest matching suffix always wins.
+Snowball (since version 2.0) supports specifying non-ASCII characters using
+the standard Unicode notation U+XXXX where XXXX is a string of
+hex digits. However, this doesn't make for very readable source code, so the
+Snowball scripts on this site define more mnemonic representations of the
+non-ASCII characters which they use - for example, the German stemmer includes
+the lines
+
+
+
/* special characters */
+
+stringdefa"'{U+00E4}'
+stringdefo"'{U+00F6}'
+stringdefu"'{U+00FC}'
+stringdefss'{U+00DF}'
+
+
+
+
+(In Unicode, hex values E4, F6, FC and DF are the numeric values
+of characters ä, ö, ü and ß respectively.)
+
+
+
+Then the code which follows uses '{a"}'
+
+
+ when it wants
+ä, etc.
+
+
+
+Using literal Unicode character in strings in the source file may work in some
+cases, but isn't really supported - the snowball compiler doesn't (currently at
+least) have the concept of "source character set", so at best you'll limit
+which programming languages your stemmer can be used with.
+
+
+
+If you wish to describe other Latin-alphabet based codesets for use in stemmers
+we recommend using the following conventions:
+
+
+
+
accent
ASCII form
example
+
acute
single quote
e' for é
+
grave
grave
a` for à
+
umlaut
double quote
u" for ü
+
circumflex
circumflex
i^ for î
+
cedilla
letter c
cc for ç
+
tilde
tilde
n~ for ñ
+
ring
letter o
ao for å
+
line through
solidus
o/ for ø
+
+
breve
plus
a+ for ă
+
double acute
letter q
oq for ő
+
comma below
,
t, for ț
+
+
+
+And, should they ever arise, use r for left and right
+hook (as in Polish), and v for hacek (as in Czech).
+
+
+
+The ‘line-through’ accent covers a numbers of miscellaneous cases: the
+Scandinavian o/, Icelandic d/ and Polish l/.
+
+
+
+Use ae and ss for æ ligature and the German
+ß, with
+upper case forms AE and SS. Use th for Icelandic thorn.
+
+
+
+We used to recommend , for cedilla, but we need a way to
+represent comma-below for Romanian, so we've repurposed ,
+for that and now recommend c for cedilla instead.
+
+
+
+If you're writing a new stemmer, see below for a file of suitable
+stringdef lines you can cut and paste into your code.
+
+Snowball (since version 2.0) supports specifying non-ASCII characters using
+the standard Unicode notation U+XXXX where XXXX is a string of
+hex digits. However, this doesn't make for very readable source code, so the
+Snowball scripts on this site define more mnemonic representations of the
+non-ASCII characters which they use - for example, the German stemmer includes
+the lines
+
+(In Unicode, hex values E4, F6, FC and DF are the numeric values
+of characters ä, ö, ü and ß respectively.)
+
+
+
+Then the code which follows uses [% highlight_inline("'{a" _ '"' _ "}'") %] when it wants
+ä, etc.
+
+
+
+Using literal Unicode character in strings in the source file may work in some
+cases, but isn't really supported - the snowball compiler doesn't (currently at
+least) have the concept of "source character set", so at best you'll limit
+which programming languages your stemmer can be used with.
+
+
+
+If you wish to describe other Latin-alphabet based codesets for use in stemmers
+we recommend using the following conventions:
+
+
+
+
accent
ASCII form
example
+
acute
single quote
e' for é
+
grave
grave
a` for à
+
umlaut
double quote
u" for ü
+
circumflex
circumflex
i^ for î
+
cedilla
letter c
cc for ç
+
tilde
tilde
n~ for ñ
+
ring
letter o
ao for å
+
line through
solidus
o/ for ø
+
+
breve
plus
a+ for ă
+
double acute
letter q
oq for ő
+
comma below
,
t, for ț
+
+
+
+And, should they ever arise, use r for left and right
+hook (as in Polish), and v for hacek (as in Czech).
+
+
+
+The ‘line-through’ accent covers a numbers of miscellaneous cases: the
+Scandinavian o/, Icelandic d/ and Polish l/.
+
+
+
+Use ae and ss for æ ligature and the German
+ß, with
+upper case forms AE and SS. Use th for Icelandic thorn.
+
+
+
+We used to recommend , for cedilla, but we need a way to
+represent comma-below for Romanian, so we've repurposed ,
+for that and now recommend c for cedilla instead.
+
+
+
+If you're writing a new stemmer, see below for a file of suitable
+stringdef lines you can cut and paste into your code.
+
+Snowball is a small string-handling language, and its name was chosen as a
+tribute to SNOBOL (Farber 1964, Griswold 1968 —
+see the references at the end of the
+introduction),
+with which it shares the
+concept of string patterns delivering signals that are used to control the
+flow of the program.
+
+
+
1 Data types
+
+
+The basic data types handled by Snowball are strings of characters, signed
+integers, and boolean truth values, or more simply strings, integers
+and booleans. Snowball supports Unicode characters, which may be represented
+as UTF-8, 8-bit characters, or 16-bit wide characters (depending on the
+programming language code is being generated for - for C, all these options are
+supported).
+
+
+
2 Names
+
+
+A name in Snowball starts with an ASCII letter, followed by zero or more ASCII
+letters, digits and underscores. A name can be of type string,
+integer, boolean, routine, external or
+grouping. All names must be declared. A declaration has the form
+
+
+
+ Ts ( ... )
+
+
+
+where symbol T is one of string, integer etc, and the region in
+brackets contains a list of names separated by whitespace. For example,
+
+p1 and p2 are integers, Y_found is boolean, and so on. Snowball is quite
+strict about the declarations, so all the names go in the same name space,
+no name may be declared twice, all used names must be declared, no two
+routine definitions can have the same name, etc. Names declared and
+subsequently not used are merely reported in a warning message.
+
+
+A name may not be one of the reserved words of Snowball. Additionally, names
+for externals must be valid function/method names in the language being
+generated in most cases, which generally means they can't be reserved words
+in that language (e.g. externals(null)
+
+
+ will generate
+invalid Java code containing a method public boolean null().)
+For internal symbols we add a prefix to avoid this issue, but an external
+has to provide an external interface. When generating C code, the
+-eprefix option provides a potential solution to this problem.
+
+
+
+Names in Snowball are case-sensitive, but external names which differ only in
+case will cause a problem for languages with case-insensitive identifiers (such
+as Pascal). This issue is avoided for internal symbols in such languages by
+encoding case difference via an added prefix.
+
+
+
+So for portability a little care is needed when choosing names for externals.
+The convention when using Snowball to implement stemming algorithms is to have
+a single external named stem, which should be safe.
+
+
+
3 Literals
+
+
3.1 Integer Literals
+
+
+A literal integer is an ASCII digit sequence, and is always interpreted as
+decimal.
+
+
+
3.2 String Literals
+
+
+A literal string is written between single quotes, for example,
+
+
+
'aeiouy'
+
+
+
+
+Two special insert characters for use in literal strings are defined by
+the directive stringescapesAB
+
+
+, for example,
+
+
+
stringescapes{}
+
+
+
+
+Conventionally { and } are used as the insert
+characters, and we would recommend following this convention unless you want to
+use these as literal characters in your strings a lot. However,
+ A and B can be any printing
+characters, except that A can't be a single quote.
+(If A and B are the same then
+ A itself can never be escaped.)
+
+
+
+A subsequent occurrence of the stringescapes directive redefines
+the insert characters (but any string macros already defined with
+stringdef remain defined).
+
+
+
+Within insert characters, the following sequences are understood:
+
+
+
+
+User-defined string macros which can be specified using
+stringdef. Macro m is defined in the
+form stringdef m 'S', where 'S' is a
+string, and m a sequence of one or more printing
+characters. Thereafter, {m} inside a string causes
+ S to be substituted in place of m.
+
+
+
+New in Snowball 2.0: Unicode codepoints can be specified using the syntax
+U+ followed by one or more hex digits - for example,
+'{U+FFFD}'
+
+
+. These are automatically handled
+appropriately in all cases except if you want to generate C code to handle a
+single byte character set other than ISO-8859-1. Such cases are handled by
+defining string macros for the U+ codes in the character set,
+after which the same Snowball source can be used. You can't mix use of
+U+ codes defined as string macros and with their default
+meanings in the same compilation. When U+ codes are defined
+as string macros, snowball will upper case the characters after the
++ if there's no macro defined with the case as given.
+
+
+
+By default {'} will substitute ' and
+{{} will substitute {, although macros ' and { may subsequently be
+redefined.
+
+
+
+A further feature is that {W} inside
+a string, where W is a
+sequence of whitespace characters including one or more newlines, is
+ignored. This enables long strings to be written over a number of lines.
+
+
+
+
+For example,
+
+
+
stringescapes{}
+
+/* Spanish diacritics */
+
+stringdefa''{U+00E1}'// a-acute
+stringdefe''{U+00E9}'// e-acute
+stringdefi''{U+00ED}'// i-acute
+stringdefo''{U+00F3}'// o-acute
+stringdefu''{U+00FA}'// u-acute
+stringdefu"'{U+00FC}'// u-diaeresis
+stringdefn~'{U+00F1}'// n-tilde
+
+/* All the characters in Spanish used to represent vowels */
+
+definev'aeiou{a'}{e'}{i'}{o'}{u'}{u"}'
+
+
+
+
4 Routines
+
+
+A routine definition has the form
+
+
+
defineRasC
+
+
+
+
+where R is the routine name and C is a command, or bracketed group of
+commands. So a routine is defined as a sequence of zero or more commands.
+Snowball routines do not (at present) take parameters. For example,
+
+
+
defineStep_5bas(// this defines Step_5b
+['l']// three commands here: [, 'l' and ]
+R2'l'// two commands, R2 and 'l'
+delete// delete is one command
+)
+
+defineR1as$p1<=cursor
+/* R1 is defined as the single command "$p1 <= cursor" */
+
+
+
+
+A routine is called simply by using its name, R, as a command.
+
+
+
5 Commands and signals
+
+
+The flow of control in Snowball is arranged by the implicit use of
+signals, rather than the explicit use of constructs like the if,
+else, break of C. The scheme is designed for handling strings, but is
+perhaps easier to introduce using integers. Suppose x, y, z ... are
+integers. The command
+
+
+
$x=1
+
+
+
+
+sets x to 1. The command
+
+
+
$x>0
+
+
+
+
+tests if x is greater than zero. Both commands give a signal t or f,
+(true or false), but while the second command gives t if x is greater
+than zero and f otherwise, the first command always gives t. In Snowball,
+every command gives a t or f signal. A sequence of commands can be turned
+into a single command by putting them in a list surrounded by round
+brackets:
+
+
+
+ ( C1 C2 C3 ... Ci Ci+1 ... )
+
+
+
+When this is obeyed, Ci+1 will be obeyed if each of the preceding C1 ...
+Ci give t, but as soon as a Ci gives f, the subsequent Ci+1 Ci+2 ...
+are ignored, and the whole sequence gives signal f. If all the Ci give t,
+however, the bracketed command sequence also gives t. So,
+
+
+
$x>0$y=1
+
+
+
+
+sets y to 1 if x is greater than zero. If x is less than or equal to zero
+the two commands give f.
+
+
+
+If C1 and C2 are commands, we can build up the larger commands,
+
+
+
+
C1 or C2
+
— Do C1. If it gives t ignore C2, otherwise do C2. The resulting
+ signal is t if and only C1 or C2 gave t.
+
C1 and C2
+
— Do C1. If it gives f ignore C2, otherwise do C2. The resulting
+ signal is t if and only C1 and C2 gave t.
+
not C
+
— Do C. The resulting signal is t if C gave f, otherwise f.
+
try C
+
— Do C. The resulting signal is t whatever the signal of C.
+
fail C
+
— Do C. The resulting signal is f whatever the signal of C.
+
+
+
+So for example,
+
+
+
+
($x>0$y=1)or($y=0)
+
+
+
— sets y to 1 if x is greater than zero, otherwise to zero.
+
+
try(($x>0)and($z>0)$y=1)
+
+
+
— sets y to 1 if both x and z are greater than 0, and gives t.
+
+
+
+This last example is the same as
+
+
+
try($x>0$z>0$y=1)
+
+
+
+
+so that and seems unnecessary here. But we will see that and has a
+particular significance in string commands.
+
+
+
+When a ‘monadic’ construct like not, try or fail is not followed by a
+round bracket, the construct applies to the shortest following valid command.
+So for example
+
+
+
trynot$x<1$z>0
+
+
+
+
+would mean
+
+
+
try(not($x<1))$z>0
+
+
+
+
+because $x<1
+
+
+ is the shortest valid command following not, and then
+not $x < 1 is the shortest valid command following try.
+
+
+
+The ‘dyadic’ constructs like and and or must sit in a bracketed list
+of commands anyway, for example,
+
+
+
+ ( C1 C2 and C3 C4 or C5 )
+
+
+
+And then in this case C2 and C3 are connected by the and; C4 and C5 are
+connected by the or. So
+
+
+
$x>0not$y>0ornot$z>0$t>0
+
+
+
+
+means
+
+
+
$x>0((not($y>0))or(not($z>0)))$t>0
+
+
+
+
+and and or are equally binding, and bind from left to right,
+so C1 or C2 and C3 means (C1 or C2) and C3 etc.
+
+
+
6 Integer commands
+
+
+There are two sorts of integer commands - assignments and comparisons. Both
+are built from Arithmetic Expressions (AEs).
+
+
+
Arithmetic Expressions (AEs)
+
+
+An AE consists of integer names, literal numbers and a few other things
+connected by dyadic +, -, * and /, and monadic -, with the same
+binding powers and semantics as C. As well as integer names and literal
+numbers, the following may be used in AEs:
+
+
+
+
minint
— the minimum negative number
+
maxint
— the maximum positive number
+
cursor
— the current value of the string cursor
+
limit
— the current value of the string limit
+
size
— the size of the string, in "slots"
+
sizeof s
— the number of "slots" in s, where s is the name of a string or (since Snowball 2.1) a literal string
+
New in Snowball 2.0:
+
len
— the length of the string, in Unicode characters
+
lenof s
— the number of Unicode characters in s, where s is the name of a string or (since Snowball 2.1) a literal string
+
+
+
+size
+
+
+ and sizeof
+
+
+ count in
+"slots" - see the "Character representation" section below for details.
+
+
+
+The cursor and limit concepts are explained below.
+
+
+
Integer assignments
+
+
+An integer assignment has the form
+
+
+
+ $X assign_op AE
+
+
+
+where X is an integer name and assign_op is one of the five assignments
+ =, +=, -=, *=, or /=.
+The meanings are the same as in C.
+
+
+
+For example,
+
+
+
$p1=limit// set p1 to the string limit
+
+
+
+
+Integer assignments always give the signal t.
+
+
+
Integer comparisons
+
+
+An integer comparison has the form
+
+
+
+ $X rel_op AE
+
+
+
+or (since Snowball 2.0):
+
+
+
+ $(AE1rel_op AE2)
+
+
+
+where X is an integer name and rel_op is one of the six tests
+ ==, !=, >=,
+ >, <=, or <.
+Again, the meanings are the same as in C.
+
+
+
+Examples of integer comparisons are,
+
+
+
$p1<=cursor// signal is f if the cursor is before position p1
+$(len>=3)// signal is f unless the string is at least 3 characters long
+
+
+
+
+The second form is more general since an integer name is a valid AE, but it
+also allows comparisons which don't involve integer variables. Before support
+for this was added the second example could only be achieved by assigning
+len to a variable and then testing that variable instead.
+
+
+
7 String commands
+
+
+If s is a string name, a string command has the form
+
+
+
$sC
+
+
+
+
+where C is a command that operate on the string. Strings can be processed
+left-to-right or right-to-left, but we will describe only the
+left-to-right case for now. The string has a cursor, which we will
+denote by c, and a limit point, or limit, which we will denote by l. c
+advances towards l in the course of a string command, but the various
+constructs and, or, not etc have side-effects which keep moving it
+backwards. Initially c is at the start and l the end of the string. For
+example,
+
+
+
+ 'a|n|i|m|a|d|v|e|r|s|i|o|n'
+ | |
+ c l
+
+
+
+c, and l, mark the boundaries between characters, and not
+characters themselves. The characters between c and l will be denoted by
+c:l.
+
+
+
+If C gives t, the cursor c will have a new, well-defined value. But if C
+gives f, c is undefined. Its later value will in fact be determined by the
+outer context of commands in which C came to be obeyed, not by C itself.
+
+
+
+Here is a list of the commands that can be used to operate on strings.
+
+
+
a) Setting a value
+
+
+
= S
+
where S is the name of a string or a literal string. c:l is set equal
+ to S, and l is adjusted to point to the end of the copied string. The
+ signal is t. For example,
+
+
$x='animadversion'/* literal string */
+$y=x/* string name */
+
+
+
+
+
+
b) Basic tests
+
+
+
S
+
here and below, S is the name of a string or a literal string. If c:l
+ begins with the substring S, c is repositioned to the end of this
+ substring, and the signal is t. Otherwise the signal is f. For example,
+
+
$x'anim'/* gives t, assuming the string is 'animadversion' */
+$x('anim''ad''vers')
+/* ditto */
+
+$t='anim'
+$xt/* ditto */
+
+
+
+
true, false
+
true is a dummy command that generates signal t. false generates
+ signal f. They are sometimes useful for emphasis,
+
+
definestart_offastrue// nothing to do
+defineexception_listasfalse// put in among(...) list later
+
+
+
+ true is equivalent to ()
+
C1 or C2
+
This is like the case for integers described above, but the extra
+ touch is that if C1 gives f, c is set back to its old position after
+ C1 has given f and before C2 is tried, so that the test takes place on
+ the same point in the string. So we have
+
+
$x('anim'/* signal t */
+'ation'/* signal f */
+)or
+('an'/* signal t - from the beginning */
+)
+
+
+
+
C1 and C2
+
And similarly c is set back to its old position after C1 has given t
+ and before C2 is tried. So,
+
+
$x'anim'and'an'/* signal t */
+$x('anim''an')/* signal f, since 'an' and 'ad' mis-match */
+
+
+
+
not C
+
try C
+
These are like the integer tests, with the added feature that c is set
+ back to its old position after an f signal is turned into t. So,
+
+
$x(not'animation'not'immersion')
+/* both tests are done at the start of the string */
+
+$x(try'animus'try'an'
+'imad')
+/* - gives t */
+
+
+
+
+
try C
is equivalent to
C or true
+
+
test C
+
This does command C but without advancing c. Its signal is the same as
+ the signal of C, but following signal t, c is set back to its old
+ value.
+
+
test C
is equivalent to
not not C
+
test C1 C2
is equivalent to
C1 and C2
+
+
fail C
+
This does C and gives signal f. It is equivalent to C false. Like
+ false it is useful, but only rarely.
+
+
do C
+
This does C, puts c back to its old value and gives signal t. It is
+ very useful as a way of suppressing the side effect of f signals and
+ cursor movement.
+
+
do C
is equivalent to
try test C
+
or
test try C
+
+
goto C
+
c is moved right until obeying C gives t. But if c cannot be moved
+ right because it is at l the signal is f. c is set back to the position
+ it had before the last obeying of C, so the effect is to leave c before
+ the pattern which matched against C.
+
+
$xgoto'ad'/* positions c after 'anim' */
+$xgoto'ax'/* signal f */
+
+
+
+
gopast C
+
Like goto, but c is not set back, so the effect is to leave c after
+ the pattern which matched against C.
+
+
$xgopast'ad'/* positions c after 'animad' */
+
+
+
+
repeat C
+
C is repeated until it gives f. When this happens c is set back to the
+ position it had before the last repetition of C, and repeat C gives
+ signal t. For example,
+
+
$xrepeatgopast'a'/* position c after the last 'a' */
+
+
+
+
loop AE C
+
This is like C C ... C written out AE times, where AE is an arithmetic
+ expression. For example,
+
+
$xloop2gopast('a'or'e'or'i'or'o'or'u')
+/* position c after the second vowel */
+
+
+
+ The equivalent expression in C has the shape,
+
+
{inti;
+intlimit=AE;
+for(i=0;i<limit;i++)C;
+}
+
+
+
atleast AE C
+
This is equivalent to loop AE C repeat C.
+
+
hop AE
+
moves c AE character positions towards l, but if AE is negative, or if
+ there are less than AE characters between c and l the signal is f.
+ For example,
+
+
testhop3
+
+
+
+ tests that c:l contains more than 2 characters.
+
+
next
+
is equivalent to hop 1.
+
+
+
c) Moving text about
+
+
+We have seen in (a) that $x = y, when x and y are strings, sets c:l of x
+to the value of y. Conversely
+
+
+
$x=>y
+
+
+
+
+sets the value of y to the c:l region of x.
+
+
+
+A more delicate mechanism for pushing text around is to define a substring,
+or slice of the string being tested. Then
+
+
+
+
[
+
+
+
sets the left-end of the slice to c,
+
]
+
+
+
sets the right-end of the slice to c,
+
->s
+
+
+
moves the slice to variable s,
+
<-S
+
+
+
replaces the slice with variable (or literal) S.
+
+
+
+For example
+
+
+
/* assume x holds 'animadversion' */
+$x([// '[animadversion' - [ set as indicated
+loop2gopast'a'
+// '[anima|dversion' - c is marked by '|'
+]// '[anima]dversion' - ] set as indicated
+->y// y is 'anima'
+)
+
+
+
+
+For any string, the slice ends should be assumed to be unset until they are
+set with the two commands [, ]. Thereafter the slice ends will retain
+the same values until altered.
+
+As this example shows, the slice markers [ and ] often appear as
+pairs in a bracketed style, which makes for easy reading of the Snowball
+scripts. But it must be remembered that, unusually in a computer
+programming language, they are not true brackets.
+
+
+
+More simply, text can be inserted at c.
+
+
+
+
insertS
+
+
+
insert variable or literal S before c, moving c to the right of the
+ insert. <+ is a synonym for insert.
+
+
attachS
+
+
+
the same, but leave c at the left of the insert.
+
+
+
d) Marks
+
+
+The cursor, c, (and the limit, l) can be thought of as having a numeric
+value, from zero upwards:
+
+
+
+ | a | n | i | m | a | d | v | e | r | s | i | o | n |
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13
+
+
+
+It is these numeric values of c and l which are accessible through
+cursor and limit in arithmetic expressions.
+
+
+
+
setmarkX
+
+
+
sets X to the current value of c, where X is an integer variable.
+ It's equivalent to: =cursor
+
+
+
+
tomarkAE
+
+
+
moves c forward to the position given by AE,
+
+
atmarkAE
+
+
+
tests if c is at position AE (t or f signal).
+ It's equivalent to: (cursor==AE)
+
+
+
+
+
+In the case of tomarkAE
+
+
+, a similar fail condition occurs as with hopAE
+
+
+.
+If c is already beyond AE, or if position l is before position AE, the
+signal is f.
+
+
+
+In the stemming algorithms, certain regions of the word are defined by
+setting marks, and later the failure condition of tomark
+
+
+ is used to see if
+c is inside a particular region.
+
+
+
+Two other commands put c at l, and test if c is at l,
+
+
+
+
tolimit
+
+
+
moves c forward to l (signal t always),
+
+
atlimit
+
+
+
tests if c is at l (t or f signal).
+
+
+
e) Changing l
+
+
+In this account of string commands we see c moving right towards l, while
+l stays fixed at the end. In fact l can be reset to a new position between
+c and its old position, to act as a shorter barrier for the movement of c.
+
+
+
+
setlimit C1 for C2
+
C1 is obeyed, and if it gives f the signal from setlimit
+ is f with no further action.
+
+
+
+ Otherwise, the final value of c becomes the new
+ position of l. c is then set back to its old value before C1 was
+ obeyed, and C2 is obeyed. Finally l is set back to its old position,
+ and the signal of C2 becomes the signal of setlimit.
+
+
+
+ So the signal is f if either C1 or C2 gives f, otherwise t.
+ For example,
+
+
+
$x(setlimitgoto's'// 'animadver}sion' new l as marked '}'
+for// below, '|' marks c after each goto
+(goto'a'and// '|animadver}sion'
+goto'e'and// 'animadv|er}sion'
+goto'i'and// 'an|imadver}sion'
+)
+)
+
+
+
+
+ This checks that x has characters ‘a’, ‘e’ and ‘i’ before the first
+ ‘s’.
+
+
+
+
+
f) Backward processing
+
+
+String commands have been described with c to the left of l and moving
+right. But the process can be reversed.
+
+
+
+
backwardsC
+
+
+
c and l are swapped over, and c moves left towards l. C is obeyed, the
+ signal given by C becomes the signal of backwards C, and c and l are
+ swapped back to their old values (except that l may have been adjusted
+ because of deletions and insertions). C cannot contain another
+ backwards
+
+
+ command.
+
+
reverseC
+
+
+
A similar idea, but here c simply moves left instead of moving right,
+ with the beginning of the string as the limit, l. C can contain other
+ reverse
+
+
+ commands, but it cannot contain commands to do deletions or
+ insertions — it must be used for testing only. (Without this
+ restriction Snowball's semantics would become very untidy.)
+
+
+
+Forward and backward processing are entirely symmetric, except that forward
+processing is the default direction, and literal strings are always
+written out forwards, even when they are being tested backwards. So the
+following are equivalent,
+
+If a routine is defined for backwards mode processing, it must be included
+inside a backwardmode(...) declaration.
+
+
+
g) substring and among
+
+
+The use of substring
+
+
+ and among
+
+
+ is central to the implementation of the
+stemming algorithms. It is like a case switch on strings. In its simpler
+form,
+
+
+
+ substring among('S1' 'S2' 'S3' ...)
+
+
+
+searches for the longest matching substring 'S1' or 'S2' or 'S3' ... from
+position c. (The 'Si' must all be different.) So this has the same
+semantics as
+
+
+
+ ('S1' or 'S2' or 'S3' ...)
+
+
+
+— so long as the 'Si' are written out in decreasing order of length.
+
+
+
+substring may be omitted, in which case it is attached to its following
+among, so
+
+
+
among(/*...*/)
+
+
+
+
+without a preceding substring
+
+
+ is equivalent to
+
+
+
(substringamong(/*...*/))
+
+
+
+
+substring
+
+
+ may also be detached from its among
+
+
+, although it must
+precede it textually in the same routine in which the among
+
+
+ appears.
+The more general form of substring/* ... */among
+
+
+ is,
+
+Obeying substring searches for a longest match among the 'Sij'. The
+signal from substring is t if a match is found, otherwise f.
+Any commands C between the substring and among will be run after this
+search and only if the search finds a match (it would be equivalent to remove C and replace each
+Ci with C Ci). When the
+among comes to be obeyed, the Ci corresponding to the matched 'Sij' is
+obeyed, and its signal becomes the signal of the among command.
+
+
+
+substring/among pairs must match up textually inside each routine
+definition. But there is no problem with an among containing other
+substring/among pairs, and substring is optional before among anyway.
+The essential constraint is that two substrings must be separated by an
+among, and each substring must be followed by an among.
+
+
+
+The effect of obeying among when the preceding substring is not obeyed
+is undefined. This would happen for example here,
+
+
+
try($x!=617substring)
+among(...)// 'substring' is bypassed in the exceptional case where x == 617
+
+
+
+
+The significance of separating the substring from the among is to allow
+them to work in different contexts. For example,
+
+Here the test for the longest 'Sij' is constrained to the region between c
+and the mark point given by integer L. But the commands Ci operate outside
+this limit. Another example is
+
+If a routine name is not specified, it is equivalent
+to a routine which simply returns signal t,
+
+
+
definenullastrue
+
+
+
+
+— so we can imagine each 'Sij' having its associated routine
+Rij. Then obeying the among causes a search for the longest
+'Sij' whose corresponding routine
+Rij gives t.
+
+
+
+The routines Rij should be written without any
+side-effects, other than the inevitable cursor movement. (c is in
+any case set back to its old value following a call of
+Rij.)
+
+
+
8 Booleans
+
+
+setB
+
+
+ and unsetB
+
+
+ set B to true and false respectively, where B is a
+boolean name. B
+
+
+ as a command gives a signal t if it is set true, f
+otherwise. For example,
+
+
+
booleans(Y_found)// declare the boolean
+
+/* ... */
+
+unsetY_found// unset it
+do(['y']<-'Y'setY_found)
+/* if c:l begins 'y' replace it by 'Y' and set Y_found */
+
+dorepeat(goto(v['y'])<-'Y'setY_found)
+/* repeatedly move down the string looking for v 'y' and
+ replacing 'y' with 'Y'. Whenever the replacement takes
+ place set Y_found. v is a test for a vowel, defined as
+ a grouping (see below). */
+
+
+/* Y_found means there are some letters Y in the string.
+ Later we can use this to trigger a conversion back to
+ lower case y. */
+
+/* ... */
+
+do(Y_foundrepeat(goto(['Y'])<-'y')
+
+
+
+
9 Groupings
+
+
+A grouping brings characters together and enables them to be looked for
+with a single test.
+
+
+
+If G is declared as a grouping, it can be defined by
+
+
+
+ define G G1op G2op G3 ...
+
+
+
+where op is + or -, and G1, G2, G3 are literal strings, or groupings that
+have already been defined. (There can be zero or more of these additional
+op components). For example,
+
+Once G is defined, it can be used as a command, and is equivalent to a test
+
+
+
+ 'ch1' or 'ch2' or ...
+
+
+
+where ch1, ch2 ... list all the characters in the grouping.
+
+
+
+nonG
+
+
+ is the converse test, and matches any character except the
+characters of G. Note that nonG
+
+
+ is not the same as notG
+
+
+, in fact
+
+
+
+nonG
+
+
+ is equivalent to (notGnext)
+
+
+
+
+
+non
+
+
+ may be optionally followed by hyphen, for example:
+
+
+
non-vowel
+non-digit
+
+
+
+
+Bear in mind that non-vowel
+
+
+ doesn't only match a
+consonant - it'll match any character which isn't in the vowel
+grouping. Failing to consider this has lead to bugs in stemming algorithms -
+for example, here we intended to undouble a consonant:
+
+
+
[non-vowel]->ch
+ch
+delete
+
+
+
+
+The problem with this code is it will also mangle numbers with repeated digits,
+for example 1900 would become 190. A good rule of
+thumb here seems to be to use an inclusive grouping check if the code goes on
+to delete the character matched:
+
+
+
[consonant]->ch
+ch
+delete
+
+
+
+
10 A Snowball program
+
+
+A complete program consists of a sequence of declarations followed by a
+sequence of definitions of groupings and routines. Routines which are
+implicitly defined as operating on c:l from right to left must be included
+in a backwardmode(...) declaration.
+
+
+
+A Snowball program is called up via a simple
+API
+through its defined externals. For example,
+
+The API also allows a current string to be defined, and this becomes the
+c:l string for the external routine to work on. Its final value is the
+result handed back through the API.
+
+
+
+The strings, integers and booleans are accessible from any point in the
+program, and exist throughout the running of the Snowball program. They are
+therefore like static declarations in C.
+
+
+
11 Comments, and other whitespace fillers
+
+
+At a deeper level, a program is a sequence of tokens, interspersed with
+whitespace. Names, reserved words, literal numbers and strings are all
+tokens. Various symbols, made up of non-alphanumerics, are also tokens.
+
+
+
+A name, reserved word or number is terminated by the first character that
+cannot form part of it. A symbol is recognised as the longest sequence of
+characters that forms a valid symbol. So +=- is two symbols, += and
+-, because += is a valid symbol in the language while +=- is not.
+Whitespace separates tokens but is otherwise ignored. This of course is
+like C.
+
+
+
+Occasionally a newer version of Snowball may add a new token. So as not to
+break existing programs, any such tokens declared as a name (via
+integers
+
+
+, routines
+
+
+, etc)
+will lose their token status for the rest of the program. This applies
+to the tokens
+len
+
+
+and
+lenof
+
+
+.
+
+
+
+Anywhere that whitespace can occur, there may also occur:
+
+
+
+(a) Comments, in the usual multi-line /* .... */
+
+
+ or single line
+// ...
+
+
+ format.
+
+
+
+(b) Get directives. These are like #include commands in C, and have the form
+get'S'
+
+
+, where 'S' is a literal string. For example,
+
+
+
get'/home/martin/snowball/main-hdr'// include the file contents
+
+
+
+
+(c) stringescapesXY
+
+
+ where X and Y are any two printing characters.
+
+
+
+(d) stringdefm'S'
+
+
+ where m is sequence of characters not including
+whitespace and terminated with whitespace, and 'S' is a literal string.
+
+
+
12 Character representation
+
+
+In this description of Snowball, it is assumed that strings are composed of
+characters, and that characters can be defined numerically, but the numeric range
+of these characters is not defined. As implemented, three different schemes
+are supported. Characters can either be (a) bytes in the range 0 to 255,
+as in traditional C strings, or (b) byte pairs in the range 0 to 65535,
+as in Java strings, or (c) UTF-8 encoded bytes sequences in the range 0
+to 65535, so that a character may occupy 1, 2 or 3 bytes.
+
+
+
+For case (c), we need to make a slight separation of the concept of
+characters into symbols, the units of text being represented, and
+slots, the units of space into which they map. (So in case (a), all
+slots are one byte; in case (b) all slots are two bytes.)
+c and l have numeric values that can be used in AEs (arithmetic
+expressions). These values count the number of slots. Similarly
+setmark, tomark and atmark are remembering and then using slot
+counts. size and sizeof measure string size
+in slots, not symbols. However, hop N moves c over N symbols,
+not N slots, and next is equivalent to hop 1.
+
+
+
+Snowball 2.0 adds len and lenof, which measure string length in symbols
+(so they're the same as size and sizeof in cases (a) and (b), but
+different in case (c)).
+
+
+
+So long as these simple distinctions are recognised, the same Snowball
+script can be compiled to work with any of the three encoding schemes.
+
+
+
13 Legacy Features
+
+
13.1 hex and decimal
+
+
+This section documents features of Snowball for which there's a strongly
+prefered alternative. They're still support for compatibility with
+existing code which uses them, but you shouldn't use them in then code.
+We document them here so that their meaning in existing code can be
+understood, and especially to aid updated to the preferred alternatives.
+
+
+
+In a stringdef , string may be preceded by the word hex,
+or the word decimal. This was how non-ASCII characters
+were specified before support for specifying Unicode codepoints using the
+U+ notation was added.
+
+
+
+hex and decimal mean that the contents of the string
+are interpreted as characters values written out in hexadecimal, or decimal,
+notation. The characters should be separated by spaces. For example,
+
+
+
hex'DA'/* is character hex DA */
+hex'D A'/* is the two characters, hex D and A (carriage
+ return, and line feed) */
+decimal'10'/* character 10 (line feed) */
+decimal'13 10'/* characters 13 and 10 (carriage return, and
+ line feed) */
+
+
+
+
+The following forms are equivalent,
+
+
+
hex'd a'/* lower case also allowed */
+hex'0D 000A'/* leading zeroes ignored */
+hex' D A '/* extra spacing is harmless */
+
+
+
+
+The interpretation of the values is as Unicode codepoints if command
+line option -utf8 or -widechars is specified, and as
+character values in an unspecified single byte character set otherwise. For
+ASCII and ISO-8859-1 the character values match Unicode codepoints, but to
+handle other single byte character sets (e.g. ISO-8859-2 or KOI8-R) you needed
+a special version of a Snowball source with different character values
+specified via stringdef. The U+ notation allows
+you to use a single Snowball source in this situation.
+
+
+
13.2 among starter command
+
+
+The among command supports a "starter" command, C
+in this example:
+
+This requires an explicit substring but seems clearer so
+we recommend using this in new code and have designated the use of a starter as
+a legacy feature.
+
+
+
+A starter is also allowed with an explicit substring, for example:
+
+In the grammar which follows, || is used for alternatives,
+ [X] means that X is
+optional, and [X]* means that X is repeated zero or more
+times. meta-symbols are defined on the left. <char> means any
+character.
+
+
+
+The definition of literal string does not allow for the escaping
+conventions established by the stringescapes directive. The command
+? is a debugging aid.
+
+
+
+<letter> ::= a || b || ... || z || A || B || ... || Z
+<digit> ::= 0 || 1 || ... || 9
+<name> ::= <letter> [ <letter> || <digit> || _ ]*
+<s_name> ::= <name>
+<i_name> ::= <name>
+<b_name> ::= <name>
+<r_name> ::= <name>
+<g_name> ::= <name>
+<literal string>::= '[<char>]*'
+<number> ::= <digit> [ <digit> ]*
+
+S ::= <s_name> || <literal string>
+G ::= <g_name> || <literal string>
+
+<declaration> ::= strings ( [<s_name>]* ) ||
+ integers ( [<i_name>]* ) ||
+ booleans ( [<b_name>]* ) ||
+ routines ( [<r_name>]* ) ||
+ externals ( [<r_name>]* ) ||
+ groupings ( [<g_name>]* )
+
+<r_definition> ::= define <r_name> as C
+<plus_or_minus> ::= + || -
+<g_definition> ::= define <g_name> G [ <plus_or_minus> G ]*
+
+AE ::= (AE) ||
+ AE + AE || AE - AE || AE * AE || AE / AE || - AE ||
+ maxint || minint || cursor || limit ||
+ size || sizeof S ||
+ len || lenof S ||
+ <i_name> || <number>
+
+<i_assign> ::= $ <i_name> = AE ||
+ $ <i_name> += AE || $ <i_name> -= AE ||
+ $ <i_name> *= AE || $ <i_name> /= AE
+
+<i_test_op> ::= == || != || > || >= || < || <=
+
+<i_test> ::= $ ( AE <i_test_op> AE ) ||
+ $ <i_name> <i_test_op> AE
+
+<s_command> ::= $ <s_name> C
+
+C ::= ( [C]* ) ||
+ <i_assign> || <i_test> || <s_command> || C or C || C and C ||
+ not C || test C || try C || do C || fail C ||
+ goto C || gopast C || repeat C || loop AE C ||
+ atleast AE C || S || = S || insert S || attach S ||
+ <- S || delete || hop AE || next ||
+ => <s_name> || [ || ] || -> <s_name> ||
+ setmark <i_name> || tomark AE || atmark AE ||
+ tolimit || atlimit || setlimit C for C ||
+ backwards C || reverse C || substring ||
+ among ( [<literal string> [<r_name>] || (C)]* ) ||
+ set <b_name> || unset <b_name> || <b_name> ||
+ <r_name> || <g_name> || non [-] <g_name> ||
+ true || false || ?
+
+P ::= [P]* || <declaration> ||
+ <r_definition> || <g_definition> ||
+ backwardmode ( P )
+
+<program> ::= P
+
+
+
+synonyms: <+ for insert
+
+Snowball is a small string-handling language, and its name was chosen as a
+tribute to SNOBOL (Farber 1964, Griswold 1968 —
+see the references at the end of the
+introduction),
+with which it shares the
+concept of string patterns delivering signals that are used to control the
+flow of the program.
+
+
+
1 Data types
+
+
+The basic data types handled by Snowball are strings of characters, signed
+integers, and boolean truth values, or more simply strings, integers
+and booleans. Snowball supports Unicode characters, which may be represented
+as UTF-8, 8-bit characters, or 16-bit wide characters (depending on the
+programming language code is being generated for - for C, all these options are
+supported).
+
+
+
2 Names
+
+
+A name in Snowball starts with an ASCII letter, followed by zero or more ASCII
+letters, digits and underscores. A name can be of type string,
+integer, boolean, routine, external or
+grouping. All names must be declared. A declaration has the form
+
+
+
+ Ts ( ... )
+
+
+
+where symbol T is one of string, integer etc, and the region in
+brackets contains a list of names separated by whitespace. For example,
+
+p1 and p2 are integers, Y_found is boolean, and so on. Snowball is quite
+strict about the declarations, so all the names go in the same name space,
+no name may be declared twice, all used names must be declared, no two
+routine definitions can have the same name, etc. Names declared and
+subsequently not used are merely reported in a warning message.
+
+
+A name may not be one of the reserved words of Snowball. Additionally, names
+for externals must be valid function/method names in the language being
+generated in most cases, which generally means they can't be reserved words
+in that language (e.g. [% highlight_inline("externals (null)") %] will generate
+invalid Java code containing a method public boolean null().)
+For internal symbols we add a prefix to avoid this issue, but an external
+has to provide an external interface. When generating C code, the
+-eprefix option provides a potential solution to this problem.
+
+
+
+Names in Snowball are case-sensitive, but external names which differ only in
+case will cause a problem for languages with case-insensitive identifiers (such
+as Pascal). This issue is avoided for internal symbols in such languages by
+encoding case difference via an added prefix.
+
+
+
+So for portability a little care is needed when choosing names for externals.
+The convention when using Snowball to implement stemming algorithms is to have
+a single external named stem, which should be safe.
+
+
+
3 Literals
+
+
3.1 Integer Literals
+
+
+A literal integer is an ASCII digit sequence, and is always interpreted as
+decimal.
+
+
+
3.2 String Literals
+
+
+A literal string is written between single quotes, for example,
+
+
+[% highlight("
+ 'aeiouy'
+") %]
+
+
+Two special insert characters for use in literal strings are defined by
+the directive [% highlight_inline("stringescapes AB") %], for example,
+
+
+[% highlight("
+ stringescapes {}
+") %]
+
+
+Conventionally { and } are used as the insert
+characters, and we would recommend following this convention unless you want to
+use these as literal characters in your strings a lot. However,
+ A and B can be any printing
+characters, except that A can't be a single quote.
+(If A and B are the same then
+ A itself can never be escaped.)
+
+
+
+A subsequent occurrence of the stringescapes directive redefines
+the insert characters (but any string macros already defined with
+stringdef remain defined).
+
+
+
+Within insert characters, the following sequences are understood:
+
+
+
+
+User-defined string macros which can be specified using
+stringdef. Macro m is defined in the
+form stringdef m 'S', where 'S' is a
+string, and m a sequence of one or more printing
+characters. Thereafter, {m} inside a string causes
+ S to be substituted in place of m.
+
+
+
+New in Snowball 2.0: Unicode codepoints can be specified using the syntax
+U+ followed by one or more hex digits - for example,
+[% highlight_inline("'{U+FFFD}'") %]. These are automatically handled
+appropriately in all cases except if you want to generate C code to handle a
+single byte character set other than ISO-8859-1. Such cases are handled by
+defining string macros for the U+ codes in the character set,
+after which the same Snowball source can be used. You can't mix use of
+U+ codes defined as string macros and with their default
+meanings in the same compilation. When U+ codes are defined
+as string macros, snowball will upper case the characters after the
++ if there's no macro defined with the case as given.
+
+
+
+By default {'} will substitute ' and
+{{} will substitute {, although macros ' and { may subsequently be
+redefined.
+
+
+
+A further feature is that {W} inside
+a string, where W is a
+sequence of whitespace characters including one or more newlines, is
+ignored. This enables long strings to be written over a number of lines.
+
+
+
+
+For example,
+
+
+[% highlight("
+ stringescapes {}
+
+ /* Spanish diacritics */
+
+ stringdef a' '{U+00E1}' // a-acute
+ stringdef e' '{U+00E9}' // e-acute
+ stringdef i' '{U+00ED}' // i-acute
+ stringdef o' '{U+00F3}' // o-acute
+ stringdef u' '{U+00FA}' // u-acute
+ stringdef u\" '{U+00FC}' // u-diaeresis
+ stringdef n~ '{U+00F1}' // n-tilde
+
+ /* All the characters in Spanish used to represent vowels */
+
+ define v 'aeiou{a'}{e'}{i'}{o'}{u'}{u\"}'
+") %]
+
+
4 Routines
+
+
+A routine definition has the form
+
+
+[% highlight("
+ define R as C
+") %]
+
+
+where R is the routine name and C is a command, or bracketed group of
+commands. So a routine is defined as a sequence of zero or more commands.
+Snowball routines do not (at present) take parameters. For example,
+
+
+[% highlight("
+ define Step_5b as ( // this defines Step_5b
+ ['l'] // three commands here: [, 'l' and ]
+ R2 'l' // two commands, R2 and 'l'
+ delete // delete is one command
+ )
+" _ '
+ define R1 as $p1 <= cursor
+ /* R1 is defined as the single command "$p1 <= cursor" */
+') %]
+
+
+A routine is called simply by using its name, R, as a command.
+
+
+
5 Commands and signals
+
+
+The flow of control in Snowball is arranged by the implicit use of
+signals, rather than the explicit use of constructs like the if,
+else, break of C. The scheme is designed for handling strings, but is
+perhaps easier to introduce using integers. Suppose x, y, z ... are
+integers. The command
+
+
+[% highlight('
+ $x = 1
+') %]
+
+
+sets x to 1. The command
+
+
+[% highlight('
+ $x > 0
+') %]
+
+
+tests if x is greater than zero. Both commands give a signal t or f,
+(true or false), but while the second command gives t if x is greater
+than zero and f otherwise, the first command always gives t. In Snowball,
+every command gives a t or f signal. A sequence of commands can be turned
+into a single command by putting them in a list surrounded by round
+brackets:
+
+
+
+ ( C1 C2 C3 ... Ci Ci+1 ... )
+
+
+
+When this is obeyed, Ci+1 will be obeyed if each of the preceding C1 ...
+Ci give t, but as soon as a Ci gives f, the subsequent Ci+1 Ci+2 ...
+are ignored, and the whole sequence gives signal f. If all the Ci give t,
+however, the bracketed command sequence also gives t. So,
+
+
+[% highlight('
+ $x > 0 $y = 1
+') %]
+
+
+sets y to 1 if x is greater than zero. If x is less than or equal to zero
+the two commands give f.
+
+
+
+If C1 and C2 are commands, we can build up the larger commands,
+
+
+
+
C1 or C2
+
— Do C1. If it gives t ignore C2, otherwise do C2. The resulting
+ signal is t if and only C1 or C2 gave t.
+
C1 and C2
+
— Do C1. If it gives f ignore C2, otherwise do C2. The resulting
+ signal is t if and only C1 and C2 gave t.
+
not C
+
— Do C. The resulting signal is t if C gave f, otherwise f.
+
try C
+
— Do C. The resulting signal is t whatever the signal of C.
+
fail C
+
— Do C. The resulting signal is f whatever the signal of C.
+
+so that and seems unnecessary here. But we will see that and has a
+particular significance in string commands.
+
+
+
+When a ‘monadic’ construct like not, try or fail is not followed by a
+round bracket, the construct applies to the shortest following valid command.
+So for example
+
+because [% highlight_inline('$x < 1') %] is the shortest valid command following not, and then
+not $x < 1 is the shortest valid command following try.
+
+
+
+The ‘dyadic’ constructs like and and or must sit in a bracketed list
+of commands anyway, for example,
+
+
+
+ ( C1 C2 and C3 C4 or C5 )
+
+
+
+And then in this case C2 and C3 are connected by the and; C4 and C5 are
+connected by the or. So
+
+
+[% highlight('
+ $x > 0 not $y > 0 or not $z > 0 $t > 0
+') %]
+
+
+and and or are equally binding, and bind from left to right,
+so C1 or C2 and C3 means (C1 or C2) and C3 etc.
+
+
+
6 Integer commands
+
+
+There are two sorts of integer commands - assignments and comparisons. Both
+are built from Arithmetic Expressions (AEs).
+
+
+
Arithmetic Expressions (AEs)
+
+
+An AE consists of integer names, literal numbers and a few other things
+connected by dyadic +, -, * and /, and monadic -, with the same
+binding powers and semantics as C. As well as integer names and literal
+numbers, the following may be used in AEs:
+
+
+
+
minint
— the minimum negative number
+
maxint
— the maximum positive number
+
cursor
— the current value of the string cursor
+
limit
— the current value of the string limit
+
size
— the size of the string, in "slots"
+
sizeof s
— the number of "slots" in s, where s is the name of a string or (since Snowball 2.1) a literal string
+
New in Snowball 2.0:
+
len
— the length of the string, in Unicode characters
+
lenof s
— the number of Unicode characters in s, where s is the name of a string or (since Snowball 2.1) a literal string
+
+
+
+[% highlight_inline('size') %] and [% highlight_inline('sizeof') %] count in
+"slots" - see the "Character representation" section below for details.
+
+
+
+The cursor and limit concepts are explained below.
+
+
+
Integer assignments
+
+
+An integer assignment has the form
+
+
+
+ $X assign_op AE
+
+
+
+where X is an integer name and assign_op is one of the five assignments
+ =, +=, -=, *=, or /=.
+The meanings are the same as in C.
+
+
+
+For example,
+
+
+[% highlight('
+ $p1 = limit // set p1 to the string limit
+') %]
+
+
+Integer assignments always give the signal t.
+
+
+
Integer comparisons
+
+
+An integer comparison has the form
+
+
+
+ $X rel_op AE
+
+
+
+or (since Snowball 2.0):
+
+
+
+ $(AE1rel_op AE2)
+
+
+
+where X is an integer name and rel_op is one of the six tests
+ ==, !=, >=,
+ >, <=, or <.
+Again, the meanings are the same as in C.
+
+
+
+Examples of integer comparisons are,
+
+
+[% highlight('
+ $p1 <= cursor // signal is f if the cursor is before position p1
+ $(len >= 3) // signal is f unless the string is at least 3 characters long
+') %]
+
+
+The second form is more general since an integer name is a valid AE, but it
+also allows comparisons which don't involve integer variables. Before support
+for this was added the second example could only be achieved by assigning
+len to a variable and then testing that variable instead.
+
+
+
7 String commands
+
+
+If s is a string name, a string command has the form
+
+
+[% highlight('
+ $s C
+') %]
+
+
+where C is a command that operate on the string. Strings can be processed
+left-to-right or right-to-left, but we will describe only the
+left-to-right case for now. The string has a cursor, which we will
+denote by c, and a limit point, or limit, which we will denote by l. c
+advances towards l in the course of a string command, but the various
+constructs and, or, not etc have side-effects which keep moving it
+backwards. Initially c is at the start and l the end of the string. For
+example,
+
+
+
+ 'a|n|i|m|a|d|v|e|r|s|i|o|n'
+ | |
+ c l
+
+
+
+c, and l, mark the boundaries between characters, and not
+characters themselves. The characters between c and l will be denoted by
+c:l.
+
+
+
+If C gives t, the cursor c will have a new, well-defined value. But if C
+gives f, c is undefined. Its later value will in fact be determined by the
+outer context of commands in which C came to be obeyed, not by C itself.
+
+
+
+Here is a list of the commands that can be used to operate on strings.
+
+
+
a) Setting a value
+
+
+
= S
+
where S is the name of a string or a literal string. c:l is set equal
+ to S, and l is adjusted to point to the end of the copied string. The
+ signal is t. For example,
+
+[% highlight('
+ $x ' _ " = 'animadversion' /* literal string */" _ '
+ $y = x /* string name */
+') %]
+
+
+
+
b) Basic tests
+
+
+
S
+
here and below, S is the name of a string or a literal string. If c:l
+ begins with the substring S, c is repositioned to the end of this
+ substring, and the signal is t. Otherwise the signal is f. For example,
+
+[% highlight('
+ $x ' _ "'anim' /* gives t, assuming the string is 'animadversion' */" _ '
+ $x ' _ "('anim' 'ad' 'vers')" _ '
+ /* ditto */
+
+ $t ' _ "= 'anim'" _ '
+ $x t /* ditto */
+') %]
+
+
true, false
+
true is a dummy command that generates signal t. false generates
+ signal f. They are sometimes useful for emphasis,
+
+[% highlight("
+ define start_off as true // nothing to do
+ define exception_list as false // put in among(...) list later
+") %]
+
+ true is equivalent to ()
+
C1 or C2
+
This is like the case for integers described above, but the extra
+ touch is that if C1 gives f, c is set back to its old position after
+ C1 has given f and before C2 is tried, so that the test takes place on
+ the same point in the string. So we have
+
+[% highlight('
+ $x ' _ "('anim' /* signal t */
+ 'ation' /* signal f */
+ ) or
+ ( 'an' /* signal t - from the beginning */
+ )
+") %]
+
+
C1 and C2
+
And similarly c is set back to its old position after C1 has given t
+ and before C2 is tried. So,
+
+[% highlight('
+ $x ' _ "'anim' and 'an' /* signal t */" _ '
+ $x ' _ "('anim' 'an') /* signal f, since 'an' and 'ad' mis-match */
+") %]
+
+
not C
+
try C
+
These are like the integer tests, with the added feature that c is set
+ back to its old position after an f signal is turned into t. So,
+
+[% highlight('
+ $x ' _ "(not 'animation' not 'immersion')
+ /* both tests are done at the start of the string */
+" _ '
+ $x ' _ "(try 'animus' try 'an'
+ 'imad')
+ /* - gives t */
+") %]
+
+
+
try C
is equivalent to
C or true
+
+
test C
+
This does command C but without advancing c. Its signal is the same as
+ the signal of C, but following signal t, c is set back to its old
+ value.
+
+
test C
is equivalent to
not not C
+
test C1 C2
is equivalent to
C1 and C2
+
+
fail C
+
This does C and gives signal f. It is equivalent to C false. Like
+ false it is useful, but only rarely.
+
+
do C
+
This does C, puts c back to its old value and gives signal t. It is
+ very useful as a way of suppressing the side effect of f signals and
+ cursor movement.
+
+
do C
is equivalent to
try test C
+
or
test try C
+
+
goto C
+
c is moved right until obeying C gives t. But if c cannot be moved
+ right because it is at l the signal is f. c is set back to the position
+ it had before the last obeying of C, so the effect is to leave c before
+ the pattern which matched against C.
+
+[% highlight('
+ $x goto' _ " 'ad' /* positions c after 'anim' */" _ '
+ $x goto' _ " 'ax' /* signal f */
+") %]
+
+
gopast C
+
Like goto, but c is not set back, so the effect is to leave c after
+ the pattern which matched against C.
+
+[% highlight('
+ $x gopast' _ " 'ad' /* positions c after 'animad' */
+") %]
+
+
repeat C
+
C is repeated until it gives f. When this happens c is set back to the
+ position it had before the last repetition of C, and repeat C gives
+ signal t. For example,
+
+[% highlight('
+ $x repeat gopast' _ " 'a' /* position c after the last 'a' */
+") %]
+
+
loop AE C
+
This is like C C ... C written out AE times, where AE is an arithmetic
+ expression. For example,
+
+[% highlight('
+ $x loop 2 gopast' _ " ('a' or 'e' or 'i' or 'o' or 'u')
+ /* position c after the second vowel */
+") %]
+
+ The equivalent expression in C has the shape,
+
+[% highlight("
+ { int i;
+ int limit = AE;
+ for (i = 0; i < limit; i++) C;
+ }
+", "c") %]
+
+
atleast AE C
+
This is equivalent to loop AE C repeat C.
+
+
hop AE
+
moves c AE character positions towards l, but if AE is negative, or if
+ there are less than AE characters between c and l the signal is f.
+ For example,
+
+[% highlight("
+ test hop 3
+") %]
+
+ tests that c:l contains more than 2 characters.
+
+
next
+
is equivalent to hop 1.
+
+
+
c) Moving text about
+
+
+We have seen in (a) that $x = y, when x and y are strings, sets c:l of x
+to the value of y. Conversely
+
+
+[% highlight('
+ $x => y
+') %]
+
+
+sets the value of y to the c:l region of x.
+
+
+
+A more delicate mechanism for pushing text around is to define a substring,
+or slice of the string being tested. Then
+
+
+
+
[% highlight_inline('[') %]
+
sets the left-end of the slice to c,
+
[% highlight_inline(']') %]
+
sets the right-end of the slice to c,
+
[% highlight_inline("-> s") %]
+
moves the slice to variable s,
+
[% highlight_inline("<- S") %]
+
replaces the slice with variable (or literal) S.
+
+
+
+For example
+
+
+[% highlight("
+ /* assume x holds 'animadversion' */" _ '
+ $x ( [ ' _ " // '[animadversion' - [ set as indicated
+ loop 2 gopast 'a'
+ // '[anima|dversion' - c is marked by '|'
+ ] // '[anima]dversion' - ] set as indicated
+ -> y // y is 'anima'
+ )
+") %]
+
+
+For any string, the slice ends should be assumed to be unset until they are
+set with the two commands [, ]. Thereafter the slice ends will retain
+the same values until altered.
+
+
+
+
[% highlight_inline("delete") %]
+
is equivalent to [% highlight_inline("<- ''") %]
+
+
+
+This next example deletes all vowels in x,
+
+
+[% highlight("
+ define vowel ('a' or 'e' or 'i' or 'o' or 'u')
+ /* ... */" _ '
+ $ x repeat ( gopast([vowel]) delete )
+') %]
+
+
+As this example shows, the slice markers [ and ] often appear as
+pairs in a bracketed style, which makes for easy reading of the Snowball
+scripts. But it must be remembered that, unusually in a computer
+programming language, they are not true brackets.
+
+
+
+More simply, text can be inserted at c.
+
+
+
+
[% highlight_inline("insert S") %]
+
insert variable or literal S before c, moving c to the right of the
+ insert. <+ is a synonym for insert.
+
+
[% highlight_inline("attach S") %]
+
the same, but leave c at the left of the insert.
+
+
+
d) Marks
+
+
+The cursor, c, (and the limit, l) can be thought of as having a numeric
+value, from zero upwards:
+
+
+
+ | a | n | i | m | a | d | v | e | r | s | i | o | n |
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13
+
+
+
+It is these numeric values of c and l which are accessible through
+cursor and limit in arithmetic expressions.
+
+
+
+
[% highlight_inline("setmark X") %]
+
sets X to the current value of c, where X is an integer variable.
+ It's equivalent to: [% highlight_inline("$X = cursor") %]
+
+
[% highlight_inline("tomark AE") %]
+
moves c forward to the position given by AE,
+
+
[% highlight_inline("atmark AE") %]
+
tests if c is at position AE (t or f signal).
+ It's equivalent to: [% highlight_inline("$(cursor == AE)") %]
+
+
+
+In the case of [% highlight_inline("tomark AE") %], a similar fail condition occurs as with [% highlight_inline("hop AE") %].
+If c is already beyond AE, or if position l is before position AE, the
+signal is f.
+
+
+
+In the stemming algorithms, certain regions of the word are defined by
+setting marks, and later the failure condition of [% highlight_inline("tomark") %] is used to see if
+c is inside a particular region.
+
+
+
+Two other commands put c at l, and test if c is at l,
+
+
+
+
[% highlight_inline("tolimit") %]
+
moves c forward to l (signal t always),
+
+
[% highlight_inline("atlimit") %]
+
tests if c is at l (t or f signal).
+
+
+
e) Changing l
+
+
+In this account of string commands we see c moving right towards l, while
+l stays fixed at the end. In fact l can be reset to a new position between
+c and its old position, to act as a shorter barrier for the movement of c.
+
+
+
+
setlimit C1 for C2
+
C1 is obeyed, and if it gives f the signal from setlimit
+ is f with no further action.
+
+
+
+ Otherwise, the final value of c becomes the new
+ position of l. c is then set back to its old value before C1 was
+ obeyed, and C2 is obeyed. Finally l is set back to its old position,
+ and the signal of C2 becomes the signal of setlimit.
+
+
+
+ So the signal is f if either C1 or C2 gives f, otherwise t.
+ For example,
+
+
+[% highlight('
+ $x ( setlimit goto' _ " 's' // 'animadver}sion' new l as marked '}'
+ for // below, '|' marks c after each goto
+ ( goto 'a' and // '|animadver}sion'
+ goto 'e' and // 'animadv|er}sion'
+ goto 'i' and // 'an|imadver}sion'
+ )
+ )
+") %]
+
+
+ This checks that x has characters ‘a’, ‘e’ and ‘i’ before the first
+ ‘s’.
+
+
+
+
+
f) Backward processing
+
+
+String commands have been described with c to the left of l and moving
+right. But the process can be reversed.
+
+
+
+
[% highlight_inline("backwards C") %]
+
c and l are swapped over, and c moves left towards l. C is obeyed, the
+ signal given by C becomes the signal of backwards C, and c and l are
+ swapped back to their old values (except that l may have been adjusted
+ because of deletions and insertions). C cannot contain another
+ [% highlight_inline("backwards") %] command.
+
+
[% highlight_inline("reverse C") %]
+
A similar idea, but here c simply moves left instead of moving right,
+ with the beginning of the string as the limit, l. C can contain other
+ [% highlight_inline("reverse") %] commands, but it cannot contain commands to do deletions or
+ insertions — it must be used for testing only. (Without this
+ restriction Snowball's semantics would become very untidy.)
+
+
+
+Forward and backward processing are entirely symmetric, except that forward
+processing is the default direction, and literal strings are always
+written out forwards, even when they are being tested backwards. So the
+following are equivalent,
+
+If a routine is defined for backwards mode processing, it must be included
+inside a backwardmode(...) declaration.
+
+
+
g) substring and among
+
+
+The use of [% highlight_inline("substring") %] and [% highlight_inline("among") %] is central to the implementation of the
+stemming algorithms. It is like a case switch on strings. In its simpler
+form,
+
+
+
+ substring among('S1' 'S2' 'S3' ...)
+
+
+
+searches for the longest matching substring 'S1' or 'S2' or 'S3' ... from
+position c. (The 'Si' must all be different.) So this has the same
+semantics as
+
+
+
+ ('S1' or 'S2' or 'S3' ...)
+
+
+
+— so long as the 'Si' are written out in decreasing order of length.
+
+
+
+substring may be omitted, in which case it is attached to its following
+among, so
+
+
+[% highlight("
+ among(/*...*/)
+") %]
+
+
+without a preceding [% highlight_inline("substring") %] is equivalent to
+
+[% highlight_inline("substring") %] may also be detached from its [% highlight_inline("among") %], although it must
+precede it textually in the same routine in which the [% highlight_inline("among") %] appears.
+The more general form of [% highlight_inline("substring /* ... */ among") %] is,
+
+Obeying substring searches for a longest match among the 'Sij'. The
+signal from substring is t if a match is found, otherwise f.
+Any commands C between the substring and among will be run after this
+search and only if the search finds a match (it would be equivalent to remove C and replace each
+Ci with C Ci). When the
+among comes to be obeyed, the Ci corresponding to the matched 'Sij' is
+obeyed, and its signal becomes the signal of the among command.
+
+
+
+substring/among pairs must match up textually inside each routine
+definition. But there is no problem with an among containing other
+substring/among pairs, and substring is optional before among anyway.
+The essential constraint is that two substrings must be separated by an
+among, and each substring must be followed by an among.
+
+
+
+The effect of obeying among when the preceding substring is not obeyed
+is undefined. This would happen for example here,
+
+
+[% highlight('
+ try($x != 617 substring)' _ "
+ among(...) // 'substring' is bypassed in the exceptional case where x == 617
+") %]
+
+
+The significance of separating the substring from the among is to allow
+them to work in different contexts. For example,
+
+Here the test for the longest 'Sij' is constrained to the region between c
+and the mark point given by integer L. But the commands Ci operate outside
+this limit. Another example is
+
+— so we can imagine each 'Sij' having its associated routine
+Rij. Then obeying the among causes a search for the longest
+'Sij' whose corresponding routine
+Rij gives t.
+
+
+
+The routines Rij should be written without any
+side-effects, other than the inevitable cursor movement. (c is in
+any case set back to its old value following a call of
+Rij.)
+
+
+
8 Booleans
+
+
+[% highlight_inline("set B") %] and [% highlight_inline("unset B") %] set B to true and false respectively, where B is a
+boolean name. [% highlight_inline("B") %] as a command gives a signal t if it is set true, f
+otherwise. For example,
+
+
+[% highlight("
+ booleans ( Y_found ) // declare the boolean
+
+ /* ... */
+
+ unset Y_found // unset it
+ do ( ['y'] <-'Y' set Y_found )
+ /* if c:l begins 'y' replace it by 'Y' and set Y_found */
+
+ do repeat(goto (v ['y']) <-'Y' set Y_found)
+ /* repeatedly move down the string looking for v 'y' and
+ replacing 'y' with 'Y'. Whenever the replacement takes
+ place set Y_found. v is a test for a vowel, defined as
+ a grouping (see below). */
+
+
+ /* Y_found means there are some letters Y in the string.
+ Later we can use this to trigger a conversion back to
+ lower case y. */
+
+ /* ... */
+
+ do (Y_found repeat(goto (['Y']) <- 'y')
+") %]
+
+
9 Groupings
+
+
+A grouping brings characters together and enables them to be looked for
+with a single test.
+
+
+
+If G is declared as a grouping, it can be defined by
+
+
+
+ define G G1op G2op G3 ...
+
+
+
+where op is + or -, and G1, G2, G3 are literal strings, or groupings that
+have already been defined. (There can be zero or more of these additional
+op components). For example,
+
+Once G is defined, it can be used as a command, and is equivalent to a test
+
+
+
+ 'ch1' or 'ch2' or ...
+
+
+
+where ch1, ch2 ... list all the characters in the grouping.
+
+
+
+[% highlight_inline("non G") %] is the converse test, and matches any character except the
+characters of G. Note that [% highlight_inline("non G") %] is not the same as [% highlight_inline("not G") %], in fact
+
+
+
+[% highlight_inline("non G") %] is equivalent to [% highlight_inline("(not G next)") %]
+
+
+
+[% highlight_inline("non") %] may be optionally followed by hyphen, for example:
+
+Bear in mind that [% highlight_inline("non-vowel") %] doesn't only match a
+consonant - it'll match any character which isn't in the vowel
+grouping. Failing to consider this has lead to bugs in stemming algorithms -
+for example, here we intended to undouble a consonant:
+
+The problem with this code is it will also mangle numbers with repeated digits,
+for example 1900 would become 190. A good rule of
+thumb here seems to be to use an inclusive grouping check if the code goes on
+to delete the character matched:
+
+A complete program consists of a sequence of declarations followed by a
+sequence of definitions of groupings and routines. Routines which are
+implicitly defined as operating on c:l from right to left must be included
+in a backwardmode(...) declaration.
+
+
+
+A Snowball program is called up via a simple
+API
+through its defined externals. For example,
+
+The API also allows a current string to be defined, and this becomes the
+c:l string for the external routine to work on. Its final value is the
+result handed back through the API.
+
+
+
+The strings, integers and booleans are accessible from any point in the
+program, and exist throughout the running of the Snowball program. They are
+therefore like static declarations in C.
+
+
+
11 Comments, and other whitespace fillers
+
+
+At a deeper level, a program is a sequence of tokens, interspersed with
+whitespace. Names, reserved words, literal numbers and strings are all
+tokens. Various symbols, made up of non-alphanumerics, are also tokens.
+
+
+
+A name, reserved word or number is terminated by the first character that
+cannot form part of it. A symbol is recognised as the longest sequence of
+characters that forms a valid symbol. So +=- is two symbols, += and
+-, because += is a valid symbol in the language while +=- is not.
+Whitespace separates tokens but is otherwise ignored. This of course is
+like C.
+
+
+
+Occasionally a newer version of Snowball may add a new token. So as not to
+break existing programs, any such tokens declared as a name (via
+[% highlight_inline('integers') %], [% highlight_inline('routines') %], etc)
+will lose their token status for the rest of the program. This applies
+to the tokens
+[% highlight_inline('len') %]
+and
+[% highlight_inline('lenof') %].
+
+
+
+Anywhere that whitespace can occur, there may also occur:
+
+
+
+(a) Comments, in the usual multi-line [% highlight_inline('/* .... */') %] or single line
+[% highlight_inline('// ...') %] format.
+
+
+
+(b) Get directives. These are like #include commands in C, and have the form
+[% highlight_inline("get 'S'") %], where 'S' is a literal string. For example,
+
+
+[% highlight("
+ get '/home/martin/snowball/main-hdr' // include the file contents
+") %]
+
+
+(c) [% highlight_inline("stringescapes XY") %] where X and Y are any two printing characters.
+
+
+
+(d) [% highlight_inline("stringdef m 'S'") %] where m is sequence of characters not including
+whitespace and terminated with whitespace, and 'S' is a literal string.
+
+
+
12 Character representation
+
+
+In this description of Snowball, it is assumed that strings are composed of
+characters, and that characters can be defined numerically, but the numeric range
+of these characters is not defined. As implemented, three different schemes
+are supported. Characters can either be (a) bytes in the range 0 to 255,
+as in traditional C strings, or (b) byte pairs in the range 0 to 65535,
+as in Java strings, or (c) UTF-8 encoded bytes sequences in the range 0
+to 65535, so that a character may occupy 1, 2 or 3 bytes.
+
+
+
+For case (c), we need to make a slight separation of the concept of
+characters into symbols, the units of text being represented, and
+slots, the units of space into which they map. (So in case (a), all
+slots are one byte; in case (b) all slots are two bytes.)
+c and l have numeric values that can be used in AEs (arithmetic
+expressions). These values count the number of slots. Similarly
+setmark, tomark and atmark are remembering and then using slot
+counts. size and sizeof measure string size
+in slots, not symbols. However, hop N moves c over N symbols,
+not N slots, and next is equivalent to hop 1.
+
+
+
+Snowball 2.0 adds len and lenof, which measure string length in symbols
+(so they're the same as size and sizeof in cases (a) and (b), but
+different in case (c)).
+
+
+
+So long as these simple distinctions are recognised, the same Snowball
+script can be compiled to work with any of the three encoding schemes.
+
+
+
13 Legacy Features
+
+
13.1 hex and decimal
+
+
+This section documents features of Snowball for which there's a strongly
+prefered alternative. They're still support for compatibility with
+existing code which uses them, but you shouldn't use them in then code.
+We document them here so that their meaning in existing code can be
+understood, and especially to aid updated to the preferred alternatives.
+
+
+
+In a stringdef , string may be preceded by the word hex,
+or the word decimal. This was how non-ASCII characters
+were specified before support for specifying Unicode codepoints using the
+U+ notation was added.
+
+
+
+hex and decimal mean that the contents of the string
+are interpreted as characters values written out in hexadecimal, or decimal,
+notation. The characters should be separated by spaces. For example,
+
+
+[% highlight("
+ hex 'DA' /* is character hex DA */
+ hex 'D A' /* is the two characters, hex D and A (carriage
+ return, and line feed) */
+ decimal '10' /* character 10 (line feed) */
+ decimal '13 10' /* characters 13 and 10 (carriage return, and
+ line feed) */
+") %]
+
+
+The following forms are equivalent,
+
+
+[% highlight("
+ hex 'd a' /* lower case also allowed */
+ hex '0D 000A' /* leading zeroes ignored */
+ hex ' D A ' /* extra spacing is harmless */
+") %]
+
+
+The interpretation of the values is as Unicode codepoints if command
+line option -utf8 or -widechars is specified, and as
+character values in an unspecified single byte character set otherwise. For
+ASCII and ISO-8859-1 the character values match Unicode codepoints, but to
+handle other single byte character sets (e.g. ISO-8859-2 or KOI8-R) you needed
+a special version of a Snowball source with different character values
+specified via stringdef. The U+ notation allows
+you to use a single Snowball source in this situation.
+
+
+
13.2 among starter command
+
+
+The among command supports a "starter" command, C
+in this example:
+
+This requires an explicit substring but seems clearer so
+we recommend using this in new code and have designated the use of a starter as
+a legacy feature.
+
+
+
+A starter is also allowed with an explicit substring, for example:
+
+In the grammar which follows, || is used for alternatives,
+ [X] means that X is
+optional, and [X]* means that X is repeated zero or more
+times. meta-symbols are defined on the left. <char> means any
+character.
+
+
+
+The definition of literal string does not allow for the escaping
+conventions established by the stringescapes directive. The command
+? is a debugging aid.
+
+
+
+<letter> ::= a || b || ... || z || A || B || ... || Z
+<digit> ::= 0 || 1 || ... || 9
+<name> ::= <letter> [ <letter> || <digit> || _ ]*
+<s_name> ::= <name>
+<i_name> ::= <name>
+<b_name> ::= <name>
+<r_name> ::= <name>
+<g_name> ::= <name>
+<literal string>::= '[<char>]*'
+<number> ::= <digit> [ <digit> ]*
+
+S ::= <s_name> || <literal string>
+G ::= <g_name> || <literal string>
+
+<declaration> ::= strings ( [<s_name>]* ) ||
+ integers ( [<i_name>]* ) ||
+ booleans ( [<b_name>]* ) ||
+ routines ( [<r_name>]* ) ||
+ externals ( [<r_name>]* ) ||
+ groupings ( [<g_name>]* )
+
+<r_definition> ::= define <r_name> as C
+<plus_or_minus> ::= + || -
+<g_definition> ::= define <g_name> G [ <plus_or_minus> G ]*
+
+AE ::= (AE) ||
+ AE + AE || AE - AE || AE * AE || AE / AE || - AE ||
+ maxint || minint || cursor || limit ||
+ size || sizeof S ||
+ len || lenof S ||
+ <i_name> || <number>
+
+<i_assign> ::= $ <i_name> = AE ||
+ $ <i_name> += AE || $ <i_name> -= AE ||
+ $ <i_name> *= AE || $ <i_name> /= AE
+
+<i_test_op> ::= == || != || > || >= || < || <=
+
+<i_test> ::= $ ( AE <i_test_op> AE ) ||
+ $ <i_name> <i_test_op> AE
+
+<s_command> ::= $ <s_name> C
+
+C ::= ( [C]* ) ||
+ <i_assign> || <i_test> || <s_command> || C or C || C and C ||
+ not C || test C || try C || do C || fail C ||
+ goto C || gopast C || repeat C || loop AE C ||
+ atleast AE C || S || = S || insert S || attach S ||
+ <- S || delete || hop AE || next ||
+ => <s_name> || [ || ] || -> <s_name> ||
+ setmark <i_name> || tomark AE || atmark AE ||
+ tolimit || atlimit || setlimit C for C ||
+ backwards C || reverse C || substring ||
+ among ( [<literal string> [<r_name>] || (C)]* ) ||
+ set <b_name> || unset <b_name> || <b_name> ||
+ <r_name> || <g_name> || non [-] <g_name> ||
+ true || false || ?
+
+P ::= [P]* || <declaration> ||
+ <r_definition> || <g_definition> ||
+ backwardmode ( P )
+
+<program> ::= P
+
+
+
+synonyms: <+ for insert
+
+Snowball, and most of the current stemming algorithms were written by
+Dr Martin Porter, who also prepared the material for the Website.
+The Snowball to Java codegenerator, and supporting Java libraries, were
+contributed by Richard Boulton.
+Dr Andrew MacFarlane, of City University, London, gave much
+initial encouragement and proofreading
+assistance.
+
+
+
+Richard Boulton established the original Snowball website, from which this
+website has evolved.
+
+
+
+Linguistic assistance for Russian, German and Dutch has been provided by
+
+Patrick Miles
+(of the
+Patrick Miles Translation Agency, Cambridge, UK). Pat is a distinguished
+translator, whose English versions of Chekhov have appeared on the London
+stage.
+
+
+
+Various emailers have helped improve the stemmers with their many suggestions
+and comments. We must especially mention Andrei Aksyonoff and
+Oleg Bartunov (Russian), Steve Tolkin and Wendy Reetz (English), and Fred Brault (French).
+Blake Madden found a number of elusive errors in the stemmer descriptions.
+
+
+
+Anna Tordai has provided the Hungarian stemming algorithm.
+
+
+
+Evren (Kapusuz) Cilden has provided the Turkish stemming algorithm.
+
+
+
+Olly Betts has made a significant performance improvement to the C
+codegenerator.
+
+
+
+The Snowball mailing lists are hosted for us free by James Aylett, who
+owns and runs the machine that hosts the
+tartarus website.
+
+
+
+We received two Romanian stemming algorithms in 2006, from Erwin
+Glockner, Doina Gliga and Marina Stegarescu, working at Heidelberg,
+and from Irina Tirdea in Bucharest. After some experimentation,
+the Snowball Romanian stemmer has been rewritten from scratch, but the
+basic list of verb endings with their separation into two groups with
+different removal criteria is taken from Irina Tirdea's stemmer.
+
Several tarballs of the Snowball sources are available:
+
+
+
+The C version of the libstemmer library.
+This contains all you need to include the snowball stemming algorithms into a
+C project of your own. If you download this, you don't need to use the snowball
+compiler, or worry about the internals of the stemmers in any way.
+
+
+The C# version of the libstemmer library.
+This contains all you need to include the snowball stemming algorithms into a
+C# project of your own. If you download this, you don't need to use the snowball
+compiler, or worry about the internals of the stemmers in any way.
+
+
+The Java version of the libstemmer library.
+This contains all you need to include the snowball stemming algorithms into a
+Java project of your own. If you download this, you don't need to use the snowball
+compiler, or worry about the internals of the stemmers in any way.
+
+
+Snowball, algorithms, and libstemmer library.
+This contains all the source code for snowball (but not the generated source files).
+This is useful mainly if you are wanting to work on the algorithms (tweaking them,
+or producing new algorithms).
+
+
+
+
+We do not make binary (ie, compiled) distributions of snowball available -
+there are simply too many different platforms, architectures and languages to
+support. If you are willing to make such binaries available for others, and
+can provide at least some measure of support for ensuring that they work, feel
+free to contact us and we will add a link to your work from this site.
+
+
+
Python
+
+
+We provide and support python wrappers for Snowball. The latest code can
+be downloaded from the PyStemmer repo.
+
+
+
Git
+
+
+Developers may wish to access the latest source using the command:
+
+Snowball is a small string processing language for creating
+stemming algorithms for use in Information Retrieval, plus a collection of
+stemming algorithms implemented using it.
+
+
+
+It was originally designed
+and built by Martin
+Porter. Martin retired from development in 2014 and Snowball is now
+maintained as a community project. Martin originally chose the name Snowball as
+a tribute to SNOBOL, the
+excellent string handling language from the 1960s. It now also serves as a
+metaphor for how the project grows by gathering contributions over time.
+
+
+
+The Snowball compiler translates a Snowball program into source code in another
+language - currently Ada, ISO C, C#, Go, Java, Javascript, Object Pascal,
+Python and Rust are supported.
+
+
+
What is Stemming?
+
+
+Stemming maps different forms of the same word to a common "stem" - for
+example, the English stemmer maps connection, connections,
+connective, connected, and connecting to connect.
+So a searching for connected would also find documents which only
+have the other forms.
+
+
+
+This stem form is often a word itself, but this is not always the case as
+this is not a requirement for text search systems, which are the intended
+field of use. We also aim to conflate words with the same meaning, rather
+than all words with a common linguistic root (so awe and awful
+don't have the same stem), and over-stemming is more problematic than
+under-stemming so we tend not to stem in cases that are hard to resolve. If
+you want to always reduce words to a root form and/or get a root form which is
+itself a word then Snowball's stemming algorithms likely aren't the right
+answer.
+
+Any such mail sent directly to individual developers may be answered less
+speedily, and in any case they reserve the right to post their answers on snowball-discuss.
+
+
+
Major events
+
+
+
+
+ Sep 2023 - Estonian stemming algorithm contributed by Linda Freienthal.
+
+ Jan 2007 - Turkish stemmer. Contributed by Evren (Kapusuz) Cilden.
+
+
+ Sep 2006 - Hungarian stemmer. Contributed by Anna Tordai.
+
+
+ Jun 2006 - Supported and updated Python bindings.
+
+
+ May 2005 - UTF-8 Unicode support.
+
+
+ Sep 2002 - Finnish stemmer.
+
+
+ Jul 2002 - ISO Latin I as default
+ The use of MS DOS Latin I is now history, but the old versions of the
+ Snowball stemmers are still accessible on the site.
+
+
+ May 2002 - Unicode support
+
+
+ Feb 2002 - Java support
+ Richard has modified the snowball code generator to produce Java output as
+ well as ANSI C output. This means that pure Java systems can now use the
+ snowball stemmers.
+
+Except where explicitly noted, all the software given out on this Snowball site
+is covered by the 3-clause BSD License:
+
+
+
+Copyright (c) 2001, Dr Martin Porter,
+Copyright (c) 2002, Richard Boulton.
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+
+Essentially, all this means is that you can do what you like with the code,
+except claim another Copyright for it, or claim that it is issued under a different
+license. The software is also issued without warranties, which means that if anyone
+suffers through its use, they cannot come back and sue you.
+You also have to alert anyone to whom you give the Snowball
+software to the fact that it is covered by the BSD license.
+
+There's one active mailing list related to Snowball:
+
+
+
+
Snowball-discuss is a list for general discussion of anything related to Snowball.
+Release announcements will also be posted to this list.
+
+Subscribe |
+Archives
+
+
+
+Note that this mailing list will reject postings from non-subscribers (due to
+the immense amount of spam received otherwise). The list is fairly
+low-traffic, but if you don't wish to receive messages (but wish to be able to
+post), you can disable sending of messages in the mailing list options after
+subscribing.
+
+In swift succession, we received in 2006 two stemmers for Romanian
+written in Snowball.
+Here is the original correspondence,
+
+
+
+From: Erwin Glockner <eglockne@ix.urz.uni-heidelberg.de>
+To: snowball-discuss
+Date: Wed, 07 Jun 2006 00:06:30 +0200
+Subject: [Snowball-discuss] romanian stemmer
+
+Hello everyone,
+
+my name is Erwn Glockner, I'm a student of computational linguistics in
+Heidelberg, Germany. Together with my fellow students Doina Gliga and
+Marina Stegarescu we started to write a romanian stemmer in Snowball.
+We planned to finish the stemmer until end of this month. We would be
+happy if the stemmer would be accepted as part of the Snowball-distribution.
+There is still some work to do, e.g. evaluating the stemmer, making a
+stopwords-list, unicode support, etc. After finishing this we will send
+you our stemmer with the corresponding files, but I couldn't find any
+email address to whom the stemmer should be sent to.
+Could please someone tell me the address(es)?
+
+With kind regards,
+E. Glockner, D. Gliga, M. Stegarescu.
+
+
+
+From: Erwin Glockner <eglockne@ix.urz.uni-heidelberg.de>
+To: richard@lemurconsulting.com,
+ martin.porter@grapeshot.co.uk
+Date: Tue Jul 18 19:43:39 2006
+Subject: romanian stemmer
+
+Dear Mr. Porter, dear Mr. Boulton,
+
+we finally finished the Romanian stemmer. Unfortunately evaluation took
+more time than expected.
+However, it was an interesting experience creating the stemmer, and we
+are happy to send you the result of our work.
+The attachment-file is a Tarball-zipped file with (hopefully) all files
+needed. The files and the stemmer as well are encoded in UTF-8. Please
+inform us if something is missing.
+
+We would be happy if the Romanian stemmer would be accepted and
+integrated into the official Snowball distribution. We agree of course
+to license the stemmer under the same terms as the existing snowball
+software.
+
+We're looking forward to hear from you soon.
+
+
+With kind regards,
+
+Marina, Doina and Erwin.
+
+Attachment: [romanian1.tgz]
+
+
+
+From: Irina Tirdea <irina.tirdea@gmail.com>
+To: richard@lemurconsulting.com,
+ martin.porter@grapeshot.co.uk
+Date: Mon Jul 31 10:19:51 2006
+Subject: Romanian stemmer
+
+Hello,
+
+My name is Irina Tirdea and I have developed a Romanian stemmer in Snowball
+as part of my bachelor thesis, in Bucharest, Romania. I am sending you the
+code attached (with vocabulary and stop word list files) and I hope you will
+accept and integrate it as a part of the Snowball project. I am ready to
+release the stemmer under the BSD license, just as the Snowball software.
+The files have been written in UTF-8 encoding (on a Linux system).
+
+Looking forward to hear from you.
+
+Kind regards,
+Irina Tirdea
+
+Attachment: [romanian2.tgz]
+
+
+
+From: martin.porter@grapeshot.co.uk (Martin Porter)
+To: snowball-discuss
+Cc: atordai@science.uva.nl,
+ eglockne@ix.urz.uni-heidelberg.de,
+ irina.tirdea@gmail.com
+Date: Mon Jul 31 10:43:05 BST 2006
+Subject: Tardy response to submissions to Snowball
+
+I am sending this general email as a kind of apology, for having done nothing
+so far on the following generously sent Snowball submissions:
+
+7 June, from E. Glockner: a Romanian stemmer
+8 June, from A. Tordai: a Hungarian stemmer
+
+and this morning another Romanian stemmer arrived,
+
+31 July, from I. Tirdea, a Romanian stemmer
+
+After the first submission I promised to look at it "next week", so Mr Glockner
+has probably been wondering what has happened. [. . .] I will make a point of
+looking at these submissions this week,
+
+More soon,
+
+Martin
+
+
+
+From: martin.porter@grapeshot.co.uk (Martin Porter)
+To: snowball-discuss
+Cc: irina.tirdea@gmail.com,
+ eglockne@ix.urz.uni-heidelberg.de,
+ mstegare@hotmail.com,
+ doina_gliga@yahoo.co.uk,
+ eglockner@hotmail.com
+Date: Wed Sep 06 12:39:16 BST 2006
+Subject: Romanian stemmer
+
+To the originators of the Romanian stemmers,
+
+I have now found time to do some preliminary work on the Romanian stemmer. I
+should explain that part of the complication has been the receipt, no more
+than ten days apart, of two Romanian stemmers in Snowball, the first
+(romanian1) from [Glockner, Gliga, and Stegarescu] in Heidelberg, the second
+(romanian2) from Tirdea in Bucharest.
+
+[. . . .]
+
+I have put together a vocabulary by combining the vocabularies provided with
+romanian1 and romanian2. This appears in column 1. Column 2 is the stemmed
+form produced by romanian1, and column 3 the stemmed form produced by
+romanian2. If the entry in column 3 is blank, both stemmers are producing the
+same result.
+
+You might care to compare the two approaches.
+
+My own feeling is that romanian1 does a more thorough job of ending removal,
+but unlike romanian2 has a habit of discarding too much from short words.
+aberant->ab, abatere->ab, aburi->ab are examples of this. In romanian1 the R2
+test is rarely used (it seems to me that 'R1 or R2' is equivalent to 'R1',
+since p2 is never to the left of p1.)
+
+I might have a go at making some modifications here. Needless to say, I am
+not familiar with Romanian, but the similarity to the other Romance
+languages, especially Italian, enables one to grasp the essential features of
+the morphology.
+
+What we would like to do is to have a single stemmer for release from the
+snowball site, if that is possible, and giving all necessary credits, along
+the lines of the recent addition,
+
+http://snowballstem.org/algorithms/hungarian/stemmer.html
+
+Hope to hear from you,
+
+Martin Porter
+
+
+
+Finally we decided to produce our own Romanian stemmer as described on the
+Romanian stemmer page. The submitted stemmers both contain stop word lists,
+available inside the tarballs.
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/otherapps/romanian/index.tt b/otherapps/romanian/index.tt
new file mode 100644
index 0000000..95e66ea
--- /dev/null
+++ b/otherapps/romanian/index.tt
@@ -0,0 +1,180 @@
+[% header('Two Romanian stemmers') %]
+
+
+In swift succession, we received in 2006 two stemmers for Romanian
+written in Snowball.
+Here is the original correspondence,
+
+
+
+From: Erwin Glockner <eglockne@ix.urz.uni-heidelberg.de>
+To: snowball-discuss
+Date: Wed, 07 Jun 2006 00:06:30 +0200
+Subject: [Snowball-discuss] romanian stemmer
+
+Hello everyone,
+
+my name is Erwn Glockner, I'm a student of computational linguistics in
+Heidelberg, Germany. Together with my fellow students Doina Gliga and
+Marina Stegarescu we started to write a romanian stemmer in Snowball.
+We planned to finish the stemmer until end of this month. We would be
+happy if the stemmer would be accepted as part of the Snowball-distribution.
+There is still some work to do, e.g. evaluating the stemmer, making a
+stopwords-list, unicode support, etc. After finishing this we will send
+you our stemmer with the corresponding files, but I couldn't find any
+email address to whom the stemmer should be sent to.
+Could please someone tell me the address(es)?
+
+With kind regards,
+E. Glockner, D. Gliga, M. Stegarescu.
+
+
+
+From: Erwin Glockner <eglockne@ix.urz.uni-heidelberg.de>
+To: richard@lemurconsulting.com,
+ martin.porter@grapeshot.co.uk
+Date: Tue Jul 18 19:43:39 2006
+Subject: romanian stemmer
+
+Dear Mr. Porter, dear Mr. Boulton,
+
+we finally finished the Romanian stemmer. Unfortunately evaluation took
+more time than expected.
+However, it was an interesting experience creating the stemmer, and we
+are happy to send you the result of our work.
+The attachment-file is a Tarball-zipped file with (hopefully) all files
+needed. The files and the stemmer as well are encoded in UTF-8. Please
+inform us if something is missing.
+
+We would be happy if the Romanian stemmer would be accepted and
+integrated into the official Snowball distribution. We agree of course
+to license the stemmer under the same terms as the existing snowball
+software.
+
+We're looking forward to hear from you soon.
+
+
+With kind regards,
+
+Marina, Doina and Erwin.
+
+Attachment: [romanian1.tgz]
+
+
+
+From: Irina Tirdea <irina.tirdea@gmail.com>
+To: richard@lemurconsulting.com,
+ martin.porter@grapeshot.co.uk
+Date: Mon Jul 31 10:19:51 2006
+Subject: Romanian stemmer
+
+Hello,
+
+My name is Irina Tirdea and I have developed a Romanian stemmer in Snowball
+as part of my bachelor thesis, in Bucharest, Romania. I am sending you the
+code attached (with vocabulary and stop word list files) and I hope you will
+accept and integrate it as a part of the Snowball project. I am ready to
+release the stemmer under the BSD license, just as the Snowball software.
+The files have been written in UTF-8 encoding (on a Linux system).
+
+Looking forward to hear from you.
+
+Kind regards,
+Irina Tirdea
+
+Attachment: [romanian2.tgz]
+
+
+
+From: martin.porter@grapeshot.co.uk (Martin Porter)
+To: snowball-discuss
+Cc: atordai@science.uva.nl,
+ eglockne@ix.urz.uni-heidelberg.de,
+ irina.tirdea@gmail.com
+Date: Mon Jul 31 10:43:05 BST 2006
+Subject: Tardy response to submissions to Snowball
+
+I am sending this general email as a kind of apology, for having done nothing
+so far on the following generously sent Snowball submissions:
+
+7 June, from E. Glockner: a Romanian stemmer
+8 June, from A. Tordai: a Hungarian stemmer
+
+and this morning another Romanian stemmer arrived,
+
+31 July, from I. Tirdea, a Romanian stemmer
+
+After the first submission I promised to look at it "next week", so Mr Glockner
+has probably been wondering what has happened. [. . .] I will make a point of
+looking at these submissions this week,
+
+More soon,
+
+Martin
+
+
+
+From: martin.porter@grapeshot.co.uk (Martin Porter)
+To: snowball-discuss
+Cc: irina.tirdea@gmail.com,
+ eglockne@ix.urz.uni-heidelberg.de,
+ mstegare@hotmail.com,
+ doina_gliga@yahoo.co.uk,
+ eglockner@hotmail.com
+Date: Wed Sep 06 12:39:16 BST 2006
+Subject: Romanian stemmer
+
+To the originators of the Romanian stemmers,
+
+I have now found time to do some preliminary work on the Romanian stemmer. I
+should explain that part of the complication has been the receipt, no more
+than ten days apart, of two Romanian stemmers in Snowball, the first
+(romanian1) from [Glockner, Gliga, and Stegarescu] in Heidelberg, the second
+(romanian2) from Tirdea in Bucharest.
+
+[. . . .]
+
+I have put together a vocabulary by combining the vocabularies provided with
+romanian1 and romanian2. This appears in column 1. Column 2 is the stemmed
+form produced by romanian1, and column 3 the stemmed form produced by
+romanian2. If the entry in column 3 is blank, both stemmers are producing the
+same result.
+
+You might care to compare the two approaches.
+
+My own feeling is that romanian1 does a more thorough job of ending removal,
+but unlike romanian2 has a habit of discarding too much from short words.
+aberant->ab, abatere->ab, aburi->ab are examples of this. In romanian1 the R2
+test is rarely used (it seems to me that 'R1 or R2' is equivalent to 'R1',
+since p2 is never to the left of p1.)
+
+I might have a go at making some modifications here. Needless to say, I am
+not familiar with Romanian, but the similarity to the other Romance
+languages, especially Italian, enables one to grasp the essential features of
+the morphology.
+
+What we would like to do is to have a single stemmer for release from the
+snowball site, if that is possible, and giving all necessary credits, along
+the lines of the recent addition,
+
+http://snowballstem.org/algorithms/hungarian/stemmer.html
+
+Hope to hear from you,
+
+Martin Porter
+
+
+
+Finally we decided to produce our own Romanian stemmer as described on the
+Romanian stemmer page. The submitted stemmers both contain stop word lists,
+available inside the tarballs.
+
+
+[% footer %]
diff --git a/otherapps/romanian/romanian1.tgz b/otherapps/romanian/romanian1.tgz
new file mode 100755
index 0000000..136dbc5
Binary files /dev/null and b/otherapps/romanian/romanian1.tgz differ
diff --git a/otherapps/romanian/romanian2.tgz b/otherapps/romanian/romanian2.tgz
new file mode 100755
index 0000000..c96661e
Binary files /dev/null and b/otherapps/romanian/romanian2.tgz differ
diff --git a/otherapps/schinke/index.html b/otherapps/schinke/index.html
new file mode 100644
index 0000000..77703ae
--- /dev/null
+++ b/otherapps/schinke/index.html
@@ -0,0 +1,308 @@
+
+
+
+
+
+
+
+
+
+ The Schinke Latin stemming algorithm - Snowball
+
+
+
+
+
+
+
+
+
+
+
+The Schinke Latin stemming algorithm is described in,
+
+
+
+ Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming algorithm for Latin text
+ databases. Journal of Documentation, 52: 172-187.
+
+
+
+It has the feature that it stems each word to two forms, noun and verb. For example,
+
+
+
+ NOUN VERB
+ ---- ----
+ aquila aquil aquila
+ portat portat porta
+ portis port por
+
+
+
+Here (slightly reformatted) are the rules of the stemmer,
+
+
+
+1. (start)
+
+2. Convert all occurrences of the letters 'j' or 'v' to 'i' or 'u',
+ respectively.
+
+3. If the word ends in '-que' then
+ if the word is on the list shown in Figure 4, then
+ write the original word to both the noun-based and verb-based
+ stem dictionaries and go to 8.
+ else remove '-que'
+
+ [Figure 4 was
+
+ atque quoque neque itaque absque apsque abusque adaeque adusque denique
+ deque susque oblique peraeque plenisque quandoque quisque quaeque
+ cuiusque cuique quemque quamque quaque quique quorumque quarumque
+ quibusque quosque quasque quotusquisque quousque ubique undique usque
+ uterque utique utroque utribique torque coque concoque contorque
+ detorque decoque excoque extorque obtorque optorque retorque recoque
+ attorque incoque intorque praetorque]
+
+4. Match the end of the word against the suffix list show in Figure 6(a),
+ removing the longest matching suffix, (if any).
+
+ [Figure 6(a) was
+
+ -ibus -ius -ae -am -as -em -es -ia
+ -is -nt -os -ud -um -us -a -e
+ -i -o -u]
+
+5. If the resulting stem contains at least two characters then write this stem
+ to the noun-based stem dictionary.
+
+6. Match the end of the word against the suffix list show in Figure 6(b),
+ identifying the longest matching suffix, (if any).
+
+ [Figure 6(b) was
+
+ -iuntur-beris -erunt -untur -iunt -mini -ntur -stis
+ -bor -ero -mur -mus -ris -sti -tis -tur
+ -unt -bo -ns -nt -ri -m -r -s
+ -t]
+
+ If any of the following suffixes are found then convert them as shown:
+
+ '-iuntur', '-erunt', '-untur', '-iunt', and '-unt', to '-i';
+ '-beris', '-bor', and '-bo' to '-bi';
+ '-ero' to '-eri'
+
+ else remove the suffix in the normal way.
+
+7. If the resulting stem contains at least two characters then write this stem
+ to the verb-based stem dictionary.
+
+8. (end)
+
+
+
+Unfortunately I was not able to make the rules match the examples given, which
+led to the following email correspondence,
+
+
+
+From: Martin Porter
+To: Peter Willett
+Date: Mon Sep 10 15:11:51 2001
+Subject: Re: Stemming algorithms
+
+> ... I'm no longer working in the IR area,
+>spending all of my time on computational chemistry/drug discovery
+>research but I guess that Mark Sanderson would be interested in
+>Snowball - do you mind if I pass your email onto him?
+
+Peter,
+
+Well, actually, I do have a question, if you can cast your mind back. I've
+implemented the Latin Stemmer in Snowball (see below: you'll have to guess the
+semantics, but I'm sure you'll agree the syntax looks nice), and find that Fig
+5 of the 1996 Schinke paper doesn't correspond to the algorithm of fig 7, but to
+the algorithm with the extra rules concerning -ba-, -bi-, -sse- mentioned on
+page 182. Which is the "correct" algorithm - with or without those rules? If
+with, what is the exact criterion for their removal? A bigger problem is why
+the -nt is not removed from 'Apparebunt', given -nt as an ending in 6(a). Is
+-nt a misprint?
+
+Sorry to bother you with this, but the paper says you are the one "to whom all
+correspondence should be addressed" :-)
+
+Martin
+
+
+ Here is your algorithm in Snowball. The generated code will do about 1 million
+ Latin word in 5 seconds:
+
+ -------
+
+
+
strings(noun_formverb_form)
+
+routines(
+
+map_letters
+que_word
+)
+
+externals(stem)
+
+definemap_lettersas(
+
+dorepeat(goto(['j'])<-'i')
+dorepeat(goto(['v'])<-'u')
+)
+
+backwardmode(
+
+defineque_wordas(
+
+['que'](
+among(
+'at''quo''ne''ita''abs''aps''abus''adae''adus'
+'deni''de''sus''obli''perae''plenis''quando''quis'
+'quae''cuius''cui''quem''quam''qua''qui'
+'quorum''quarum''quibus''quos''quas''quotusquis'
+'quous''ubi''undi''us''uter''uti''utro''utribi'
+'tor''co''conco''contor''detor''deco''exco''extor'
+'obtor''optor''retor''reco''attor''inco''intor'
+'praetor'
+)atlimit]
+=>noun_form
+=>verb_form
+)orfail(delete)
+)
+)
+
+definestemas(
+
+map_letters
+
+backwards(
+que_wordor(
+=>noun_form
+=>verb_form
+
+$noun_formbackwardstry(
+[substring]hop2
+among(
+'ibus''ius''ae''am''as''em''es''ia''is''nt'
+'os''ud''um''us''a''e''i''o''u'
+(delete)
+)
+)
+
+$verb_formbackwardstry(
+[substring]hop2
+among(
+'iuntur''erunt''untur''iunt''unt'
+(<-'i')
+'beris''bor''bo'
+(<-'bi')
+'ero'
+(<-'eri')
+'mini''ntur''stis''mur''mus''ris''sti''tis'
+'tur''ns''nt''ri''m''r''s''t'
+(delete)
+)
+)
+)
+)
+
+/* the stemmed words are left in noun-form and verb-form, and can
+ be picked up as C strings at z->S[0] and z->S[1] through the API. */
+)
+
+
+
+
+
+From: Peter Willett
+To: Martin Porter
+Date: Mon Sep 10 20:25:24 2001
+Subject: Re: Stemming algorithms
+
+Martin
+
+Sorry - I just cannot answer. Robertson has retired to Dorset while
+Schinke is now in - I think - Canada
+
+Peter
+
+
+
+Following this, I was unable to contact Schinke, and so the problems have
+remained unresolved.
+
+
+
+The linked zip file contains the stemmer,
+generated C version, and sample data.
+(The stemmer differs slightly from the version in the email above in that
+it assembles the noun- and verb-forms of the stem in a single string with
+space separation.)
+voc.txt is a sample vocabulary, and joined.txt the vocabulary
+joined with the two stemmed forms as three column output.
+
+The Schinke Latin stemming algorithm is described in,
+
+
+
+ Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming algorithm for Latin text
+ databases. Journal of Documentation, 52: 172-187.
+
+
+
+It has the feature that it stems each word to two forms, noun and verb. For example,
+
+
+
+ NOUN VERB
+ ---- ----
+ aquila aquil aquila
+ portat portat porta
+ portis port por
+
+
+
+Here (slightly reformatted) are the rules of the stemmer,
+
+
+
+1. (start)
+
+2. Convert all occurrences of the letters 'j' or 'v' to 'i' or 'u',
+ respectively.
+
+3. If the word ends in '-que' then
+ if the word is on the list shown in Figure 4, then
+ write the original word to both the noun-based and verb-based
+ stem dictionaries and go to 8.
+ else remove '-que'
+
+ [Figure 4 was
+
+ atque quoque neque itaque absque apsque abusque adaeque adusque denique
+ deque susque oblique peraeque plenisque quandoque quisque quaeque
+ cuiusque cuique quemque quamque quaque quique quorumque quarumque
+ quibusque quosque quasque quotusquisque quousque ubique undique usque
+ uterque utique utroque utribique torque coque concoque contorque
+ detorque decoque excoque extorque obtorque optorque retorque recoque
+ attorque incoque intorque praetorque]
+
+4. Match the end of the word against the suffix list show in Figure 6(a),
+ removing the longest matching suffix, (if any).
+
+ [Figure 6(a) was
+
+ -ibus -ius -ae -am -as -em -es -ia
+ -is -nt -os -ud -um -us -a -e
+ -i -o -u]
+
+5. If the resulting stem contains at least two characters then write this stem
+ to the noun-based stem dictionary.
+
+6. Match the end of the word against the suffix list show in Figure 6(b),
+ identifying the longest matching suffix, (if any).
+
+ [Figure 6(b) was
+
+ -iuntur-beris -erunt -untur -iunt -mini -ntur -stis
+ -bor -ero -mur -mus -ris -sti -tis -tur
+ -unt -bo -ns -nt -ri -m -r -s
+ -t]
+
+ If any of the following suffixes are found then convert them as shown:
+
+ '-iuntur', '-erunt', '-untur', '-iunt', and '-unt', to '-i';
+ '-beris', '-bor', and '-bo' to '-bi';
+ '-ero' to '-eri'
+
+ else remove the suffix in the normal way.
+
+7. If the resulting stem contains at least two characters then write this stem
+ to the verb-based stem dictionary.
+
+8. (end)
+
+
+
+Unfortunately I was not able to make the rules match the examples given, which
+led to the following email correspondence,
+
+
+
+From: Martin Porter
+To: Peter Willett
+Date: Mon Sep 10 15:11:51 2001
+Subject: Re: Stemming algorithms
+
+> ... I'm no longer working in the IR area,
+>spending all of my time on computational chemistry/drug discovery
+>research but I guess that Mark Sanderson would be interested in
+>Snowball - do you mind if I pass your email onto him?
+
+Peter,
+
+Well, actually, I do have a question, if you can cast your mind back. I've
+implemented the Latin Stemmer in Snowball (see below: you'll have to guess the
+semantics, but I'm sure you'll agree the syntax looks nice), and find that Fig
+5 of the 1996 Schinke paper doesn't correspond to the algorithm of fig 7, but to
+the algorithm with the extra rules concerning -ba-, -bi-, -sse- mentioned on
+page 182. Which is the "correct" algorithm - with or without those rules? If
+with, what is the exact criterion for their removal? A bigger problem is why
+the -nt is not removed from 'Apparebunt', given -nt as an ending in 6(a). Is
+-nt a misprint?
+
+Sorry to bother you with this, but the paper says you are the one "to whom all
+correspondence should be addressed" :-)
+
+Martin
+
+
+ Here is your algorithm in Snowball. The generated code will do about 1 million
+ Latin word in 5 seconds:
+
+ -------
+
+
+[% highlight_file('schinke') %]
+
+
+
+From: Peter Willett
+To: Martin Porter
+Date: Mon Sep 10 20:25:24 2001
+Subject: Re: Stemming algorithms
+
+Martin
+
+Sorry - I just cannot answer. Robertson has retired to Dorset while
+Schinke is now in - I think - Canada
+
+Peter
+
+
+
+Following this, I was unable to contact Schinke, and so the problems have
+remained unresolved.
+
+
+
+The linked zip file contains the stemmer,
+generated C version, and sample data.
+(The stemmer differs slightly from the version in the email above in that
+it assembles the noun- and verb-forms of the stem in a single string with
+space separation.)
+voc.txt is a sample vocabulary, and joined.txt the vocabulary
+joined with the two stemmed forms as three column output.
+
+This page lists projects which are related to Snowball in some way.
+
+
+
+
+
Wrappers
+
+
+These projects allow Snowball-generated stemmers to be used from other
+languages.
+
+
+
+Aside from PyStemmer, we've not tested them to see if they successfully wrap
+the Snowball-generated code, are well implemented, etc. These projects aren't
+endorsed or recommended as such, but we hope they may be of interest.
+
+Richard Boulton put together some new Python bindings for snowball, inspired by
+Andreas Jung's initial implementation of PyStemmer from 2001, but with a
+different API. PyStemmer's current home is as part of the snowballstem github
+project.
+
+Lingua::Stem::Snowball is an XS module which provides a Perl interface to the
+C versions of the Snowball stemmers. The Snowball stopwords lists are also
+wrapped by Lingua::StopWords.
+
+A Node.js interface to the Snowball stemming algorithms, written by Andrea Maccis and largely inspired by Richard Boulton's PyStemmer.
+
+
+
+
+
Reimplementations of the Stemming Algorithms
+
+
+These projects reimplement the Snowball algorithms, either in hand-written
+code, or in code manually translated from the generated output for another
+language.
+
+
+
+We've not tested them to see if they correctly implement the stemming
+algorithms, are well implemented, etc. These projects aren't endorsed
+or recommended as such, but we hope they may be of interest.
+
+
+
+If you want to use one of these stemmers, we suggest you take the sample
+vocabulary for the corresponding natural language, and check that the
+stemmer produces the corresponding stemmed output.
+
+(added Sep 2010) Developed by Oleg Mazko, Urim is a standalone,
+offline tag-cloud builder engine, fully written in JavaScript and so
+capable of integration into all Internet browsers. Available as
+a Firefox add-on. With a JavaScript port of the Snowball stemmers (danish,
+dutch, english, finnish, french, german, hungarian, italian,
+norwegian, portuguese, russian, spanish, swedish, romanian, turkish)
+also available as a separate library ready for developers.
+
+When you download Snowball,
+it already contains a make file to allow you to build it, like so:
+
+
+
+ make
+
+
+
+You can confirm it's working with a simple test like so:
+
+
+
+ echo "running" | ./stemwords -l en
+
+
+
+which should output: run
+
+
+
+There's no built in way to install snowball currently - you can either copy
+the snowball binary to somewhere that's on your PATH
+(e.g. on a typical Linux machine: sudo cp snowball /usr/local/bin)
+or just run it from the source tree with ./snowball).
+
+
+
Running Snowball
+
+
+The snowball compiler has the following command line syntax,
+
+
+
+Usage: snowball SOURCE_FILE... [OPTIONS]
+
+Supported options:
+ -o, -output OUTPUT_BASE
+ -s, -syntax
+ -comments
+ -j, -java
+ -cs, -csharp
+ -c++
+ -pascal
+ -py, -python
+ -js[=TYPE] generate Javascript (TYPE values:
+ esm global, default: global)
+ -rust
+ -go
+ -ada
+ -w, -widechars
+ -u, -utf8
+ -n, -name CLASS_NAME
+ -ep, -eprefix EXTERNAL_PREFIX
+ -vp, -vprefix VARIABLE_PREFIX
+ -i, -include DIRECTORY
+ -r, -runtime DIRECTORY
+ -p, -parentclassname CLASS_NAME fully qualified parent class name
+ -P, -Package PACKAGE_NAME package name for stemmers
+ -S, -Stringclass STRING_CLASS StringBuffer-compatible class
+ -a, -amongclass AMONG_CLASS fully qualified name of the Among class
+ -gop, -gopackage PACKAGE_NAME Go package name for stemmers
+ -gor, -goruntime PACKAGE_NAME Go snowball runtime package
+ --help display this help and exit
+ --version output version information and exit
+
+The first argument, SOURCE_FILE, is the name of the Snowball file to be compiled. Unless you specify a different programming language to
+generate code for, the default is to generate ISO C which results in two output
+files, a C source in OUTPUT_BASE.c and a corresponding header file in OUTPUT_BASE.h. This is similar for other
+programming languages, e.g. if option -java is
+present, Java output is produced in OUTPUT_BASE.java.
+
+
+
+Some options are only valid when generating code for particular programming
+languages. For example, the -widechars,
+ -utf8, -eprefix and
+ -vprefix options are specific to C and C++.
+
+
+
ISO C generation
+
+
+In the absence of the -eprefix and -vprefix options, the list of
+declared externals in the Snowball program, for example,
+
+The -utf8 and -widechars options affects how
+the generated C/C++ code expects strings to be represented - UTF-8 or
+wide-character Unicode (stored using 2 bytes per codepoint), or if neither is
+specified, one byte per codepoint using either ISO-8859-1 or another encoding.
+
+
+
+For other programming languages, one of these three options is effectively
+implicitly hard-coded (except wide-characters may be wider) - e.g. C#, Java,
+Javascript and Python use wide characters; Ada, Go and Rust use UTF-8; Pascal
+uses ISO-8859-1. Since Snowball 2.0 it's possible with a little care to write
+Snowball code that works regardless of how characters are represented. See
+section 12 of the Snowball manual for
+more details.
+
+
+
+The -runtime option is used to prepend a path to any #include
+lines in the generated code, and is useful when the runtime header files (i.e.
+those files in the runtime directory in the standard distribution) are not
+in the same location as the generated source files. It is used when
+building the libstemmer library, and may be useful for other projects.
+
+
+
+
+
Other options
+
+
+If -syntax is used the other options are ignored, and the syntax tree
+of the Snowball program is directed to stdout. This can be a handy way
+of checking that you have got the bracketing right in the program you have
+written.
+
+
+
+Any number of -include options may be present, for example,
+
+They give a 1 or 0 result, corresponding to the t or f result of
+the Snowball routine.
+
+
+
+And later,
+
+
+
Khotanese_close_env(z);
+
+
+
+To release the space raised by z back to the system. You can do this for a
+number of Snowball modules at the same time: you will need a separate
+struct SN_env * z; for each module.
+
+
+
+The current string is given by the z->l bytes of data starting at z->p.
+The string is not zero-terminated, but you can zero terminate it yourself with
+
+
+
z->p[z->l]=0;
+
+
+
+(There is always room for this last zero byte.) For example,
+
+
+
SN_set_current(z,strlen(s),s);
+Khotanese_stem_1(z);
+z->p[z->l]=0;
+printf("Khotanese-1 stems '%s' to '%s'\n",s,z->p);
+
+
+
+The values of the other variables can be accessed via the #define
+settings that result from the -vprefix option, although this should not
+usually be necessary:
+
+
+
printf("p1 is %d\n",z->Khotanese_variable_p1);
+
+
+
+The stemming scripts on this Web site use Snowball very simply.
+-vprefix is left unset, and -eprefix is set to the name of the
+script (usually the language the script is for).
+
+
+
+
+
Debugging snowball scripts
+
+
+In the rare event that your Snowball script does not run perfectly the first time:
+
+
+
+Remember that the option -syntax prints out the syntax tree. A question
+mark can be included in Snowball as a command, and it will generate a call
+debug(...). The defined debug in runtime/utilities.c (usually
+commented out) can then be used. It causes the
+current string to sent to stdout, with square brackets marking the
+slice and vertical bar the position of c. Curly brackets mark the
+end-limits of the string, which may be less than the whole string because
+of the action of setlimit.
+
+
+
+At present there is no way of reporting the value of an integer or boolean.
+
+
+
+If desperate, you can put debugging lines into the generated C program.
+You can pass -comments to the snowball compiler to get it to
+generate comments showing the correspondence with the Snowball source which
+makes it easier to find where to add such debugging code.
+
+
+
Compiler bugs
+
+
+If you hit a snowball compiler bug, try to
+capture it in a small script before notifying us.
+
+
+
Known problems in Snowball
+
+
+The main one is that it is possible to ‘pull the rug from under your own feet’ in
+constructions like this:
+
+Suppose C1 gives t, the delete removes the slice established on the first
+line, and C2 gives f, so C3 is done with c set back to the value it had
+before C1 was obeyed — but this old value does not take account of the byte shift
+caused by the delete. This problem was foreseen from the beginning when designing
+Snowball, and recognised as a minor issue because it is an unnatural thing to want to
+do. (C3 should not be an alternative to something which has deletion as an
+occasional side-effect.) It may be addressed in the future.
+
+When you download Snowball,
+it already contains a make file to allow you to build it, like so:
+
+
+
+ make
+
+
+
+You can confirm it's working with a simple test like so:
+
+
+
+ echo "running" | ./stemwords -l en
+
+
+
+which should output: run
+
+
+
+There's no built in way to install snowball currently - you can either copy
+the snowball binary to somewhere that's on your PATH
+(e.g. on a typical Linux machine: sudo cp snowball /usr/local/bin)
+or just run it from the source tree with ./snowball).
+
+
+
Running Snowball
+
+
+The snowball compiler has the following command line syntax,
+
+The first argument, SOURCE_FILE, is the name of the Snowball file to be compiled. Unless you specify a different programming language to
+generate code for, the default is to generate ISO C which results in two output
+files, a C source in OUTPUT_BASE.c and a corresponding header file in OUTPUT_BASE.h. This is similar for other
+programming languages, e.g. if option -java is
+present, Java output is produced in OUTPUT_BASE.java.
+
+
+
+Some options are only valid when generating code for particular programming
+languages. For example, the -widechars,
+ -utf8, -eprefix and
+ -vprefix options are specific to C and C++.
+
+
+
ISO C generation
+
+
+In the absence of the -eprefix and -vprefix options, the list of
+declared externals in the Snowball program, for example,
+
+The -utf8 and -widechars options affects how
+the generated C/C++ code expects strings to be represented - UTF-8 or
+wide-character Unicode (stored using 2 bytes per codepoint), or if neither is
+specified, one byte per codepoint using either ISO-8859-1 or another encoding.
+
+
+
+For other programming languages, one of these three options is effectively
+implicitly hard-coded (except wide-characters may be wider) - e.g. C#, Java,
+Javascript and Python use wide characters; Ada, Go and Rust use UTF-8; Pascal
+uses ISO-8859-1. Since Snowball 2.0 it's possible with a little care to write
+Snowball code that works regardless of how characters are represented. See
+section 12 of the Snowball manual for
+more details.
+
+
+
+The -runtime option is used to prepend a path to any #include
+lines in the generated code, and is useful when the runtime header files (i.e.
+those files in the runtime directory in the standard distribution) are not
+in the same location as the generated source files. It is used when
+building the libstemmer library, and may be useful for other projects.
+
+
+
+
+
Other options
+
+
+If -syntax is used the other options are ignored, and the syntax tree
+of the Snowball program is directed to stdout. This can be a handy way
+of checking that you have got the bracketing right in the program you have
+written.
+
+
+
+Any number of -include options may be present, for example,
+
+To release the space raised by z back to the system. You can do this for a
+number of Snowball modules at the same time: you will need a separate
+struct SN_env * z; for each module.
+
+
+
+The current string is given by the z->l bytes of data starting at z->p.
+The string is not zero-terminated, but you can zero terminate it yourself with
+
+The values of the other variables can be accessed via the #define
+settings that result from the -vprefix option, although this should not
+usually be necessary:
+
+The stemming scripts on this Web site use Snowball very simply.
+-vprefix is left unset, and -eprefix is set to the name of the
+script (usually the language the script is for).
+
+
+
+
+
Debugging snowball scripts
+
+
+In the rare event that your Snowball script does not run perfectly the first time:
+
+
+
+Remember that the option -syntax prints out the syntax tree. A question
+mark can be included in Snowball as a command, and it will generate a call
+debug(...). The defined debug in runtime/utilities.c (usually
+commented out) can then be used. It causes the
+current string to sent to stdout, with square brackets marking the
+slice and vertical bar the position of c. Curly brackets mark the
+end-limits of the string, which may be less than the whole string because
+of the action of setlimit.
+
+
+
+At present there is no way of reporting the value of an integer or boolean.
+
+
+
+If desperate, you can put debugging lines into the generated C program.
+You can pass -comments to the snowball compiler to get it to
+generate comments showing the correspondence with the Snowball source which
+makes it easier to find where to add such debugging code.
+
+
+
Compiler bugs
+
+
+If you hit a snowball compiler bug, try to
+capture it in a small script before notifying us.
+
+
+
Known problems in Snowball
+
+
+The main one is that it is possible to ‘pull the rug from under your own feet’ in
+constructions like this:
+
+
+[% highlight('
+ [ do something ]
+ do something_else
+ ( C1 delete C2 ) or ( C3 )
+') %]
+
+
+Suppose C1 gives t, the delete removes the slice established on the first
+line, and C2 gives f, so C3 is done with c set back to the value it had
+before C1 was obeyed — but this old value does not take account of the byte shift
+caused by the delete. This problem was foreseen from the beginning when designing
+Snowball, and recognised as a minor issue because it is an unnatural thing to want to
+do. (C3 should not be an alternative to something which has deletion as an
+occasional side-effect.) It may be addressed in the future.
+
Although conceptually different from an apostrophe, a single closing
+quote is also represented by character U+2019.
+
+
+
Character U+0027 is used for apostrophe, single closing quote and
+single opening quote (U+2018).
+
+
+
A fourth character, U+201B, like U+2018 but with the tail ‘rising’
+instead of ‘descending’, is also sometimes used as apostrophe (in the
+house style of certain publishers, for surnames like M’Coy and so on.)
+
+
+
+
+In the English stemming algorithm, it is assumed that apostrophe is
+represented by U+0027. This makes it ASCII compatible. Clearly other codes
+for apostrophe can be mapped to this code prior to stemming.
+
+
+
+In English orthography, apostrophe has one of three functions.
+
+
+
+
It indicates a contraction in what is now accepted as a single word:
+o’clock, O’Reilly, M’Coy. Except in proper names such forms
+are rare: the apostrophe in Hallowe’en is disappearing, and in
+’bus has disappeared.
+
+
+
It indicates a standard contraction with auxiliary or modal verbs:
+you’re, isn’t, we’d. There are about forty of these forms in
+contemporary English, and their use is increasing as they displace the full
+forms that were at one time used in formal documents. Although they can be
+reduced to word pairs, it is more convenient to treat them as single items
+(usually stopwords) in IR work. And then preserving the apostrophe is
+important, so that he’ll, she’ll, we’ll are not equated with
+hell, shell, well etc.
+
+
+
It is used to form the ‘English genitive’, John's book, the horses’
+hooves etc. This is a development of (1), where historically the apostrophe
+stood for an elided e. (Similarly the printed form ’d for ed was
+very common before the nineteenth century.) Although in decline (witness pigs
+trotters, Girls School Trust), its use continues in contemporary
+English, where it is fiercely promoted as correct grammar, despite (or it might
+be closer to the truth to say because of) its complete semantic redundancy.
+
+
+
+
+For these reasons, the English stemmer treats apostrophe as a letter, removing
+it from the beginning of a word, where it might have stood for an opening
+quote, from the end of the word, where it might have stood for a closing quote,
+or been an apostrophe following s. The form ’s is also treated as an ending.
+
Although conceptually different from an apostrophe, a single closing
+quote is also represented by character U+2019.
+
+
+
Character U+0027 is used for apostrophe, single closing quote and
+single opening quote (U+2018).
+
+
+
A fourth character, U+201B, like U+2018 but with the tail ‘rising’
+instead of ‘descending’, is also sometimes used as apostrophe (in the
+house style of certain publishers, for surnames like M’Coy and so on.)
+
+
+
+
+In the English stemming algorithm, it is assumed that apostrophe is
+represented by U+0027. This makes it ASCII compatible. Clearly other codes
+for apostrophe can be mapped to this code prior to stemming.
+
+
+
+In English orthography, apostrophe has one of three functions.
+
+
+
+
It indicates a contraction in what is now accepted as a single word:
+o’clock, O’Reilly, M’Coy. Except in proper names such forms
+are rare: the apostrophe in Hallowe’en is disappearing, and in
+’bus has disappeared.
+
+
+
It indicates a standard contraction with auxiliary or modal verbs:
+you’re, isn’t, we’d. There are about forty of these forms in
+contemporary English, and their use is increasing as they displace the full
+forms that were at one time used in formal documents. Although they can be
+reduced to word pairs, it is more convenient to treat them as single items
+(usually stopwords) in IR work. And then preserving the apostrophe is
+important, so that he’ll, she’ll, we’ll are not equated with
+hell, shell, well etc.
+
+
+
It is used to form the ‘English genitive’, John's book, the horses’
+hooves etc. This is a development of (1), where historically the apostrophe
+stood for an elided e. (Similarly the printed form ’d for ed was
+very common before the nineteenth century.) Although in decline (witness pigs
+trotters, Girls School Trust), its use continues in contemporary
+English, where it is fiercely promoted as correct grammar, despite (or it might
+be closer to the truth to say because of) its complete semantic redundancy.
+
+
+
+
+For these reasons, the English stemmer treats apostrophe as a letter, removing
+it from the beginning of a word, where it might have stood for an opening
+quote, from the end of the word, where it might have stood for a closing quote,
+or been an apostrophe following s. The form ’s is also treated as an ending.
+
+The question occasionally arises of how far the English (or earlier Porter)
+stemming algorithm can be adapted to handle older forms of the English
+language.
+
+
+
+Historically, English is usually divided into three periods of development,
+
+
+
+
Old English (or Anglo-Saxon), the language of Beowulf,
+
Middle English, the language of Chaucer,
+
Modern English, the language of Shakespeare, Dickens, and people today.
+
+
+
+Old English is so different from Modern English that it may be regarded as a
+distinct language.
+
+
+
+Middle English is problematical for a number of reasons. There is no standard
+spelling in the original texts, and the grammatical differences between Middle
+and Modern English prevent the spelling from being simply ‘modernised’. It is
+however possible to normalise the spelling according to some modern scheme, but
+again there is no standard modern scheme. Middle
+English itself had great regional variations, so that for example the
+English of Chaucer and his contemporary the Gawain poet (both late 14th century)
+are strikingly different. Finally, grammar was fluid even for one writer, so
+Chaucer might use they love or they loven, he
+sitteth or he sit.
+
+
+
+We may take Modern English to mean English which can be cast into a modern
+spelling form without too much damage being done to the original. From this
+point of view Shakespeare and the Authorised Version of the Bible are in Modern
+English. The ending structure of words in early Modern English differ from
+contemporary English in the est and eth endings of verbs in the present
+indicative,
+
+
+
+ I bring
+ thou bringest
+ he bringeth
+ we bring
+ you bring
+ they bring
+
+
+
+Both of these endings underwent rapid decline. The eth form occurs in
+Shakespeare, but is much rarer than the modern s form. The language of the
+Authorised Version,
+in which both forms abound,
+seemed archaic even on its first publication. Consequently
+the eth form survives now only in the language of the traditional Bible and
+Book of Common Prayer. The est form disappeared more slowly, as the use of
+thou became displaced by you in conversation.
+
+
+
+To put the endings into the
+Porter stemmer,
+the rules
+
+
+
+Step 1b
+
+
(m>0) EED
→
EE
+
(*v*) ED
→
+
(*v*) ING
→
+
+
+
+
+should be extended to
+
+
+
+Step 1b
+
+
(m>0) EED
→
EE
+
(*v*) ED
→
+
(*v*) ING
→
+
(*v*) EST
→
+
(*v*) ETH
→
+
+
+
+
+And to put the endings into the
+English stemmer,
+the list
+
+
+
+ed edly ing ingly
+
+of Step 1b should be extended to
+
+ed edly ing ingly est eth
+
+
+
+As far as the Snowball scripts are concerned, the endings 'est' 'eth' must
+be added against ending 'ing'.
+
+
+
+The inclusion of these endings does produce certain ‘side effects’. est is
+the ending of adjectival superlatives (greatest, unkindest), where it
+will also be removed. Words like brandreth, deforest will be mis-stemmed.
+Nevertheless, for the vocabulary of the Bible, the inclusion of these extra
+endings is not harmful (see
+this demonstration —
+for example, search for the text love in 1000 verses).
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/texts/earlyenglish.tt b/texts/earlyenglish.tt
new file mode 100644
index 0000000..827a911
--- /dev/null
+++ b/texts/earlyenglish.tt
@@ -0,0 +1,135 @@
+[% header('Stemming early English') %]
+
+
+The question occasionally arises of how far the English (or earlier Porter)
+stemming algorithm can be adapted to handle older forms of the English
+language.
+
+
+
+Historically, English is usually divided into three periods of development,
+
+
+
+
Old English (or Anglo-Saxon), the language of Beowulf,
+
Middle English, the language of Chaucer,
+
Modern English, the language of Shakespeare, Dickens, and people today.
+
+
+
+Old English is so different from Modern English that it may be regarded as a
+distinct language.
+
+
+
+Middle English is problematical for a number of reasons. There is no standard
+spelling in the original texts, and the grammatical differences between Middle
+and Modern English prevent the spelling from being simply ‘modernised’. It is
+however possible to normalise the spelling according to some modern scheme, but
+again there is no standard modern scheme. Middle
+English itself had great regional variations, so that for example the
+English of Chaucer and his contemporary the Gawain poet (both late 14th century)
+are strikingly different. Finally, grammar was fluid even for one writer, so
+Chaucer might use they love or they loven, he
+sitteth or he sit.
+
+
+
+We may take Modern English to mean English which can be cast into a modern
+spelling form without too much damage being done to the original. From this
+point of view Shakespeare and the Authorised Version of the Bible are in Modern
+English. The ending structure of words in early Modern English differ from
+contemporary English in the est and eth endings of verbs in the present
+indicative,
+
+
+
+ I bring
+ thou bringest
+ he bringeth
+ we bring
+ you bring
+ they bring
+
+
+
+Both of these endings underwent rapid decline. The eth form occurs in
+Shakespeare, but is much rarer than the modern s form. The language of the
+Authorised Version,
+in which both forms abound,
+seemed archaic even on its first publication. Consequently
+the eth form survives now only in the language of the traditional Bible and
+Book of Common Prayer. The est form disappeared more slowly, as the use of
+thou became displaced by you in conversation.
+
+
+
+To put the endings into the
+Porter stemmer,
+the rules
+
+
+
+Step 1b
+
+
(m>0) EED
→
EE
+
(*v*) ED
→
+
(*v*) ING
→
+
+
+
+
+should be extended to
+
+
+
+Step 1b
+
+
(m>0) EED
→
EE
+
(*v*) ED
→
+
(*v*) ING
→
+
(*v*) EST
→
+
(*v*) ETH
→
+
+
+
+
+And to put the endings into the
+English stemmer,
+the list
+
+
+
+ed edly ing ingly
+
+of Step 1b should be extended to
+
+ed edly ing ingly est eth
+
+
+
+As far as the Snowball scripts are concerned, the endings 'est' 'eth' must
+be added against ending 'ing'.
+
+
+
+The inclusion of these endings does produce certain ‘side effects’. est is
+the ending of adjectival superlatives (greatest, unkindest), where it
+will also be removed. Words like brandreth, deforest will be mis-stemmed.
+Nevertheless, for the vocabulary of the Bible, the inclusion of these extra
+endings is not harmful (see
+this demonstration —
+for example, search for the text love in 1000 verses).
+
+[% footer %]
diff --git a/texts/glossary.html b/texts/glossary.html
new file mode 100644
index 0000000..9f1fc31
--- /dev/null
+++ b/texts/glossary.html
@@ -0,0 +1,170 @@
+
+
+
+An a-suffix, or attached suffix, is a particle word attached to another
+word. (In the stemming literature they sometimes get referred to as
+‘enclitics’.) In Italian, for example, personal pronouns attach to
+certain verb forms:
+
+
+
+
mandargli =
mandare + gli
=
to send + to him
+
mandarglielo =
mandare + gli + lo
=
to send + it + to him
+
+
+
+a-suffixes appear in Italian and Spanish, and also in Portuguese, although
+in Portuguese they are separated by hyphen from the preceding word, which
+makes them easy to eliminate.
+
+
+
+
i-suffix
+
+
+An i-suffix, or inflectional suffix, forms part of the basic grammar of a
+language, and is applicable to all words of a certain grammatical type,
+with perhaps a small number of exceptions. In English for example, the past
+of a verb is formed by adding ed. Certain modifications may be required
+in the stem:
+
+
+
+
fit + ed
→
fitted (double t)
+
love + ed
→
loved (drop the final e of love)
+
+
+
+
d-suffix
+
+
+
+A d-suffix, or derivational suffix, enables a new word, often with a
+different grammatical category, or with a different sense, to be built from
+another word. Whether a d-suffix can be attached is discovered not from
+the rules of grammar, but by referring to a dictionary. So in English,
+ness can be added to certain adjectives to form corresponding nouns
+(littleness, kindness, foolishness ...) but not to all adjectives (not for
+example, to big, cruel, wise ...) d-suffixes can be used to change
+meaning, often in rather exotic ways. So in italian astro means a sham
+form of something else:
+
+
+
+
medico + astro
=
medicastro
=
quack doctor
+
poeta + astro
=
poetastro
=
poetaster
+
+
+
+
Indo-European languages
+
+
+
+Most European and many Asian languages belong to the Indo-European language
+group. Historically, it includes the Latin, Greek, Persian and Sanskrit of
+the ancient world, and with the rise of the European empires, languages of
+this group are now dominant in the Americas, Australia and large parts of
+Africa. Indo-European languages are therefore the main languages of modern
+Western culture, and they are all similarly amenable to stemming.
+
+
+
+The Indo-European group has many recognisable sub-groups, for example
+Romance (Italian, French, Spanish ...), Slavonic (Russian, Polish,
+Czech ...), Celtic (Irish Gaelic, Scottish Gaelic, Welsh ...). The
+Germanic sub-group includes German and Dutch, and the Scandinavian
+languages are also usually classed as Germanic, although for convenience we
+have made a separate grouping of them on the Snowball site. English is also
+classed as Germanic, although it has been classed separately by us. This is
+not for reasons of narrow chauvinism, but because the suffix structure of
+English clearly lies mid-way between the Germanic and Romance groups, and it
+therefore requires separate treatment.
+
+
+
+
Uralic languages
+
+
+
+The Uralic languages are spoken mainly in Northern Russia and Europe. They
+are divided into Samoyed, spoken mainly in the Siberian region, and
+Finno-Ugric, spoken mainly in Europe. Although the number of languages in
+the group is substantial, the total number of speakers is relatively small.
+The best known Uralic languages are perhaps Hungarian, Finnish and
+Estonian. Finnish and Estonian are in fact fairly similar. On the other
+hand Hungarian and Finnish are as different as are, say, French and Persian
+in the Indo-European group.
+
+
+
+Like the Indo-European languages, the Uralic languages are amenable to
+stemming.
+
+An a-suffix, or attached suffix, is a particle word attached to another
+word. (In the stemming literature they sometimes get referred to as
+‘enclitics’.) In Italian, for example, personal pronouns attach to
+certain verb forms:
+
+
+
+
mandargli =
mandare + gli
=
to send + to him
+
mandarglielo =
mandare + gli + lo
=
to send + it + to him
+
+
+
+a-suffixes appear in Italian and Spanish, and also in Portuguese, although
+in Portuguese they are separated by hyphen from the preceding word, which
+makes them easy to eliminate.
+
+
+
+
i-suffix
+
+
+An i-suffix, or inflectional suffix, forms part of the basic grammar of a
+language, and is applicable to all words of a certain grammatical type,
+with perhaps a small number of exceptions. In English for example, the past
+of a verb is formed by adding ed. Certain modifications may be required
+in the stem:
+
+
+
+
fit + ed
→
fitted (double t)
+
love + ed
→
loved (drop the final e of love)
+
+
+
+
d-suffix
+
+
+
+A d-suffix, or derivational suffix, enables a new word, often with a
+different grammatical category, or with a different sense, to be built from
+another word. Whether a d-suffix can be attached is discovered not from
+the rules of grammar, but by referring to a dictionary. So in English,
+ness can be added to certain adjectives to form corresponding nouns
+(littleness, kindness, foolishness ...) but not to all adjectives (not for
+example, to big, cruel, wise ...) d-suffixes can be used to change
+meaning, often in rather exotic ways. So in italian astro means a sham
+form of something else:
+
+
+
+
medico + astro
=
medicastro
=
quack doctor
+
poeta + astro
=
poetastro
=
poetaster
+
+
+
+
Indo-European languages
+
+
+
+Most European and many Asian languages belong to the Indo-European language
+group. Historically, it includes the Latin, Greek, Persian and Sanskrit of
+the ancient world, and with the rise of the European empires, languages of
+this group are now dominant in the Americas, Australia and large parts of
+Africa. Indo-European languages are therefore the main languages of modern
+Western culture, and they are all similarly amenable to stemming.
+
+
+
+The Indo-European group has many recognisable sub-groups, for example
+Romance (Italian, French, Spanish ...), Slavonic (Russian, Polish,
+Czech ...), Celtic (Irish Gaelic, Scottish Gaelic, Welsh ...). The
+Germanic sub-group includes German and Dutch, and the Scandinavian
+languages are also usually classed as Germanic, although for convenience we
+have made a separate grouping of them on the Snowball site. English is also
+classed as Germanic, although it has been classed separately by us. This is
+not for reasons of narrow chauvinism, but because the suffix structure of
+English clearly lies mid-way between the Germanic and Romance groups, and it
+therefore requires separate treatment.
+
+
+
+
Uralic languages
+
+
+
+The Uralic languages are spoken mainly in Northern Russia and Europe. They
+are divided into Samoyed, spoken mainly in the Siberian region, and
+Finno-Ugric, spoken mainly in Europe. Although the number of languages in
+the group is substantial, the total number of speakers is relatively small.
+The best known Uralic languages are perhaps Hungarian, Finnish and
+Estonian. Finnish and Estonian are in fact fairly similar. On the other
+hand Hungarian and Finnish are as different as are, say, French and Persian
+in the Indo-European group.
+
+
+
+Like the Indo-European languages, the Uralic languages are amenable to
+stemming.
+
+
+[% footer %]
diff --git a/texts/howtohelp.html b/texts/howtohelp.html
new file mode 100644
index 0000000..e8f7448
--- /dev/null
+++ b/texts/howtohelp.html
@@ -0,0 +1,129 @@
+
+
+
+
+
+
+
+
+
+ Snowball: How You Can Help - Snowball
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Snowball: How You Can Help
+
+
+
+For the work on this site there are two possible lines of development, one is
+Snowball itself — the language and compiler — and the other is the
+stemmers which are written in Snowball. At the moment it is the latter that
+is the real area of interest.
+
+
+
+It is useful to have suggestions about improvements to the existing
+stemmers, especially for the ones which are not English. However, the
+process of piecemeal improvement can be taken too far, and it is important
+in making these suggestions to recognise the inevitable limitations of
+accuracy of algorithmic stemmers. But more importantly: —
+
+
+
+Stemming algorithms have a well-understood place in IR (Information
+Retrieval), and as language-specific tools in an IR system, they have an
+extremely useful part to play. It is therefore something of a scandal that
+there are so very few stemming algorithms which are readily available, so
+if you want to make a contribution to Snowball, the best thing you can do
+is to create a good quality stemmer for a new language. This must
+include an algorithmic description of the stemmer, an implementation in
+Snowball, and a representative language vocabulary of about 30,000 words
+that can be used as part of a standard test.
+
+
+
+Alternatively, you might come up with the algorithm and be able to provide
+representative texts from which to derive the vocabulary, but hesitate
+about the Snowball implementation. If so, get in touch, and we might be
+able to complete the work collaboratively.
+
+
+
+We are also interested in:
+
+
+
+
Significant applications developed with the Snowball stemmers
+
+
Stemmers held on other sites that derive from Snowball work
+
+
Other useful stemming resources
+
+
+
+It may seem like stating the obvious, but if you do hit a technical
+problem, please, please send in a full notice of the system being used,
+the activity you were engaged on, and the errors that you encounter.
+
+
+
+Finally, if you want to contribute to this site, you must be prepared to
+release under the BSD license (i.e. to make your work free).
+
+
+
+Martin Porter
+Richard Boulton
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/texts/howtohelp.tt b/texts/howtohelp.tt
new file mode 100644
index 0000000..5231d72
--- /dev/null
+++ b/texts/howtohelp.tt
@@ -0,0 +1,65 @@
+[% header('Snowball: How You Can Help') %]
+
+
+For the work on this site there are two possible lines of development, one is
+Snowball itself — the language and compiler — and the other is the
+stemmers which are written in Snowball. At the moment it is the latter that
+is the real area of interest.
+
+
+
+It is useful to have suggestions about improvements to the existing
+stemmers, especially for the ones which are not English. However, the
+process of piecemeal improvement can be taken too far, and it is important
+in making these suggestions to recognise the inevitable limitations of
+accuracy of algorithmic stemmers. But more importantly: —
+
+
+
+Stemming algorithms have a well-understood place in IR (Information
+Retrieval), and as language-specific tools in an IR system, they have an
+extremely useful part to play. It is therefore something of a scandal that
+there are so very few stemming algorithms which are readily available, so
+if you want to make a contribution to Snowball, the best thing you can do
+is to create a good quality stemmer for a new language. This must
+include an algorithmic description of the stemmer, an implementation in
+Snowball, and a representative language vocabulary of about 30,000 words
+that can be used as part of a standard test.
+
+
+
+Alternatively, you might come up with the algorithm and be able to provide
+representative texts from which to derive the vocabulary, but hesitate
+about the Snowball implementation. If so, get in touch, and we might be
+able to complete the work collaboratively.
+
+
+
+We are also interested in:
+
+
+
+
Significant applications developed with the Snowball stemmers
+
+
Stemmers held on other sites that derive from Snowball work
+
+
Other useful stemming resources
+
+
+
+It may seem like stating the obvious, but if you do hit a technical
+problem, please, please send in a full notice of the system being used,
+the activity you were engaged on, and the errors that you encounter.
+
+
+
+Finally, if you want to contribute to this site, you must be prepared to
+release under the BSD license (i.e. to make your work free).
+
+ Algorithmic stemmers continue to have great utility in IR, despite the
+ promise of out-performance by dictionary-based stemmers. Nevertheless,
+ there are few algorithmic descriptions of stemmers, and even when they
+ exist they are liable to misinterpretation. Here we look at the ideas
+ underlying stemming, and on this website define a language, Snowball,
+ in which stemmers can be exactly defined, and from which fast stemmer
+ programs in ANSI C or Java can be generated. A range of stemmers is presented
+ in parallel algorithmic and Snowball form, including the original
+ Porter stemmer for English.
+
+
+
1 Introduction
+
+
+There are two main reasons for creating Snowball. One is the lack of
+readily available stemming algorithms for languages
+other than English. The other is the consciousness of a certain failure on
+my part in promoting exact implementations of the stemming
+algorithm described in (Porter 1980), which has come to be called the
+Porter stemming algorithm. The first point needs some qualification: a
+great deal of work has been done on stemmers in a wide range of natural
+languages, both in their development and evaluation, (a complete
+bibliography cannot be attempted here). But it is rare to see a stemmer
+laid out in an unambiguous algorithmic form from which encodings in C,
+Java, Perl etc might easily be made. When exact descriptions are
+attempted, it is often with approaches to stemming that are
+relatively simple, for example the Latin stemmer of Schinke (Schinke 1996),
+or the Slovene stemmer of Popovic (Popovic 1990). A more complex, and
+therefore more characteristic stemmer is the Kraaij-Pohlmann stemmer for
+Dutch (Kraaij 1994), which is presented as open source code in ANSI C. To
+extract an algorithmic description of their stemmer from the source code
+proves to be quite hard.
+
+
+
+The disparity between the Porter stemmer definition and many of its
+purported implementations is much wider than is generally realised in the
+IR community. Three problems seem to compound: one is a misunderstanding
+of the meaning of the original algorithm, another is bugs in the
+encodings, and a third is the almost irresistible urge of programmers
+to add improvements.
+
+
+
+For example, a Perl script advertised on the Web as an
+implementation of the Porter algorithm was tested in October 2001, and it was
+found that 14 percent of words were stemmed incorrectly when given a large sample
+vocabulary. Most words of English have
+very simple endings, so this means that it was effectively getting everything
+wrong. At certain points on the Web are demonstrations of the Porter stemmer.
+You type some English into a box and the stemmed words are displayed. These
+are frequently faulty. (A good test is to type in agreement. It should stem
+to agreement — the same word. If it stems to agreem there is an
+error.) Researchers frequently pick up faulty versions of the stemmer and
+report that they have applied ‘Porter stemming’, with the result that their
+experiments are not quite repeatable. Researchers who work on stemming will
+sometimes give incorrect examples of the behaviour of the Porter stemmer in
+their published works.
+
+
+
+To address all these problems I have tried to develop a rigorous system
+for defining stemming algorithms. A language, Snowball, has been invented,
+in which the rules of stemming algorithms can be expressed in a natural
+way. Snowball is quite small, and can be learned by an experienced
+programmer in an hour or so. On this website a number of foreign language
+stemmers is presented (a) in Snowball, and (b) in a less formal
+English-language description. (b) can be thought of as the program
+comments for (a). A Snowball compiler translates each Snowball
+definition into (c) an equivalent program in ANSI C or Java. Finally (d)
+standard vocabularies of words and their stemmed equivalents are provided
+for each stemmer. The combination of (a), (b), (c) and (d)
+can be used to pin down the definition of a stemmer exactly, and it is
+hoped that Snowball itself will be a useful resource in creating stemmers
+in the future.
+
+
+
2 Some ideas underlying stemming
+
+
+Work in stemming has produced a number of different approaches, albeit tied
+together by a number of common assumptions. It is worthwhile looking at some
+of them to see exactly where Snowball fits into the whole picture.
+
+
+
+A point tacitly assumed in almost all of the stemming literature is that
+stemmers are based upon the written, and not the spoken, form of the
+language. This is also the assumption here. Historically,
+grammarians often regarded the written language as the real language and
+the spoken as a mere derivative form. Almost in reaction, many modern
+linguists have taken a precisely opposite view (Palmer, 1965 pp 2-3). A
+more balanced position is that the two languages are distinct though
+connected, and require separate treatment. One can in fact imagine parallel
+stemming algorithms for the spoken language, or rather for the phoneme
+sequence into which the spoken language is transformed. Stress and
+intonation could be used as clues for an indexing process in the same way
+that punctuation and capitalisation are used as clues in the written
+language. But currently stemmers work on the written language for the good
+reason that there is so much of it available in machine readable form from
+which to build our IR systems. Inevitably therefore the stemmers get
+caught up in accidental details of orthography. In English, removing the
+ing from rotting should be followed by undoubling the tt,
+whereas in rolling we do not undouble the ll. In French, removing
+the er from ennuyer should be followed by changing the y to
+i, so that the resulting word conflates with ennui, and so on.
+
+
+
+The idea of stemming is to improve IR performance generally by bringing
+under one heading variant forms of a word which share a common meaning.
+Harman (1991) was first to present compelling evidence that it may not do
+so, when her experiments discovered no significant improvement with the
+use of stemming.
+Similarly Lennon (1981) discovered no appreciable difference between different
+stemmers running on a constant collection.
+Later work has modified this position however. Krovetz
+(1995) found significant, although sometimes small, improvements across a
+range of test collections. What he did discover is that the degree of
+improvement varies considerably between different collections.
+These tests were however done on collections in
+English, and the reasonable assumption of IR researchers has always been that for
+languages that are more highly inflected than English (and nearly all
+are), greater improvements will be observed when stemming is applied. My
+own view is that stemming helps regularise the
+vocabulary of an IR system, and this leads to advantages that are not
+easily quantifiable through standard IR experiments. For example, it helps
+in presenting lists of terms associated with the query back to the IR user
+in a relevance feedback cycle, which is one of the underlying ideas of the
+probabilistic model. More will be said on the use of a stemmed vocabulary
+in section 5.
+
+
+
+Stemming is not a concept applicable to all languages. It is not, for
+example, applicable in Chinese. But to languages of the Indo-European (*)
+group (and most of the stemmers on this site are for Indo-European
+languages), a common
+pattern of word structure does emerge. Assuming words are written left to
+right, the stem, or root of a word is on the left, and zero or more
+suffixes may be added on the right. If the root is modified by this
+process it will normally be at its right hand end. And also prefixes may
+be added on the left. So unhappiness has a prefix un, a suffix
+ness, and the y of happy has become i with the addition of
+the suffix. Usually, prefixes alter meaning radically, so they are best
+left in place (German and Dutch ge is an exception here). But suffixes
+can, in certain circumstances, be removed. So for example happy and
+happiness have closely related meanings, and we may wish to stem both
+forms to happy, or happi. Infixes can occur, although rarely:
+ge in German and Dutch, and zu in German.
+
+
+
+One can make some distinction between root and stem. Lovins (1968)
+sees the root as the stem minus any prefixes. But here we will
+think of the stem as the residue of the stemming process, and the root as the
+inner word from which the stemmed word derives, so we think of root to
+some extent in an etymological way. It must be admitted that when you
+start thinking hard about these concepts root, stem, suffix,
+prefix ... they turn out to be very difficult indeed to define.
+Nor do definitions, even if we arrive at them, help us much. After all, suffix
+stripping is a practical aid in IR, not an exercise in linguistics or
+etymology. This is especially true of the central concept of root. We
+think of the etymological root of a word as something we can discover with
+certainty from a dictionary, forgetting that etymology itself is a subject
+with its own doubts and controversies (Jesperson 1922, Chapter XVI).
+Indeed, Jesperson goes so far as to say that
+
+
+
+
+ ‘It is of course impossible to say how great a proportion of the
+ etymologies given in dictionaries should strictly be classed under
+ each of the following heads: (1) certain, (2) probable, (3)
+ possible, (4) improbable, (5) impossible — but I am afraid the
+ first two classes would be the least numerous.’
+
+
+
+
+Here we will simply assume a common sense understanding of
+the basic idea of stem and suffix, and hope that this proves sufficient
+for designing and discussing stemming algorithms.
+
+
+
+We can separate suffixes out into three basic classes, which will be
+called d-, i- and a-suffixes.
+
+
+
+An a-suffix, or attached suffix, is a particle word attached to another
+word. (In the stemming literature they sometimes get referred to as
+‘enclitics’.) In Italian, for example, personal pronouns attach to
+certain verb forms:
+
+
+
+
mandargli =
mandare + gli
=
to send + to him
+
mandarglielo =
mandare + gli + lo
=
to send + it + to him
+
+
+
+a-suffixes appear in Italian and Spanish, and also in Portuguese, although
+in Portuguese they are separated by hyphen from the preceding word, which
+makes them easy to eliminate.
+
+
+
+An i-suffix, or inflectional suffix, forms part of the basic grammar of a
+language, and is applicable to all words of a certain grammatical type,
+with perhaps a small number of exceptions. In English for example, the past
+of a verb is formed by adding ed. Certain modifications may be required
+in the stem:
+
+
+
+
fit + ed
→
fitted (double t)
+
love + ed
→
loved (drop the final e of love)
+
+
+
+but otherwise the rule applies in a regular way to all verbs in
+contemporary English, with about 150 (Palmer, 1965) exceptional forms,
+
+
+
+
bear
beat
become
begin
bend
....
+
bore
beat
became
began
bent
+
+
+
+A d-suffix, or derivational suffix, enables a new word, often with a
+different grammatical category, or with a different sense, to be built from
+another word. Whether a d-suffix can be attached is discovered not from
+the rules of grammar, but by referring to a dictionary. So in English,
+ness can be added to certain adjectives to form corresponding nouns
+(littleness, kindness, foolishness ...) but not to all adjectives (not for
+example, to big, cruel, wise ...) d-suffixes can be used to change
+meaning, often in rather exotic ways. So in Italian astro means a sham
+form of something else:
+
+
+
+
medico + astro
=
medicastro
=
quack doctor
+
poeta + astro
=
poetastro
=
poetaster
+
+
+
+Generally i-suffixes follow d-suffixes. i-suffixes can precede d-suffixes,
+for example lovingly, devotedness, but such cases are exceptional. To
+be a little more precise, d-suffixes can sometimes be added to
+participles. devoted, used adjectivally, is a participle derived from the
+verb devote, and ly can be added to turn the adjective into an adverb,
+or ness to turn it into a noun. The same feature occurs in other
+Indo-European languages.
+
+
+
+Sometimes it is hard to say whether a suffix is a d-suffix or i-suffix,
+the comparative and superlative endings er, est of English for example.
+
+
+
+A d-suffix can serve more than one function. In English, for example,
+ly standardly turns an adjective into an adverb (greatly), but it
+can also turn a noun into an adjective (kingly). In French, ement
+also standardly turns an adjective into an adverb (grandement), but it
+can also turn a verb into a noun (rapprochement). (Referring to the
+French stemmer, this double use is ultimately why ement is tested for
+being in the RV rather than the R2 region of the word being
+stemmed.)
+
+
+
+It is quite common for an i-suffix to serve more than one function.
+In English, s can either be (1) a verb ending attached to third person
+singular forms (runs, sings), (2) a noun ending indicating the plural
+(dogs, cats) or (3) a noun ending indicating the possessive
+(boy’s, girls’). By an orthographic convention now several hundred
+years old, the possessive is written with an apostrophe, but
+nowadays this is
+frequently omitted in familiar phrases (a girls school). (Usage (3) is
+relatively rare compared with (1) and (2): there are only nine uses of
+’s in this document.)
+
+
+
+Since the normal order of suffixes is d, i and a, we
+can expect them to be removed
+from the right in the order a, i and d. Usually we want to remove
+all a- and i-suffixes, and some of the d-suffixes.
+
+
+
+If the stemming process reduces two words to the same stem, they are said
+to be conflated.
+
+
+
3 Stemming errors, and the use of dictionaries
+
+
+One way of thinking of the relation between terms and documents in an IR
+system is to see the documents as being about concepts, and the terms as
+words that describe the concepts. Then, of course, one word can cover many
+concepts, so pound can mean a unit of currency, a weight, an enclosure,
+or a beating. Pound is a homonym. And one concept can be described by
+many words, as with money, capital, cash, currency. These words
+are synonyms. There is a many-many mapping therefore between the set of
+terms and the set of concepts. Stemming is a process that transforms this
+mapping to advantage, on the whole reducing the number of synonyms, but
+occasionally creating new homonyms. It is worth remembering that what are
+called stemming errors are usually just the introduction of new homonyms into
+vocabularies that already contain very large numbers of homonyms.
+
+
+
+Words which have no place in this term-concept mapping are those which
+describe no concepts. The particle words of grammar, the, of,
+and
+..., known in IR as stopwords, fall into this category. Stopwords can be
+useful for retrieval but only in searching for phrases, ‘to be or not to
+be’, ‘do as you would be done by’ etc. This suggests that stemming
+stopwords is not useful. More will be said on stopwords in section 7.
+
+
+
+In the literature, a distinction is often made between
+under-stemming, which is the error of taking off too small a suffix, and
+over-stemming, which is the error of taking off too much. In French, for
+example, croûtons is the plural of croûton, ‘a crust’, so to remove
+ons would be over-stemming, while croulons is a verb form of crouler,
+‘to totter’, so to remove s would be under-stemming. We would like to
+introduce a further distinction between mis-stemming and over-stemming.
+Mis-stemming is taking off what looks like an ending, but is really part
+of the stem. Over-stemming is taking off a true ending which results in
+the conflation of words of different meanings.
+
+
+
+So for example ly can be removed from cheaply, but not from reply,
+because in replyly is not a suffix. If it was removed, reply would
+conflate with rep, (the commonly used short form of representative).
+Here we have a case of mis-stemming.
+
+
+
+To illustrate over-stemming, look at these four words,
+
+
+
+
verb
adjective
+
+
First pair:
prove
provable
+
Second pair:
probe
probable
+
+
+
+Morphologically, the two pairs are exactly parallel (in the written, if not
+the spoken language). They also have a common etymology. All four words
+derive from the Latin probare, ‘to prove or to test’, and the idea of
+testing connects the meanings of the words. But the meanings are not parallel.
+provable means ‘able to be proved’; probable does not mean ‘able to be
+probed’. Most people would judge conflation of the first pair as correct,
+and of the second pair, incorrect. In other words, to remove able from
+probable is a case of over-stemming.
+
+
+
+We can try to avoid mis-stemming and over-stemming by using a dictionary.
+The dictionary can tell us that reply does not derive from rep, and
+that the meanings of probe and probable are well separated in modern
+English. It is important to realise however that a dictionary does not give
+a complete solution here, but can be a tool to improve the conflation
+process.
+
+
+
+In Krovetz’s dictionary experiments (Krovetz 1995), he noted that in
+looking up a past participle like suited, one is led either to suit or
+to suite as plausible infinitive forms. suite can be rejected,
+however, because the dictionary tells us that
+although it is a word of English
+it is not a verb form. Cases
+like this (and Krovetz found about 60) had to be treated as exceptions. But
+the form routed could
+either derive from the verb rout or the verb route:
+
+
+
+ At Waterloo Napoleon’s forces were routed
+ The cars were routed off the motorway
+
+
+
+Such cases in English are extremely rare, but they are commoner in more
+highly inflected languages. In French for example, affiliez can either be
+the verb affiler, to sharpen, with imperfect ending iez, or the verb
+affilier, to affiliate, with present indicative ending ez:
+
+
+
+
vous affiliez
=
vous affil-iez
=
you sharpened
+
vous affiliez
=
vous affili-ez
=
you affiliate
+
+
+
+If the second is intended, removal of iez is mis-stemming.
+
+
+
+With over-stemming we must rely upon the dictionary to separate meanings.
+There are different ways of doing this, but all involve some degree of
+reliance upon the lexicographers. Krovetz’s methods are no doubt best,
+because the most objective: he uses several measures, but they are based on
+the idea of measuring the similarity in
+meaning of two words by the degree of overlap among the words used to define
+them, and this is at a good remove from a lexicographer’s subjective
+judgement about semantic similarity.
+
+
+
+There is an interesting difference between mis-stemming and over-stemming
+to do with language history. The morphology of a language changes less
+rapidly than the meanings of the words in it. When extended to include a
+few archaic endings, such as ick as an alternative to ic, a stemmer for
+contemporary English can be applied to the English of 300 years ago.
+Mis-stemmings will be roughly the same, but the pattern of over-stemming will
+be different because of the changing meaning of words in the language. For
+example, relativity in the 19th century merely meant ‘the condition of
+being relative to’. With that meaning, it is acceptable to conflate it
+with relative.
+But with the 20th century meaning brought to it by
+Einstein, stemming to relativ is over-stemming.
+Here we see the word with the suffix changing its meaning, but it can happen
+the other way round. transpire has come to mean ‘happen’, and its old
+meaning of ’exhalation’ or ‘breathing out’ is now effectively lost.
+(That is the bitter reality, although dictionaries still try to persuade us
+otherwise). But transpiration still carries the earlier meaning.
+So what was formerly an acceptable stemming may be judged now as
+an over-stemming, not because the word being stemmed has changed its meaning,
+but because some cognate word has changed its meaning.
+
+
+
+In these examples we are presenting words as if they had single meanings, but
+the true picture is more complicated. Krovetz uses a model of word
+meanings which is extremely helpful here. He makes a distinction between
+homonyms and polysemes. The meaning of homonyms are quite unrelated.
+For example, ground in the sense of ‘earth’, and ‘ground’ as the past
+participle of ‘grind’ are homonyms. Etymologically homonyms have different
+stories, and they usually have separate entries in a dictionary. But each
+homonym form can have a range of polysemic forms, corresponding to different
+shades of meaning. So ground can mean the earth’s surface, or the bottom
+of the sea, or soil, or any base, and so the basis of an argument, and so on.
+Over time new polysemes appear and old ones die. At any moment, the use of a
+word will be common in some polysemic forms and rare in others. If a suffix is
+attached to a word the new word will get a different set of polysemes. For
+example, grounds = ground + s acquires the sense of ‘dregs’ and
+‘estate lands’, loses the sense of ‘earth’, and shares the sense of
+‘basis’.
+
+
+
+Consider the conflation of mobility with mobile. mobile has
+acquired two new polysemes not shared with mobility. One is the ‘mobile
+art object’, common in the nursery. This arrived in the 1960s, and is
+still in use. The other is the ‘mobile phone’ which is now very dominant,
+although it may decline in the future when it has been replaced by some new
+gadget with a different name. We might draw a graph of the degree of
+separation of the meanings of mobility and mobile against time,
+which would depend upon the number of polysemes and the intensity of their
+use. What seemed like a valid conflation of the two words in 1940 may seem
+to be invalid today.
+
+
+
+In general therefore one can say that judgements about whether words are
+over-stemmed change with time as the meanings of words in the language
+change.
+
+
+
+The use of a dictionary should reduce errors of mis-stemming and errors of
+over-stemming. And, for English at least, the mis-stemming errors should
+reduce well, even if there are problems with over-stemming errors. Of
+course, it depends on the quality of the dictionary. A dictionary will need
+to be very comprehensive, fully up-to-date, and with good word definitions
+to achieve the best results.
+
+
+
+Historically, stemmers have often been thought of as either
+dictionary-based or algorithmic. The presentation of studies of stemming
+in the literature has perhaps helped to create this division. In the
+Lovins’ stemmer the algorithmic description is central. In accounts of
+dictionary-based stemmers the emphasis tends to be on dictionary content
+and structure, and IR effectiveness. Savoy’s French stemmer (Savoy, 1993)
+is a good example of this. But the two approaches are not really distinct.
+An algorithmic stemmer can include long exception lists that are
+effectively mini-dictionaries, and a dictionary-based stemmer usually
+needs a process for removing at least i-suffixes to make the look-up
+in the dictionary possible. In fact in a language in which proper names
+are inflected (Latin, Finnish, Russian ...), a dictionary-based stemmer
+will need to remove i-suffixes independently of dictionary look-up,
+because the proper names will not of course be in the dictionary.
+
+
+
+The stemmers available on the Snowball website are all purely
+algorithmic. They can be extended to include built-in exception lists, they
+could be used in combination with a full dictionary, but they are still
+presented here in their simplest possible form. Being purely algorithmic,
+they are, or ought to be, inferior to the performance of well-constructed
+dictionary-based stemmers. But they are still very useful, for the
+following reasons:
+
+
+
+
Algorithmic stemmers are (or can be made) very lean and very fast. The
+stemmers presented here generate code that will process about a million
+words in six seconds on a conventional 500MHz PC. Nowadays we can generate
+very large IR systems with quite modest resources, and tools that assist in
+this have value.
+
+
+
Despite the errors they can be seen to make, algorithmic stemmers still
+give good practical results. As Krovetz (1995) says in surprise of the
+algorithmic stemmer, ‘Why does it do so well?’ (page 89).
+
+
+
Dictionary-based stemmers require dictionary maintenance, to keep up
+with an ever-changing language, and this is actually quite a problem. It
+is not just that a dictionary created to assist stemming today will
+probably require major updating in a few years time, but that a dictionary
+in use for this purpose today may already be several years out of date.
+
+
+
+
+We can hazard an answer to Krovetz’s question, as to why algorithmic
+stemmers perform as well as they do, when they reveal so many cases of
+under-, over- and mis-stemming. Under-stemming is a fault, but by itself
+it will not degrade the performance of an IR system. Because of
+under-stemming words may fail
+to conflate that ought to have conflated, but you are, in a sense, no
+worse off than you were before. Mis-stemming is more serious, but again
+mis-stemming does not really matter unless it leads to false conflations,
+and that frequently does not happen. For example, removing the ate
+ending in English, can result in useful conflations (luxury,
+luxuriate; affection, affectionate), but very often produces
+stems that are not English words
+(enerv-ate, accommod-ate,
+deliber-ate etc). In the literature, these are normally
+classed as stemming errors — overstemming — although in our nomenclature
+they are examples of mis-stemming.
+However these residual stems,
+enerv, accommod,
+deliber ... do not conflate with other word forms, and so behave in
+an IR system in the same way as if they still retained their ate
+ending. No false conflations arise, and so there is no over-stemming here.
+
+
+
+To summarise, one can say that just as a word can be over-stemmed
+but not mis-stemmed (relativity → relative), so it can be
+mis-stemmed but not over-stemmed (enervate → enerv). And, of
+course, even over-stemming does not matter, if the over-stemmed word falsely
+conflates with other words that exist in the language, but are not
+encountered in the IR
+system which is being used.
+
+
+
+Of the three types of error,
+over-stemming is the most important, and
+using a dictionary does not eliminate all over-stemmings, but does reduce their
+incidence.
+
+
+
4 Stemming as part of an indexing process
+
+
+Stemming is part of a composite process of extracting words from text and
+turning them into index terms in an IR system. Because stemming is somewhat
+complex and specialised, it is usually studied in isolation. Even so, it
+cannot really be separated from other aspect of the indexing process:
+
+
+
+
What is a word? For indexing purposes, a word in a European language is
+a sequence of letters bounded by non-letters. But in English, an internal
+apostrophe does not split a word, although it is not classed as a letter.
+The treatment of these word boundary characters affects the stemmer. For
+example, the Kraaij Pohlmann stemmer for Dutch (Kraaij, 1994, 1995) removes hyphen and
+treats apostrophe as part of the alphabet (so ’s, ’tje and ’je are three
+of their endings). The Dutch stemmer presented here assumes hyphen and
+apostrophe have already been removed from the word to be stemmed.
+
+
+
What is a letter? Clearly letters define words, but different languages
+use different letters, much confusion coming from the varied use of
+accented Roman letters.
+
+
+
+English speakers, perhaps influenced by the ASCII character set, typically regard
+their alphabet of a to z as the norm, and other forms (for example, Danish
+å and ø, or German ß) as somewhat abnormal. But this is
+an insular point of view. In Italian, for example, the letters
+j, k, w, x and y are not part of the alphabet, and are
+only seen in foreign words. We also tend to regard other alphabets as only
+used for isolated languages, and that is not strictly true. Cyrillic is
+used for a range of languages other than Russian, among which additional
+letters and accented forms abound.
+
+
+
+In English, a broad definition of letter would be anything that could be
+accepted as a pronounceable element of a word. This would include
+accented Roman letters (naïve, Fauré), and certain ligature
+forms (encyclopædia). It would exclude letters
+of foreign alphabets, such as Greek and Cyrillic.
+The a to z alphabet is one of those where letters come in
+two styles, upper and lower case, which historically correspond (very roughly) to the
+shapes you get if you use a chisel or a pen. Across all languages, the
+exact relation of upper to lower case is not so easy to define. In Italian,
+for example, an accented lower case letter is sometimes represented in
+upper case by the unaccented letter followed by an apostrophe. (I have
+seen this convention used in modern Italian news stories in machine
+readable form.)
+
+
+
+In fact the Porter stemmer (which is for English) assumes the word being stemmed is
+unaccented and in lower case. More exactly, a, e, i, o,
+u,
+and sometimes y, are
+treated as vowels, and any other character gets treated as a consonant.
+Each stemmer presented here assumes some degree of normalisation before it
+receives the word, which is roughly (a) put all letters into lower case,
+and (b) remove accents from letter-accent combinations that do not form
+part of the alphabet of the language. Each stemmer declares the
+letter-accent combinations for its language, and this can be used as a
+guide for the normalisation, but even so, we can see from
+the discussion above that (a) and (b) are not trivial
+operations, and need to be done with care.
+
+
+
+(Incidentally, because the stemmers work on lower case words, turning
+letters to upper case is sometimes used internally for flagging purposes.)
+
+
+
Identifying stopwords. Invariant stopwords are more easily found before
+stemming is applied, but inflecting stopwords (for example, German kein, keine, keinem,
+keinen ... ) may be easier to find after — because there are fewer forms.
+There is a case for building stopword identification into the stemming
+process. See section 7.
+
+
+
Conflating irregular forms. More will be said on this in section 6.
+
+
+
+
5 The use of stemmed words
+
+
+The idea of how stemmed words might be employed in an IR system has
+evolved slightly over the years. The Lovins stemmer (Lovins 1968) was
+developed not for indexing document texts, but the subject terms attached
+to them. With queries stemmed in the same way, the user needed no special
+knowledge of the form of the subject terms. Rijsbergen (1979, Chapter 2)
+assumes document text analysis: stopwords are removed, the remaining words
+are stemmed, and the resulting set of stemmed word constitute the IR index
+(and this style of use is widespread today). More flexibility however is
+obtained by indexing all words in a text in an unstemmed form, and
+keeping a separate two-column relation which connects the words to their
+stemmed equivalents. The relation can be denoted by R(s, w), which means
+that s is the stemmed form of word w. From the relation we can get, for
+any word w, its unique stemmed form, stem(w), and for any stem s, the set
+of words, words(s), that stem to s.
+
+
+
+The user should not have to see the stemmed form of a word. If a list of
+stems is to be presented back for query expansion, in place of
+a stem, s, the user should be shown a single representative from the set
+words(s), the one of highest frequency perhaps. The user should also
+be able to choose for the whole query, or at a lower level for each word
+in a query, whether or not it should be stemmed. In the absence of such
+choices, the system can make its own
+decisions.
+Perhaps single word queries would not undergo
+stemming; long queries would; stopwords would be removed
+except in phrases. In query expansion, the system would work with stemmed
+forms, ignoring stopwords.
+
+
+
+Query expansion with stemming results in a much cleaner vocabulary list
+than without, and this is a main strength of using a stemming process.
+
+
+
+A question arises: if the user never sees the stemmed form, does its
+appearance matter? The answer must be no, although
+the Porter stemmer tries to make the unstemmed forms guessable from the stemmed
+forms. For example, from appropri you can guess appropriate. At least,
+trying to achieve this effect acts as a useful control. Similarly with the
+other stemmers presented here, an attempt has been made to keep the
+appearance of the stemmed forms as familiar as possible.
+
+
+
6 Irregular grammatical forms
+
+
+All languages contain irregularities, but to what extent should they be
+accommodated in a stemming algorithm? An English stemmer, for example, can
+convert regular plurals to singular form without difficulty (boys, girls,
+hands ...). Should it do the same with irregular plurals (men, children,
+feet, ...)? Here we have irregular cases with i-suffixes, but there are
+irregularities with d-suffixes, which Lovins calls ‘spelling exceptions’.
+absorb/absorption and conceive/conception are examples of this.
+Etymologically, the explanation of the first is that the Latin root,
+sorbere, is an irregular verb, and of the second that the word
+conceive comes to us from the French rather than straight from the Latin.
+It is interesting that, even with no knowledge of the etymology, we do
+recognise the connection between the words.
+
+
+
+Lovins tries to solve spelling exceptions by formulating general respelling
+rules (turn rpt into rb for example), but it might be easier to have
+simply a list of exceptional stems.
+
+
+
+The Porter stemmer does not handle irregularities at all, but from the
+author’s own experience, this has never been an area of complaint.
+Complaints in fact are always about false conflations, for example new
+and news.
+
+
+
+Possibly Lovins was right in wanting to resolve d-suffix irregularities,
+and not being concerned about i-suffix irregularities. i-suffix
+irregularities in English go with short, old words, that are either in very
+common use (man/men, woman/women, see/saw ...) or are used only rarely
+(ox/oxen, louse/lice, forsake/forsook ...). The latter class can be
+ignored, and the former has its own problems which are not always solved
+by stemming. For example man is a verb, and saw can mean a cutting
+instrument, or, as a verb, can mean to use such an instrument. Conflation
+of these forms frequently leads to an error like mis-stemming therefore.
+
+
+
+An algorithmic stemmer really needs holes where the irregular forms can be
+plugged in as necessary. This is more serviceable than attempting to
+embed special lists of these irregular forms into software.
+
+
+
7 Stopwords
+
+
+We have suggested that stemming stopwords is not useful. There is a
+grammatical connection between being and be, but conflation of the two
+forms has little use in IR because they have no shared meaning that would
+entitle us to think of them as synonyms. being and be have a
+morphological connection as well, but that is not true of am and was,
+although they have a grammatical connection. Generally speaking,
+inflectional stopwords exhibit many irregularities, which means that
+stemming is not only not useful, but not possible, unless one builds into
+the stemmer tables of exceptions.
+
+
+
+Switching from English to French, consider être, the equivalent form
+of be. It has about 40 different forms, including,
+
+
+
+ suis es sommes serez étaient fus furent sois été
+
+
+
+(and suis incidentally is a homonym, as part of the verb suivre.)
+Passing all forms through a rule-based stemmer creates something of a
+mess. An alternative approach is to recognise this group of words, and
+other groups, and take special action. The recognition could take place
+inside the stemmer, or be done before the stemmer is called. One special
+action would be to stem (perhaps one should say ‘map’) all the forms to a
+standard form, ETRE, to indicate that they are parts of the verb être.
+Deciding what to do with the term ETRE, and it would probably be to
+discard it, would be done outside the stemming process. Another special
+action would be to recognize a whole class of stopwords and simply discard
+them.
+
+
+
+The strategy adopted will depend upon the underlying IR model, so what one
+needs is the flexibility to create modified forms of a standard stemmer.
+Usually we present Snowball stemmers in their unadorned form. Thereafter,
+the addition of stopword tables is quite easy.
+
+
+
8 Rare forms
+
+
+Stemmers do not need to handle linguistic forms that turn up only very
+rarely, but in practice it is hard to design a stemmer with all rare forms
+eliminated without there appearing to be some gaps in the thinking. For
+this reason one should not worry too much about their occasional presence.
+For example, in contemporary Portuguese, use of the second person plural
+form of verbs has almost completely disappeared. Even so, endings for
+those forms are included in the Portuguese stemmer. They appear in all the
+grammar books, and will in any case be found in older texts. The habit of
+putting in rare forms to ‘complete the picture’ is well established, and
+usually passes unnoticed. An example is the list of English stopwords in
+van Rijsbergen (1979). This includes yourselves, by analogy with
+himself, herself etc., although yourselves is actually quite a rare
+word in English.
+
+
+
References
+
+
+Farber DJ, Griswold RE and Polonsky IP (1964) SNOBOL, a string manipulation
+language. Journal of the Association for Computing Machinery, 11: 21-30.
+
+
+
+Griswold RE, Poage JF and Polonsky IP (1968) The SNOBOL4 programming
+language. Prentice-Hall, New Jersey.
+
+
+
+Harman D (1991) How effective is suffixing? Journal of the American
+Society for Information Science, 42: 7-15.
+
+
+
+Jesperson O (1921) Language, its nature, origin and development. George
+Allen & Unwin, London.
+
+
+
+Kraaij W and Pohlmann R. (1994) Porter’s stemming algorithm for Dutch. In
+Noordman LGM and de Vroomen WAM, eds. Informatiewetenschap 1994:
+Wetenschappelijke bijdragen aan de derde STINFON Conferentie, Tilburg,
+1994. pp. 167-180.
+
+
+
+Kraaij W and Pohlmann R (1995) Evaluation of a Dutch stemming algorithm.
+Rowley J, ed. The New Review of Document and Text Management, volume 1,
+Taylor Graham, London, 1995. pp. 25-43,
+
+
+
+Krovetz B (1995) Word sense disambiguation for large text databases. PhD
+Thesis. Department of Computer Science, University of Massachusetts
+Amherst.
+
+
+
+Lennon M, Pierce DS, Tarry BD and Willett P (1981) An evaluation of some
+conflation algorithms for information retrieval. Journal of Information
+Science, 3: 177-183.
+
+
+
+Lovins JB (1968) Development of a stemming algorithm. Mechanical
+Translation and Computational Linguistics, 11: 22-31.
+
+
+
+Palmer FR (1965) A linguistic study of the English verb. Longmans, London.
+
+
+
+Popovic M and Willett P (1990) Processing of documents and queries in a
+Slovene language free text retrieval system. Literary and Linguistic
+Computing, 5: 182-190.
+
+
+
+Porter MF (1980) An algorithm for suffix stripping. Program, 14: 130-137.
+
+
+
+Rijsbergen CJ (1979) Information retrieval. Second edition. Butterworths,
+London.
+
+
+
+Savoy J (1993) Stemming of French words based on grammatical categories.
+Journal of the American Society for Information Science, 44: 1-9.
+
+
+
+Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming
+algorithm for Latin text databases. Journal of Documentation, 52:
+172-187.
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/texts/introduction.tt b/texts/introduction.tt
new file mode 100644
index 0000000..16b7727
--- /dev/null
+++ b/texts/introduction.tt
@@ -0,0 +1,923 @@
+[% header('Snowball: A language for stemming algorithms') %]
+
+
+ Algorithmic stemmers continue to have great utility in IR, despite the
+ promise of out-performance by dictionary-based stemmers. Nevertheless,
+ there are few algorithmic descriptions of stemmers, and even when they
+ exist they are liable to misinterpretation. Here we look at the ideas
+ underlying stemming, and on this website define a language, Snowball,
+ in which stemmers can be exactly defined, and from which fast stemmer
+ programs in ANSI C or Java can be generated. A range of stemmers is presented
+ in parallel algorithmic and Snowball form, including the original
+ Porter stemmer for English.
+
+
+
1 Introduction
+
+
+There are two main reasons for creating Snowball. One is the lack of
+readily available stemming algorithms for languages
+other than English. The other is the consciousness of a certain failure on
+my part in promoting exact implementations of the stemming
+algorithm described in (Porter 1980), which has come to be called the
+Porter stemming algorithm. The first point needs some qualification: a
+great deal of work has been done on stemmers in a wide range of natural
+languages, both in their development and evaluation, (a complete
+bibliography cannot be attempted here). But it is rare to see a stemmer
+laid out in an unambiguous algorithmic form from which encodings in C,
+Java, Perl etc might easily be made. When exact descriptions are
+attempted, it is often with approaches to stemming that are
+relatively simple, for example the Latin stemmer of Schinke (Schinke 1996),
+or the Slovene stemmer of Popovic (Popovic 1990). A more complex, and
+therefore more characteristic stemmer is the Kraaij-Pohlmann stemmer for
+Dutch (Kraaij 1994), which is presented as open source code in ANSI C. To
+extract an algorithmic description of their stemmer from the source code
+proves to be quite hard.
+
+
+
+The disparity between the Porter stemmer definition and many of its
+purported implementations is much wider than is generally realised in the
+IR community. Three problems seem to compound: one is a misunderstanding
+of the meaning of the original algorithm, another is bugs in the
+encodings, and a third is the almost irresistible urge of programmers
+to add improvements.
+
+
+
+For example, a Perl script advertised on the Web as an
+implementation of the Porter algorithm was tested in October 2001, and it was
+found that 14 percent of words were stemmed incorrectly when given a large sample
+vocabulary. Most words of English have
+very simple endings, so this means that it was effectively getting everything
+wrong. At certain points on the Web are demonstrations of the Porter stemmer.
+You type some English into a box and the stemmed words are displayed. These
+are frequently faulty. (A good test is to type in agreement. It should stem
+to agreement — the same word. If it stems to agreem there is an
+error.) Researchers frequently pick up faulty versions of the stemmer and
+report that they have applied ‘Porter stemming’, with the result that their
+experiments are not quite repeatable. Researchers who work on stemming will
+sometimes give incorrect examples of the behaviour of the Porter stemmer in
+their published works.
+
+
+
+To address all these problems I have tried to develop a rigorous system
+for defining stemming algorithms. A language, Snowball, has been invented,
+in which the rules of stemming algorithms can be expressed in a natural
+way. Snowball is quite small, and can be learned by an experienced
+programmer in an hour or so. On this website a number of foreign language
+stemmers is presented (a) in Snowball, and (b) in a less formal
+English-language description. (b) can be thought of as the program
+comments for (a). A Snowball compiler translates each Snowball
+definition into (c) an equivalent program in ANSI C or Java. Finally (d)
+standard vocabularies of words and their stemmed equivalents are provided
+for each stemmer. The combination of (a), (b), (c) and (d)
+can be used to pin down the definition of a stemmer exactly, and it is
+hoped that Snowball itself will be a useful resource in creating stemmers
+in the future.
+
+
+
2 Some ideas underlying stemming
+
+
+Work in stemming has produced a number of different approaches, albeit tied
+together by a number of common assumptions. It is worthwhile looking at some
+of them to see exactly where Snowball fits into the whole picture.
+
+
+
+A point tacitly assumed in almost all of the stemming literature is that
+stemmers are based upon the written, and not the spoken, form of the
+language. This is also the assumption here. Historically,
+grammarians often regarded the written language as the real language and
+the spoken as a mere derivative form. Almost in reaction, many modern
+linguists have taken a precisely opposite view (Palmer, 1965 pp 2-3). A
+more balanced position is that the two languages are distinct though
+connected, and require separate treatment. One can in fact imagine parallel
+stemming algorithms for the spoken language, or rather for the phoneme
+sequence into which the spoken language is transformed. Stress and
+intonation could be used as clues for an indexing process in the same way
+that punctuation and capitalisation are used as clues in the written
+language. But currently stemmers work on the written language for the good
+reason that there is so much of it available in machine readable form from
+which to build our IR systems. Inevitably therefore the stemmers get
+caught up in accidental details of orthography. In English, removing the
+ing from rotting should be followed by undoubling the tt,
+whereas in rolling we do not undouble the ll. In French, removing
+the er from ennuyer should be followed by changing the y to
+i, so that the resulting word conflates with ennui, and so on.
+
+
+
+The idea of stemming is to improve IR performance generally by bringing
+under one heading variant forms of a word which share a common meaning.
+Harman (1991) was first to present compelling evidence that it may not do
+so, when her experiments discovered no significant improvement with the
+use of stemming.
+Similarly Lennon (1981) discovered no appreciable difference between different
+stemmers running on a constant collection.
+Later work has modified this position however. Krovetz
+(1995) found significant, although sometimes small, improvements across a
+range of test collections. What he did discover is that the degree of
+improvement varies considerably between different collections.
+These tests were however done on collections in
+English, and the reasonable assumption of IR researchers has always been that for
+languages that are more highly inflected than English (and nearly all
+are), greater improvements will be observed when stemming is applied. My
+own view is that stemming helps regularise the
+vocabulary of an IR system, and this leads to advantages that are not
+easily quantifiable through standard IR experiments. For example, it helps
+in presenting lists of terms associated with the query back to the IR user
+in a relevance feedback cycle, which is one of the underlying ideas of the
+probabilistic model. More will be said on the use of a stemmed vocabulary
+in section 5.
+
+
+
+Stemming is not a concept applicable to all languages. It is not, for
+example, applicable in Chinese. But to languages of the Indo-European (*)
+group (and most of the stemmers on this site are for Indo-European
+languages), a common
+pattern of word structure does emerge. Assuming words are written left to
+right, the stem, or root of a word is on the left, and zero or more
+suffixes may be added on the right. If the root is modified by this
+process it will normally be at its right hand end. And also prefixes may
+be added on the left. So unhappiness has a prefix un, a suffix
+ness, and the y of happy has become i with the addition of
+the suffix. Usually, prefixes alter meaning radically, so they are best
+left in place (German and Dutch ge is an exception here). But suffixes
+can, in certain circumstances, be removed. So for example happy and
+happiness have closely related meanings, and we may wish to stem both
+forms to happy, or happi. Infixes can occur, although rarely:
+ge in German and Dutch, and zu in German.
+
+
+
+One can make some distinction between root and stem. Lovins (1968)
+sees the root as the stem minus any prefixes. But here we will
+think of the stem as the residue of the stemming process, and the root as the
+inner word from which the stemmed word derives, so we think of root to
+some extent in an etymological way. It must be admitted that when you
+start thinking hard about these concepts root, stem, suffix,
+prefix ... they turn out to be very difficult indeed to define.
+Nor do definitions, even if we arrive at them, help us much. After all, suffix
+stripping is a practical aid in IR, not an exercise in linguistics or
+etymology. This is especially true of the central concept of root. We
+think of the etymological root of a word as something we can discover with
+certainty from a dictionary, forgetting that etymology itself is a subject
+with its own doubts and controversies (Jesperson 1922, Chapter XVI).
+Indeed, Jesperson goes so far as to say that
+
+
+
+
+ ‘It is of course impossible to say how great a proportion of the
+ etymologies given in dictionaries should strictly be classed under
+ each of the following heads: (1) certain, (2) probable, (3)
+ possible, (4) improbable, (5) impossible — but I am afraid the
+ first two classes would be the least numerous.’
+
+
+
+
+Here we will simply assume a common sense understanding of
+the basic idea of stem and suffix, and hope that this proves sufficient
+for designing and discussing stemming algorithms.
+
+
+
+We can separate suffixes out into three basic classes, which will be
+called d-, i- and a-suffixes.
+
+
+
+An a-suffix, or attached suffix, is a particle word attached to another
+word. (In the stemming literature they sometimes get referred to as
+‘enclitics’.) In Italian, for example, personal pronouns attach to
+certain verb forms:
+
+
+
+
mandargli =
mandare + gli
=
to send + to him
+
mandarglielo =
mandare + gli + lo
=
to send + it + to him
+
+
+
+a-suffixes appear in Italian and Spanish, and also in Portuguese, although
+in Portuguese they are separated by hyphen from the preceding word, which
+makes them easy to eliminate.
+
+
+
+An i-suffix, or inflectional suffix, forms part of the basic grammar of a
+language, and is applicable to all words of a certain grammatical type,
+with perhaps a small number of exceptions. In English for example, the past
+of a verb is formed by adding ed. Certain modifications may be required
+in the stem:
+
+
+
+
fit + ed
→
fitted (double t)
+
love + ed
→
loved (drop the final e of love)
+
+
+
+but otherwise the rule applies in a regular way to all verbs in
+contemporary English, with about 150 (Palmer, 1965) exceptional forms,
+
+
+
+
bear
beat
become
begin
bend
....
+
bore
beat
became
began
bent
+
+
+
+A d-suffix, or derivational suffix, enables a new word, often with a
+different grammatical category, or with a different sense, to be built from
+another word. Whether a d-suffix can be attached is discovered not from
+the rules of grammar, but by referring to a dictionary. So in English,
+ness can be added to certain adjectives to form corresponding nouns
+(littleness, kindness, foolishness ...) but not to all adjectives (not for
+example, to big, cruel, wise ...) d-suffixes can be used to change
+meaning, often in rather exotic ways. So in Italian astro means a sham
+form of something else:
+
+
+
+
medico + astro
=
medicastro
=
quack doctor
+
poeta + astro
=
poetastro
=
poetaster
+
+
+
+Generally i-suffixes follow d-suffixes. i-suffixes can precede d-suffixes,
+for example lovingly, devotedness, but such cases are exceptional. To
+be a little more precise, d-suffixes can sometimes be added to
+participles. devoted, used adjectivally, is a participle derived from the
+verb devote, and ly can be added to turn the adjective into an adverb,
+or ness to turn it into a noun. The same feature occurs in other
+Indo-European languages.
+
+
+
+Sometimes it is hard to say whether a suffix is a d-suffix or i-suffix,
+the comparative and superlative endings er, est of English for example.
+
+
+
+A d-suffix can serve more than one function. In English, for example,
+ly standardly turns an adjective into an adverb (greatly), but it
+can also turn a noun into an adjective (kingly). In French, ement
+also standardly turns an adjective into an adverb (grandement), but it
+can also turn a verb into a noun (rapprochement). (Referring to the
+French stemmer, this double use is ultimately why ement is tested for
+being in the RV rather than the R2 region of the word being
+stemmed.)
+
+
+
+It is quite common for an i-suffix to serve more than one function.
+In English, s can either be (1) a verb ending attached to third person
+singular forms (runs, sings), (2) a noun ending indicating the plural
+(dogs, cats) or (3) a noun ending indicating the possessive
+(boy’s, girls’). By an orthographic convention now several hundred
+years old, the possessive is written with an apostrophe, but
+nowadays this is
+frequently omitted in familiar phrases (a girls school). (Usage (3) is
+relatively rare compared with (1) and (2): there are only nine uses of
+’s in this document.)
+
+
+
+Since the normal order of suffixes is d, i and a, we
+can expect them to be removed
+from the right in the order a, i and d. Usually we want to remove
+all a- and i-suffixes, and some of the d-suffixes.
+
+
+
+If the stemming process reduces two words to the same stem, they are said
+to be conflated.
+
+
+
3 Stemming errors, and the use of dictionaries
+
+
+One way of thinking of the relation between terms and documents in an IR
+system is to see the documents as being about concepts, and the terms as
+words that describe the concepts. Then, of course, one word can cover many
+concepts, so pound can mean a unit of currency, a weight, an enclosure,
+or a beating. Pound is a homonym. And one concept can be described by
+many words, as with money, capital, cash, currency. These words
+are synonyms. There is a many-many mapping therefore between the set of
+terms and the set of concepts. Stemming is a process that transforms this
+mapping to advantage, on the whole reducing the number of synonyms, but
+occasionally creating new homonyms. It is worth remembering that what are
+called stemming errors are usually just the introduction of new homonyms into
+vocabularies that already contain very large numbers of homonyms.
+
+
+
+Words which have no place in this term-concept mapping are those which
+describe no concepts. The particle words of grammar, the, of,
+and
+..., known in IR as stopwords, fall into this category. Stopwords can be
+useful for retrieval but only in searching for phrases, ‘to be or not to
+be’, ‘do as you would be done by’ etc. This suggests that stemming
+stopwords is not useful. More will be said on stopwords in section 7.
+
+
+
+In the literature, a distinction is often made between
+under-stemming, which is the error of taking off too small a suffix, and
+over-stemming, which is the error of taking off too much. In French, for
+example, croûtons is the plural of croûton, ‘a crust’, so to remove
+ons would be over-stemming, while croulons is a verb form of crouler,
+‘to totter’, so to remove s would be under-stemming. We would like to
+introduce a further distinction between mis-stemming and over-stemming.
+Mis-stemming is taking off what looks like an ending, but is really part
+of the stem. Over-stemming is taking off a true ending which results in
+the conflation of words of different meanings.
+
+
+
+So for example ly can be removed from cheaply, but not from reply,
+because in replyly is not a suffix. If it was removed, reply would
+conflate with rep, (the commonly used short form of representative).
+Here we have a case of mis-stemming.
+
+
+
+To illustrate over-stemming, look at these four words,
+
+
+
+
verb
adjective
+
+
First pair:
prove
provable
+
Second pair:
probe
probable
+
+
+
+Morphologically, the two pairs are exactly parallel (in the written, if not
+the spoken language). They also have a common etymology. All four words
+derive from the Latin probare, ‘to prove or to test’, and the idea of
+testing connects the meanings of the words. But the meanings are not parallel.
+provable means ‘able to be proved’; probable does not mean ‘able to be
+probed’. Most people would judge conflation of the first pair as correct,
+and of the second pair, incorrect. In other words, to remove able from
+probable is a case of over-stemming.
+
+
+
+We can try to avoid mis-stemming and over-stemming by using a dictionary.
+The dictionary can tell us that reply does not derive from rep, and
+that the meanings of probe and probable are well separated in modern
+English. It is important to realise however that a dictionary does not give
+a complete solution here, but can be a tool to improve the conflation
+process.
+
+
+
+In Krovetz’s dictionary experiments (Krovetz 1995), he noted that in
+looking up a past participle like suited, one is led either to suit or
+to suite as plausible infinitive forms. suite can be rejected,
+however, because the dictionary tells us that
+although it is a word of English
+it is not a verb form. Cases
+like this (and Krovetz found about 60) had to be treated as exceptions. But
+the form routed could
+either derive from the verb rout or the verb route:
+
+
+
+ At Waterloo Napoleon’s forces were routed
+ The cars were routed off the motorway
+
+
+
+Such cases in English are extremely rare, but they are commoner in more
+highly inflected languages. In French for example, affiliez can either be
+the verb affiler, to sharpen, with imperfect ending iez, or the verb
+affilier, to affiliate, with present indicative ending ez:
+
+
+
+
vous affiliez
=
vous affil-iez
=
you sharpened
+
vous affiliez
=
vous affili-ez
=
you affiliate
+
+
+
+If the second is intended, removal of iez is mis-stemming.
+
+
+
+With over-stemming we must rely upon the dictionary to separate meanings.
+There are different ways of doing this, but all involve some degree of
+reliance upon the lexicographers. Krovetz’s methods are no doubt best,
+because the most objective: he uses several measures, but they are based on
+the idea of measuring the similarity in
+meaning of two words by the degree of overlap among the words used to define
+them, and this is at a good remove from a lexicographer’s subjective
+judgement about semantic similarity.
+
+
+
+There is an interesting difference between mis-stemming and over-stemming
+to do with language history. The morphology of a language changes less
+rapidly than the meanings of the words in it. When extended to include a
+few archaic endings, such as ick as an alternative to ic, a stemmer for
+contemporary English can be applied to the English of 300 years ago.
+Mis-stemmings will be roughly the same, but the pattern of over-stemming will
+be different because of the changing meaning of words in the language. For
+example, relativity in the 19th century merely meant ‘the condition of
+being relative to’. With that meaning, it is acceptable to conflate it
+with relative.
+But with the 20th century meaning brought to it by
+Einstein, stemming to relativ is over-stemming.
+Here we see the word with the suffix changing its meaning, but it can happen
+the other way round. transpire has come to mean ‘happen’, and its old
+meaning of ’exhalation’ or ‘breathing out’ is now effectively lost.
+(That is the bitter reality, although dictionaries still try to persuade us
+otherwise). But transpiration still carries the earlier meaning.
+So what was formerly an acceptable stemming may be judged now as
+an over-stemming, not because the word being stemmed has changed its meaning,
+but because some cognate word has changed its meaning.
+
+
+
+In these examples we are presenting words as if they had single meanings, but
+the true picture is more complicated. Krovetz uses a model of word
+meanings which is extremely helpful here. He makes a distinction between
+homonyms and polysemes. The meaning of homonyms are quite unrelated.
+For example, ground in the sense of ‘earth’, and ‘ground’ as the past
+participle of ‘grind’ are homonyms. Etymologically homonyms have different
+stories, and they usually have separate entries in a dictionary. But each
+homonym form can have a range of polysemic forms, corresponding to different
+shades of meaning. So ground can mean the earth’s surface, or the bottom
+of the sea, or soil, or any base, and so the basis of an argument, and so on.
+Over time new polysemes appear and old ones die. At any moment, the use of a
+word will be common in some polysemic forms and rare in others. If a suffix is
+attached to a word the new word will get a different set of polysemes. For
+example, grounds = ground + s acquires the sense of ‘dregs’ and
+‘estate lands’, loses the sense of ‘earth’, and shares the sense of
+‘basis’.
+
+
+
+Consider the conflation of mobility with mobile. mobile has
+acquired two new polysemes not shared with mobility. One is the ‘mobile
+art object’, common in the nursery. This arrived in the 1960s, and is
+still in use. The other is the ‘mobile phone’ which is now very dominant,
+although it may decline in the future when it has been replaced by some new
+gadget with a different name. We might draw a graph of the degree of
+separation of the meanings of mobility and mobile against time,
+which would depend upon the number of polysemes and the intensity of their
+use. What seemed like a valid conflation of the two words in 1940 may seem
+to be invalid today.
+
+
+
+In general therefore one can say that judgements about whether words are
+over-stemmed change with time as the meanings of words in the language
+change.
+
+
+
+The use of a dictionary should reduce errors of mis-stemming and errors of
+over-stemming. And, for English at least, the mis-stemming errors should
+reduce well, even if there are problems with over-stemming errors. Of
+course, it depends on the quality of the dictionary. A dictionary will need
+to be very comprehensive, fully up-to-date, and with good word definitions
+to achieve the best results.
+
+
+
+Historically, stemmers have often been thought of as either
+dictionary-based or algorithmic. The presentation of studies of stemming
+in the literature has perhaps helped to create this division. In the
+Lovins’ stemmer the algorithmic description is central. In accounts of
+dictionary-based stemmers the emphasis tends to be on dictionary content
+and structure, and IR effectiveness. Savoy’s French stemmer (Savoy, 1993)
+is a good example of this. But the two approaches are not really distinct.
+An algorithmic stemmer can include long exception lists that are
+effectively mini-dictionaries, and a dictionary-based stemmer usually
+needs a process for removing at least i-suffixes to make the look-up
+in the dictionary possible. In fact in a language in which proper names
+are inflected (Latin, Finnish, Russian ...), a dictionary-based stemmer
+will need to remove i-suffixes independently of dictionary look-up,
+because the proper names will not of course be in the dictionary.
+
+
+
+The stemmers available on the Snowball website are all purely
+algorithmic. They can be extended to include built-in exception lists, they
+could be used in combination with a full dictionary, but they are still
+presented here in their simplest possible form. Being purely algorithmic,
+they are, or ought to be, inferior to the performance of well-constructed
+dictionary-based stemmers. But they are still very useful, for the
+following reasons:
+
+
+
+
Algorithmic stemmers are (or can be made) very lean and very fast. The
+stemmers presented here generate code that will process about a million
+words in six seconds on a conventional 500MHz PC. Nowadays we can generate
+very large IR systems with quite modest resources, and tools that assist in
+this have value.
+
+
+
Despite the errors they can be seen to make, algorithmic stemmers still
+give good practical results. As Krovetz (1995) says in surprise of the
+algorithmic stemmer, ‘Why does it do so well?’ (page 89).
+
+
+
Dictionary-based stemmers require dictionary maintenance, to keep up
+with an ever-changing language, and this is actually quite a problem. It
+is not just that a dictionary created to assist stemming today will
+probably require major updating in a few years time, but that a dictionary
+in use for this purpose today may already be several years out of date.
+
+
+
+
+We can hazard an answer to Krovetz’s question, as to why algorithmic
+stemmers perform as well as they do, when they reveal so many cases of
+under-, over- and mis-stemming. Under-stemming is a fault, but by itself
+it will not degrade the performance of an IR system. Because of
+under-stemming words may fail
+to conflate that ought to have conflated, but you are, in a sense, no
+worse off than you were before. Mis-stemming is more serious, but again
+mis-stemming does not really matter unless it leads to false conflations,
+and that frequently does not happen. For example, removing the ate
+ending in English, can result in useful conflations (luxury,
+luxuriate; affection, affectionate), but very often produces
+stems that are not English words
+(enerv-ate, accommod-ate,
+deliber-ate etc). In the literature, these are normally
+classed as stemming errors — overstemming — although in our nomenclature
+they are examples of mis-stemming.
+However these residual stems,
+enerv, accommod,
+deliber ... do not conflate with other word forms, and so behave in
+an IR system in the same way as if they still retained their ate
+ending. No false conflations arise, and so there is no over-stemming here.
+
+
+
+To summarise, one can say that just as a word can be over-stemmed
+but not mis-stemmed (relativity → relative), so it can be
+mis-stemmed but not over-stemmed (enervate → enerv). And, of
+course, even over-stemming does not matter, if the over-stemmed word falsely
+conflates with other words that exist in the language, but are not
+encountered in the IR
+system which is being used.
+
+
+
+Of the three types of error,
+over-stemming is the most important, and
+using a dictionary does not eliminate all over-stemmings, but does reduce their
+incidence.
+
+
+
4 Stemming as part of an indexing process
+
+
+Stemming is part of a composite process of extracting words from text and
+turning them into index terms in an IR system. Because stemming is somewhat
+complex and specialised, it is usually studied in isolation. Even so, it
+cannot really be separated from other aspect of the indexing process:
+
+
+
+
What is a word? For indexing purposes, a word in a European language is
+a sequence of letters bounded by non-letters. But in English, an internal
+apostrophe does not split a word, although it is not classed as a letter.
+The treatment of these word boundary characters affects the stemmer. For
+example, the Kraaij Pohlmann stemmer for Dutch (Kraaij, 1994, 1995) removes hyphen and
+treats apostrophe as part of the alphabet (so ’s, ’tje and ’je are three
+of their endings). The Dutch stemmer presented here assumes hyphen and
+apostrophe have already been removed from the word to be stemmed.
+
+
+
What is a letter? Clearly letters define words, but different languages
+use different letters, much confusion coming from the varied use of
+accented Roman letters.
+
+
+
+English speakers, perhaps influenced by the ASCII character set, typically regard
+their alphabet of a to z as the norm, and other forms (for example, Danish
+å and ø, or German ß) as somewhat abnormal. But this is
+an insular point of view. In Italian, for example, the letters
+j, k, w, x and y are not part of the alphabet, and are
+only seen in foreign words. We also tend to regard other alphabets as only
+used for isolated languages, and that is not strictly true. Cyrillic is
+used for a range of languages other than Russian, among which additional
+letters and accented forms abound.
+
+
+
+In English, a broad definition of letter would be anything that could be
+accepted as a pronounceable element of a word. This would include
+accented Roman letters (naïve, Fauré), and certain ligature
+forms (encyclopædia). It would exclude letters
+of foreign alphabets, such as Greek and Cyrillic.
+The a to z alphabet is one of those where letters come in
+two styles, upper and lower case, which historically correspond (very roughly) to the
+shapes you get if you use a chisel or a pen. Across all languages, the
+exact relation of upper to lower case is not so easy to define. In Italian,
+for example, an accented lower case letter is sometimes represented in
+upper case by the unaccented letter followed by an apostrophe. (I have
+seen this convention used in modern Italian news stories in machine
+readable form.)
+
+
+
+In fact the Porter stemmer (which is for English) assumes the word being stemmed is
+unaccented and in lower case. More exactly, a, e, i, o,
+u,
+and sometimes y, are
+treated as vowels, and any other character gets treated as a consonant.
+Each stemmer presented here assumes some degree of normalisation before it
+receives the word, which is roughly (a) put all letters into lower case,
+and (b) remove accents from letter-accent combinations that do not form
+part of the alphabet of the language. Each stemmer declares the
+letter-accent combinations for its language, and this can be used as a
+guide for the normalisation, but even so, we can see from
+the discussion above that (a) and (b) are not trivial
+operations, and need to be done with care.
+
+
+
+(Incidentally, because the stemmers work on lower case words, turning
+letters to upper case is sometimes used internally for flagging purposes.)
+
+
+
Identifying stopwords. Invariant stopwords are more easily found before
+stemming is applied, but inflecting stopwords (for example, German kein, keine, keinem,
+keinen ... ) may be easier to find after — because there are fewer forms.
+There is a case for building stopword identification into the stemming
+process. See section 7.
+
+
+
Conflating irregular forms. More will be said on this in section 6.
+
+
+
+
5 The use of stemmed words
+
+
+The idea of how stemmed words might be employed in an IR system has
+evolved slightly over the years. The Lovins stemmer (Lovins 1968) was
+developed not for indexing document texts, but the subject terms attached
+to them. With queries stemmed in the same way, the user needed no special
+knowledge of the form of the subject terms. Rijsbergen (1979, Chapter 2)
+assumes document text analysis: stopwords are removed, the remaining words
+are stemmed, and the resulting set of stemmed word constitute the IR index
+(and this style of use is widespread today). More flexibility however is
+obtained by indexing all words in a text in an unstemmed form, and
+keeping a separate two-column relation which connects the words to their
+stemmed equivalents. The relation can be denoted by R(s, w), which means
+that s is the stemmed form of word w. From the relation we can get, for
+any word w, its unique stemmed form, stem(w), and for any stem s, the set
+of words, words(s), that stem to s.
+
+
+
+The user should not have to see the stemmed form of a word. If a list of
+stems is to be presented back for query expansion, in place of
+a stem, s, the user should be shown a single representative from the set
+words(s), the one of highest frequency perhaps. The user should also
+be able to choose for the whole query, or at a lower level for each word
+in a query, whether or not it should be stemmed. In the absence of such
+choices, the system can make its own
+decisions.
+Perhaps single word queries would not undergo
+stemming; long queries would; stopwords would be removed
+except in phrases. In query expansion, the system would work with stemmed
+forms, ignoring stopwords.
+
+
+
+Query expansion with stemming results in a much cleaner vocabulary list
+than without, and this is a main strength of using a stemming process.
+
+
+
+A question arises: if the user never sees the stemmed form, does its
+appearance matter? The answer must be no, although
+the Porter stemmer tries to make the unstemmed forms guessable from the stemmed
+forms. For example, from appropri you can guess appropriate. At least,
+trying to achieve this effect acts as a useful control. Similarly with the
+other stemmers presented here, an attempt has been made to keep the
+appearance of the stemmed forms as familiar as possible.
+
+
+
6 Irregular grammatical forms
+
+
+All languages contain irregularities, but to what extent should they be
+accommodated in a stemming algorithm? An English stemmer, for example, can
+convert regular plurals to singular form without difficulty (boys, girls,
+hands ...). Should it do the same with irregular plurals (men, children,
+feet, ...)? Here we have irregular cases with i-suffixes, but there are
+irregularities with d-suffixes, which Lovins calls ‘spelling exceptions’.
+absorb/absorption and conceive/conception are examples of this.
+Etymologically, the explanation of the first is that the Latin root,
+sorbere, is an irregular verb, and of the second that the word
+conceive comes to us from the French rather than straight from the Latin.
+It is interesting that, even with no knowledge of the etymology, we do
+recognise the connection between the words.
+
+
+
+Lovins tries to solve spelling exceptions by formulating general respelling
+rules (turn rpt into rb for example), but it might be easier to have
+simply a list of exceptional stems.
+
+
+
+The Porter stemmer does not handle irregularities at all, but from the
+author’s own experience, this has never been an area of complaint.
+Complaints in fact are always about false conflations, for example new
+and news.
+
+
+
+Possibly Lovins was right in wanting to resolve d-suffix irregularities,
+and not being concerned about i-suffix irregularities. i-suffix
+irregularities in English go with short, old words, that are either in very
+common use (man/men, woman/women, see/saw ...) or are used only rarely
+(ox/oxen, louse/lice, forsake/forsook ...). The latter class can be
+ignored, and the former has its own problems which are not always solved
+by stemming. For example man is a verb, and saw can mean a cutting
+instrument, or, as a verb, can mean to use such an instrument. Conflation
+of these forms frequently leads to an error like mis-stemming therefore.
+
+
+
+An algorithmic stemmer really needs holes where the irregular forms can be
+plugged in as necessary. This is more serviceable than attempting to
+embed special lists of these irregular forms into software.
+
+
+
7 Stopwords
+
+
+We have suggested that stemming stopwords is not useful. There is a
+grammatical connection between being and be, but conflation of the two
+forms has little use in IR because they have no shared meaning that would
+entitle us to think of them as synonyms. being and be have a
+morphological connection as well, but that is not true of am and was,
+although they have a grammatical connection. Generally speaking,
+inflectional stopwords exhibit many irregularities, which means that
+stemming is not only not useful, but not possible, unless one builds into
+the stemmer tables of exceptions.
+
+
+
+Switching from English to French, consider être, the equivalent form
+of be. It has about 40 different forms, including,
+
+
+
+ suis es sommes serez étaient fus furent sois été
+
+
+
+(and suis incidentally is a homonym, as part of the verb suivre.)
+Passing all forms through a rule-based stemmer creates something of a
+mess. An alternative approach is to recognise this group of words, and
+other groups, and take special action. The recognition could take place
+inside the stemmer, or be done before the stemmer is called. One special
+action would be to stem (perhaps one should say ‘map’) all the forms to a
+standard form, ETRE, to indicate that they are parts of the verb être.
+Deciding what to do with the term ETRE, and it would probably be to
+discard it, would be done outside the stemming process. Another special
+action would be to recognize a whole class of stopwords and simply discard
+them.
+
+
+
+The strategy adopted will depend upon the underlying IR model, so what one
+needs is the flexibility to create modified forms of a standard stemmer.
+Usually we present Snowball stemmers in their unadorned form. Thereafter,
+the addition of stopword tables is quite easy.
+
+
+
8 Rare forms
+
+
+Stemmers do not need to handle linguistic forms that turn up only very
+rarely, but in practice it is hard to design a stemmer with all rare forms
+eliminated without there appearing to be some gaps in the thinking. For
+this reason one should not worry too much about their occasional presence.
+For example, in contemporary Portuguese, use of the second person plural
+form of verbs has almost completely disappeared. Even so, endings for
+those forms are included in the Portuguese stemmer. They appear in all the
+grammar books, and will in any case be found in older texts. The habit of
+putting in rare forms to ‘complete the picture’ is well established, and
+usually passes unnoticed. An example is the list of English stopwords in
+van Rijsbergen (1979). This includes yourselves, by analogy with
+himself, herself etc., although yourselves is actually quite a rare
+word in English.
+
+
+
References
+
+
+Farber DJ, Griswold RE and Polonsky IP (1964) SNOBOL, a string manipulation
+language. Journal of the Association for Computing Machinery, 11: 21-30.
+
+
+
+Griswold RE, Poage JF and Polonsky IP (1968) The SNOBOL4 programming
+language. Prentice-Hall, New Jersey.
+
+
+
+Harman D (1991) How effective is suffixing? Journal of the American
+Society for Information Science, 42: 7-15.
+
+
+
+Jesperson O (1921) Language, its nature, origin and development. George
+Allen & Unwin, London.
+
+
+
+Kraaij W and Pohlmann R. (1994) Porter’s stemming algorithm for Dutch. In
+Noordman LGM and de Vroomen WAM, eds. Informatiewetenschap 1994:
+Wetenschappelijke bijdragen aan de derde STINFON Conferentie, Tilburg,
+1994. pp. 167-180.
+
+
+
+Kraaij W and Pohlmann R (1995) Evaluation of a Dutch stemming algorithm.
+Rowley J, ed. The New Review of Document and Text Management, volume 1,
+Taylor Graham, London, 1995. pp. 25-43,
+
+
+
+Krovetz B (1995) Word sense disambiguation for large text databases. PhD
+Thesis. Department of Computer Science, University of Massachusetts
+Amherst.
+
+
+
+Lennon M, Pierce DS, Tarry BD and Willett P (1981) An evaluation of some
+conflation algorithms for information retrieval. Journal of Information
+Science, 3: 177-183.
+
+
+
+Lovins JB (1968) Development of a stemming algorithm. Mechanical
+Translation and Computational Linguistics, 11: 22-31.
+
+
+
+Palmer FR (1965) A linguistic study of the English verb. Longmans, London.
+
+
+
+Popovic M and Willett P (1990) Processing of documents and queries in a
+Slovene language free text retrieval system. Literary and Linguistic
+Computing, 5: 182-190.
+
+
+
+Porter MF (1980) An algorithm for suffix stripping. Program, 14: 130-137.
+
+
+
+Rijsbergen CJ (1979) Information retrieval. Second edition. Butterworths,
+London.
+
+
+
+Savoy J (1993) Stemming of French words based on grammatical categories.
+Journal of the American Society for Information Science, 44: 1-9.
+
+
+
+Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming
+algorithm for Latin text databases. Journal of Documentation, 52:
+172-187.
+
+Most of the stemmers make use of at least one of the region definitions R1 and
+R2. They are defined as follows:
+
+
+
+R1 is the region after the first non-vowel following a vowel, or is the null
+region at the end of the word if there is no such non-vowel.
+
+
+
+R2 is the region after the first non-vowel following a vowel in R1, or is
+the null region at the end of the word if there is no such non-vowel.
+
+
+
+The definition of vowel varies from language to language. In French, for
+example, é is a vowel, and in Italian i between two other vowels is not a
+vowel. The class of letters that constitute vowels is made clear in each stemmer.
+
+
+
+Below, R1 and R2 are shown for a number of English words,
+
+
+
+ b e a u t i f u l
+ |<------------->| R1
+ |<----->| R2
+
+
+
+Letter t is the first non-vowel following a vowel in beautiful, so R1
+is iful. In iful, the letter f is the first non-vowel following a
+vowel, so R2 is ul.
+
+
+
+ b e a u t y
+ |<->| R1
+ ->|<- R2
+
+
+
+In beauty, the last letter y is classed as a vowel. Again, letter t is
+the first non-vowel following a vowel, so R1 is just the last letter, y.
+R1 contains no non-vowel, so R2 is the null region at the end of the word.
+
+
+
+ b e a u
+ ->|<- R1
+ ->|<- R2
+
+In beau, R1 and R2 are both null.
+
+
+
+Other examples:
+
+
+
+ a n i m a d v e r s i o n
+ |<----------------------------------------->| R1
+ |<--------------------------------->| R2
+
+ s p r i n k l e d
+ |<------------->| R1
+ ->|<- R2
+
+ e u c h a r i s t
+ |<--------------------->| R1
+ |<--------->| R2
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/texts/r1r2.tt b/texts/r1r2.tt
new file mode 100644
index 0000000..6b2df04
--- /dev/null
+++ b/texts/r1r2.tt
@@ -0,0 +1,78 @@
+[% header('Defining R1 and R2') %]
+
+
+Most of the stemmers make use of at least one of the region definitions R1 and
+R2. They are defined as follows:
+
+
+
+R1 is the region after the first non-vowel following a vowel, or is the null
+region at the end of the word if there is no such non-vowel.
+
+
+
+R2 is the region after the first non-vowel following a vowel in R1, or is
+the null region at the end of the word if there is no such non-vowel.
+
+
+
+The definition of vowel varies from language to language. In French, for
+example, é is a vowel, and in Italian i between two other vowels is not a
+vowel. The class of letters that constitute vowels is made clear in each stemmer.
+
+
+
+Below, R1 and R2 are shown for a number of English words,
+
+
+
+ b e a u t i f u l
+ |<------------->| R1
+ |<----->| R2
+
+
+
+Letter t is the first non-vowel following a vowel in beautiful, so R1
+is iful. In iful, the letter f is the first non-vowel following a
+vowel, so R2 is ul.
+
+
+
+ b e a u t y
+ |<->| R1
+ ->|<- R2
+
+
+
+In beauty, the last letter y is classed as a vowel. Again, letter t is
+the first non-vowel following a vowel, so R1 is just the last letter, y.
+R1 contains no non-vowel, so R2 is the null region at the end of the word.
+
+
+
+ b e a u
+ ->|<- R1
+ ->|<- R2
+
+In beau, R1 and R2 are both null.
+
+
+
+Other examples:
+
+
+
+ a n i m a d v e r s i o n
+ |<----------------------------------------->| R1
+ |<--------------------------------->| R2
+
+ s p r i n k l e d
+ |<------------->| R1
+ ->|<- R2
+
+ e u c h a r i s t
+ |<--------------------->| R1
+ |<--------->| R2
+
+Some of the algorithms begin with a step which puts letters which are
+normally classed as vowels into upper case to indicate that they are are to be
+treated as consonants (the assumption being that the words are presented to
+the stemmers in lower case). Upper case therefore acts as a flag indicating a
+consonant.
+
+
+
+For example, the English stemmer begins with the step
+
+ Set initial y, or y after a vowel, to Y,
+
+giving rise to the following changes,
+
+
+
+
youth
→
Youth
+
boy
→
boY
+
boyish
→
boYish
+
fly
→
fly
+
flying
→
flying
+
syzygy
→
syzygy
+
+
+
+This process works from left to right, and
+if a word contains Vyy, where V is a vowel, the first y is put
+into upper case, but the second y is left alone, since it is preceded by
+upper case Y which is a consonant. A sequence Vyyyyy... would be
+changed to VYyYyY....
+
+
+
+The combination yy never occurs in English, although it might appear in
+foreign words:
+
+
+
+
sayyid
→
saYyid
+
+
+
+(A sayyid, my dictionary tells me, is a descendant of Mohammed's daughter
+Fatima.) But the left-to-right process is significant in other languages, for
+example French. In French the rule for marking vowels as consonants is,
+
+
+
+ Put into upper case u or i preceded and followed by a vowel, and
+ y preceded or followed by a vowel. Put u after q into upper
+ case.
+
+
+
+which gives rise to,
+
+
+
+
ennuie
→
ennuIe
+
inquiétude
→
inqUiétude
+
+
+
+In the first word, i is put into upper case since it has a vowel on both
+sides of it.
+In the second word, u after q is put into upper case, and again the
+following i is left alone, since it is preceded by upper case U which
+is a consonant.
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/texts/vowelmarking.tt b/texts/vowelmarking.tt
new file mode 100644
index 0000000..13a24d2
--- /dev/null
+++ b/texts/vowelmarking.tt
@@ -0,0 +1,74 @@
+[% header('Marking vowels as consonants') %]
+
+
+Some of the algorithms begin with a step which puts letters which are
+normally classed as vowels into upper case to indicate that they are are to be
+treated as consonants (the assumption being that the words are presented to
+the stemmers in lower case). Upper case therefore acts as a flag indicating a
+consonant.
+
+
+
+For example, the English stemmer begins with the step
+
+ Set initial y, or y after a vowel, to Y,
+
+giving rise to the following changes,
+
+
+
+
youth
→
Youth
+
boy
→
boY
+
boyish
→
boYish
+
fly
→
fly
+
flying
→
flying
+
syzygy
→
syzygy
+
+
+
+This process works from left to right, and
+if a word contains Vyy, where V is a vowel, the first y is put
+into upper case, but the second y is left alone, since it is preceded by
+upper case Y which is a consonant. A sequence Vyyyyy... would be
+changed to VYyYyY....
+
+
+
+The combination yy never occurs in English, although it might appear in
+foreign words:
+
+
+
+
sayyid
→
saYyid
+
+
+
+(A sayyid, my dictionary tells me, is a descendant of Mohammed's daughter
+Fatima.) But the left-to-right process is significant in other languages, for
+example French. In French the rule for marking vowels as consonants is,
+
+
+
+ Put into upper case u or i preceded and followed by a vowel, and
+ y preceded or followed by a vowel. Put u after q into upper
+ case.
+
+
+
+which gives rise to,
+
+
+
+
ennuie
→
ennuIe
+
inquiétude
→
inqUiétude
+
+
+
+In the first word, i is put into upper case since it has a vowel on both
+sides of it.
+In the second word, u after q is put into upper case, and again the
+following i is left alone, since it is preceded by upper case U which
+is a consonant.
+