From 48e014b850aaf1edfa87cb861ba9c955d4501fc6 Mon Sep 17 00:00:00 2001 From: Olly Betts Date: Thu, 19 Oct 2023 18:48:59 +1300 Subject: [PATCH] Add missing differences from Porter stemmer --- algorithms/english/stemmer.tt | 36 ++++++++++++++++++++++++----------- 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/algorithms/english/stemmer.tt b/algorithms/english/stemmer.tt index 9eedab8..b50e69f 100644 --- a/algorithms/english/stemmer.tt +++ b/algorithms/english/stemmer.tt @@ -50,19 +50,33 @@ But it is hardly surprising that after twenty years of use of the Porter stemmer, certain improvements did suggest themselves, and a new algorithm for English is therefore offered here. (It could be called the ‘Porter2’ stemmer to distinguish it from the Porter stemmer, from which it derives.) -The changes are not so very extensive: (1) terminating y is changed to -i rather less often, (2) suffix us does not lose its s, (3) a -few additional suffixes are included for removal, including (4) suffix -ly. In addition, a small list of exceptional forms is included. In -December 2001 there were two further adjustments: (5) Steps 5a and 5b +The changes are not so very extensive: +

+ +
    +
  1. [In C Porter stemmer but not in paper] +Extra rule in Step 2: logi -> log +
  2. [In C Porter stemmer but not in paper] +Step 2 rule: abli -> able replace by bli -> ble +
  3. [In C Porter stemmer but not in paper] +The algorithm leaves along strings of length 2 (so +as and is not longer lose s. +
  4. Terminating y is changed to i rather less often +
  5. Suffix us does not lose its s +
  6. A few additional suffixes are included for removal, including suffix +ly +
  7. A small list of exceptional forms is included +
  8. [December 2001] Steps 5a and 5b of the old Porter stemmer were combined into a single step. This means -that undoubling final ll is not done with removal of final e. (6) -In Step 3 ative is removed only when in region R2. -(7) -In July -2005 a small adjustment was made (including a new step 0) to handle +that undoubling final ll is not done with removal of final e +
  9. [December 2001] In Step 3 ative is removed only when in region R2. +
  10. [May 2005] commun added to exceptional forms +
  11. [July 2005] A small adjustment was made (including a new step 0) to handle apostrophe. -

    +
  12. [January 2006] "Words" ied and ies now stem to ie rather than i. +
  13. [January 2006] The implementation was fixed to follow the algorithm as documented here and now always treats an initial y as a consonant. +
  14. [November 2006] arsen added to exceptional forms +

To begin with, here is the basic algorithm without reference to the