Releases: lex-lingo/lingo
Releases · lex-lingo/lingo
v1.10.2
v1.10.1
v1.10.0
- Dropped support for Ruby 2.0.
- Updated dependency versions.
v1.9.0
- Dropped support for Ruby 1.9.
- Removed support for deprecated options and attendee names (
old
→new
):- Lingo::Language::Grammar:
compositum
→compound
- Lingo::Attendee::TextReader:
lir-record-pattern
→records
- Lingo::Config:
multiworder
→multi_worder
,
objectfilter
→object_filter
,
textreader
→text_reader
,
textwriter
→text_writer
,
vectorfilter
→vector_filter
,
wordsearcher
→word_searcher
- Lingo::Language::Grammar:
- Lingo::Attendee::TextWriter learned format directives for
ext
option (currently supported are:%c
= config name,%l
= language name,%d
= current date,%t
= current time). - Lingo::Attendee::Sequencer remembers word form of sequences.
- Updated and extended English system dictionary and suffix list.
- Fixed errors with XML input (issue #15 by Thomas Berger).
v1.8.7
- Added Lingo::Attendee::LsiFilter to correlate semantically related terms
(LSI) over the
"corpus" of all files processed during a single program invocation; requires
lsi4r which in turn requires
rb-gsl. [EXPERIMENTAL: Interface may
be changed or removed in next release.] - Added Lingo::Attendee::HalFilter to correlate semantically related terms
(HAL) over
individual documents; requires hal4r
which in turn requires rb-gsl.
[EXPERIMENTAL: Interface may be changed or removed in next release.] - Added Lingo::Attendee::AnalysisFilter and associated
lingoctl
tooling. - Multiword dictionaries can now identify hyphenated variants (e.g.
automatic data-processing
); sethyphenate: true
in the
dictionary config. - Lingo::Attendee::Tokenizer no longer considers hyphens at word edges as part
of the word. As a consequence, Lingo::Attendee::Dehyphenizer has been
dropped. - Dropped Lingo::Attendee::NonewordFilter; use Lingo::Attendee::VectorFilter
with optionlexicals: '\?'
instead. - Lingo::Attendee::TextReader and Lingo::Attendee::TextWriter learned
encoding
option to read/write text that is not UTF-8 encoded;
configuration files and dictionaries still need to be UTF-8, though. - Lingo::Attendee::TextReader and Lingo::Attendee::TextWriter learned to
read/write Gzip-compressed files (file extension.gz
or.gzip
). - Lingo::Attendee::Sequencer learned to recognize
0
in the pattern to match
number tokens. - Fixed Lingo::Attendee::TextReader to recognize BOM in input files; does not
apply to input read fromSTDIN
. - Fixed regression introduced in 1.8.6 where Lingo::Attendee::Debugger would
no longer work immediately behind Lingo::Attendee::TextReader. - Fixed
lingoctl
copy commands when overwriting existing files. - Refactored Lingo::Database::Crypter into a module.
- JRuby 9000 compatibility.
v1.8.6
- Lingo::Attendee::VectorFilter learned
pos
option to print position and
byte offset with each word. - Lingo::Attendee::VectorFilter learned
tfidf
option to sort results based
on their tf–idf score; the document
frequencies are calculated over the "corpus" of all files processed during
a single program invocation. - Lingo::Attendee::VectorFilter learned
tokens
option to filter on
Lingo::Language::Token in addition to Lingo::Language::Word. - Lingo::Attendee::VectorFilter no longer supports
debug
(as well as
prompt
andpreamble
); use Lingo::Attendee::DebugFilter instead. - Lingo::Attendee::TextReader no longer removes line endings; option
chomp
is obsolete. - Lingo::Attendee::TextReader passes byte offset to the following attendee.
- Lingo::Attendee::Tokenizer records token's byte offset.
- Lingo::Attendee::Tokenizer records token's sequence position.
- Lingo::Attendee::Tokenizer learned
skip-tags
option to skip over
specified tags' contents. - Lingo::Attendee subclasses warn when invalid or obsolete options or names
are used. - Changed German infix substitution
/en
toch/chen
in order to prevent
overly aggressive identifications. - Internal refactoring and API changes.
v1.8.5
- Dictionary values (projections) are no longer sorted; hence, order of
definition affects processing. - Lexicals in Lingo::Language::Word are no longer sorted; in particular,
compound parts keep their original order. - Lexicals in Lingo::Language::Word are no longer cleaned from duplicates.
- Compiled dictionaries are updated whenever the Lingo version or their
configuration changes, not only when the source file's size or modification
time changes. - Lingo::Attendee::Synonymer learned
compound-parts
option to also
generate synonyms for compound parts when set totrue
. - Lingo::Attendee::TextReader learned better PDF-to-text conversion using the
pdftotext
command; specifyfilter: pdftotext
in the config. - Lingo::Attendee::VectorFilter learned
dict
option to print words in
dictionary format (viz. Lingo::Database::Source::WordClass). - Lingo::Attendee::VectorFilter learned
preamble
option to print current
configuration to the beginning of the log file (debug: 'true'
);
setpreamble: false
to disable. - Multiword dictionaries compiled from base forms can now generate inflected
adjectives based on the gender of the head noun; setinflect: true
in the dictionary config. - Lingo::Database::Source::WordClass supports gender information being encoded
in the dictionary as well as shorthand notation for multiple word
classes/genders. - Lingo::Database::Source::WordClass supports compounds being encoded in the
dictionary (appending+
to their parts' word classes is
recommended). - Lingo::Database::Source removes leading and trailing whitespace from
dictionary lines. - Lingo::Database::Crypter uses OpenSSL to encrypt/decrypt dictionaries.
Note: Can't decrypt dictionaries encrypted with the old scheme anymore. - Lingo::Attendee::Tokenizer learned subset of MediaWiki syntax.
- Eliminated pathological behaviour of the
URLS
rule in
Lingo::Attendee::Tokenizer. - Fixed regression introduced in 1.8.2 where
combine: all
would no
longer work in Lingo::Attendee::MultiWorder. - Updated and extended Russian dictionaries. (Yulia Dorokhova, Thomas Müller)
lingoctl
no longer overwrites existing files without confirmation.lingoctl
learnedarchive
command.- Dictionary cleanup.
v1.8.4
- Lingo::Attendee::Sequencer accepts regular expression patterns.
- Lingo::Attendee::Sequencer substitutes
0
in the format string for the
matched pattern. - Lingo::Attendee::NonewordFilter learned
dict
option to print nonewords
in dictionary format. - Added progress reporting to Lingo::Attendee::TextReader for
STDIN
. lingoctl demo
reports successful initialization.- Russian localization for Lingo::Web. (Yulia Dorokhova, Thomas Müller)
- Lingo::Web learned parameter
hl
to set UI language. - Lingo::Web displays the configuration in use.
- Lingo::Srv accepts array of query strings in addition to single query
string. - Meeting config takes precedence over language config.
- When dictionary entries are rejected during conversion, the location of the
reject file will be shown. - LIR record number defaults to match string in absence of capture group.
- Optionally prevent Lingo from sorting any results by setting the
LINGO_NO_SORT
environment variable.
v1.8.3
- Fixed regression introduced in 1.8.2 where reading input from
STDIN
was no
longer possible. - Fixed regression introduced in 1.8.2 where Lingo would no longer run on Ruby
1.9.2. - Fixed length limit handling for multibyte characters in SDBM store.
- Fixed encoding issue in SDBM store.
- Fixed issue with BOM in config files.
- Modified character handling to accept any Unicode letter (Alphabetic)
and digit (Decimal Number). - Modified Lingo::Attendee::Tokenizer to use only hard-coded tokenization
rules. - Modified Lingo::Attendee::VectorFilter option
lexicals
to be
case-sensitive. - Improved overall performance and memory usage; Lingo::Attendee::Sequencer
changed the order sequences are inserted into the stream. - Eliminated performance penalty caused by Lingo::Attendee::Abbreviator.
- Added Russian language support. (Yulia Dorokhova, Thomas Müller)
- Added
fields
option to Lingo::Attendee::TextReader to cut off field
labels; defaults totrue
in record (LIR) mode. - Added
skip
option to Lingo::Attendee::TextReader to skip lines matching
the given pattern. - Added
src
option to Lingo::Attendee::VectorFilter to print "source" part
of compounds. - Added
lingosrv
andlingoweb
executables. The former provides a simple
HTTP endpoint with JSON output; the latter serves a demo web interface. - Refactored internal caching.
- Made dependency on Ruby version >= 1.9.2 explicit.
- Removed reporting facility (options
--perfmon
and--status
). - Learned
--profile
option to collect profiling information while running. - Deprecated Lingo::Language::Grammar option
compositum
(nowcompound
),
Lingo::Config optiontextreader
(nowtext_reader
), and
Lingo::Attendee::TextReader optionlir-record-pattern
(nowrecords
);
they will be removed in Lingo 1.9.
v1.8.2
- Performance improvements regarding Lingo::Attendee::VectorFilter (as well
as Lingo::Attendee::NonewordFilter) memory usage; setsort: false
in the config. - Added Lingo::Attendee::Stemmer (implementing Porter's algorithm for suffix
stripping). - Added progress reporting to Lingo::Attendee::TextReader; set
progress: true
in the config. - Added directory and glob processing to Lingo::Attendee::TextReader (new
optionsglob
andrecursive
). - Renamed Lingo::Attendee::TextReader option
lir-record-pattern
to
records
. - Fixed Lingo::Attendee::Debugger to forward all objects so it can be
inserted between any two attendees in the config. - Fixed regression introduced in 1.8.0 where Lingo would not use existing
compiled dictionary when source file is not present. - Fixed "invalid byte sequence in UTF-8" on Windows for SDBM store.
- Enabled pluggable (compiled) dictionaries and storage backends.
- Extensive internal refactoring and cleanup. (Finished for now.)