Several optimizations, refactoring #61

kno10 · 2017-09-08T15:19:39Z

As I'm repeatedly running this on the entire Wikipedia(s) for experiments, it was worth the effort to optimize this thoroughly.
Sorry, I cannot easily break this apart into smaller pull requests, because optimizations and code refactorings depend on each other; and I mostly did all of this because I needed it to become a lot faster... but I would appreciate if you test and benchmark this, as I believe this improves both the code and the runtime for everybody and not just for me. ;-)

kno10 · 2017-09-11T15:52:58Z

Some (preliminary) results from a benchmark that is still running,
involving a pipeline with CoreNLP, a huge entity dictionary to match named entities, and HeidelTime:
Results after 1.000.000 Wikipedia documents.

with HeidelTime 2.2.1: 7 hours 13 minutes
with above patches: 4 hours 55 minutes

When the process has finished I can also compare result sizes etc. (as e.g. writing much less to the database will also be faster). But from these numbers, I do think my branch is faster. ;-)

I've just added code to (hopefully) be able to break down runtime into CoreNLP, Entity Dictionary, HeidelTime with more detail on the next run. Because right now I cannot tell how much (or little) of the remaining 4:55 is due to HeidelTime.

kno10 · 2017-09-12T15:34:05Z

Final runtime:
with HeidelTime 2.2.1: 28 hours 19 minutes
with above patches: 25 hours 23 minutes

longliveenduro · 2018-03-27T13:50:41Z

I created a fork and merged your PR but unfortunately then I get a lot of "dates" like this XXXX-XX-XX. This doesn't happen with the original version from master branch. My fork: https://github.com/longliveenduro/heideltime

kno10 · 2018-03-27T14:08:39Z

Do you have an example document?
Such dates often are relative dates ("three days earlier") without a reference. There should be relative information available, even when the date cannot be formated as a calendar date.

longliveenduro · 2018-03-27T14:18:15Z

@kno10 yes, the test document I've used is

"Anfang 2018: Gestern war ein schöner Tag. Nächste Woche wird es besser werden. Am kommenden Montag kommt ein weiteres Tief auf uns zu. Das Sturmtief wird gegen Montag Nachmittag um ca. 15.00 Uhr in Bayern ankommen. Mehr in 60 Minuten! Zweimal im Jahr!"

Works pretty good with the master version of heideltime when choosing "News" as Document Type and supplying a document date.

FYI: I do use Heideltime Standalone with Treetagger:

val heidelTimeStandalone = new HeidelTimeStandalone(Language.GERMAN, DocumentType.NEWS, OutputType.TIMEML)
val timeMlResult = heidelTimeStandalone.process(doc.text, doc.publishingDate.toDate)

kno10 · 2018-03-27T14:30:23Z

Document type and publishing date certainly will have an effect.
The XXXX-XX-XX annotations look as if the publishing date is not used here anymore. This may be that my code considers "Anfang 2018" to be a stronger temporal anchor than the documents publishing date. But if you try to anchor "Gestern" with "Anfang 2018", you cannot make this a date, so it probably behaves a bit more similar to Heideltime narrative mode in this particular case.

longliveenduro · 2018-03-27T14:41:34Z

I think I had the same issue with a simpler version without the "Anfang 2018".

"Gestern war ein schöner Tag. Nächste Woche wird es besser werden. Am kommenden Montag kommt ein weiteres Tief auf uns zu. Das Sturmtief wird gegen Montag Nachmittag um ca. 15.00 Uhr in Bayern ankommen."

Ok unfortunately I think this makes your performance improvements for our news articles quite unusable, because the publishing date should be a very strong anchor for news. Also the project owner should be aware of this when he might consider merging this PR.

kno10 · 2018-03-27T14:56:34Z

IIRC I had too many false anchorings with the old logic.
But yes, this needs a full regression testing, but unfortunately there is no testsuite with well-defined behavior yet. Many of my changes are to make such behavior can be more cleanly controlled, and I could not retain undocumented behavior in all cases.

kno10 · 2018-04-23T15:31:45Z

Resolving using the document creation time is trivially disabled in this line:

https://github.com/kno10/heideltime/blob/799e611b20d3bcae3be9cb1bd8127901379f921c/src/de/unihd/dbs/uima/annotator/heideltime/ResolveAmbiguousValues.java#L131

so that is as easy as it can be to "fix". I think the original motivation was to always use the document creation time if provided - if you do not want to use it for resolving, just do not provide it. So the line should probably be ... = ParsedDct.read(jcas);.

But that is a design decision. I am not a fan of the except-narrative hidden rule.

This causes some output changes that may be worth discussing. E.g. does "a week earlier" without context refer to a day, or a week? I.e. is XXXX-XX-XX or XXXX-WXX correct?

Because the compiled patterns are only used in tense matching.

kno10 force-pushed the master branch from f001d81 to 799e611 Compare December 1, 2017 15:22

kno10 force-pushed the master branch from a54c272 to b970e6f Compare April 23, 2018 15:38

kno10 added 12 commits April 23, 2018 17:49

Use a -SNAPSHOT version, to distinguish from the released version.

1c5210d

Log exceptions rather than print them.

0dbbc07

Switch to slf4j, cleanups and optimizations.

c5ee166

Replace Logger with slf4j, reduce unnecessary log string generation.

c1afb54

Switch logging in standalone to slf4j, breaks -v option for now.

2a01912

Use a primitive int rather than an object.

fbfb99d

Use precompiled matchers, for performance.

04d769e

Indentation fixes.

3f2f123

Reduce object allocations when not matched.

3997595

Optimizations and cleanup.

77d7103

More simplifications.

3663444

Optimize, by avoiding recompiling patterns.

81fa449

kno10 force-pushed the master branch from b970e6f to 67756e1 Compare April 23, 2018 16:17

kno10 added 6 commits April 23, 2018 18:39

More optimization.

438e4eb

More optimization

91f740f

More optimizations.

010391c

More low-level optimizations.

e56898c

More fixes. Bottleneck now appears to be Matcher.find.

bbf6f1b

Improve date handling, in particular error reporting.

bf30132

kno10 added 29 commits April 23, 2018 18:47

More code cleanups.

cf0dc35

Refactor: extract smaller methods from monster function.

77a6300

Cleanups, and further refactoring.

522f907

Refactor handling of undef-year

3fc9688

Further refactoring of undef handling.

bbd1cf3

This causes some output changes that may be worth discussing. E.g. does "a week earlier" without context refer to a day, or a week? I.e. is XXXX-XX-XX or XXXX-WXX correct?

Fix regression: use only current sentence when determining tense.

6a53800

Allow testing with dct

8d0fd55

Improve dct handling.

1a764e3

Boundary matching, re-enable a rule for testing, don't match 2201

6c7765c

Tweaks and fixes.

af1a9fa

Replace "finalize" with a cache for re patterns

4f1e2ff

Because the compiled patterns are only used in tense matching.

Throw exceptions, rather than using System.exit

0b68e15

Add %rePattern1%|rePattern2 syntax | for more efficient regexps.

c83e81c

Use %|re syntax for higher performance.

d72e331

Optimize German rules by using %| notion.

7cb3a96

Tweak patterns a little bit to match more.

3fee6f3

Improve matching of 1940/1941 etc.

4d2665b

Cleanup and simplification.

043bc1f

Have TreeTager provide sentence numbers, use in disambiguation.

6918ea6

Try to resolve handling of e.g. "next weekend"

8358b70

Negation typo.

31ca504

Normalization fixes.

741ce1b

Robustness.

59e05ae

Change combination syntax: %(reA|reB|reC) instead %reA%|reB%|reC

8eabeb7

Trivial code cleanup. Bug fix, use equals() not == for strings.

63d6c2e

Reduce code duplication.

cef6b25

Match both centuries in "19th and early 20th century".

70b350d

re-enable the use of document creation time

b166d1f

add back proxy constants, for improved compatibility

42c1d60

kno10 force-pushed the master branch from 67756e1 to 42c1d60 Compare April 23, 2018 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several optimizations, refactoring #61

Several optimizations, refactoring #61

kno10 commented Sep 8, 2017

kno10 commented Sep 11, 2017

kno10 commented Sep 12, 2017

longliveenduro commented Mar 27, 2018

kno10 commented Mar 27, 2018

longliveenduro commented Mar 27, 2018 •

edited

Loading

kno10 commented Mar 27, 2018

longliveenduro commented Mar 27, 2018

kno10 commented Mar 27, 2018

kno10 commented Apr 23, 2018

Several optimizations, refactoring #61

Are you sure you want to change the base?

Several optimizations, refactoring #61

Conversation

kno10 commented Sep 8, 2017

kno10 commented Sep 11, 2017

kno10 commented Sep 12, 2017

longliveenduro commented Mar 27, 2018

kno10 commented Mar 27, 2018

longliveenduro commented Mar 27, 2018 • edited Loading

kno10 commented Mar 27, 2018

longliveenduro commented Mar 27, 2018

kno10 commented Mar 27, 2018

kno10 commented Apr 23, 2018

longliveenduro commented Mar 27, 2018 •

edited

Loading