Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several optimizations, refactoring #61

Open
wants to merge 113 commits into
base: master
Choose a base branch
from
Open

Conversation

kno10
Copy link

@kno10 kno10 commented Sep 8, 2017

As I'm repeatedly running this on the entire Wikipedia(s) for experiments, it was worth the effort to optimize this thoroughly.
Sorry, I cannot easily break this apart into smaller pull requests, because optimizations and code refactorings depend on each other; and I mostly did all of this because I needed it to become a lot faster... but I would appreciate if you test and benchmark this, as I believe this improves both the code and the runtime for everybody and not just for me. ;-)

@kno10
Copy link
Author

kno10 commented Sep 11, 2017

Some (preliminary) results from a benchmark that is still running,
involving a pipeline with CoreNLP, a huge entity dictionary to match named entities, and HeidelTime:
Results after 1.000.000 Wikipedia documents.

with HeidelTime 2.2.1: 7 hours 13 minutes
with above patches: 4 hours 55 minutes

When the process has finished I can also compare result sizes etc. (as e.g. writing much less to the database will also be faster). But from these numbers, I do think my branch is faster. ;-)

I've just added code to (hopefully) be able to break down runtime into CoreNLP, Entity Dictionary, HeidelTime with more detail on the next run. Because right now I cannot tell how much (or little) of the remaining 4:55 is due to HeidelTime.

@kno10
Copy link
Author

kno10 commented Sep 12, 2017

Final runtime:
with HeidelTime 2.2.1: 28 hours 19 minutes
with above patches: 25 hours 23 minutes

@longliveenduro
Copy link

I created a fork and merged your PR but unfortunately then I get a lot of "dates" like this XXXX-XX-XX. This doesn't happen with the original version from master branch. My fork: https://github.com/longliveenduro/heideltime

@kno10
Copy link
Author

kno10 commented Mar 27, 2018

Do you have an example document?
Such dates often are relative dates ("three days earlier") without a reference. There should be relative information available, even when the date cannot be formated as a calendar date.

@longliveenduro
Copy link

longliveenduro commented Mar 27, 2018

@kno10 yes, the test document I've used is

"Anfang 2018: Gestern war ein schöner Tag. Nächste Woche wird es besser werden. Am kommenden Montag kommt ein weiteres Tief auf uns zu. Das Sturmtief wird gegen Montag Nachmittag um ca. 15.00 Uhr in Bayern ankommen. Mehr in 60 Minuten! Zweimal im Jahr!"

Works pretty good with the master version of heideltime when choosing "News" as Document Type and supplying a document date.

FYI: I do use Heideltime Standalone with Treetagger:

val heidelTimeStandalone = new HeidelTimeStandalone(Language.GERMAN, DocumentType.NEWS, OutputType.TIMEML)
val timeMlResult = heidelTimeStandalone.process(doc.text, doc.publishingDate.toDate)

@kno10
Copy link
Author

kno10 commented Mar 27, 2018

Document type and publishing date certainly will have an effect.
The XXXX-XX-XX annotations look as if the publishing date is not used here anymore. This may be that my code considers "Anfang 2018" to be a stronger temporal anchor than the documents publishing date. But if you try to anchor "Gestern" with "Anfang 2018", you cannot make this a date, so it probably behaves a bit more similar to Heideltime narrative mode in this particular case.

@longliveenduro
Copy link

I think I had the same issue with a simpler version without the "Anfang 2018".

"Gestern war ein schöner Tag. Nächste Woche wird es besser werden. Am kommenden Montag kommt ein weiteres Tief auf uns zu. Das Sturmtief wird gegen Montag Nachmittag um ca. 15.00 Uhr in Bayern ankommen."

Ok unfortunately I think this makes your performance improvements for our news articles quite unusable, because the publishing date should be a very strong anchor for news. Also the project owner should be aware of this when he might consider merging this PR.

@kno10
Copy link
Author

kno10 commented Mar 27, 2018

IIRC I had too many false anchorings with the old logic.
But yes, this needs a full regression testing, but unfortunately there is no testsuite with well-defined behavior yet. Many of my changes are to make such behavior can be more cleanly controlled, and I could not retain undocumented behavior in all cases.

@kno10
Copy link
Author

kno10 commented Apr 23, 2018

Resolving using the document creation time is trivially disabled in this line:

https://github.com/kno10/heideltime/blob/799e611b20d3bcae3be9cb1bd8127901379f921c/src/de/unihd/dbs/uima/annotator/heideltime/ResolveAmbiguousValues.java#L131

so that is as easy as it can be to "fix". I think the original motivation was to always use the document creation time if provided - if you do not want to use it for resolving, just do not provide it. So the line should probably be ... = ParsedDct.read(jcas);.

But that is a design decision. I am not a fan of the except-narrative hidden rule.

kno10 added 29 commits April 23, 2018 18:47
This causes some output changes that may be worth discussing.
E.g. does "a week earlier" without context refer to a day, or a week?
I.e. is XXXX-XX-XX or XXXX-WXX correct?
Because the compiled patterns are only used in tense matching.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants