-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several optimizations, refactoring #61
base: master
Are you sure you want to change the base?
Conversation
Some (preliminary) results from a benchmark that is still running, with HeidelTime 2.2.1: 7 hours 13 minutes When the process has finished I can also compare result sizes etc. (as e.g. writing much less to the database will also be faster). But from these numbers, I do think my branch is faster. ;-) I've just added code to (hopefully) be able to break down runtime into CoreNLP, Entity Dictionary, HeidelTime with more detail on the next run. Because right now I cannot tell how much (or little) of the remaining 4:55 is due to HeidelTime. |
Final runtime: |
I created a fork and merged your PR but unfortunately then I get a lot of "dates" like this XXXX-XX-XX. This doesn't happen with the original version from master branch. My fork: https://github.com/longliveenduro/heideltime |
Do you have an example document? |
@kno10 yes, the test document I've used is "Anfang 2018: Gestern war ein schöner Tag. Nächste Woche wird es besser werden. Am kommenden Montag kommt ein weiteres Tief auf uns zu. Das Sturmtief wird gegen Montag Nachmittag um ca. 15.00 Uhr in Bayern ankommen. Mehr in 60 Minuten! Zweimal im Jahr!" Works pretty good with the master version of heideltime when choosing "News" as Document Type and supplying a document date. FYI: I do use Heideltime Standalone with Treetagger:
|
Document type and publishing date certainly will have an effect. |
I think I had the same issue with a simpler version without the "Anfang 2018". "Gestern war ein schöner Tag. Nächste Woche wird es besser werden. Am kommenden Montag kommt ein weiteres Tief auf uns zu. Das Sturmtief wird gegen Montag Nachmittag um ca. 15.00 Uhr in Bayern ankommen." Ok unfortunately I think this makes your performance improvements for our news articles quite unusable, because the publishing date should be a very strong anchor for news. Also the project owner should be aware of this when he might consider merging this PR. |
IIRC I had too many false anchorings with the old logic. |
Resolving using the document creation time is trivially disabled in this line: so that is as easy as it can be to "fix". I think the original motivation was to always use the document creation time if provided - if you do not want to use it for resolving, just do not provide it. So the line should probably be But that is a design decision. I am not a fan of the except-narrative hidden rule. |
This causes some output changes that may be worth discussing. E.g. does "a week earlier" without context refer to a day, or a week? I.e. is XXXX-XX-XX or XXXX-WXX correct?
Because the compiled patterns are only used in tense matching.
As I'm repeatedly running this on the entire Wikipedia(s) for experiments, it was worth the effort to optimize this thoroughly.
Sorry, I cannot easily break this apart into smaller pull requests, because optimizations and code refactorings depend on each other; and I mostly did all of this because I needed it to become a lot faster... but I would appreciate if you test and benchmark this, as I believe this improves both the code and the runtime for everybody and not just for me. ;-)