Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several optimizations, refactoring #61

Open
wants to merge 113 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
1c5210d
Use a -SNAPSHOT version, to distinguish from the released version.
kno10 Dec 30, 2016
0dbbc07
Log exceptions rather than print them.
kno10 Dec 16, 2016
c5ee166
Switch to slf4j, cleanups and optimizations.
kno10 Dec 30, 2016
c1afb54
Replace Logger with slf4j, reduce unnecessary log string generation.
kno10 Dec 30, 2016
2a01912
Switch logging in standalone to slf4j, breaks -v option for now.
kno10 Jan 12, 2017
fbfb99d
Use a primitive int rather than an object.
kno10 Dec 16, 2016
04d769e
Use precompiled matchers, for performance.
kno10 Dec 20, 2016
3f2f123
Indentation fixes.
kno10 Dec 30, 2016
3997595
Reduce object allocations when not matched.
kno10 Dec 30, 2016
77d7103
Optimizations and cleanup.
kno10 Dec 30, 2016
3663444
More simplifications.
kno10 Dec 30, 2016
81fa449
Optimize, by avoiding recompiling patterns.
kno10 Dec 30, 2016
438e4eb
More optimization.
kno10 Dec 30, 2016
91f740f
More optimization
kno10 Dec 30, 2016
010391c
More optimizations.
kno10 Dec 30, 2016
e56898c
More low-level optimizations.
kno10 Dec 30, 2016
bbf6f1b
More fixes. Bottleneck now appears to be `Matcher.find`.
kno10 Dec 30, 2016
bf30132
Improve date handling, in particular error reporting.
kno10 Dec 30, 2016
d36d904
Code cleanups and optimizations.
kno10 Jan 2, 2017
7f505c5
Another copy and paste error.
kno10 Jan 2, 2017
9813bbf
Optimizations and cleanups.
kno10 Jan 2, 2017
b9ec99e
Add profiling information, and improve error reporting for bad rules.
kno10 Jan 2, 2017
121a87a
Optimizations and refactoring (duration simplification, chinese numbers)
kno10 Jan 2, 2017
90da90e
Improve error handling and logging.
kno10 Jan 2, 2017
656e081
Improve rule performance by optimizing regular expressions.
kno10 Jan 2, 2017
d9addee
Use FAST_CHECK to speed up some of the more expensive rules.
kno10 Jan 2, 2017
d66d030
Fix time_r8d, optimize various patterns.
kno10 Jan 3, 2017
bcae16c
Anchor rules, for better performance.
kno10 Jan 11, 2017
32efad7
Refactor rules into a Rule class, which greatly simplifies code.
kno10 Jan 11, 2017
fef5662
More cleanups and optimization, retire Toolbox.
kno10 Jan 11, 2017
7b07ef2
More smallish cleanups.
kno10 Jan 11, 2017
0fee9cf
Switch DateCalculator to Java 8 Date API (much better). Untested.
kno10 Jan 11, 2017
43c25a5
Improve robustness of date parsing.
kno10 Jan 19, 2017
a78b509
Update dependencies to latest versions. Simplify UIMA context (untested)
kno10 Jan 12, 2017
33e478f
Decrease logging priority debug -> trace for variable expansion.
kno10 Jan 12, 2017
1543bbf
Much improved whitespace handling. Move more logging to TRACE level.
kno10 Jan 12, 2017
e191bef
Minor pattern improvements
kno10 Jan 12, 2017
913a77a
Don't complain on empty lines.
kno10 Jan 12, 2017
8b26cf2
Move more logging from DEBUG to TRACE level.
kno10 Jan 12, 2017
ac57dc8
Add (unfinished) unit test for english.
kno10 Jan 12, 2017
d2af95a
Code formatting and cleanups.
kno10 Jan 12, 2017
202cbfa
Simplify ContextAnalyzer
kno10 Jan 12, 2017
ec891dc
Testing improvements and avoid "last week of"
kno10 Jan 12, 2017
7c75269
BIG optimization of reused RePatterns for performance (14->7 CPU days)
kno10 Jan 13, 2017
8975da6
More rule expansion logic to its own class.
kno10 Jan 13, 2017
05d5c32
Refactoring of unit tests. Some fail (needs more cleanup)
kno10 Jan 16, 2017
c1229cb
Some code simplification, some methods public now.
kno10 Jan 16, 2017
6806d43
Rule improvements and fixes found by unit testing.
kno10 Jan 16, 2017
c83a68b
More pattern improvements.
kno10 Jan 18, 2017
5693e00
Add a regexp optimization step (TODO: make optional)
kno10 Jan 18, 2017
5ea526e
Rule improvements.
kno10 Jan 18, 2017
a12aac3
Add @Ignore to tests that require POS tags for now.
kno10 Jan 18, 2017
65012dd
Improve handling for overlapping rule matches.
kno10 Jan 18, 2017
b1de32c
Reduce debug verbosity to trace.
kno10 Jan 18, 2017
7b5b681
Improve rule error reporting.
kno10 Jan 18, 2017
43624d0
Add dotted version, too.
kno10 Jan 18, 2017
aac3682
Adjustments because of latest Wikipedia test (runtime now down to 5h)
kno10 Jan 19, 2017
07385b3
Make tokenizer alignment more robust
kno10 Jan 19, 2017
be5cdef
Undo manual optimizations, let the new automatic optimizer do its job.
kno10 Jan 19, 2017
e26215a
Improve tests, by usually using colloquial (except for historic)
kno10 Jan 19, 2017
652b674
Add "dawn of", move historic after positive patterns
kno10 Jan 19, 2017
6c1134c
Retain file order rather than sorting alphabetically.
kno10 Jan 19, 2017
cc099da
make allTokIds optional; cleanups; undo 'longer result wins' tie breaker
kno10 Jan 19, 2017
7e19c35
Simplify whitespace & improve token matching procedure.
kno10 Jan 20, 2017
da59a07
Because of the improved token matching, we do not need anchors anymore.
kno10 Jan 20, 2017
2569ff7
Simplify patterns (no capturing groups, simplify `[\.]`)
kno10 Mar 1, 2017
4ecde08
Don't fail on optimizing an empty set of patterns.
kno10 Mar 1, 2017
4727076
Fix Regexp optimizer corner case present in German patterns
kno10 Mar 1, 2017
b219aac
Some fixes for German dates.
kno10 Mar 12, 2017
3f41f7d
Normalization: allow leading 0 on 01st to 09th
kno10 Mar 14, 2017
a7a4a80
Fix matching of 24:00 times.
kno10 May 31, 2017
0db0715
Tweak to negative matches, for Wikipedia.
kno10 May 31, 2017
6ce5e1b
Reduce string usage, refactor Dct parsing.
kno10 May 31, 2017
757b241
Remove language/normalization roundtrip for next-week via Java 8.
kno10 May 31, 2017
f21179d
Disable tests and signing by default, override via command line.
kno10 May 31, 2017
ccbfe81
Split heideltime class in two: refactor ambiguous values resolving.
kno10 May 31, 2017
0924f41
Larger code cleanups of disambiguation logic.
kno10 May 31, 2017
46016b5
Bug fix for issue HeidelTime/heideltime#53
kno10 May 31, 2017
d5c9e53
Use a DocumentType enum not only in standalone.
kno10 Jun 1, 2017
5340a2a
Change season to an enum.
kno10 Jun 6, 2017
01c0dce
Use enum for tense.
kno10 Jun 6, 2017
5a6f59f
Further improve Season handling.
kno10 Jun 6, 2017
46da9a7
Improve efficiency of integer parsing.
kno10 Jun 6, 2017
de2a368
Code cleanups.
kno10 Jun 6, 2017
cf0dc35
More code cleanups.
kno10 Jun 6, 2017
77a6300
Refactor: extract smaller methods from monster function.
kno10 Jun 6, 2017
522f907
Cleanups, and further refactoring.
kno10 Jun 6, 2017
3fc9688
Refactor handling of undef-year
kno10 Jun 7, 2017
bbd1cf3
Further refactoring of undef handling.
kno10 Jun 7, 2017
6a53800
Fix regression: use only current sentence when determining tense.
kno10 Jun 8, 2017
8d0fd55
Allow testing with dct
kno10 Jun 8, 2017
1a764e3
Improve dct handling.
kno10 Jun 8, 2017
6c7765c
Boundary matching, re-enable a rule for testing, don't match 2201
kno10 Jun 8, 2017
af1a9fa
Tweaks and fixes.
kno10 Jun 8, 2017
4f1e2ff
Replace "finalize" with a cache for re patterns
kno10 Jun 9, 2017
0b68e15
Throw exceptions, rather than using System.exit
kno10 Jun 9, 2017
c83e81c
Add `%rePattern1%|rePattern2` syntax `|` for more efficient regexps.
kno10 Jun 9, 2017
d72e331
Use `%|re` syntax for higher performance.
kno10 Jun 9, 2017
7cb3a96
Optimize German rules by using `%|` notion.
kno10 Jun 9, 2017
3fee6f3
Tweak patterns a little bit to match more.
kno10 Jun 9, 2017
4d2665b
Improve matching of `1940/1941` etc.
kno10 Jun 9, 2017
043bc1f
Cleanup and simplification.
kno10 Jun 9, 2017
6918ea6
Have TreeTager provide sentence numbers, use in disambiguation.
kno10 Jun 12, 2017
8358b70
Try to resolve handling of e.g. "next weekend"
kno10 Jun 19, 2017
31ca504
Negation typo.
kno10 Jun 19, 2017
741ce1b
Normalization fixes.
kno10 Jun 19, 2017
59e05ae
Robustness.
kno10 Jun 20, 2017
8eabeb7
Change combination syntax: %(reA|reB|reC) instead %reA%|reB%|reC
kno10 Aug 1, 2017
63d6c2e
Trivial code cleanup. Bug fix, use equals() not == for strings.
kno10 Aug 1, 2017
cef6b25
Reduce code duplication.
kno10 Aug 1, 2017
70b350d
Match both centuries in "19th and early 20th century".
kno10 Dec 1, 2017
b166d1f
re-enable the use of document creation time
kno10 Apr 23, 2018
42c1d60
add back proxy constants, for improved compatibility
kno10 Apr 23, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 46 additions & 6 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

<groupId>com.github.heideltime</groupId>
<artifactId>heideltime</artifactId>
<version>2.2.1</version>
<version>2.2.2-SNAPSHOT</version>

<name>HeidelTime</name>
<description>HeidelTime is a multilingual cross-domain temporal tagger that extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard.</description>
Expand All @@ -24,6 +24,8 @@

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.test.skip>true</maven.test.skip>
<gpg.skip>true</gpg.skip>
</properties>

<developers>
Expand All @@ -50,6 +52,8 @@
<build>
<sourceDirectory>src</sourceDirectory>
<outputDirectory>${basedir}/class</outputDirectory>
<testSourceDirectory>test</testSourceDirectory>
<testOutputDirectory>${basedir}/testclass</testOutputDirectory>
<resources>
<resource>
<directory>${basedir}</directory>
Expand All @@ -70,8 +74,17 @@
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<configuration>
<outputDirectory>
${project.build.directory}/lib
</outputDirectory>
</configuration>
</plugin>
<plugin>
Expand Down Expand Up @@ -183,21 +196,21 @@
<dependency>
<groupId>org.apache.uima</groupId>
<artifactId>uimaj-core</artifactId>
<version>2.8.1</version>
<version>2.10.2</version>
<scope>provided</scope>
</dependency>
<!-- for the StanfordPOSTaggerWrapper -->
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.3.1</version>
<version>3.8.0</version>
<scope>provided</scope>
</dependency>
<!-- these are for JVnTextPro -->
<dependency>
<groupId>args4j</groupId>
<artifactId>args4j</artifactId>
<version>2.32</version>
<version>2.33</version>
<scope>provided</scope>
</dependency>
<dependency>
Expand All @@ -206,5 +219,32 @@
<version>0.1</version>
<scope>provided</scope>
</dependency>
<!-- Logging facade -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.25</version>
<scope>provided</scope>
</dependency>
<!-- Default logging - you can mask these -->
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-core</artifactId>
<version>[1.2.3,)</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>[1.2.3,)</version>
<scope>provided</scope>
</dependency>
<!-- JUnit4 testing -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>[4.12,5)</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
96 changes: 43 additions & 53 deletions resources/english/normalization/resources_normalization_normDay.txt
Original file line number Diff line number Diff line change
@@ -1,52 +1,42 @@
// author: Jannik Strötgen
// email: [email protected]
// email: stroetgen@uni-hd\.de
// date: 2011-06-10
// This file contains "day words" and their normalized expressions
// according to TIMEX3 format.
// according to TIMEX3 format\.
// For example, the normalized value of "first" is "01"
// FORMAT: "day-word","normalized-day-word"
"0","00"
"00","00"
"1","01"
"01","01"
"2","02"
"02","02"
"3","03"
"03","03"
"4","04"
"04","04"
"5","05"
"05","05"
"6","06"
"06","06"
"7","07"
"07","07"
"8","08"
"08","08"
"9","09"
"09","09"
"10","10"
"11","11"
"12","12"
"13","13"
"14","14"
"15","15"
"16","16"
"17","17"
"18","18"
"19","19"
"20","20"
"21","21"
"22","22"
"23","23"
"24","24"
"25","25"
"26","26"
"27","27"
"28","28"
"29","29"
"30","30"
"31","31"
"00?\.?","00"
"0?1\.?","01"
"0?2\.?","02"
"0?3\.?","03"
"0?4\.?","04"
"0?5\.?","05"
"0?6\.?","06"
"0?7\.?","07"
"0?8\.?","08"
"0?9\.?","09"
"10\.?","10"
"11\.?","11"
"12\.?","12"
"13\.?","13"
"14\.?","14"
"15\.?","15"
"16\.?","16"
"17\.?","17"
"18\.?","18"
"19\.?","19"
"20\.?","20"
"21\.?","21"
"22\.?","22"
"23\.?","23"
"24\.?","24"
"25\.?","25"
"26\.?","26"
"27\.?","27"
"28\.?","28"
"29\.?","29"
"30\.?","30"
"31\.?","31"
//
"first","01"
"second","02"
Expand Down Expand Up @@ -115,15 +105,15 @@
"Thirtieth","30"
"Thirty-first","31"
//
"1st","01"
"2nd","02"
"3rd","03"
"4th","04"
"5th","05"
"6th","06"
"7th","07"
"8th","08"
"9th","09"
"0?1st","01"
"0?2nd","02"
"0?3rd","03"
"0?4th","04"
"0?5th","05"
"0?6th","06"
"0?7th","07"
"0?8th","08"
"0?9th","09"
"10th","10"
"11th","11"
"12th","12"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,5 @@
"Friday","5"
"Saturday","6"
"Sunday","7"
// Popular spelling mistakes
"[Ww]e[dn][nd]e?sday","3"
Loading