feat: round numbers to reduce undeterministic behavior #3740

badGarnet · 2024-10-19T20:09:29Z

This PR rounds the floating point number associated with coordinates in pdfminer_processing.py. This helps to eliminate machine precision caused randomness in bounding box overlap detection. Currently the rounding is set to the nearest machine precision for np.float32 using np.finfo(float), which yields resolution = 1e-15.

future work

We should reduce the rounding to only 6 digits after floating point since the data type float32 has a resolution of only 1e-6. However it would break tests. A followup is required to tune the threshold values in pdfminer_processing.py so that it works with 1e-6 resolution.

pawel-kmiecik

LGTM

badGarnet requested review from scanny and pawel-kmiecik October 20, 2024 21:46

badGarnet marked this pull request as ready for review October 20, 2024 21:46

pawel-kmiecik approved these changes Oct 21, 2024

View reviewed changes

badGarnet added this pull request to the merge queue Oct 21, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 21, 2024

scanny added this pull request to the merge queue Oct 21, 2024

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Oct 21, 2024

badGarnet added 3 commits October 21, 2024 10:30

feat: round numbers to reduce undeterministic behavior

7343af4

feat: set round to nearest machine precision

33a0c4f

chore: update changelog and bump version

669d717

scanny force-pushed the feat/round-floating-point-number-before-computation branch from 826186b to 669d717 Compare October 21, 2024 17:34

scanny enabled auto-merge October 21, 2024 17:35

scanny added this pull request to the merge queue Oct 21, 2024

Merged via the queue into main with commit e764bc5 Oct 21, 2024
41 checks passed

scanny deleted the feat/round-floating-point-number-before-computation branch October 21, 2024 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: round numbers to reduce undeterministic behavior #3740

feat: round numbers to reduce undeterministic behavior #3740

badGarnet commented Oct 19, 2024 •

edited

Loading

pawel-kmiecik left a comment

feat: round numbers to reduce undeterministic behavior #3740

feat: round numbers to reduce undeterministic behavior #3740

Conversation

badGarnet commented Oct 19, 2024 • edited Loading

future work

pawel-kmiecik left a comment

Choose a reason for hiding this comment

badGarnet commented Oct 19, 2024 •

edited

Loading