Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Klaijan/fix: isalnum reference before assignment #1586

Merged
merged 11 commits into from
Oct 3, 2023
Merged

Conversation

Klaijan
Copy link
Contributor

@Klaijan Klaijan commented Sep 29, 2023

Executive Summary
Fix bug on the get_word_bounding_box_from_element function that prevent partition_pdf to run.

Technical Details

  • The function originally first define isalnum on the first index. Now switched to conditional on flag value.

Comment on lines 1005 to 1113
if index == 0:
if not set_alnum:
isalnum = char.isalnum()
set_alnum = True

if char.isalnum() != isalnum:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe think about this problem from a different angle? What does isalnum 's intention? It seems that if a char's check is not the same as the last check we add a word? So on first pass we always expect the next condition to pass at line 1113 and word would be not empty after that.
So in other words if word is empty we know we don't need to check at line 1113 so we can replace the highlighted lines with

if not word:
    isalnum = char.isalnum()
if word and char.isalnum() != isalnum:

Copy link
Collaborator

@badGarnet badGarnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented on the code; also can we add a test that would fail without the change please (that triggers the bug)

@Klaijan
Copy link
Contributor Author

Klaijan commented Sep 29, 2023

commented on the code; also can we add a test that would fail without the change please (that triggers the bug)

@badGarnet will require us to find a document with url and first element of the object is not LTChar. I will ask for the document sample from community Slack. But otherwise, I can come back to find the document later late night today or tomorrow.

@badGarnet
Copy link
Collaborator

commented on the code; also can we add a test that would fail without the change please (that triggers the bug)

@badGarnet will require us to find a document with url and first element of the object is not LTChar. I will ask for the document sample from community Slack. But otherwise, I can come back to find the document later late night today or tomorrow.

we can mock a doc to trigger this condition

isalnum = char.isalnum()

if char.isalnum() != isalnum:
if word and char.isalnum() != isalnum:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i'm following this logic correctly, can this be simplified to:

isalnum = char.isalnum()
if word and isalnum == False:
    ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition here is to use isalnum flag to help us split the word by alphanumeric and non-alphanumeric property in the text sentence. For example, Lorem ipsum dolor sit amet, consectetur "adipiscing" elit. would be split as

Lorem 
ipsum 
dolor 
sit 
amet
, 
consectetur
"
adipiscing
" 
elit
.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cragwolfe
Copy link
Contributor

@Klaijan Did you got a test PDF to test the function? if not https://documentation.nokia.com/sr/23-7-2/pdf/Interface_Configuration_Guide_23.7.R1.pdf

LGTM assuming problematic .pdf passes.

Also, it would be worth extracting the problematic page and adding it is a test. The page could be extracted with a command line:

qpdf --empty --pages orignal.pdf 89 -- ~/tmp/docs/single-page-p89.pdf

@Rmaram07
Copy link

Rmaram07 commented Oct 3, 2023

Hi, I'm facing the same issue. When will this be merged to main branch? I see updated code in your repo but not in the main branch.

@Klaijan Klaijan merged commit d6efd52 into main Oct 3, 2023
39 checks passed
@Klaijan Klaijan deleted the klaijan/pdf-isalnum-bug branch October 3, 2023 15:25
@vprzybylo
Copy link

I am still experiencing this issue. Are any updates or reinstallations needed on our end? Unfortunately, I can't share the document due to corporate policy.

File "C:\Users\przybv\Anaconda3\envs\sentiment\Lib\site-packages\unstructured\partition\pdf.py", line 1004, in get_word_bounding_box_from_element
if char.isalnum() != isalnum:

@Klaijan
Copy link
Contributor Author

Klaijan commented Oct 3, 2023

Hi @vprzybylo please try to pull the latest version from main and rerun your document again. Do not hesitate to reach out if the problem still persist for you!

@anjanikant06
Copy link

Hi @Klaijan , I am using AWS Glue to run our code, hence need to do pip install unstructured. but still it shows the same issue for me. local variable 'isalnum' referenced before assignment. can you please help what to do ? Thanks

@Klaijan
Copy link
Contributor Author

Klaijan commented Oct 5, 2023

@anjanikant06 Then please stay tuned for our new pypi release to update the package which will include the bug fix for isalnum issue. Thank you!

@Klaijan
Copy link
Contributor Author

Klaijan commented Oct 5, 2023

@anjanikant06 actually we just released new version 0.10.19. Please try to upgrade the package :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants