-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Klaijan/fix: isalnum reference before assignment #1586
Conversation
unstructured/partition/pdf.py
Outdated
if index == 0: | ||
if not set_alnum: | ||
isalnum = char.isalnum() | ||
set_alnum = True | ||
|
||
if char.isalnum() != isalnum: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe think about this problem from a different angle? What does isalnum
's intention? It seems that if a char
's check is not the same as the last check we add a word? So on first pass we always expect the next condition to pass at line 1113 and word would be not empty after that.
So in other words if word is empty we know we don't need to check at line 1113 so we can replace the highlighted lines with
if not word:
isalnum = char.isalnum()
if word and char.isalnum() != isalnum:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented on the code; also can we add a test that would fail without the change please (that triggers the bug)
@badGarnet will require us to find a document with url and first element of the object is not |
we can mock a doc to trigger this condition |
isalnum = char.isalnum() | ||
|
||
if char.isalnum() != isalnum: | ||
if word and char.isalnum() != isalnum: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i'm following this logic correctly, can this be simplified to:
isalnum = char.isalnum()
if word and isalnum == False:
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition here is to use isalnum
flag to help us split the word
by alphanumeric and non-alphanumeric property in the text sentence. For example, Lorem ipsum dolor sit amet, consectetur "adipiscing" elit.
would be split as
Lorem
ipsum
dolor
sit
amet
,
consectetur
"
adipiscing
"
elit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Klaijan Did you got a test PDF to test the function? if not https://documentation.nokia.com/sr/23-7-2/pdf/Interface_Configuration_Guide_23.7.R1.pdf
LGTM assuming problematic .pdf passes. Also, it would be worth extracting the problematic page and adding it is a test. The page could be extracted with a command line:
|
Hi, I'm facing the same issue. When will this be merged to main branch? I see updated code in your repo but not in the main branch. |
I am still experiencing this issue. Are any updates or reinstallations needed on our end? Unfortunately, I can't share the document due to corporate policy. File "C:\Users\przybv\Anaconda3\envs\sentiment\Lib\site-packages\unstructured\partition\pdf.py", line 1004, in get_word_bounding_box_from_element |
Hi @vprzybylo please try to pull the latest version from main and rerun your document again. Do not hesitate to reach out if the problem still persist for you! |
Hi @Klaijan , I am using AWS Glue to run our code, hence need to do pip install unstructured. but still it shows the same issue for me. local variable 'isalnum' referenced before assignment. can you please help what to do ? Thanks |
@anjanikant06 Then please stay tuned for our new pypi release to update the package which will include the bug fix for |
@anjanikant06 actually we just released new version 0.10.19. Please try to upgrade the package :) |
Executive Summary
Fix bug on the
get_word_bounding_box_from_element
function that preventpartition_pdf
to run.Technical Details
isalnum
on the first index. Now switched to conditional on flag value.