Skip to content

Commit

Permalink
feat: enhance word extraction from PDFMiner objects
Browse files Browse the repository at this point in the history
  • Loading branch information
christinestraub committed Oct 30, 2024
1 parent d2332ca commit 6fa0c09
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions unstructured/partition/pdf_image/pdfminer_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -575,6 +575,11 @@ def get_words_from_obj(
y2 = height - character.y0

word += char
else:
words.append(
{"text": word, "bbox": (x1, y1, x2, y2), "start_index": start_index},
)
word = ""
text_len += len(text_line)
return characters, words

Expand Down

0 comments on commit 6fa0c09

Please sign in to comment.