Skip to content

Commit

Permalink
bump unstructured-inference (#3711)
Browse files Browse the repository at this point in the history
This PR bumps `unstructured-inference` to `0.8.0`, which introduces
vectorized data structure for layout elements and text regions.
This PR also cleans up a few places in CI that has repeated definition
of env variables or missing installation of testing dependencies in
cache.

A few document ingest results are changed:
- two places for `biomed-api` (actually processed locally on runner) are
due to very small changes in numerical results of the bounding box
areas: one results in a duplicated page number/header and another
results in a deduplication of a word of a sentence that starts in a new
line. (yes, two cases goes in opposite directions)
- the layout parser paper now outputs the code lines with page number
inside the code box as list items

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: badGarnet <[email protected]>
Co-authored-by: christinestraub <[email protected]>
  • Loading branch information
4 people authored Oct 21, 2024
1 parent e764bc5 commit a11ad22
Show file tree
Hide file tree
Showing 23 changed files with 184 additions and 109 deletions.
5 changes: 4 additions & 1 deletion .github/actions/base-cache/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,17 @@ runs:
shell: bash
run: |
python${{ inputs.python-version }} -m pip install --upgrade virtualenv
python${{ inputs.python-version }} -m venv .venv
if [ ! -d ".venv" ]; then
python${{ inputs.python-version }} -m venv .venv
fi
source .venv/bin/activate
[ ! -d "$NLTK_DATA" ] && mkdir "$NLTK_DATA"
if [ "${{ inputs.python-version == '3.12' }}" == "true" ]; then
python -m ensurepip --upgrade
python -m pip install --upgrade setuptools
fi
make install-ci
make install-nltk-models
- name: Save Cache
if: steps.virtualenv-cache-restore.outputs.cache-hit != 'true'
id: virtualenv-cache-save
Expand Down
6 changes: 4 additions & 2 deletions .github/actions/base-ingest-cache/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ runs:
path: |
.venv
nltk_data
key: unstructured-ingest-${{ runner.os }}-${{ inputs.python-version }}-${{ hashFiles('requirements/ingest/*.txt') }}-${{ hashFiles('requirements/*.txt') }}
key: unstructured-ingest-${{ runner.os }}-${{ inputs.python-version }}-${{ hashFiles('requirements/ingest/*.txt', 'requirements/*.txt') }}
lookup-only: ${{ inputs.check-only }}
- name: Set up Python ${{ inputs.python-version }}
if: steps.ingest-virtualenv-cache-restore.outputs.cache-hit != 'true'
Expand All @@ -39,6 +39,8 @@ runs:
python -m pip install --upgrade setuptools
fi
make install-ci
make install-nltk-models
make install-all-docs
make install-ingest
- name: Save Ingest Cache
if: steps.ingest-virtualenv-cache-restore.outputs.cache-hit != 'true'
Expand All @@ -48,5 +50,5 @@ runs:
path: |
.venv
nltk_data
key: unstructured-ingest-${{ runner.os }}-${{ inputs.python-version }}-${{ hashFiles('requirements/ingest/*.txt') }}-${{ hashFiles('requirements/*.txt') }}
key: unstructured-ingest-${{ runner.os }}-${{ inputs.python-version }}-${{ hashFiles('requirements/ingest/*.txt', 'requirements/*.txt') }}

17 changes: 6 additions & 11 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,15 @@ permissions:
id-token: write
contents: read

env:
NLTK_DATA: ${{ github.workspace }}/nltk_data

jobs:
setup:
strategy:
matrix:
python-version: ["3.9","3.10","3.11", "3.12"]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/base-cache
Expand Down Expand Up @@ -78,8 +79,6 @@ jobs:
strategy:
matrix:
python-version: ["3.9","3.10","3.11"]
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
runs-on: ubuntu-latest
needs: [setup, changelog]
steps:
Expand Down Expand Up @@ -185,8 +184,6 @@ jobs:
python-version: ["3.10"]
extra: ["csv", "docx", "odt", "markdown", "pypandoc", "pdf-image", "pptx", "xlsx"]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup, lint, test_unit_no_extras]
steps:
- uses: actions/checkout@v4
Expand Down Expand Up @@ -220,15 +217,14 @@ jobs:
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
tesseract --version
make install-${{ matrix.extra }}
make test-extra-${{ matrix.extra }} CI=true
setup_ingest:
strategy:
matrix:
python-version: [ "3.9","3.10" ]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup]
steps:
- uses: actions/checkout@v4
Expand Down Expand Up @@ -307,7 +303,6 @@ jobs:
MXBAI_API_KEY: ${{secrets.MXBAI_API_KEY}}
OCR_AGENT: "unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"
CI: "true"
NLTK_DATA: ${{ github.workspace }}/nltk_data
PYTHON: python${{ matrix.python-version }}
run: |
source .venv/bin/activate
Expand All @@ -320,6 +315,8 @@ jobs:
sudo apt-get install -y tesseract-ocr-kor
sudo apt-get install diffstat
tesseract --version
make install-all-docs
make install-ingest
./test_unstructured_ingest/test-ingest-src.sh
Expand All @@ -329,8 +326,6 @@ jobs:
# NOTE(yuming): Unstructured API only use Python 3.10
python-version: ["3.10"]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup, lint]
steps:
- uses: actions/checkout@v4
Expand Down
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
## 0.16.1-dev5
## 0.16.1-dev6

### Enhancements

* **Bump `unstructured-inference` to 0.7.39** and upgrade other dependencies
* **Round coordinates** Round coordinates when computing bounding box overlaps in `pdfminer_processing.py` to nearest machine precision. This can help reduce underterministic behavior from machine precision that affects which bounding boxes to combine.

### Features
Expand Down
10 changes: 5 additions & 5 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./base.in
#
anyio==4.6.0
anyio==4.6.2.post1
# via httpx
backoff==2.2.1
# via -r ./base.in
Expand All @@ -20,15 +20,15 @@ cffi==1.17.1
# via cryptography
chardet==5.2.0
# via -r ./base.in
charset-normalizer==3.3.2
charset-normalizer==3.4.0
# via
# requests
# unstructured-client
click==8.1.7
# via
# nltk
# python-oxmsg
cryptography==43.0.1
cryptography==43.0.3
# via unstructured-client
dataclasses-json==0.6.7
# via
Expand Down Expand Up @@ -62,7 +62,7 @@ langdetect==1.0.9
# via -r ./base.in
lxml==5.3.0
# via -r ./base.in
marshmallow==3.22.0
marshmallow==3.23.0
# via
# dataclasses-json
# unstructured-client
Expand All @@ -84,7 +84,7 @@ packaging==24.1
# via
# marshmallow
# unstructured-client
psutil==6.0.0
psutil==6.1.0
# via -r ./base.in
pycparser==2.22
# via cffi
Expand Down
8 changes: 4 additions & 4 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./dev.in
#
build==1.2.2
build==1.2.2.post1
# via pip-tools
cfgv==3.4.0
# via pre-commit
Expand All @@ -13,7 +13,7 @@ click==8.1.7
# -c ./base.txt
# -c ./test.txt
# pip-tools
distlib==0.3.8
distlib==0.3.9
# via virtualenv
filelock==3.16.1
# via virtualenv
Expand All @@ -36,7 +36,7 @@ platformdirs==4.3.6
# via
# -c ./test.txt
# virtualenv
pre-commit==3.8.0
pre-commit==4.0.1
# via -r ./dev.in
pyproject-hooks==1.2.0
# via
Expand All @@ -51,7 +51,7 @@ tomli==2.0.2
# -c ./test.txt
# build
# pip-tools
virtualenv==20.26.6
virtualenv==20.27.0
# via pre-commit
wheel==0.44.0
# via pip-tools
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-epub.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@
#
# pip-compile ./extra-epub.in
#
pypandoc==1.13
pypandoc==1.14
# via -r ./extra-epub.in
2 changes: 1 addition & 1 deletion requirements/extra-odt.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ lxml==5.3.0
# via
# -c ./base.txt
# python-docx
pypandoc==1.13
pypandoc==1.14
# via -r ./extra-odt.in
python-docx==1.1.2
# via -r ./extra-odt.in
Expand Down
12 changes: 6 additions & 6 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./extra-paddleocr.in
#
anyio==4.6.0
anyio==4.6.2.post1
# via
# -c ./base.txt
# httpx
Expand All @@ -16,7 +16,7 @@ certifi==2024.8.30
# httpcore
# httpx
# requests
charset-normalizer==3.3.2
charset-normalizer==3.4.0
# via
# -c ./base.txt
# requests
Expand Down Expand Up @@ -52,7 +52,7 @@ idna==3.10
# anyio
# httpx
# requests
imageio==2.35.1
imageio==2.36.0
# via
# imgaug
# scikit-image
Expand Down Expand Up @@ -104,7 +104,7 @@ paddlepaddle==3.0.0b1
# via -r ./extra-paddleocr.in
pdf2image==1.17.0
# via unstructured-paddleocr
pillow==10.4.0
pillow==11.0.0
# via
# imageio
# imgaug
Expand All @@ -117,9 +117,9 @@ protobuf==4.25.5
# via
# -c ././deps/constraints.txt
# paddlepaddle
pyclipper==1.3.0.post5
pyclipper==1.3.0.post6
# via unstructured-paddleocr
pyparsing==3.1.4
pyparsing==3.2.0
# via matplotlib
python-dateutil==2.9.0.post0
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pandoc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@
#
# pip-compile ./extra-pandoc.in
#
pypandoc==1.13
pypandoc==1.14
# via -r ./extra-pandoc.in
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ google-cloud-vision
effdet
# Do not move to constraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference==0.7.36
unstructured-inference==0.8.0
unstructured.pytesseract>=0.3.12
Loading

0 comments on commit a11ad22

Please sign in to comment.