Skip to content

Commit

Permalink
Merge branch 'main' into sebastian/fix-title-depth
Browse files Browse the repository at this point in the history
  • Loading branch information
LaverdeS authored Oct 4, 2023
2 parents cdced8e + 0a65fc2 commit 40ea340
Show file tree
Hide file tree
Showing 39 changed files with 1,030 additions and 298 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

### Enhancements

* **Adds XLSX document level language detection** Enhancing on top of language detection functionality in previous release, we now support language detection within `.xlsx` file type at Element level.
* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
Expand All @@ -10,7 +11,7 @@
* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
* **change default `hi_res` model for pdf/image partition to `yolox`** Now partitioning pdf/image using `hi_res` strategy utilizes `yolox_quantized` model isntead of `detectron2_onnx` model. This new default model has better recall for tables and produces more detailed categories for elements.

### Features
* **XLSX can now reads subtables within one sheet** Problem: Many .xlsx files are not created to be read as one full table per sheet. There are subtables, text and header along with more informations to extract from each sheet. Feature: This `partition_xlsx` now can reads subtable(s) within one .xlsx sheet, along with extracting other title and narrative texts. Importance: This enhance the power of .xlsx reading to not only one table per sheet, allowing user to capture more data tables from the file, if exists.

### Fixes

Expand Down
4 changes: 2 additions & 2 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ certifi==2023.7.22
# -c requirements/constraints.in
# -r requirements/build.in
# requests
charset-normalizer==3.2.0
charset-normalizer==3.3.0
# via
# -c requirements/base.txt
# requests
Expand Down Expand Up @@ -54,7 +54,7 @@ mdurl==0.1.2
# via markdown-it-py
myst-parser==2.0.0
# via -r requirements/build.in
packaging==23.1
packaging==23.2
# via
# -c requirements/base.txt
# sphinx
Expand Down
Binary file not shown.
Binary file added example-docs/vodafone.xlsx
Binary file not shown.
4 changes: 2 additions & 2 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ certifi==2023.7.22
# requests
chardet==5.2.0
# via -r requirements/base.in
charset-normalizer==3.2.0
charset-normalizer==3.3.0
# via requests
click==8.1.7
# via nltk
Expand Down Expand Up @@ -40,7 +40,7 @@ numpy==1.24.4
# via
# -c requirements/constraints.in
# -r requirements/base.in
packaging==23.1
packaging==23.2
# via marshmallow
python-iso639==2023.6.15
# via -r requirements/base.in
Expand Down
4 changes: 2 additions & 2 deletions requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ certifi==2023.7.22
# -c requirements/constraints.in
# -r requirements/build.in
# requests
charset-normalizer==3.2.0
charset-normalizer==3.3.0
# via
# -c requirements/base.txt
# requests
Expand Down Expand Up @@ -54,7 +54,7 @@ mdurl==0.1.2
# via markdown-it-py
myst-parser==2.0.0
# via -r requirements/build.in
packaging==23.1
packaging==23.2
# via
# -c requirements/base.txt
# sphinx
Expand Down
Loading

0 comments on commit 40ea340

Please sign in to comment.