Skip to content

Commit

Permalink
Merge branch 'main' into metadata-docs
Browse files Browse the repository at this point in the history
  • Loading branch information
ron-unstructured authored Oct 4, 2023
2 parents 12eac85 + 9960ce5 commit a786631
Show file tree
Hide file tree
Showing 42 changed files with 1,081 additions and 301 deletions.
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
## 0.10.19-dev10
## 0.10.19-dev11

### Enhancements

* **Adds XLSX document level language detection** Enhancing on top of language detection functionality in previous release, we now support language detection within `.xlsx` file type at Element level.
* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
Expand All @@ -10,7 +11,7 @@
* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
* **change default `hi_res` model for pdf/image partition to `yolox`** Now partitioning pdf/image using `hi_res` strategy utilizes `yolox_quantized` model isntead of `detectron2_onnx` model. This new default model has better recall for tables and produces more detailed categories for elements.

### Features
* **XLSX can now reads subtables within one sheet** Problem: Many .xlsx files are not created to be read as one full table per sheet. There are subtables, text and header along with more informations to extract from each sheet. Feature: This `partition_xlsx` now can reads subtable(s) within one .xlsx sheet, along with extracting other title and narrative texts. Importance: This enhance the power of .xlsx reading to not only one table per sheet, allowing user to capture more data tables from the file, if exists.

### Fixes

Expand All @@ -19,6 +20,7 @@
Problem: Under certain circumstances, text immediately after some HTML tags will be misssing from partition result.
Fix: Updated code to deal with these cases.
Importance: This will ensure the correctness when partitioning HTML and Markdown documents.
* **Fixes chunking when `detection_class_prob` appears in Element metadata** Problem: when `detection_class_prob` appears in Element metadata, Elements will only be combined by chunk_by_title if they have the same `detection_class_prob` value (which is rare). This is unlikely a case we ever need to support and most often results in no chunking. Fix: `detection_class_prob` is included in the chunking list of metadata keys excluded for similarity comparison. Importance: This change allows `chunk_by_title` to operate as intended for documents which include `detection_class_prob` metadata in their Elements.

## 0.10.18

Expand Down
4 changes: 2 additions & 2 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ certifi==2023.7.22
# -c requirements/constraints.in
# -r requirements/build.in
# requests
charset-normalizer==3.2.0
charset-normalizer==3.3.0
# via
# -c requirements/base.txt
# requests
Expand Down Expand Up @@ -54,7 +54,7 @@ mdurl==0.1.2
# via markdown-it-py
myst-parser==2.0.0
# via -r requirements/build.in
packaging==23.1
packaging==23.2
# via
# -c requirements/base.txt
# sphinx
Expand Down
Binary file not shown.
Binary file added example-docs/vodafone.xlsx
Binary file not shown.
4 changes: 2 additions & 2 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ certifi==2023.7.22
# requests
chardet==5.2.0
# via -r requirements/base.in
charset-normalizer==3.2.0
charset-normalizer==3.3.0
# via requests
click==8.1.7
# via nltk
Expand Down Expand Up @@ -40,7 +40,7 @@ numpy==1.24.4
# via
# -c requirements/constraints.in
# -r requirements/base.in
packaging==23.1
packaging==23.2
# via marshmallow
python-iso639==2023.6.15
# via -r requirements/base.in
Expand Down
4 changes: 2 additions & 2 deletions requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ certifi==2023.7.22
# -c requirements/constraints.in
# -r requirements/build.in
# requests
charset-normalizer==3.2.0
charset-normalizer==3.3.0
# via
# -c requirements/base.txt
# requests
Expand Down Expand Up @@ -54,7 +54,7 @@ mdurl==0.1.2
# via markdown-it-py
myst-parser==2.0.0
# via -r requirements/build.in
packaging==23.1
packaging==23.2
# via
# -c requirements/base.txt
# sphinx
Expand Down
Loading

0 comments on commit a786631

Please sign in to comment.