Merge branch 'main' into ahmet/sharepoint-rbac

Unstructured-IO · Oct 4, 2023 · b557094 · b557094
2 parents adeb806 + 9960ce5
commit b557094
Show file tree

Hide file tree

Showing 41 changed files with 1,079 additions and 299 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,7 @@
 
 ### Enhancements
 
+* **Adds XLSX document level language detection** Enhancing on top of language detection functionality in previous release, we now support language detection within `.xlsx` file type at Element level.
 * **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
 * **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
 * **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
@@ -10,7 +11,7 @@
 * **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
 * **change default `hi_res` model for pdf/image partition to `yolox`** Now partitioning pdf/image using `hi_res` strategy utilizes `yolox_quantized` model isntead of `detectron2_onnx` model. This new default model has better recall for tables and produces more detailed categories for elements.
 
-### Features
+* **XLSX can now reads subtables within one sheet** Problem: Many .xlsx files are not created to be read as one full table per sheet. There are subtables, text and header along with more informations to extract from each sheet. Feature: This `partition_xlsx` now can reads subtable(s) within one .xlsx sheet, along with extracting other title and narrative texts. Importance: This enhance the power of .xlsx reading to not only one table per sheet, allowing user to capture more data tables from the file, if exists.
 
 * **Adds permissions(RBAC) data ingestion functionality for the Sharepoint connector.** Problem: Role based access control is an important component in many data storage systems. Users may need to pass permissions (RBAC) data to downstream systems when ingesting data. Feature: Added permissions data ingestion functionality to the Sharepoint connector.
 
@@ -21,6 +22,7 @@
   Problem: Under certain circumstances, text immediately after some HTML tags will be misssing from partition result.
   Fix: Updated code to deal with these cases.
   Importance: This will ensure the correctness when partitioning HTML and Markdown documents.
+* **Fixes chunking when `detection_class_prob` appears in Element metadata** Problem: when `detection_class_prob` appears in Element metadata, Elements will only be combined by chunk_by_title if they have the same `detection_class_prob` value (which is rare). This is unlikely a case we ever need to support and most often results in no chunking. Fix: `detection_class_prob` is included in the chunking list of metadata keys excluded for similarity comparison. Importance: This change allows `chunk_by_title` to operate as intended for documents which include `detection_class_prob` metadata in their Elements.
 
 ## 0.10.18
 

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -18,7 +18,7 @@ certifi==2023.7.22
     #   -c requirements/constraints.in
     #   -r requirements/build.in
     #   requests
-charset-normalizer==3.2.0
+charset-normalizer==3.3.0
     # via
     #   -c requirements/base.txt
     #   requests
@@ -54,7 +54,7 @@ mdurl==0.1.2
     # via markdown-it-py
 myst-parser==2.0.0
     # via -r requirements/build.in
-packaging==23.1
+packaging==23.2
     # via
     #   -c requirements/base.txt
     #   sphinx

diff --git a/example-docs/2023-half-year-analyses-by-segment.xlsx b/example-docs/2023-half-year-analyses-by-segment.xlsx
diff --git a/example-docs/vodafone.xlsx b/example-docs/vodafone.xlsx
diff --git a/requirements/base.txt b/requirements/base.txt
@@ -12,7 +12,7 @@ certifi==2023.7.22
     #   requests
 chardet==5.2.0
     # via -r requirements/base.in
-charset-normalizer==3.2.0
+charset-normalizer==3.3.0
     # via requests
 click==8.1.7
     # via nltk
@@ -40,7 +40,7 @@ numpy==1.24.4
     # via
     #   -c requirements/constraints.in
     #   -r requirements/base.in
-packaging==23.1
+packaging==23.2
     # via marshmallow
 python-iso639==2023.6.15
     # via -r requirements/base.in

diff --git a/requirements/build.txt b/requirements/build.txt
@@ -18,7 +18,7 @@ certifi==2023.7.22
     #   -c requirements/constraints.in
     #   -r requirements/build.in
     #   requests
-charset-normalizer==3.2.0
+charset-normalizer==3.3.0
     # via
     #   -c requirements/base.txt
     #   requests
@@ -54,7 +54,7 @@ mdurl==0.1.2
     # via markdown-it-py
 myst-parser==2.0.0
     # via -r requirements/build.in
-packaging==23.1
+packaging==23.2
     # via
     #   -c requirements/base.txt
     #   sphinx