Skip to content

Commit

Permalink
Merge branch 'ahmet/sharepoint-rbac' of https://github.com/Unstructur…
Browse files Browse the repository at this point in the history
…ed-IO/unstructured into ahmet/sharepoint-rbac
  • Loading branch information
ahmetmeleq committed Oct 4, 2023
2 parents c9b1691 + bba81db commit f3895c4
Show file tree
Hide file tree
Showing 28 changed files with 3,755 additions and 5,562 deletions.
9 changes: 5 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
## 0.10.19-dev7
## 0.10.19-dev10

### Enhancements

* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
* * **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=<n>` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length <n> characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
* **change default `hi_res` model for pdf/image partition to `yolox`** Now partitioning pdf/image using `hi_res` strategy utilizes `yolox_quantized` model isntead of `detectron2_onnx` model. This new default model has better recall for tables and produces more detailed categories for elements.


### Features
### Features

### Fixes

Expand Down
20 changes: 10 additions & 10 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ When elements are extracted from PDFs or images, it may be useful to get their b
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -155,7 +155,7 @@ You can specify the encoding to use to decode the text input. If no value is pro
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -204,7 +204,7 @@ You can also specify what languages to use for OCR with the ``ocr_languages`` kw
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -250,7 +250,7 @@ By default the result will be in ``json``, but it can be set to ``text/csv`` to
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -296,7 +296,7 @@ Pass the `include_page_breaks` parameter to `true` to include `PageBreak` elemen
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -345,7 +345,7 @@ On the other hand, ``hi_res`` is the better choice for PDFs that may have text w
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -398,7 +398,7 @@ To use the ``hi_res`` strategy with **Chipper** model, pass the argument for ``h
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -451,7 +451,7 @@ To extract the table structure from PDF files using the ``hi_res`` strategy, ens
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -499,7 +499,7 @@ We also provide support for enabling and disabling table extraction for file typ
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down Expand Up @@ -545,7 +545,7 @@ When processing XML documents, set the ``xml_keep_tags`` parameter to ``true`` t
file_path = "/Path/To/File"
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url, headers=headers, files=files, data=data)
response = requests.post(url, headers=headers, files=file_data, data=data)
file_data['files'].close()
Expand Down
25 changes: 25 additions & 0 deletions scripts/minio-test-helpers/create-and-check-minio.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/usr/bin/env bash

SCRIPT_DIR=$(dirname "$(realpath "$0")")

secret_key=minioadmin
access_key=minioadmin
region=us-east-2
endpoint_url=http://localhost:9000
bucket_name=utic-dev-tech-fixtures

function upload(){
echo "Uploading test content to new bucket in minio"
AWS_REGION=$region AWS_SECRET_ACCESS_KEY=$secret_key AWS_ACCESS_KEY_ID=$access_key \
aws --output json --endpoint-url $endpoint_url s3api create-bucket --bucket $bucket_name | jq
AWS_REGION=$region AWS_SECRET_ACCESS_KEY=$secret_key AWS_ACCESS_KEY_ID=$access_key \
aws --endpoint-url $endpoint_url s3 cp "$SCRIPT_DIR"/wiki_movie_plots_small.csv s3://$bucket_name/
}

# Create Minio single server
docker compose version
docker compose -f "$SCRIPT_DIR"/docker-compose.yaml up --wait
docker compose -f "$SCRIPT_DIR"/docker-compose.yaml ps

echo "Cluster is live."
upload
13 changes: 13 additions & 0 deletions scripts/minio-test-helpers/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
services:
minio:
image: quay.io/minio/minio
container_name: minio-test
ports:
- 9000:9000
- 9001:9001
command: server --console-address ":9001" /data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 5s
timeout: 20s
retries: 3
Loading

0 comments on commit f3895c4

Please sign in to comment.