Skip to content

Commit

Permalink
chore: Add encoding param to ingest (#955)
Browse files Browse the repository at this point in the history
* Add encoding param to ingest
  • Loading branch information
Jason Scheirer authored Jul 24, 2023
1 parent 676c50a commit 196efa0
Show file tree
Hide file tree
Showing 8 changed files with 234 additions and 2 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
* Add min_partition kwarg to that combines elements below a specified threshold and modifies splitting of strings longer than max partition so words are not split.
* set the file's current position to the beginning after reading the file in `convert_to_bytes`
* Add slide notes to pptx
* Add `--encoding` directive to ingest

### Features

Expand Down
19 changes: 19 additions & 0 deletions example-docs/fake-html-cp1252.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>
<p>Some CP1252-specific characters:</p>

<pre>
¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯
° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
à á â ã ä å æ ç è é ê ë ì í î ï
ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
</pre>

</body>
</html>
Original file line number Diff line number Diff line change
Expand Up @@ -151,13 +151,13 @@
},
{
"type": "NarrativeText",
"element_id": "7480a79a5bad8a36f3f7e5d622f0b5f3",
"element_id": "f3be9748ecd68b20d706548129baa22d",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "First, take steps to better prepare for the seasonal hazards weather can throw at you.\r\nThis could include a spring cleaning of your storm shelter or ensuring your emergency kit is fully stocked. Take a look at our infographics and social media posts to help you become “weather-ready.”"
"text": "First, take steps to better prepare for the seasonal hazards weather can throw at you.\nThis could include a spring cleaning of your storm shelter or ensuring your emergency kit is fully stocked. Take a look at our infographics and social media posts to help you become “weather-ready.”"
},
{
"type": "NarrativeText",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
[
{
"type": "Title",
"element_id": "0540311f6c077fe8f797080918b8d74b",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "My First Heading"
},
{
"type": "Title",
"element_id": "399af454cb1368b8257ed406b430de84",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "My first paragraph."
},
{
"type": "Title",
"element_id": "b4cf0d13edfa976816649971bd640a66",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Some CP1252-specific characters:"
},
{
"type": "UncategorizedText",
"element_id": "ada7c3084f437d31d297f85da3941a55",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 2
},
"text": "¡\t¢\t£\t¤\t¥\t¦\t§\t¨\t©\tª\t«\t¬\tSHY\t®\t¯"
},
{
"type": "UncategorizedText",
"element_id": "dda5e8c4d245c1954ecb64e5dfea598d",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 3
},
"text": "°\t±\t²\t³\t´\tµ\t\t·\t¸\t¹\tº\t»\t¼\t½\t¾\t¿"
},
{
"type": "Title",
"element_id": "85df09b375e5813aefa3b5f30c8ddff8",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 4
},
"text": "À\tÁ\tÂ\tÃ\tÄ\tÅ\tÆ\tÇ\tÈ\tÉ\tÊ\tË\tÌ\tÍ\tÎ\tÏ"
},
{
"type": "Title",
"element_id": "2726d2569cd7a6cecb79a6e46bb0b2b3",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 5
},
"text": "Ð\tÑ\tÒ\tÓ\tÔ\tÕ\tÖ\t×\tØ\tÙ\tÚ\tÛ\tÜ\tÝ\tÞ\tß"
},
{
"type": "Title",
"element_id": "2b01f3e428520f6e47d8513292688cf6",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 6
},
"text": "à\tá\tâ\tã\tä\tå\tæ\tç\tè\té\tê\të\tì\tí\tî\tï"
},
{
"type": "Title",
"element_id": "5ed256e41bfb169af5f50524b9593a16",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 7
},
"text": "ð\tñ\tò\tó\tô\tõ\tö\t÷\tø\tù\tú\tû\tü\tý\tþ\tÿ"
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
[
{
"type": "Title",
"element_id": "0540311f6c077fe8f797080918b8d74b",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "My First Heading"
},
{
"type": "Title",
"element_id": "399af454cb1368b8257ed406b430de84",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "My first paragraph."
},
{
"type": "Title",
"element_id": "b4cf0d13edfa976816649971bd640a66",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Some CP1252-specific characters:"
},
{
"type": "UncategorizedText",
"element_id": "ada7c3084f437d31d297f85da3941a55",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 2
},
"text": "¡\t¢\t£\t¤\t¥\t¦\t§\t¨\t©\tª\t«\t¬\tSHY\t®\t¯"
},
{
"type": "UncategorizedText",
"element_id": "dda5e8c4d245c1954ecb64e5dfea598d",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 3
},
"text": "°\t±\t²\t³\t´\tµ\t\t·\t¸\t¹\tº\t»\t¼\t½\t¾\t¿"
},
{
"type": "Title",
"element_id": "85df09b375e5813aefa3b5f30c8ddff8",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 4
},
"text": "À\tÁ\tÂ\tÃ\tÄ\tÅ\tÆ\tÇ\tÈ\tÉ\tÊ\tË\tÌ\tÍ\tÎ\tÏ"
},
{
"type": "Title",
"element_id": "2726d2569cd7a6cecb79a6e46bb0b2b3",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 5
},
"text": "Ð\tÑ\tÒ\tÓ\tÔ\tÕ\tÖ\t×\tØ\tÙ\tÚ\tÛ\tÜ\tÝ\tÞ\tß"
},
{
"type": "Title",
"element_id": "2b01f3e428520f6e47d8513292688cf6",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 6
},
"text": "à\tá\tâ\tã\tä\tå\tæ\tç\tè\té\tê\të\tì\tí\tî\tï"
},
{
"type": "Title",
"element_id": "5ed256e41bfb169af5f50524b9593a16",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 7
},
"text": "ð\tñ\tò\tó\tô\tõ\tö\t÷\tø\tù\tú\tû\tü\tý\tþ\tÿ"
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/usr/bin/env bash

set -e

SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/.. || exit 1
OUTPUT_FOLDER_NAME=local-single-file-with-encoding
OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME

PYTHONPATH=. ./unstructured/ingest/main.py \
--metadata-exclude filename,file_directory,metadata.data_source.date_processed \
--local-input-path example-docs/fake-html-cp1252.html \
--structured-output-dir "$OUTPUT_DIR" \
--encoding cp1252 \
--verbose \
--reprocess

set +e

sh "$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME
1 change: 1 addition & 0 deletions test_unstructured_ingest/test-ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,6 @@ export OMP_THREAD_LIMIT=1
./test_unstructured_ingest/test-ingest-confluence-diff.sh
./test_unstructured_ingest/test-ingest-confluence-large.sh
./test_unstructured_ingest/test-ingest-local-single-file.sh
./test_unstructured_ingest/test-ingest-local-single-file-with-encoding.sh
# NOTE(yuming): The following test should be put after any tests with --preserve-downloads option
./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh
7 changes: 7 additions & 0 deletions unstructured/ingest/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,11 @@ def run(self):
"language pack needs to be installed."
"Default: eng",
)
@click.option(
"--encoding",
default="utf-8",
help="Text encoding to use when reading documents. Default: utf-8",
)
@click.option(
"--api-key",
default="",
Expand Down Expand Up @@ -588,6 +593,7 @@ def main(
partition_endpoint,
partition_strategy,
partition_ocr_languages,
encoding,
api_key,
local_input_path,
local_file_glob,
Expand Down Expand Up @@ -979,6 +985,7 @@ def main(
process_document,
strategy=partition_strategy,
ocr_languages=partition_ocr_languages,
encoding=encoding,
)

MainProcess(
Expand Down

0 comments on commit 196efa0

Please sign in to comment.