Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect document language across all partitioners #1627

Merged
merged 95 commits into from
Oct 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
61fae5c
add docs with multiple languages for testing
Coniferish Sep 30, 2023
5224d49
fix metadata languages test
Coniferish Sep 26, 2023
022a057
add language detection to partition_csv
Coniferish Oct 1, 2023
edce758
add language detection to partition_email--FAILING
Coniferish Oct 1, 2023
946ee84
add language detection to partition_docx and linting
Coniferish Oct 1, 2023
5e906b9
add language detection to partition_doc
Coniferish Oct 1, 2023
e333301
add language detection to partition_epub
Coniferish Oct 1, 2023
216032e
alter apply_lang_metadata to accept lists and remove debugging code f…
Coniferish Oct 1, 2023
d71f4bc
fix test name for epub
Coniferish Oct 1, 2023
376adcc
add language detection to partition_html
Coniferish Oct 1, 2023
1c83758
add language detection to partition_md
Coniferish Oct 1, 2023
8a0d5f7
Fix partition_json docstring to clarify it only partitions serialized…
Coniferish Oct 1, 2023
7c5ea7f
add language detection to partition_msg: FAILING
Coniferish Oct 1, 2023
0849cf7
add language detection to partition_odt
Coniferish Oct 1, 2023
f743c6d
add language detection to partition_org
Coniferish Oct 1, 2023
5ae2e5c
add language detection to partition_rst
Coniferish Oct 1, 2023
89d4afe
add language detection to partition_rtf
Coniferish Oct 1, 2023
f4858de
add language detection to partition_tsv
Coniferish Oct 1, 2023
77d5030
add language detection to partition_xlsx
Coniferish Oct 1, 2023
26dac7e
add language detection to partition_xml
Coniferish Oct 1, 2023
22bbd6b
fix lang detection to partition_msg
Coniferish Oct 2, 2023
b517b11
fix lang detection to partition_email
Coniferish Oct 2, 2023
8550d41
add type checking for to missed partitioners
Coniferish Oct 2, 2023
5cf7f35
add language detection to partition_pptx
Coniferish Oct 2, 2023
383dbdf
add language detection to partition_ppt
Coniferish Oct 2, 2023
84dc8f1
add test for partition_text
Coniferish Oct 2, 2023
e31e6a7
remove debugging breakpoint
Coniferish Oct 2, 2023
3db0fa1
fix test asserting on all metadata
Coniferish Oct 2, 2023
08676e4
add language detection test to partition_text
Coniferish Oct 2, 2023
493516a
Merge branch 'jj/1536-lang-test-docs' into jj/1534-doc-lvl-lang
Coniferish Oct 2, 2023
bebb898
add tests for multiple languages
Coniferish Oct 2, 2023
926b21b
update docstring info and fix skipping language detection
Coniferish Oct 2, 2023
34c4ea3
add kwarg to detect lang by element
Coniferish Oct 3, 2023
a0a7d9f
fix multilanguages tests
Coniferish Oct 3, 2023
38a2012
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 3, 2023
b108e5a
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 3, 2023
c36d434
add additional docs and tests for multilanguage detection in docx, ep…
Coniferish Oct 3, 2023
70b6082
linting
Coniferish Oct 4, 2023
a8b78b6
add additional docs and tests for multilanguage detection in doc, htm…
Coniferish Oct 4, 2023
d3c4a20
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 4, 2023
2879f21
remove unnecessary continue statement
Coniferish Oct 4, 2023
b1c4048
improve docstrings and remove detect_per_element kwarg from partition…
Coniferish Oct 4, 2023
262e342
remove comment
Coniferish Oct 4, 2023
6209eba
add languages doc and test for partition_xml
Coniferish Oct 4, 2023
f17cd45
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 4, 2023
d66b7a3
attempt to fix linting error
Coniferish Oct 4, 2023
58f2a2f
fix linting error referring to duplicate arguments
Coniferish Oct 4, 2023
808a1fe
changelog and version
Coniferish Oct 4, 2023
2561928
resolve PR review comments
Coniferish Oct 4, 2023
a87efd2
fix type hinting for linting
Coniferish Oct 4, 2023
496e2a8
fix type hinting errors
Coniferish Oct 4, 2023
2e57296
add length check and default to 'eng' for language detection
Coniferish Oct 4, 2023
6f3202e
update comment in tests to be more accurate
Coniferish Oct 4, 2023
cd23a01
remove unnecessary metadata creation
Coniferish Oct 4, 2023
7c33eda
improve conditions for defaulting to 'eng' language detection
Coniferish Oct 4, 2023
6dedeb5
remove incorrectly detected lang from test
Coniferish Oct 5, 2023
75f47b1
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 5, 2023
f80459f
add test to increase test coverage
Coniferish Oct 5, 2023
7db585b
move type check to detect_languages and out from partitioners
Coniferish Oct 5, 2023
5ff8764
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 5, 2023
e37f76c
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 5, 2023
d8b3a9c
undo changes to tests
Coniferish Oct 5, 2023
9f05d0a
Update CHANGELOG.md
Coniferish Oct 5, 2023
02bfb75
detect document language across all partitioners <- Ingest test fixtu…
ryannikolaidis Oct 6, 2023
f7a8f11
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 6, 2023
0f2c251
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 6, 2023
0d9f135
add comment about why detect_languages_per_element is not included in…
Coniferish Oct 6, 2023
ba79d8a
add tests for languages arg and raising TypeError
Coniferish Oct 6, 2023
d3faf31
linting
Coniferish Oct 6, 2023
dd51858
move apply_lang_metadata to lang.py to avoid adding unnecessary depen…
Coniferish Oct 6, 2023
6ce99bf
remove extra test
Coniferish Oct 6, 2023
c9bfd65
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 6, 2023
e0d2394
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 7, 2023
f55c4e9
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 7, 2023
9f4b350
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 7, 2023
d16c6d5
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 9, 2023
01e29df
update error message
Coniferish Oct 9, 2023
c2c2c79
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 9, 2023
1937832
add tests for incorrect arg type for 'languages'
Coniferish Oct 9, 2023
1275130
add languages param to auto partition
Coniferish Oct 10, 2023
c2e7d67
add tests
Coniferish Oct 10, 2023
096f69d
fix typo and update changelog/version
Coniferish Oct 10, 2023
9556188
add tests to auto partition and fix defaults for 'languages'
Coniferish Oct 10, 2023
5522e47
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 10, 2023
8a50216
fix last type hint
Coniferish Oct 10, 2023
0d28632
standardize test names
Coniferish Oct 10, 2023
ff5c3f9
standardize test names(2)
Coniferish Oct 10, 2023
9494bff
standardize test names(3)
Coniferish Oct 10, 2023
f95f6ab
Merge branch 'main' into jj/1534-doc-lvl-lang
Coniferish Oct 10, 2023
de676ea
changelog and version
Coniferish Oct 10, 2023
3a78281
linting
Coniferish Oct 10, 2023
059a8a8
Merge branch 'main' into jj/1534-doc-lvl-lang
cragwolfe Oct 11, 2023
f3554d4
cut a release
awalker4 Oct 11, 2023
1c13dd6
detect document language across all partitioners <- Ingest test fixtu…
ryannikolaidis Oct 11, 2023
421e148
fix version file
awalker4 Oct 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
## 0.10.20-dev9
## 0.10.20

### Enhancements

* **Add document level language detection functionality.** Adds the "auto" default for the languages param to all partitioners. The primary language present in the document is detected using the `langdetect` package. Additional param `detect_language_per_element` is also added for partitioners that return multiple elements. Defaults to `False`.
* **Refactor OCR code** The OCR code for entire page is moved from unstructured-inference to unstructured. On top of continuing support for OCR language parameter, we also support two OCR processing modes, "entire_page" or "individual_blocks".
* **Align to top left when shrinking bounding boxes for `xy-cut` sorting:** Update `shrink_bbox()` to keep top left rather than center.
* **Add visualization script to annotate elements** This script is often used to analyze/visualize elements with coordinates (e.g. partition_pdf()).
Expand Down
Binary file added example-docs/language-docs/eng_spa.xlsx
Binary file not shown.
Binary file added example-docs/language-docs/eng_spa_mult.doc
Binary file not shown.
Binary file added example-docs/language-docs/eng_spa_mult.docx
Binary file not shown.
113 changes: 113 additions & 0 deletions example-docs/language-docs/eng_spa_mult.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
MIME-Version: 1.0
Date: Wed, 4 Oct 2023 09:27:45 -0500
Message-ID: <CABDvgF2Wpt9+eSO7zgMJZ2fQb=QZ__CS6N_Y+msnGwpKeg1a+A@mail.gmail.com>
Subject: Test email with multiple languages
From: John <[email protected]>
To: John <[email protected]>
Content-Type: multipart/alternative; boundary="000000000000a0666d0606e4cfb7"

--000000000000a0666d0606e4cfb7
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

All human beings are born free and equal in dignity and rights. They are
endowed with reason and conscience and should act towards one another in a
spirit of brotherhood. All human beings are born free and equal in dignity
and rights. They are endowed with reason and conscience and should act
towards one another in a spirit of brotherhood. All human beings are born
free and equal in dignity and rights. They are endowed with reason and
conscience and should act towards one another in a spirit of brotherhood.
All human beings are born free and equal in dignity and rights. They are
endowed with reason and conscience and should act towards one another in a
spirit of brotherhood. All human beings are born free and equal in dignity
and rights. They are endowed with reason and conscience and should act
towards one another in a spirit of brotherhood. All human beings are born
free and equal in dignity and rights. They are endowed with reason and
conscience and should act towards one another in a spirit of brotherhood.

All human beings are born free and equal in dignity and rights. They are
endowed with reason and conscience and should act towards one another in a
spirit of brotherhood. "Todos los seres humanos nacen libres e iguales en
dignidad y derechos y, dotados como est=C3=A1n de raz=C3=B3n y conciencia, =
deben
comportarse fraternalmente los unos con los otros. Todos los seres humanos
nacen libres e iguales en dignidad y derechos y, dotados como est=C3=A1n de
raz=C3=B3n y conciencia, deben comportarse fraternalmente los unos con los
otros."

All human beings are born free and equal in dignity and rights. They are
endowed with reason and conscience and should act towards one another in a
spirit of brotherhood. All human beings are born free and equal in dignity
and rights. They are endowed with reason and conscience and should act
towards one another in a spirit of brotherhood. All human beings are born
free and equal in dignity and rights. They are endowed with reason and
conscience and should act towards one another in a spirit of brotherhood.
All human beings are born free and equal in dignity and rights. They are
endowed with reason and conscience and should act towards one another in a
spirit of brotherhood.

All human beings are born free and equal in dignity and rights. They are
endowed with reason and conscience and should act towards one another in a
spirit of brotherhood. All human beings are born free and equal in dignity
and rights. They are endowed with reason and conscience and should act
towards one another in a spirit of brotherhood. All human beings are born
free and equal in dignity and rights. They are endowed with reason and
conscience and should act towards one another in a spirit of brotherhood.

"Todos los seres humanos nacen libres e iguales en dignidad y derechos y,
dotados como est=C3=A1n de raz=C3=B3n y conciencia, deben comportarse frate=
rnalmente
los unos con los otros. Todos los seres humanos nacen libres e iguales en
dignidad y derechos y, dotados como est=C3=A1n de raz=C3=B3n y conciencia, =
deben
comportarse fraternalmente los unos con los otros."

--000000000000a0666d0606e4cfb7
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">All human beings are born free and equal in dignity and ri=
ghts. They are endowed with reason and conscience and should act towards on=
e another in a spirit of brotherhood. All human beings are born free and eq=
ual in dignity and rights. They are endowed with reason and conscience and =
should act towards one another in a spirit of brotherhood. All human beings=
are born free and equal in dignity and rights. They are endowed with reaso=
n and conscience and should act towards one another in a spirit of brotherh=
ood. All human beings are born free and equal in dignity and rights. They a=
re endowed with reason and conscience and should act towards one another in=
a spirit of brotherhood. All human beings are born free and equal in digni=
ty and rights. They are endowed with reason and conscience and should act t=
owards one another in a spirit of brotherhood. All human beings are born fr=
ee and equal in dignity and rights. They are endowed with reason and consci=
ence and should act towards one another in a spirit of brotherhood.<br><br>=
All human beings are born free and equal in dignity and rights. They are en=
dowed with reason and conscience and should act towards one another in a sp=
irit of brotherhood. &quot;Todos los seres humanos nacen libres e iguales e=
n dignidad y derechos y, dotados como est=C3=A1n de raz=C3=B3n y conciencia=
, deben comportarse fraternalmente los unos con los otros. Todos los seres =
humanos nacen libres e iguales en dignidad y derechos y, dotados como est=
=C3=A1n de raz=C3=B3n y conciencia, deben comportarse fraternalmente los un=
os con los otros.&quot;<br><br>All human beings are born free and equal in =
dignity and rights. They are endowed with reason and conscience and should =
act towards one another in a spirit of brotherhood. All human beings are bo=
rn free and equal in dignity and rights. They are endowed with reason and c=
onscience and should act towards one another in a spirit of brotherhood. Al=
l human beings are born free and equal in dignity and rights. They are endo=
wed with reason and conscience and should act towards one another in a spir=
it of brotherhood. All human beings are born free and equal in dignity and =
rights. They are endowed with reason and conscience and should act towards =
one another in a spirit of brotherhood.<br><br>All human beings are born fr=
ee and equal in dignity and rights. They are endowed with reason and consci=
ence and should act towards one another in a spirit of brotherhood. All hum=
an beings are born free and equal in dignity and rights. They are endowed w=
ith reason and conscience and should act towards one another in a spirit of=
brotherhood. All human beings are born free and equal in dignity and right=
s. They are endowed with reason and conscience and should act towards one a=
nother in a spirit of brotherhood.<br><br>&quot;Todos los seres humanos nac=
en libres e iguales en dignidad y derechos y, dotados como est=C3=A1n de ra=
z=C3=B3n y conciencia, deben comportarse fraternalmente los unos con los ot=
ros. Todos los seres humanos nacen libres e iguales en dignidad y derechos =
y, dotados como est=C3=A1n de raz=C3=B3n y conciencia, deben comportarse fr=
aternalmente los unos con los otros.&quot;</div>

--000000000000a0666d0606e4cfb7--
Binary file added example-docs/language-docs/eng_spa_mult.epub
Binary file not shown.
20 changes: 20 additions & 0 deletions example-docs/language-docs/eng_spa_mult.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
<!DOCTYPE html>
<html>
<body>
<p>
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
</p>
<p>
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. "Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros. Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros."
</p>
<p>
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
</p>
<p>
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
</p>
<p>
"Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros. Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros."
</p>
</body>
</html>
9 changes: 9 additions & 0 deletions example-docs/language-docs/eng_spa_mult.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. "Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros. Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros."

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

"Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros. Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros."
Binary file added example-docs/language-docs/eng_spa_mult.odt
Binary file not shown.
9 changes: 9 additions & 0 deletions example-docs/language-docs/eng_spa_mult.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. "Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros. Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros."

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

"Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros. Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros."
Binary file added example-docs/language-docs/eng_spa_mult.ppt
Binary file not shown.
Binary file added example-docs/language-docs/eng_spa_mult.pptx
Binary file not shown.
Loading
Loading