Skip to content

Commit

Permalink
fix: separate elements by <br> tag in partition_html (#1314)
Browse files Browse the repository at this point in the history
### Summary

Closes #1230. Updates `partition_html` to split on `<br>` tags that
appear within text elements.


### Testing

The following is code previously produced one giant element on `main`.

```python
from unstructured.partition.html import partition_html

filename = "example-docs/ideas-page.html"
elements = partition_html(filename=filename)

len(elements) # Should be 4
print("\n\n".join([str(el) for el in elements)])
```

The output should be:

```python
January 2023

(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from.  The
answer was ok, but not what I would have said. This is what I would have said.)

The way to get new ideas is to notice anomalies: what seems strange,
or missing, or broken? You can see anomalies in everyday life (much
of standup comedy is based on this), but the best place to look for
them is at the frontiers of knowledge.

Knowledge grows fractally.
From a distance its edges look smooth, but when you learn enough
to get close to one, you'll notice it's full of gaps. These gaps
will seem obvious; it will seem inexplicable that no one has tried
x or wondered about y. In the best case, exploring such gaps yields
whole new fractal buds.
```
  • Loading branch information
MthwRobinson authored Sep 7, 2023
1 parent 09cc4bf commit 22974f6
Show file tree
Hide file tree
Showing 16 changed files with 426 additions and 108 deletions.
5 changes: 3 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.10.13-dev2
## 0.10.13-dev3

### Enhancements

Expand All @@ -10,6 +10,7 @@

### Fixes

* `partition_html` breaks on `<br>` elements.
* Ingest error handling to properly raise errors when wrapped

## 0.10.12
Expand All @@ -31,7 +32,7 @@

* Bump unstructured-inference
* Avoid divide-by-zero errors swith `safe_division` (0.5.21)

## 0.10.11

### Enhancements
Expand Down
26 changes: 21 additions & 5 deletions test_unstructured/partition/test_html_partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,12 +265,28 @@ def test_partition_html_raises_with_too_many_specified():
partition_html(filename=filename, text=text)


def test_partition_html_on_ideas_page():
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "ideas-page.html")
def test_partition_html_on_ideas_page(filename="example-docs/ideas-page.html"):
elements = partition_html(filename=filename)
document_text = "\n\n".join([str(el) for el in elements])
assert document_text.startswith("January 2023(Someone fed my essays into GPT")
assert document_text.endswith("whole new fractal buds.")
assert len(elements) == 4

assert elements[0] == Title("January 2023")
assert elements[0].metadata.emphasized_text_contents is None
assert elements[0].metadata.link_urls is None

assert elements[1].text.startswith("(Someone fed my essays")
assert elements[1].text.endswith("I would have said.)")
assert len(elements[1].metadata.emphasized_text_contents) == 1
assert len(elements[1].metadata.link_urls) == 1

assert elements[2].text.startswith("The way to get new ideas")
assert elements[2].text.endswith("the frontiers of knowledge.")
assert elements[2].metadata.emphasized_text_contents is None
assert elements[2].metadata.link_urls is None

assert elements[3].text.startswith("Knowledge grows fractally")
assert elements[3].text.endswith("whole new fractal buds.")
assert elements[3].metadata.emphasized_text_contents is None
assert elements[3].metadata.link_urls is None


def test_user_without_file_write_permission_can_partition_html(tmp_path, monkeypatch):
Expand Down
Original file line number Diff line number Diff line change
@@ -1,18 +1,27 @@
[
{
"type": "Title",
"element_id": "17c1a6701c263407d0fcf7c3ebfb2986",
"metadata": {
"data_source": {},
"filename": "ideas-page.html",
"filetype": "text/html",
"page_number": 1
},
"text": "January 2023"
},
{
"type": "NarrativeText",
"element_id": "c08fcabe68ba13b7a7cc6592bd5513a8",
"element_id": "6ea0e510b7ea64f87b55c1fe388cba7f",
"metadata": {
"data_source": {},
"filename": "ideas-page.html",
"filetype": "text/html",
"page_number": 1,
"link_urls": [
"index.html",
"https://twitter.com/stef/status/1617222428727586816"
],
"link_texts": [
null,
null
],
"emphasized_text_contents": [
Expand All @@ -22,6 +31,28 @@
"i"
]
},
"text": "January 2023(Someone fed my essays into GPT to make something that could answer\nquestions based on them, then asked it where good ideas come from. The\nanswer was ok, but not what I would have said. This is what I would have said.)The way to get new ideas is to notice anomalies: what seems strange,\nor missing, or broken? You can see anomalies in everyday life (much\nof standup comedy is based on this), but the best place to look for\nthem is at the frontiers of knowledge.Knowledge grows fractally.\nFrom a distance its edges look smooth, but when you learn enough\nto get close to one, you'll notice it's full of gaps. These gaps\nwill seem obvious; it will seem inexplicable that no one has tried\nx or wondered about y. In the best case, exploring such gaps yields\nwhole new fractal buds."
"text": "(Someone fed my essays into GPT to make something that could answer\nquestions based on them, then asked it where good ideas come from. The\nanswer was ok, but not what I would have said. This is what I would have said.)"
},
{
"type": "NarrativeText",
"element_id": "a8ce0a2e7d66af2000e6c3bd36994411",
"metadata": {
"data_source": {},
"filename": "ideas-page.html",
"filetype": "text/html",
"page_number": 1
},
"text": "The way to get new ideas is to notice anomalies: what seems strange,\nor missing, or broken? You can see anomalies in everyday life (much\nof standup comedy is based on this), but the best place to look for\nthem is at the frontiers of knowledge."
},
{
"type": "NarrativeText",
"element_id": "4eafbff98b81999dfbf3572440d22393",
"metadata": {
"data_source": {},
"filename": "ideas-page.html",
"filetype": "text/html",
"page_number": 1
},
"text": "Knowledge grows fractally.\nFrom a distance its edges look smooth, but when you learn enough\nto get close to one, you'll notice it's full of gaps. These gaps\nwill seem obvious; it will seem inexplicable that no one has tried\nx or wondered about y. In the best case, exploring such gaps yields\nwhole new fractal buds."
}
]
Original file line number Diff line number Diff line change
Expand Up @@ -353,7 +353,7 @@
},
{
"type": "NarrativeText",
"element_id": "7480a79a5bad8a36f3f7e5d622f0b5f3",
"element_id": "073a8fd4fe21204eff8c0ca133f6993f",
"metadata": {
"data_source": {},
"filetype": "text/html",
Expand All @@ -365,7 +365,17 @@
"strong"
]
},
"text": "First, take steps to better prepare for the seasonal hazards weather can throw at you.\r\nThis could include a spring cleaning of your storm shelter or ensuring your emergency kit is fully stocked. Take a look at our infographics and social media posts to help you become “weather-ready.”"
"text": "First, take steps to better prepare for the seasonal hazards weather can throw at you."
},
{
"type": "NarrativeText",
"element_id": "d97aee85f18639e200b29757e5783dad",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "This could include a spring cleaning of your storm shelter or ensuring your emergency kit is fully stocked. Take a look at our infographics and social media posts to help you become “weather-ready.”"
},
{
"type": "NarrativeText",
Expand Down Expand Up @@ -416,26 +426,98 @@
"text": "Stay safe this spring, and every season, by being informed, prepared, and Weather-Ready."
},
{
"type": "NarrativeText",
"element_id": "47d5d0d27a35a36d7467dfc8b6e089b3",
"type": "Title",
"element_id": "c9b4b8b324383371034a3682d0d712d2",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"link_urls": [
"http://www.commerce.gov"
],
"link_texts": [
"US Dept of Commerce"
]
},
"text": "US Dept of Commerce"
},
{
"type": "Title",
"element_id": "668c4fe04cbbc45c7e91b0b675dd48a3",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"link_urls": [
"http://www.noaa.gov"
],
"link_texts": [
"National Oceanic and Atmospheric Administration"
]
},
"text": "National Oceanic and Atmospheric Administration"
},
{
"type": "Title",
"element_id": "a5c0620dc25afae7e2761c210037b45c",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"link_urls": [
"https://www.weather.gov"
],
"link_texts": [
"National Weather Service"
]
},
"text": "National Weather Service"
},
{
"type": "Title",
"element_id": "41f6e17bf5e9a407fcca74e902f802a0",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "News Around NOAA"
},
{
"type": "Title",
"element_id": "d27040ad6074797db8e535d1fba3b5d8",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "1325 East West Highway"
},
{
"type": "Address",
"element_id": "7ab3e0275d15e2c26b18983db0685ddb",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Silver Spring, MD 20910"
},
{
"type": "Title",
"element_id": "1b0316a06a8f4d5b672669bb9f5b2877",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"link_urls": [
"http://www.commerce.gov",
"http://www.noaa.gov",
"https://www.weather.gov",
"https://www.weather.gov/news/contact"
],
"link_texts": [
"US Dept of Commerce",
"National Oceanic and Atmospheric Administration",
"National Weather Service",
"Comments? Questions? Please Contact Us."
]
},
"text": "US Dept of Commerce\n National Oceanic and Atmospheric Administration\n National Weather Service\n News Around NOAA1325 East West HighwaySilver Spring, MD 20910Comments? Questions? Please Contact Us."
"text": "Comments? Questions? Please Contact Us."
},
{
"type": "Title",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,17 +1,25 @@
[
{
"type": "Title",
"element_id": "17c1a6701c263407d0fcf7c3ebfb2986",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "January 2023"
},
{
"type": "NarrativeText",
"element_id": "c08fcabe68ba13b7a7cc6592bd5513a8",
"element_id": "6ea0e510b7ea64f87b55c1fe388cba7f",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"link_urls": [
"index.html",
"https://twitter.com/stef/status/1617222428727586816"
],
"link_texts": [
null,
null
],
"emphasized_text_contents": [
Expand All @@ -21,6 +29,26 @@
"i"
]
},
"text": "January 2023(Someone fed my essays into GPT to make something that could answer\nquestions based on them, then asked it where good ideas come from. The\nanswer was ok, but not what I would have said. This is what I would have said.)The way to get new ideas is to notice anomalies: what seems strange,\nor missing, or broken? You can see anomalies in everyday life (much\nof standup comedy is based on this), but the best place to look for\nthem is at the frontiers of knowledge.Knowledge grows fractally.\nFrom a distance its edges look smooth, but when you learn enough\nto get close to one, you'll notice it's full of gaps. These gaps\nwill seem obvious; it will seem inexplicable that no one has tried\nx or wondered about y. In the best case, exploring such gaps yields\nwhole new fractal buds."
"text": "(Someone fed my essays into GPT to make something that could answer\nquestions based on them, then asked it where good ideas come from. The\nanswer was ok, but not what I would have said. This is what I would have said.)"
},
{
"type": "NarrativeText",
"element_id": "a8ce0a2e7d66af2000e6c3bd36994411",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "The way to get new ideas is to notice anomalies: what seems strange,\nor missing, or broken? You can see anomalies in everyday life (much\nof standup comedy is based on this), but the best place to look for\nthem is at the frontiers of knowledge."
},
{
"type": "NarrativeText",
"element_id": "4eafbff98b81999dfbf3572440d22393",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Knowledge grows fractally.\nFrom a distance its edges look smooth, but when you learn enough\nto get close to one, you'll notice it's full of gaps. These gaps\nwill seem obvious; it will seem inexplicable that no one has tried\nx or wondered about y. In the best case, exploring such gaps yields\nwhole new fractal buds."
}
]
Original file line number Diff line number Diff line change
@@ -1,17 +1,25 @@
[
{
"type": "Title",
"element_id": "17c1a6701c263407d0fcf7c3ebfb2986",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "January 2023"
},
{
"type": "NarrativeText",
"element_id": "c08fcabe68ba13b7a7cc6592bd5513a8",
"element_id": "6ea0e510b7ea64f87b55c1fe388cba7f",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1,
"link_urls": [
"index.html",
"https://twitter.com/stef/status/1617222428727586816"
],
"link_texts": [
null,
null
],
"emphasized_text_contents": [
Expand All @@ -21,6 +29,26 @@
"i"
]
},
"text": "January 2023(Someone fed my essays into GPT to make something that could answer\nquestions based on them, then asked it where good ideas come from. The\nanswer was ok, but not what I would have said. This is what I would have said.)The way to get new ideas is to notice anomalies: what seems strange,\nor missing, or broken? You can see anomalies in everyday life (much\nof standup comedy is based on this), but the best place to look for\nthem is at the frontiers of knowledge.Knowledge grows fractally.\nFrom a distance its edges look smooth, but when you learn enough\nto get close to one, you'll notice it's full of gaps. These gaps\nwill seem obvious; it will seem inexplicable that no one has tried\nx or wondered about y. In the best case, exploring such gaps yields\nwhole new fractal buds."
"text": "(Someone fed my essays into GPT to make something that could answer\nquestions based on them, then asked it where good ideas come from. The\nanswer was ok, but not what I would have said. This is what I would have said.)"
},
{
"type": "NarrativeText",
"element_id": "a8ce0a2e7d66af2000e6c3bd36994411",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "The way to get new ideas is to notice anomalies: what seems strange,\nor missing, or broken? You can see anomalies in everyday life (much\nof standup comedy is based on this), but the best place to look for\nthem is at the frontiers of knowledge."
},
{
"type": "NarrativeText",
"element_id": "4eafbff98b81999dfbf3572440d22393",
"metadata": {
"data_source": {},
"filetype": "text/html",
"page_number": 1
},
"text": "Knowledge grows fractally.\nFrom a distance its edges look smooth, but when you learn enough\nto get close to one, you'll notice it's full of gaps. These gaps\nwill seem obvious; it will seem inexplicable that no one has tried\nx or wondered about y. In the best case, exploring such gaps yields\nwhole new fractal buds."
}
]
Loading

0 comments on commit 22974f6

Please sign in to comment.