Skip to content

Commit

Permalink
fix: GH issue 1057 etree parser error (csv) (#1112)
Browse files Browse the repository at this point in the history
Addresses #1057 for CSV. Related to PR #1077.

* update partition_csv to always use soupparser_fromstring to parse html text
  • Loading branch information
christinestraub authored Aug 14, 2023
1 parent 612f9da commit 8026646
Show file tree
Hide file tree
Showing 7 changed files with 12 additions and 14 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
## 0.9.3-dev2
## 0.9.3-dev3

### Enhancements

* Update `partition_csv` to always use `soupparser_fromstring` to parse `html text`
* Update `partition_tsv` to always use `soupparser_fromstring` to parse `html text`
* Add `metadata.section` to capture epub table of contents data
* Add `unique_element_ids` kwarg to partition functions. If `True`, will use a UUID
Expand Down

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.9.3-dev2" # pragma: no cover
__version__ = "0.9.3-dev3" # pragma: no cover
5 changes: 1 addition & 4 deletions unstructured/partition/csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
from typing import IO, BinaryIO, List, Optional, Union, cast

import pandas as pd
from lxml.html import document_fromstring
from lxml.html.soupparser import fromstring as soupparser_fromstring

from unstructured.documents.elements import (
Expand All @@ -13,7 +12,6 @@
)
from unstructured.file_utils.filetype import FileType, add_metadata_with_filetype
from unstructured.partition.common import (
contains_emoji,
exactly_one,
get_last_modified_date,
get_last_modified_date_from_file,
Expand Down Expand Up @@ -60,8 +58,7 @@ def partition_csv(
table = pd.read_csv(f)

html_text = table.to_html(index=False, header=False, na_rep="")
html_string_parser = soupparser_fromstring if contains_emoji(html_text) else document_fromstring
text = html_string_parser(html_text).text_content()
text = soupparser_fromstring(html_text).text_content()

if include_metadata:
metadata = ElementMetadata(
Expand Down

0 comments on commit 8026646

Please sign in to comment.