Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep minimal structure of tables in text #13

Open
ivsanro1 opened this issue Jun 23, 2024 · 2 comments
Open

Keep minimal structure of tables in text #13

ivsanro1 opened this issue Jun 23, 2024 · 2 comments
Labels
good first issue Good for newcomers

Comments

@ivsanro1
Copy link

I think it'd be great to keep some basic sepatarors to not lose too much structural info from tables:

>>> import html_text

>>> tree = fromstring("""
... <table>
...   <tr>
...     <th>Company</th>
...     <th>Contact</th>
...     <th>Country</th>
...   </tr>
...   <tr>
...     <td>Alfreds Futterkiste</td>
...     <td>Maria Anders</td>
...     <td>Germany</td>
...   </tr>
...   <tr>
...     <td>Centro comercial Moctezuma</td>
...     <td>Francisco Chang</td>
...     <td>Mexico</td>
...   </tr>
... </table> 
... """)

>>> print(html_text.extract_text(tree, guess_layout=True))
Company Contact Country
Alfreds Futterkiste Maria Anders Germany
Centro comercial Moctezuma Francisco Chang Mexico

While some better output would be:

Company | Contact | Country
Alfreds Futterkiste | Maria Anders | Germany
Centro comercial Moctezuma | Francisco Chang | Mexico

@lopuhin do you think this would be relevant for this library?

@ivsanro1 ivsanro1 added the good first issue Good for newcomers label Jun 23, 2024
@lopuhin
Copy link
Contributor

lopuhin commented Jun 24, 2024

@ivsanro1 that makes a lot of sense. Thinking about other options here, one more possibility could be using tabs \t instead of | as a separator. That would still follow the approach that we don't add new non-blank characters to original text, but at the same time preserve the same amount of info as the |, and this is how tables are represented if you try to copy them and paste into a text field.

@ivsanro1
Copy link
Author

ivsanro1 commented Jun 24, 2024

Thinking about other options here, one more possibility could be using tabs \t instead of | as a separator

makes sense @lopuhin thanks for your input on this. Originally I was thinking on | rather than tabs because of how latest LLMs (e.g. llama3) tend to have in their vocab combinations of spaces + tabs, making the resulting tokens less consistent, especially if there are cells in the table without text -- and I was wondering if that'd affect how a LLM would interpret this text, semantically speaking

I find using separators | more consistent in tokenization:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.encode("\t", add_special_tokens=False)
[197]
>>> tokenizer.encode("\t\t", add_special_tokens=False)
[298]
>>> tokenizer.encode("\t\t\t", add_special_tokens=False)
[573]
>>> tokenizer.encode(" | ", add_special_tokens=False)
[765, 220]
>>> tokenizer.encode(" |  | ", add_special_tokens=False)
[765, 220, 765, 220]
>>> tokenizer.encode("| ", add_special_tokens=False)
[91, 220]
>>> tokenizer.encode("|  |", add_special_tokens=False)
[91, 220, 765]
>>> tokenizer.encode(" \t  \t ", add_special_tokens=False)
[7163, 79199]
>>> tokenizer.encode(" \t  \t  \t", add_special_tokens=False)
[7163, 256, 63472]
>>> tokenizer.encode(" \t  \t  \t ", add_special_tokens=False)
[7163, 256, 8860, 3762]
>>> tokenizer.encode(" |  |  |", add_special_tokens=False)
[765, 220, 765, 220, 765]
>>> tokenizer.encode(" |  |  | ", add_special_tokens=False)
[765, 220, 765, 220, 765, 220]

But I also like the option of not adding non-spacing chars. I think the best option would be to make it customizable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants