Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add clean_ligatures to core cleaners (#1326)
# Background [Ligatures](https://en.wikipedia.org/wiki/Ligature_(writing)#Ligatures_in_Unicode_(Latin_alphabets)) can sometimes show up during the text extraction process when they should not. Very common examples of this are with the Latin `f` related ligatures which can be **very subtle** to spot by eye (see example below), but can wreak havoc later. ```python "ff": "ff", "fi": "fi", "fl": "fl", "ffi": "ffi", "ffl": "ffl", ``` Several libraries already do something like this. Most recently, `pdfplumber` added this sort of capability as part of the text extraction process, see jsvine/pdfplumber#598 Instead of incorporating any sort of breaking change to the PDF text processing in `unstructured`, it is best to add this as another cleaner and allow users to opt in. In turn, the `clean_ligatures` method has been added in this PR - with accompanying tests. # Example Here is an example PDF that causes the issue. For example: `Benefits`, which should be `Benefits`. [example.pdf](https://github.com/Unstructured-IO/unstructured/files/12544344/example.pdf) ```bash curl -X 'POST' \ 'https://api.unstructured.io/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H 'unstructured-api-key: ${UNSTRUCTURED_API_KEY}' \ -F '[email protected]' \ -s | jq -C . ``` # Notes An initial list of mappings was added with the most common ligatures. There is some subjectivity to this, but this should be a relatively safe starting set. Can always be expanded as needed.
- Loading branch information