Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the error caused by file format when docx loading files #2061

Merged
merged 4 commits into from
Oct 17, 2024

Conversation

iCanDoAllThingszz
Copy link
Contributor

Description

When uploading a doc file to the knowledge base, docx will occur an error when loading a file with a specific format. The reason is that docx does not parse the file element correctly and returns a null value.

The error message is :

2024-10-12 15:00:56 | ERROR | dbgpt.serve.rag.service.service | document embedding, failed:海外市场相关研究-补充材料.docx, "There is no item named 'NULL' in the archive"

sample file:
海外市场相关研究-补充材料.docx

sample code:

import docx
from docx.document import Document

doc = docx.Document('/Users/zhaoyu/Desktop/海外市场相关研究-补充材料.docx')

How To Fix

This problem has been solved in the docx issues on the git
We just need to rewrite the load_from_xml_v2 method, and when the target element to be parsed is NULL, do not throw an exception and continue parsing.
This method allows users to upload doc files in a specific format without being bothered by errors.

Aries-ckt
Aries-ckt previously approved these changes Oct 16, 2024
Copy link
Collaborator

@Aries-ckt Aries-ckt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Collaborator

@Aries-ckt Aries-ckt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Collaborator

@csunny csunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+

@csunny csunny merged commit 630d644 into eosphoros-ai:main Oct 17, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants