Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patch unstructured embeddings gen example #520

Merged
merged 2 commits into from
Oct 18, 2024
Merged

patch unstructured embeddings gen example #520

merged 2 commits into from
Oct 18, 2024

Conversation

mattseddon
Copy link
Member

@mattseddon mattseddon commented Oct 18, 2024

It is just a patch for the example/test. We should revert once Unstructured-IO/unstructured#3730 is merged and released

Copy link

cloudflare-workers-and-pages bot commented Oct 18, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: fe93b13
Status: ✅  Deploy successful!
Preview URL: https://93a1d263.datachain-documentation.pages.dev
Branch Preview URL: https://patch-example.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Oct 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.43%. Comparing base (f6445e2) to head (fe93b13).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #520   +/-   ##
=======================================
  Coverage   87.43%   87.43%           
=======================================
  Files          97       97           
  Lines       10069    10069           
  Branches     1374     1374           
=======================================
  Hits         8804     8804           
  Misses        908      908           
  Partials      357      357           
Flag Coverage Δ
datachain 87.40% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -12,11 +12,11 @@
group_broken_paragraphs,
replace_unicode_quotes,
)
from unstructured.embed.huggingface import (
from unstructured.partition.pdf import partition_pdf
from unstructured_ingest.embed.huggingface import (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F] From what I can see this is the new package for doing things. There is a new API for ingesting data but I haven't been able to grok it completely.

What I can tell you is that you can no longer instantiate the old HuggingFaceEmbeddingEncoder because it is missing abstract methods and the new embed_documents expects list[dict] instead of list[Element] so they are incompatible without these small changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised Unstructured-IO/unstructured#3731 / Unstructured-IO/unstructured#3730 to fix the issue on their end properly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems simpler to just set the upper limit of unstructured to the version before 0.16.0

@mattseddon mattseddon marked this pull request as ready for review October 18, 2024 04:31
@mattseddon mattseddon requested a review from a team October 18, 2024 04:31
@mattseddon mattseddon self-assigned this Oct 18, 2024
@mattseddon
Copy link
Member Author

cc @tibor-mach

Copy link
Contributor

@dtulga dtulga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for investigating this!

@shcheklein shcheklein merged commit e699c1e into main Oct 18, 2024
38 checks passed
@shcheklein shcheklein deleted the patch-example branch October 18, 2024 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants