Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added --slice-queryset argument #284

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

iJohnMaged
Copy link

Came across your library and wanted to integrate it for a client, great work!

However when I was deploying on a relatively big database (2M rows, big model with lots of text data), the process was always getting killed on PythonAnywhere while using all the CPU and ram available, without creating a single index in watson_searchentry.

So I tinkered a bit and found that .iterator() is the issue in my case (limited resources, MySQL database too), buildwatson doesn't get to create any index, eventually changed the code to slice instead of .iterator and it got through.

I add an argument to buildwatson called --slice-queryset to slice it instead of iterate, if that works for others in some cases.

@etianen
Copy link
Owner

etianen commented Aug 18, 2021

Thanks for this, and for the tests.

I'm really surprised this works! Performing a count on a large dataset can be very slow, and chunking through a dataset gets slower each chunk due to how databases perform offsetting.

But, pragmatically, for databases that don't support server side cursors, it's better than nothing.

I would suggest the following changes:

  1. Lose the count. Just keep slicing until you get no results.
  2. Rather than using offset, order the queryset by pk (asc), and after each batch, filter the next batch by pk__gt=final_pk_of_prev_batch

(2) is really important. It means this will scale to larger datasets, and it means, for auto-increment pks, no models will be missed if the data is being edited.

@iJohnMaged
Copy link
Author

I'll fix the build checks errors and implement those changes, you're completely right about those and I ended up implementing that in the view anyway for search!

@danihodovic
Copy link
Contributor

Running into a similar issue on 100k rows :/ Any chance of merging this John?

@etianen
Copy link
Owner

etianen commented Aug 21, 2022

I can't merge this without the suggested changes, and broken builds, being fixed. I'm happy to consider another PR, or updates to this one.

@danihodovic
Copy link
Contributor

@etianen Did you publicize the library mentioned in #26 (comment) :) ?

@etianen
Copy link
Owner

etianen commented Aug 21, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants