Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

squatter: do not hold mailbox locks while calling the attachment extract service #4480

Merged
merged 23 commits into from
Jul 26, 2023

Conversation

rsto
Copy link
Member

@rsto rsto commented Apr 17, 2023

This updates squatter to not lock a mailbox while extracting text from attachments. To do so, the attachment extraction module now supports an optional file-system based cache. By default, squatter uses a temporary cache. The newly added experimental squatter argument options v--attachextract-cache-dir and --attachextract-cache-only allow to use user-defined cache directory, as well as instruct squatter to only lookup extracted attachment text in the cache.

In addition, handling of timeouts when calling the extractor service has changed: rather than attempting to connect up to 4 times, an error now is returned after the first timeout.

This pull request best is read commit by commit, where the first two commits are refactoring only.

@rsto rsto changed the title Do not hold mailbox locks while calling the attachment extract service squatter: do not hold mailbox locks while calling the attachment extract service Apr 17, 2023
Copy link
Contributor

@ksmurchison ksmurchison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know a ton about the extractor, but most of this appears to be sane.
I still don't understand with the port_waitevent is used for.

@rsto
Copy link
Member Author

rsto commented Apr 18, 2023

@ksmurchison the comment in this commit indicates that the prot_waitevent timeout is used to close network connections to the backend if it hasn't been used recently. See here: https://github.com/cyrusimap/cyrus-imapd/blame/f0f8a130b889381da1beb89962727f855370a301/imap/index.c#L5590
I guess we could achieve the same by adding the waitevent on the http client out stream, rather than clientin?

@ksmurchison
Copy link
Contributor

@ksmurchison the comment in this commit indicates that the prot_waitevent timeout is used to close network connections to the backend if it hasn't been used recently. See here: https://github.com/cyrusimap/cyrus-imapd/blame/f0f8a130b889381da1beb89962727f855370a301/imap/index.c#L5590
I guess we could achieve the same by adding the waitevent on the http client out stream, rather than clientin?

No, this is fine. After thinking about it, this is he same thing that we do with the backend IMAP connections in a Murder.

Copy link
Contributor

@elliefm elliefm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of comments and suggestions, some just nits, some more important.

cassandane/Cassandane/Cyrus/SearchFuzzy.pm Outdated Show resolved Hide resolved
cassandane/Cassandane/Cyrus/SearchFuzzy.pm Outdated Show resolved Hide resolved
cassandane/Cassandane/Cyrus/SearchFuzzy.pm Outdated Show resolved Hide resolved
cassandane/Cassandane/Cyrus/SearchFuzzy.pm Outdated Show resolved Hide resolved
cassandane/Cassandane/Cyrus/SearchFuzzy.pm Outdated Show resolved Hide resolved
imap/attachextract.c Show resolved Hide resolved
imap/attachextract.c Outdated Show resolved Hide resolved
imap/search_engines.c Outdated Show resolved Hide resolved
imap/search_engines.c Outdated Show resolved Hide resolved
cassandane/Cassandane/Cyrus/SearchFuzzy.pm Outdated Show resolved Hide resolved
@rsto rsto force-pushed the squatter_nolock_attachment_text_extraction branch from 72ae24b to f731de2 Compare April 20, 2023 11:55
@rsto
Copy link
Member Author

rsto commented Apr 20, 2023

@elliefm thanks, I have updated the code accordingly

@rsto rsto requested a review from elliefm April 20, 2023 11:55
Copy link
Contributor

@elliefm elliefm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird stuff still going on with that one chmod test... I've attached a patch with some extra debug logging that identifies exactly when things get weird, but I have no idea why they get weird.

Also got a failure from SearchFuzzy.squatter_attachextract_nolock_slow but I haven't been able to reproduce it, it started passing as soon as I started looking at it.

imap/index.c Show resolved Hide resolved
cassandane/Cassandane/Cyrus/SearchFuzzy.pm Outdated Show resolved Hide resolved
cassandane/Cassandane/Cyrus/SearchFuzzy.pm Outdated Show resolved Hide resolved
@rsto rsto requested a review from elliefm April 24, 2023 16:15
@rsto rsto force-pushed the squatter_nolock_attachment_text_extraction branch from f731de2 to a20be29 Compare April 26, 2023 06:03
@rsto
Copy link
Member Author

rsto commented May 2, 2023

Tagging this Do-Not-Merge for now: I first want to get #4495 into the codebase.

@rsto rsto force-pushed the squatter_nolock_attachment_text_extraction branch from a20be29 to 5dde1ee Compare May 15, 2023 13:49
@rsto rsto force-pushed the squatter_nolock_attachment_text_extraction branch from 5dde1ee to 6e84d70 Compare May 15, 2023 16:18
@rsto rsto force-pushed the squatter_nolock_attachment_text_extraction branch from 6e84d70 to 39ce3dc Compare May 19, 2023 09:18
@rsto rsto force-pushed the squatter_nolock_attachment_text_extraction branch from 39ce3dc to 8cd385b Compare May 23, 2023 08:13
@rsto rsto force-pushed the squatter_nolock_attachment_text_extraction branch from f8d81fb to 5110853 Compare June 7, 2023 13:46
@rsto rsto force-pushed the squatter_nolock_attachment_text_extraction branch from 121516b to db2b787 Compare June 16, 2023 11:28
@rsto rsto removed the Do Not Merge label Jun 19, 2023
rsto added 23 commits July 26, 2023 13:23
This fixes a regression where the HTTP 422 response code of
the attachment extractor service was inadvertently handled as
an error. As a result, the message was tagged as partially
indexed.

Signed-off-by: Robert Stepanek <[email protected]>
@rsto rsto force-pushed the squatter_nolock_attachment_text_extraction branch from db2b787 to 406fa26 Compare July 26, 2023 11:26
@rsto rsto merged commit 789d955 into master Jul 26, 2023
2 checks passed
@rsto rsto deleted the squatter_nolock_attachment_text_extraction branch July 26, 2023 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants