Optional Use of Body_Safe #110

HurricanePete · 2019-09-10T22:22:07Z

Hello @Jerska, just wanted to ask about the possibilities of customizing or deactivating the character limit dictated by body_safe. Something like passing an option to algoliasearchZendeskHC for characterLimit or setting it to false in order to store the entire article body and override the default option here:

algoliasearch-zendesk/crawler/item.rb

Line 103 in 7ac413e

def truncate str, max = 5_000

It was changed here as a bug fix: https://github.com/algolia/algoliasearch-zendesk/blob/master/CHANGELOG.md#2173-2017-10-17

It does look like crawler options come from Algolia (not Zendesk frontend), but would love to get some extra context about why the change was made and what our options are.

We use the search and instant search for a small knowledge base for our application. Since we don't have too many documents to index, we'd like to try and include the entire document body as part of the searches - since currently a lot of the article is not included in these search functions.

Would be happy to open a PR for this if it makes sense and there's not some other reason not to have it.

Jerska · 2019-09-16T09:19:26Z

Hi @HurricanePete . Thanks for raising the issue.

The reason for this character limit is that Algolia has a size limit for records.
It used to be 100KB, but changed over time to now be 10KB, and we need to leave some available room for the other attributes of the article.
https://www.algolia.com/doc/faq/basics/is-there-a-size-limit-for-my-index-records/

Our suggestion in case some articles don't show up for a search query because of the size limit is to add relevant keywords in the tags of your article.

The long term solution would be #54 .
The idea is then to split the article in 1 record per paragraph instead of a record per article and use distinct at query time to have only one result per article. This consumes more records, but scales with really long documents.
As you can see in the creation date of the issue above, this has been a topic that we haven't tackled in a really long time.

Our Zendesk integration is currently in maintenance only mode, and we do not plan to add any new feature (which this would be).
I you'd be interested in creating a PR for this, I'd be happy to review it, but this requires a bunch of non-trivial changes.

HurricanePete · 2019-09-18T20:01:10Z

Hello @Jerska - thank you for the reply. Our indexes average about 3kb each, so that shouldn't be a problem. Would you see any potential issues if we disabled the integration and uploaded the full article bodies from our end? We already (effectively) have a crawler in place, this would just be for reindexing on the Algolia side.

Jerska · 2019-09-19T10:44:35Z

If you're able to do the indexing on your part, by all means feel free to. The requirement is to match the extracted JSON our system indexes.

What I'm not sure to understand is how that would fix the issue. You'll be facing the same limit, and if some records are truncated today by our script, it means you already have articles above 5KB. While there is some room between 5 and 10KB, I guess we can safely assume some of them will be above limit and fail to be indexed.

HurricanePete · 2019-09-25T20:10:57Z

Ah - bit of a mix up there. I am planning on splitting by paragraph and then using the distinct feature within Algolia. This seems like the best option at this point, as I think you said the ability to do that through the algoliasearch-zendesk integration hadn't been developed.

Jerska · 2019-09-26T13:32:58Z

It makes sense.

You are correct that the integration doesn't support this at this point in time.
We're open to Pull Requests, so if you want to take our code as a base for the script and submit one, it could be integrated directly in the connector.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional Use of Body_Safe #110

Optional Use of Body_Safe #110

HurricanePete commented Sep 10, 2019 •

edited

Loading

Jerska commented Sep 16, 2019

HurricanePete commented Sep 18, 2019

Jerska commented Sep 19, 2019

HurricanePete commented Sep 25, 2019 •

edited

Loading

Jerska commented Sep 26, 2019

Optional Use of Body_Safe #110

Optional Use of Body_Safe #110

Comments

HurricanePete commented Sep 10, 2019 • edited Loading

Jerska commented Sep 16, 2019

HurricanePete commented Sep 18, 2019

Jerska commented Sep 19, 2019

HurricanePete commented Sep 25, 2019 • edited Loading

Jerska commented Sep 26, 2019

HurricanePete commented Sep 10, 2019 •

edited

Loading

HurricanePete commented Sep 25, 2019 •

edited

Loading