Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance indexing and searching #559

Open
Ruitjes opened this issue Sep 29, 2021 · 6 comments
Open

Performance indexing and searching #559

Ruitjes opened this issue Sep 29, 2021 · 6 comments

Comments

@Ruitjes
Copy link

Ruitjes commented Sep 29, 2021

Hello,

I made two proof of concepts of search-index and itemsjs in Next.js. I looked at the indexing time and the search time of the libraries, using a dataset with 28k movies containg 4 fields.

Results:

  • Itemsjs indexing time = ~700msec
  • Itemsjs search time = ~300msec
  • Search-Index indexing time = ~ 90000msec (1.5 minutes)
  • Search-Index search time = ~ 2000msec

Even with search-index using 500 rows it resulted in higher indexing time (~1200msec) then itemsjs with 28k rows.

Indexing code search-index:

async function initSearchIndex() {
    // initialize an index
    const searchIndex = await si();

    // Without this the index will keep filling after every refresh 
    await searchIndex.FLUSH()

    const t0 = window.performance.now();
    // add documents to the index
    await searchIndex.PUT(movies)
    const t1 = window.performance.now();
    console.log('Indexing took', t1 - t0, 'msec and');
    setInitTime(t1 - t0)
    
    // Calculate number of bytes on disk
    const numBytes = new Blob([movies]).size;
    console.log(numBytes, 'bytes (on disk)');
  
    return searchIndex;
  }

Is this common behaviour for search-index?

Thanks in advance.

@fergiemcdowall
Copy link
Owner

Hi!

Probably the main difference is that search-index inserts into a persistent database, so you can index, stop the program and then start the progam and your index is still there. I think that itemjs is an "in memory" index, so you have to reindex every time you start up.

@fergiemcdowall
Copy link
Owner

BTW- I dont understand the 2000ms for a search. That seems very high- what are you searching for?

@Ruitjes
Copy link
Author

Ruitjes commented Sep 30, 2021

Hi Fergie,

Thanks for your fast response!
That indeed seemed to be the cause for the longer indexing time, it has been changed to memdown and is now much faster ~ 5 seconds for 28k rows!

The searching time still takes quite some time, I might have it configured incorrect?

Code that runs on each change:

useEffect(() => {
    if (!searchIndex) {
      return;
    }

    async function search(){
      const t2 = window.performance.now();

      const q = {
        SEARCH: [query.search, { 
          FIELD: ['year'], 
          VALUE: {
            GTE: query.sliderGTE.toString(),
        },
        }],
      }
      const results = await searchIndex.QUERY(q, {
        DOCUMENTS: true,
        SORT: {
          TYPE: 'NUMERIC',              // can be 'NUMERIC' or 'ALPHABETIC'
          DIRECTION: query.sort,        // can be 'ASCENDING' or 'DESCENDING'
          FIELD: '_match.year'          // field to sort on
        },
        PAGE: {
          NUMBER: 0,
          SIZE: 25
        }
      })

      const t3 = window.performance.now();
      console.log('Searching took', t3 - t2, 'msec');
      setSearchTime(t3 - t2)
      
      console.log('searchIndex',searchIndex)
      setDocuments(results)
      setFirstSearch(true)
    }

    search();

  }, [searchIndex, query.search, query.sort, query.sliderGTE, query.sliderLTE])

Sandbox as demonstration: https://codesandbox.io/s/adoring-wozniak-ujip3

@leeoniya
Copy link

leeoniya commented Oct 1, 2023

@fergiemcdowall

unfortunately, i can confirm that this library is extremely slow to index, to the point that i cannot include it in my uFuzzy benchmark :(

when trying to add 161k documents each containing a single string from testdata.json, i get a callstack explosion:

image

reducing the document count by 10x to 16k, clocks in at 33266ms to build the index. by comparison, the next-slowest library in the benchmark takes 5620ms to index the full 161k dataset. fast fulltext libs take ~500ms to index the entire dataset.

@fergiemcdowall
Copy link
Owner

fergiemcdowall commented Oct 3, 2023

Hehe- I can see that you are perhaps not completely impartial @leeoniya , but thanks for the test case 👍🙂

Indexing speed is a secondary consideration since search-index allows persistence and import. This means search-index can be switched off and on, and the data will still be there. It also means that implementers can pregenerate indices and then quickly import them.

That said- it seems that you have managed to break search-index in a way that shouldn't be possible- I will take a look at your test data when I have time and either introduce a fix, or suggest a set of recommended options that might alleviate the performance issues.

@leeoniya
Copy link

leeoniya commented Oct 3, 2023

Hehe- I can see that you are perhaps not completely impartial @leeoniya , but thanks for the test case 👍🙂

not impartial, but trying hard to stay objective. it's basically impossible to do a broad apples-to-apples comparison of so many libs. search-index appears to be an outlier among outliers, though.

Indexing speed is a secondary consideration

strange take, i gotta say. all search libs i've encountered take at least some care to ensure their indexing performance isn't accidentally quadratic.

since search-index allows persistence and import.

think there are several libs that offer serializable indices. Pagefind being one, MiniSearch is another.

I will take a look at your test data when I have time and either introduce a fix, or suggest a set of recommended options that might alleviate the performance issues.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants