All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Update pyarrow
- better windows support
- better temporary small indices path by randomizing
- Update dependencies
- Support python 3.11 (#167) (#159)
- Fix issue where creating forced memory-mapped indices with faiss-cpu>1.7.2 was failing (#164)
- Support python 3.10 in CI (#165)
- Loosen pyarrow version constraints
- methods create_small_index and create_big_index now accepts a list of input embedding directories and a final output directory that is used to store indexes and metrics
- build_partitioned_indexes now accepts an optional path to a (pre-trained) index. If provided, this index will be used to build all partitioned indexes
- dataclasses is now pulled only when working with python < 3.7. dataclasses is now part of the built-in libraries since python 3.7
- Fix in distributed.py: the number of indices to keep after merging small indexes can't be bigger than the number of small indexes.
- Fix in distributed.py to compute number of batches to split vectors into before adding vectors to index
autofaiss build_partitioned_indexes
accepts an index key that will be used to create all indexes if defined
- Autofaiss supports the creation of multiple indices from a partitionning column using the command
autofaiss build_partitioned_indexes
.
- Changed the training logic to work with smaller functions.
Fix the estimated number of batches
Fix "Fix the number of output index files
- Fix the number of output index files
- Add the possiblity to tune the index to return at least k nearest neighbors
- Do not save dataframe index for ids
- Fix get_index_size by using NamedTemporaryFile
- use index.add_with_ids to have consecutive ids in N indices mode (#106)
- check if folder exists before removing
- move optimization in executors in distributed mode
- fix the number of batches in merge n indices
- add guide for distributed autofaiss
- replace embedding iterator by embedding reader
- produce less indices in distributed mode
- Make max_nb_threads in #94 less than or equal to cpu cores
- Read indices by small batch one after the other to avoid out of disk error
- Fix the indices naming for N indices in the README
- Save temporary indices directly to the specified temporary files instead of doing a copy first
- Fix adding memory estimation and quick fix for training
- Fix docstring of read_total_nb_vectors_and_dim to match return output
Improve the estimation of memory available for adding
Improve training memory estimation Option to produce N indices in the distributed mode
Fix _yield_embeddings_batch to avoid the case where slice_start is equal to slice_end
Fix the order of indices when merging
fix pex publishing
Pex building for python 3.6 and 3.8
Add pex building
better dependencies ranges
Fix/Complete some documents Disable IVF, Flat index_key for large numbers of vectors on CPU
Fix "Filter empty files"
- Empty ids path and temporary small indices folder at the beginning
- Use a central logger instead of print functions
- Add a verbosity flag to control the log level
- Filter empty files
- Implement 2 stage merging in the distributed module
- Make absolute path so that it is more safer to use fsspec
- Fix memory estimation for inverted list
Add support for multiple embeddings folders
Optional distributed indexing support using pyspark
Add support for memory-mapped indices generation
- Add make direct map to argement of index metadata while estimating index size
- add make_direct_map to memory estimation
- clean ids_batch and ids_batch_df between each batch
Add make_direct_map option in build index
fix shape reading for numpy format
Add support for Vector Id columns
use fsspec to write the index + make tune index and score index option
Decrease memory usage by using a lazy ndarray reader
- add in memory autofaiss support
- improve API by removing the Quantizer class
- quantize -> build_index
- make index_path explicit
Add missing fsspec dep in setup.py
Make index creation agnostic of filesystem using fsspec
Check if index needs to be trained
- improve estimation of the memory needed for training
- add pq 256 in list to create large indices
- add pq 128 in list to create large indices
- improve memory usage of score command
reserve more space for training
fix small typo train -> build
Improve memory capping and core estimation
- use multiprocessing cpu count to find correct number of cores
- better estimate memory usage and use that to adapt training and adding batch size
add explicit example in score index doc and in readme
fix the batch size of scoring
Make the local numpy iterator memory efficient
speed improvements
- when estimating shape of numpy files, use memory mapping ; reduce estimation from hours to seconds
- do not keep the training vector in memory after training, reduce memory requirements by a factor 2
- fix score function for embeddings to load from files
- convert embeddings to float32 if needed at loading
- Indices descriptions function
- Indices size estimator
- Enhance indices selection function (switch to get_optimal_index_keys_v2 + improvements)
- Update slider notebook
- add doc notebooks
- Create the output folder if missing to avoid error
- use 128 instead of 101 (mostly equivalent) in index param selection for ivf
- add embedding_column_name to download
- import in download.py
- First release
- Initial commit