Extended Cluster-Pruning

The extended cluster-pruning algorithm is implemented in C++ and provides a python wrapper for usage within the ANN-benchmarking tool to compare with other state of the art ANN algorithms.

Initially proposed as a bachelors project at the IT-University of Copenhagen in spring 2020. This repository continues the work by building a dynamic extension of the algorithm ontop of the original implementation.

This project is currently a work in progress. See CONTRIBUTING.md on how to contribute.

Project dependencies

gcc, c++, cmake, python3.6, swig, hdf5 (libhdf5-dev), docker

How-to's

The following descriptions assumes some sort of Linux distribution even if the eCP source code will compile on e.g. Windows. It is also assumed that ./ is the root of this repository.

How to compile eCP

Install dependencies outlined above.
Afterwards, to configure and compile project load the CMakeLists.txt file in the root dir using your favorite IDE, OR go to ./eCP/scripts and execute:

configure.sh
build.sh

This will build the project in Release mode. To change this see configure.sh script. Currently only GNU/GCC is known to be supported.

How to run eCP in ANN-Benchmarks

Running it manually

Go to ./ann_benchmarks/install/Dockerfile.ecp and change the line describing what repository is cloned when the eCP docker container is build by Ann-Benchmarks (AB). Make it point to the current working branch e.g. git clone -b <branch_name> <url_to_repo> eCP, where 'eCP will be local name of repo dir to help build scripts.
Push your work to the designated branch.
Execute the script ./scipts/ecp_install.sh which will pull down AB repository and copy the needed files into the newly cloned AB repository. Here the Python requirements will be installed and finally an AB install script will be called to setup the eCP algorithm with AB, i.e. build a Docker image containing a compiled executable of the code from the designated branch from earlier.
Goto AB directory and follow the guide to run tests found at https://github.com/erikbern/ann-benchmarks.

Using the provided run script

From within the eCP/scripts folder run the ann-benchmark.sh script, it takes arbitrarily many arguments in a array-style format depending on the amount of experiments the user wants to run. An example configuration could be sudo ./ann-benchmark.sh [abc123,foo,random-xs-20-euclidean,false] [def456,bar,random-xs-20-euclidean,true] corresponding to running the code at commit id abc123 on the random-xs-20-euclidean dataset without early halting and naming it foo whereafter commit id def456 would be run on the same dataset with early halting enabled and under the name bar. Observe that there cannot be any spaces after each comma.

It supports the following flags:

-d (--delete) delete the current local ann-benchmarks repository if it exists
-l (--local) run ann-benchmarks in local mode (with local source code) (https://github.com/erikbern/ann-benchmarks)

Documentation

Wrapper interface

SWIG is used to construct the Python-based wrapper around the C++ code. SWIG is handled by CMake which will recompile the wrapper every time the code base is recompiled. This will also happen inside the docker container, when the code is send to the Ann-Benchmarks framework for testing.

The wrapper interface is described in ./eCP/swig/eCP.i and exposes two functions from the C++ source code as outlined below. The interface file is used to describe exactly what part of the C++ code is exposed through the Python API:

eCP_Index(S, L, m)

Builds an index with the provided dataset. Accepts three arguments:

A dataset (nested list of data points)
Integer determining how many levels the index should have
Metric for comparing distance (1 - Angular distance, 0 - Euclidean distance)

query(I, q, k, b)

Queries the given index. Accepts four arguments:

Index to be queried
Query point
Amount of nearest neighbors to return
Amount of clusters to search

Python code example of using the wrapper

import eCP_wrapper as e
index = e.eCP_Index([[41, 67, 34, 0, 69, 24],[78, 58, 62, 64, 5, 45],[81, 27, 61, 91, 95, 42],[27, 36, 91, 4, 2, 53],[92, 82, 21, 16, 18, 95],[47, 26, 71, 38, 69, 12],[67, 99, 35, 94, 3, 11],[22, 33, 73, 64, 41, 11], [53, 68, 47, 44, 62, 57], [37, 59, 23, 41, 29, 78], [16, 35, 90, 42, 88, 6], [40, 42, 64, 48, 46, 5], [90, 29, 70, 50, 6, 1], [93, 48, 29, 23, 84, 54], [56, 40, 66, 76, 31, 8]], 2, 1)
result = e.query(index, [5,5,5], 5, 1)

VTUNE

Flags for running the main.cpp in VTune and Code Editor

-f HDF5 file (Sometimes the HDF5 is locked, use HDF5_USE_FILE_LOCKIN false as an environment variable)
-k number of nearest numbers for KNN search
-b clusters to search through
-m distance metric
-l levels of the index
-r query repetitions

Datasets

A little bit of information about the datasets that have been used for testing.

Dataset	Dimensions	Train size	Test size	Neighbors	Distance
random-xs-20-euclidean	20	9000	1000	10	Euclidean
random-xs-20-angular	20	9000	1000	10	Angular
random-s-100-euclidean	100	9000	1000	10	Euclidean
MNIST	784	60,000	10,000	100	Euclidean
Fashion-MNIST	784	60,000	10,000	100	Euclidean
SIFT	128	1,000,000	10,000	100	Euclidean
GloVe	25	1,183,514	10,000	100	Angular

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
benchmarks-files		benchmarks-files
eCP		eCP
papers		papers
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extended Cluster-Pruning

Project dependencies

How-to's

How to compile eCP

How to run eCP in ANN-Benchmarks

Running it manually

Using the provided run script

Documentation

Wrapper interface

eCP_Index(S, L, m)

query(I, q, k, b)

Python code example of using the wrapper

VTUNE

Datasets

About

Releases 3

Packages

Languages

fremartini/eCP

Folders and files

Latest commit

History

Repository files navigation

Extended Cluster-Pruning

Project dependencies

How-to's

How to compile eCP

How to run eCP in ANN-Benchmarks

Running it manually

Using the provided run script

Documentation

Wrapper interface

eCP_Index(S, L, m)

query(I, q, k, b)

Python code example of using the wrapper

VTUNE

Datasets

About

Resources

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages