Skip to content

📚 Word shingling for near duplicate document detection

License

Notifications You must be signed in to change notification settings

unhammer/wshiml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wshiml

Implementation of http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html

Build requires oasis. Do:

./configure    # optionally with --prefix
make
make install

To build the example command-line program, do

./configure --enable-cli
make
make install
find-similar-docs --help

The command-line program requires cmdliner. The rest of the software has no dependencies apart from Oasis for building from git.

On Debian/Ubuntu, you can install all build dependencies with

sudo apt install oasis libcmdliner-ocaml-dev

So far the code is fairly unoptimised apart from what's described in http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html and uses 7s (4s with super-shingling) to cluster 1100 documents of altogether 766,937 words on an old 2.8 GHz AMD.

API documentation

is here.