Skip to content

Colbert and DPR search on wikipedia datasets

Notifications You must be signed in to change notification settings

datastax-labs/colbert-wikipedia-data

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Apps using ColBERT DPR on wikipedia datasets

ColBERT Dense Passage Retrieval (DPR) with ready-to-go pre-vectorised wikipedia datasets for RAG applications.

Datasets contain both ColBERT and all-MiniLM-L6-v2 embeddings, on 1024 character chunks for all articles in the wikipedia set. Datasets from different wikipedia sites and languages can be combined and searched in the one table and index.

These datasets are intended for

  • RAG Apps using either/or ColBERT and all-MiniLM-L6-v2 embeddings for DPR on wikipedia data,
  • comparing and benchmarking ColBERT vs all-MiniLM-L6-v2 performance and relevancy,
  • production use.

ColBERT demonstrates improved performanced and accuracy for RAG applications and uses a smaller model. Despite the need to create more embeddings: also resulting in a larger ANN vector index; and having to multiple more searches per request, the improved relevancy and lower system resource cost of ColBERT makes it the attractive solution.

The datasets can be found here: https://s.apache.org/vectorized-wiki-sstables

Setup

  • Python setup
cd colbert-wikipedia-data
virtualenv -p python3.11 .venv
source .venv/bin/activate
pip install -r requirements.txt
  • Database setup

You can skip this step if you already have Apache Cassandra >=5.0-beta2 running.

# 5.0-beta2 not yet released.  use latest 5.0 nightlies build for now.
#wget https://www.apache.org/dyn/closer.lua/cassandra/5.0-beta2/apache-cassandra-5.0-beta2-bin.tar.gz

# these only last for two weeks, update the build number "219" to find the latest
wget https://nightlies.apache.org/cassandra/Cassandra-5.0/219/artifacts/jdk11/amd64/apache-cassandra-5.0-beta2-SNAPSHOT-bin.tar.gz

tar -xzf  apache-cassandra-5.0*-bin.tar.gz
rm apache-cassandra-5.0*-bin.tar.gz
cp apache-cassandra-5.0*/conf/cassandra_latest.yaml apache-cassandra-5.0*/conf/cassandra.yaml
export PATH="$(echo $(pwd)/apache-cassandra-5.0*)/bin/:$PATH"
export CASSANDRA_DATA="$(echo $(pwd)/apache-cassandra-5.0*)/data"
cassandra -f

All following steps assume C* is listening on localhost:9042

 

  • Load schema and the prepared dataset for the simple-english wikipedia dataset
cqlsh -f schema.cql

# Download (from a browser) https://s.apache.org/simplewiki-sstable-tar
# these files are very big, tens/hundreds of GBs

# move the downloaded file to the current directory, renaming it to simplewiki-sstable.tar
# for example:
mv ~/Downloads/simplewiki-20240304-sstable.tar simplewiki-sstable.tar

# note, if you have existing data in this table you'll want to check the tarball's files don't clobber any existing
tar -xf simplewiki-sstable.tar -C ${CASSANDRA_DATA}/data/wikidata/articles-*/

# alternative is to just restart the node (any failures in the indexes will be rebuilt automatically)
nodetool import wikidata articles ${CASSANDRA_DATA}/data/wikidata/articles-*/

The datamodel is a single table wikidata.articles. Separate ANN SAI indexes exist for the minilm-l6-v2 and colbert embeddings.

Serve ColBERT and all-MiniLM-L6-v2 DPR

wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz
mkdir checkpoints
tar -xvf colbertv2.0.tar.gz -C checkpoints/
  • Run command line
python serve.py
  • Run webserver
python serve_httpy.py`

# open http://localhost:5000

The Retriver is simple code off this datamodel, fetching similarity search results from both ColBERT and all-MiniLM-L6-v2 embeddings.

FAQ

I only want the ColBERT embeddings

Just drop the all_minilm_l6_v2_ann index and then the all_minilm_l6_v2_embedding column.

cqlsh

DROP INDEX wikidata.all_minilm_l6_v2_ann ;
ALTER TABLE wikidata.articles DROP all_minilm_l6_v2_embedding ;

If you only want the all_minilm_l6_v2 embeddings then it is the same procedure but for the colbert_ann index and bert_embedding column. Note this will leave all the bert_embedding_no rows behind but they will be empty.

Manual extraction of wikipedia datasets

If you want to extract the wikipedia data yourself (instead of downloading the above ready prepared sstable data)

cqlsh -e 'DROP INDEX wikidata.all_minilm_l6_v2_ann ; DROP INDEX wikidata.colbert_ann ;'
nodetool disableautocompaction

python extract-wikidump.py -q simplewiki-20240304-cirrussearch-content.json

nodetool compact
# to watch progress (ctl-c when complete)
watch nodetool compactionstats

cqlsh -e "CREATE CUSTOM INDEX all_minilm_l6_v2_ann ON articles(all_minilm_l6_v2_embedding) USING 'StorageAttachedIndex' WITH OPTIONS = { 'similarity_function': 'COSINE' };"
cqlsh -e "CREATE CUSTOM INDEX colbert_ann ON articles(bert_embedding) USING 'StorageAttachedIndex' WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };"

# to watch progress (ctl-c when complete)
watch nodetool compactionstats

Extraction (extract-wikidump.py) works with wikipedia cirrus dumps found at https://dumps.wikimedia.org/other/cirrussearch/

The extraction defaults to chunks 1024, with overlap 256, characters using langchain's RecursiveCharacterTextSplitter.

About

Colbert and DPR search on wikipedia datasets

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%