update-doc

lightonai · Sep 11, 2024 · c8d1b49 · c8d1b49
1 parent d2db110
commit c8d1b49
Show file tree

Hide file tree

Showing 27 changed files with 389 additions and 429 deletions.
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,7 @@ __pycache__/
 *.ipynb
 *.duckdb
 duckdb_tmp/
+*.block
 
 evaluation_datasets/
 
@@ -168,3 +169,23 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+arguana
+benchmark_bm25s.py
+benchmark_crud.py
+climate-fever
+fever.tmp/duckdb_temp_block-4611686018432402649.block
+fever.tmp/duckdb_temp_block-4611686018432404521.block
+fever.tmp/duckdb_temp_block-4611686018432404963.block
+fever.tmp/duckdb_temp_storage-4.tmp
+metrics.json
+metrics_20K.json
+metrics_bm25s.json
+mmetrics_30K.json
+msmarco
+nfcorpus
+nq
+quora
+scidocs
+scifact
+trec-covid
+webis-touche2020
diff --git a/Makefile b/Makefile
@@ -10,28 +10,22 @@ tests:
 	@echo "Removing test.duckdb if it exists..."
 	rm -rf test.duckdb
 	rm -rf test.duckdb.wal
-	pytest ducksearch/tables/create.py 
-	pytest ducksearch/tables/insert.py 
-	pytest ducksearch/tables/select.py
+	pytest ducksearch/tables/create.py --disable-warnings
+	pytest ducksearch/tables/insert.py --disable-warnings
+	pytest ducksearch/tables/select.py --disable-warnings
 	rm -rf test.duckdb
 	rm -rf test.duckdb.wal
-	pytest ducksearch/hf/insert.py
+	pytest ducksearch/hf/insert.py --disable-warnings
 	rm -rf test.duckdb
 	rm -rf test.duckdb.wal
-	pytest ducksearch/delete/documents.py
+	pytest ducksearch/evaluation/evaluation.py --disable-warnings
 	rm -rf test.duckdb
 	rm -rf test.duckdb.wal
-	pytest ducksearch/evaluation/evaluation.py
+	pytest ducksearch/search/create.py --disable-warnings
+	pytest ducksearch/search/select.py --disable-warnings
 	rm -rf test.duckdb
 	rm -rf test.duckdb.wal
-	pytest ducksearch/upload/upload.py
-	rm -rf test.duckdb
-	rm -rf test.duckdb.wal
-	pytest ducksearch/search/create.py
-	pytest ducksearch/search/select.py
-	rm -rf test.duckdb
-	rm -rf test.duckdb.wal
-	pytest ducksearch/search/graphs.py
+	pytest ducksearch/search/graphs.py --disable-warnings
 	rm -rf test.duckdb
 	rm -rf test.duckdb.wal
 

diff --git a/README.md b/README.md
@@ -13,30 +13,26 @@
 </div>
 
 <p align="justify">
-DuckSearch is a lightweight and easy-to-use library that allows to index and search documents. DuckSearch is built on top of DuckDB, a high-performance analytical database. DuckDB is designed to execute analytical SQL queries fast, and DuckSearch leverages this to provide efficient and scallable search / filtering capabilities.
+DuckSearch is a lightweight and easy-to-use library to search documents. DuckSearch is built on top of DuckDB, a high-performance analytical database. DuckDB is designed to execute analytical SQL queries fast, and DuckSearch leverages this to provide efficient search and filtering features. DuckSearch index can be updated with new documents and documents can be deleted as well. 
+
+DuckSearch also supports HuggingFace datasets, allowing to index datasets directly from the HuggingFace Hub.
 </p>
 
 ## Installation
 
-We can install DuckSearch using pip:
+Install DuckSearch using pip:
 
 ```bash
 pip install ducksearch
 ```
 
-For evaluation dependencies, we can install DuckSearch with the `eval` extra:
-
-```bash
-pip install "ducksearch[eval]"
-```
-
 ## Documentation 
 
 The complete documentation is available [here](https://lightonai.github.io/ducksearch/), which includes in-depth guides, examples, and API references.
 
 ### Upload
 
-We can upload documents to DuckDB using the `upload.documents` function. The documents are stored in a DuckDB database, and the fields are indexed with BM25.
+We can upload documents to DuckDB using the `upload.documents` function. The documents are stored in a DuckDB database, and the `fields` are indexed with BM25.
 
 ```python
 from ducksearch import upload
@@ -79,7 +75,7 @@ upload.documents(
 
 ## Search
 
-We can search documents using the `search.documents` function. The function returns the documents that match the query, sorted by the BM25 score. The `top_k` parameter controls the number of documents to return. We can also filter the results using SQL syntax which will be evaluated by DuckDB, therefore all DuckDB functions are available.
+`search.documents` returns a list of list of documents ordered by relevance. We can control the number of documents to return using the `top_k` parameter. The following example demonstrates how to search for documents with the queries "punk" and "california" while filtering the results to include only documents with a date after 1970 and a popularity score greater than 8.
 
 ```python
 from ducksearch import search
@@ -117,7 +113,22 @@ search.documents(
 ]
 ```
 
-List of DuckDB functions such as date functions can be found [here](https://duckdb.org/docs/sql/functions/date).
+Filters are SQL expressions that are applied to the search results. We can use every filtering function DuckDB provides such as [date functions](https://duckdb.org/docs/sql/functions/date).
+
+## Delete and update index
+
+We can delete documents and update the BM25 weights accordingly using the `delete.documents` function.
+
+```python
+from ducksearch import delete
+
+delete.documents(
+    database="ducksearch.duckdb",
+    ids=[0, 1],
+)
+```
+
+To update the index, we should first delete the documents and then upload the updated documents.
 
 ## Extra features
 
@@ -152,7 +163,6 @@ search.documents(
     database="fineweb.duckdb",
     queries="earth science",
     top_k=2,
-    filters="token_count > 200",
 )
 ```
 
@@ -180,82 +190,25 @@ search.documents(
 ]
 ```
 
-### Graphs
-
-The `search.graphs` function can be used to search documents with a graph query. This function is useful if we have paired documents and queries. The search will retrieve the set of documents and queries that match the input query. Then it will build a graph and compute the weight of each document using a graph-based scoring function.
-
-```python
-from ducksearch import search, upload
-
-documents = [
-    {
-        "id": 0,
-        "title": "Hotel California",
-        "style": "rock",
-        "date": "1977-02-22",
-        "popularity": 9,
-    },
-    {
-        "id": 1,
-        "title": "Here Comes the Sun",
-        "style": "rock",
-        "date": "1969-06-10",
-        "popularity": 10,
-    },
-    {
-        "id": 2,
-        "title": "Alive",
-        "style": "electro, punk",
-        "date": "2007-11-19",
-        "popularity": 9,
-    },
-]
-
-upload.documents(
-    database="ducksearch.duckdb",
-    key="id",
-    fields=["title", "style", "date", "popularity"],
-    documents=documents,
-    dtypes={
-        "date": "DATE",
-        "popularity": "INT",
-    },
-)
-
-# Mapping between documents ids and queries
-documents_queries = {
-    0: ["the beatles", "rock band"],
-    1: ["rock band", "california"],
-    2: ["daft"],
-}
-
-upload.queries(
-	database="ducksearch.duckdb",
-	documents_queries=documents_queries,
-)
-
-search.graphs(
-	database="ducksearch.duckdb",
-	queries="daft punk",
-	top_k=10,
-)
-```
-
-```python
-[
-    {
-        "id": "2",
-        "title": "Alive",
-        "style": "electro, punk",
-        "date": Timestamp("2007-11-19 00:00:00"),
-        "popularity": 9,
-        "score": 2.877532958984375,
-    }
-]
-```
-
-## Lightning fast
-
+## Benchmark
+
+
+| Dataset           | ndcg@10   | hits@1  | hits@10  | mrr@10   | map@10  | r-precision | qps            | Indexation Time (s) | Number of Documents and Queries |
+|-------------------|-----------|---------|----------|----------|---------|-------------|----------------|---------------------|--------------------------------|
+| arguana            | 0.3779    | 0.0     | 0.8267   | 0.2491   | 0.2528  | 0.0108      | 117.80         | 1.42                | 1,406 queries, 8.67K documents |
+| climate-fever      | 0.1184    | 0.1068  | 0.3648   | 0.1644   | 0.0803  | 0.0758      | 5.88           | 302.39              | 1,535 queries, 5.42M documents |
+| dbpedia-entity     | 0.6046    | 0.7669  | 5.6241   | 0.8311   | 0.0649  | 0.0741      | 113.20         | 181.42              | 400 queries, 4.63M documents   |
+| fever              | 0.3861    | 0.2583  | 0.5826   | 0.3525   | 0.3329  | 0.2497      | 74.40          | 329.70              | 6,666 queries, 5.42M documents |
+| fiqa               | 0.2445    | 0.2207  | 0.6790   | 0.3002   | 0.1848  | 0.1594      | 545.77         | 6.04                | 648 queries, 57K documents     |
+| hotpotqa           | 0.4487    | 0.5059  | 0.9699   | 0.5846   | 0.3642  | 0.3388      | 48.15          | 163.14              | 7,405 queries, 5.23M documents |
+| msmarco            | 0.8951    | 1.0     | 8.6279   | 1.0      | 0.0459  | 0.0473      | 35.11          | 202.37              | 6,980 queries, 8.84M documents |
+| nfcorpus           | 0.3301    | 0.4396  | 2.4087   | 0.5292   | 0.1233  | 0.1383      | 3464.66        | 0.99                | 323 queries, 3.6K documents    |
+| nq                 | 0.2451    | 0.1272  | 0.4574   | 0.2099   | 0.1934  | 0.1240      | 150.23         | 71.43               | 3,452 queries, 2.68M documents |
+| quora              | 0.7705    | 0.6783  | 1.1749   | 0.7606   | 0.7206  | 0.6502      | 741.13         | 3.78                | 10,000 queries, 523K documents |
+| scidocs            | 0.1025    | 0.1790  | 0.8240   | 0.2754   | 0.0154  | 0.0275      | 879.11         | 4.46                | 1,000 queries, 25K documents   |
+| scifact            | 0.6908    | 0.5533  | 0.9133   | 0.6527   | 0.6416  | 0.5468      | 2153.64        | 1.22                | 300 queries, 5K documents      |
+| trec-covid         | 0.9533    | 1.0     | 9.4800   | 1.0      | 0.0074  | 0.0077      | 112.38         | 22.15               | 50 queries, 171K documents     |
+| webis-touche2020   | 0.4130    | 0.5510  | 3.7347   | 0.7114   | 0.0564  | 0.0827      | 104.65         | 44.14               | 49 queries, 382K documents     |
 
 ## License
 

diff --git a/benchmark.py b/benchmark.py
@@ -1,39 +1,40 @@
-import time
+from nltk import download
+from nltk.corpus import stopwords
 
 from ducksearch import evaluation, search, upload
 
+download("stopwords")
+
+stopword = list(stopwords.words("english"))
+
 dataset_name = "quora"
 
 documents, queries, qrels = evaluation.load_beir(
-    dataset_name=dataset_name, split="test"
+    dataset_name=dataset_name,
+    split="test",
 )
 
 upload.documents(
-    database=dataset_name, documents=documents, key="id", fields=["title", "text"]
+    database=dataset_name,
+    documents=documents,
+    key="id",
+    fields=["title", "text"],
+    stopwords=stopword,
 )
 
-upload.indexes(database=dataset_name)
-
-
-start = time.time()
-
 scores = search.documents(
     database=dataset_name,
     queries=queries,
     top_k=10,
-    top_k_token=10_000,
-    batch_size=30,
+    top_k_token=30_000,
+    batch_size=32,
 )
 
-end = time.time()
-
-print(f"Search took {end - start:.2f} seconds, QPS: {len(queries) / (end - start):.2f}")
-
 evaluation_scores = evaluation.evaluate(
     scores=scores,
     qrels=qrels,
     queries=queries,
-    metrics=["ndcg@10", "hits@1", "hits@2", "hits@3", "hits@4", "hits@5", "hits@10"],
+    metrics=["ndcg@10", "hits@1", "hits@10", "mrr@10", "map@10", "r-precision"],
 )
 
 print(evaluation_scores)
diff --git a/docs/api/evaluation/evaluate.md b/docs/api/evaluation/evaluate.md
@@ -47,16 +47,5 @@ Evaluate the performance of document retrieval using relevance judgments.
 ...     queries=queries,
 ...     top_k=10,
 ... )
-
->>> evaluation_scores = evaluation.evaluate(
-...     scores=scores,
-...     qrels=qrels,
-...     queries=queries,
-...     metrics=["ndcg@10", "hits@1", "hits@2", "hits@3", "hits@4", "hits@5", "hits@10"],
-... )
-
->>> assert evaluation_scores["ndcg@10"] > 0.68
->>> assert evaluation_scores["hits@1"] > 0.54
->>> assert evaluation_scores["hits@10"] > 0.90
 ```
 
diff --git a/docs/api/search/documents.md b/docs/api/search/documents.md
@@ -14,15 +14,15 @@ Search for documents in the documents table using specified queries.
 
     A string or list of query strings to search for.
 
-- **batch_size** (*int*) – defaults to `30`
+- **batch_size** (*int*) – defaults to `32`
 
     The batch size for query processing.
 
 - **top_k** (*int*) – defaults to `10`
 
     The number of top documents to retrieve for each query.
 
-- **top_k_token** (*int*) – defaults to `10000`
+- **top_k_token** (*int*) – defaults to `30000`
 
     The number of documents to score per token.
 
@@ -38,17 +38,22 @@ Search for documents in the documents table using specified queries.
 
     Optional SQL filters to apply during the search.
 
-- **kwargs**
-
 
 
 ## Examples
 
 ```python
 >>> from ducksearch import evaluation, upload, search
->>> documents, queries, qrels = evaluation.load_beir("scifact", split="test")
->>> scores = search.documents(database="test.duckdb", queries=queries, top_k_token=1000)
->>> evaluation_scores = evaluation.evaluate(scores=scores, qrels=qrels, queries=queries)
->>> assert evaluation_scores["ndcg@10"] > 0.68
+
+>>> documents, queries, qrels = evaluation.load_beir(
+...     "scifact",
+...     split="test",
+... )
+
+>>> scores = search.documents(
+...     database="test.duckdb",
+...     queries=queries,
+...     top_k_token=1000,
+... )
 ```
 
diff --git a/docs/api/search/graphs.md b/docs/api/search/graphs.md
@@ -22,7 +22,7 @@ Search for graphs in DuckDB using the provided queries.
 
     The number of top documents to retrieve for each query.
 
-- **top_k_token** (*int*) – defaults to `10000`
+- **top_k_token** (*int*) – defaults to `30000`
 
     The number of top tokens to retrieve.
 
@@ -65,16 +65,5 @@ Search for graphs in DuckDB using the provided queries.
 ...     queries=queries,
 ...     top_k=10,
 ... )
-
->>> assert len(scores) > 0
-
->>> evaluation_scores = evaluation.evaluate(
-...     scores=scores,
-...     qrels=qrels,
-...     queries=queries,
-...     metrics=["ndcg@10", "hits@1", "hits@10"]
-... )
-
->>> assert evaluation_scores["ndcg@10"] > 0.74
 ```