Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vector join table macros #25

Merged
merged 3 commits into from
Aug 12, 2024
Merged

Add vector join table macros #25

merged 3 commits into from
Aug 12, 2024

Conversation

Maxxen
Copy link
Member

@Maxxen Maxxen commented Aug 12, 2024

Now that duckdb/duckdb#12834 and duckdb/duckdb#13342 have landed in core DuckDB we can make vector search a bit easier.
This PR adds two table functions, implemented as table macros, to enable easy fuzzy joining of two vector datasets:

  • vss_join(left_table, right_table, left_col, right_col, k, metric := 'l2sq)

    Joins the left_table and right_table by matching up to k vectors in right_col for each vector in left_col, using the metric metric (by default l2sq)

  • vss_match(right_table, left_col, right_col, k, metric := 'l2sq')

Returns up to k rows matching right_col in right_table for each input vector in left_col. Since you don't pass in a left_table to this function, you can pass in any scalar argument as left_col which allows you to use this function both in a lateral join or with a constant.

Examples:

CREATE TABLE source (name VARCHAR, s_vec FLOAT[3]);
INSERT INTO source VALUES ('alice', [5,5,5]::FLOAT[3]), ('bob', [100,120,90]::FLOAT[3]);

CREATE TABLE dest (id INTEGER, d_vec FLOAT[3]);
INSERT INTO dest SELECT row_number() over (), array_value(a,b,c) FROM range(1,10) ra(a), range(1,10) rb(b), range(1,10) rc(c);

-- Join the two tables
SELECT * FROM vss_join(source, dest, s_vec, d_vec, 3);
----
┌───────────┬──────────────────────────────────────────────┬───────────────────────────────────────┐
│   score   │                   left_tbl                   │               right_tbl               │
│   float   │    struct("name" varchar, s_vec float[3])    │  struct(id integer, d_vec float[3])   │
├───────────┼──────────────────────────────────────────────┼───────────────────────────────────────┤
│ 164.81201 │ {'name': bob, 's_vec': [100.0, 120.0, 90.0]} │ {'id': 729, 'd_vec': [9.0, 9.0, 9.0]} │
│ 165.30577 │ {'name': bob, 's_vec': [100.0, 120.0, 90.0]} │ {'id': 648, 'd_vec': [9.0, 9.0, 8.0]} │
│ 165.36626 │ {'name': bob, 's_vec': [100.0, 120.0, 90.0]} │ {'id': 720, 'd_vec': [8.0, 9.0, 9.0]} │
│       0.0 │ {'name': alice, 's_vec': [5.0, 5.0, 5.0]}    │ {'id': 365, 'd_vec': [5.0, 5.0, 5.0]} │
│       1.0 │ {'name': alice, 's_vec': [5.0, 5.0, 5.0]}    │ {'id': 364, 'd_vec': [5.0, 4.0, 5.0]} │
│       1.0 │ {'name': alice, 's_vec': [5.0, 5.0, 5.0]}    │ {'id': 356, 'd_vec': [4.0, 5.0, 5.0]} │
└───────────┴──────────────────────────────────────────────┴───────────────────────────────────────┘

-- Rank the source table rows against a constant vector
SELECT * FROM vss_match(source, [10, 10, 10]::FLOAT[3], s_vec, 5);
----
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                       matches                                                                       │
│                                         struct(score float, "row" struct("name" varchar, s_vec float[3]))[]                                         │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ [{'score': 8.6602545, 'row': {'name': alice, 's_vec': [5.0, 5.0, 5.0]}}, {'score': 163.09506, 'row': {'name': bob, 's_vec': [100.0, 120.0, 90.0]}}] │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


-- Perform a lateral join
SELECT * FROM source, vss_match(dest, s_vec, d_vec, 2);
----
┌─────────┬──────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  name   │        s_vec         │                                                                 matches                                                                  │
│ varchar │       float[3]       │                                     struct(score float, "row" struct(id integer, d_vec float[3]))[]                                      │
├─────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ bob     │ [100.0, 120.0, 90.0] │ [{'score': 164.81201, 'row': {'id': 729, 'd_vec': [9.0, 9.0, 9.0]}}, {'score': 165.30577, 'row': {'id': 648, 'd_vec': [9.0, 9.0, 8.0]}}] │
│ alice   │ [5.0, 5.0, 5.0]      │ [{'score': 0.0, 'row': {'id': 365, 'd_vec': [5.0, 5.0, 5.0]}}, {'score': 1.0, 'row': {'id': 356, 'd_vec': [4.0, 5.0, 5.0]}}]             │
└─────────┴──────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

@Maxxen Maxxen merged commit a5da2f9 into duckdb:main Aug 12, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant