I encountered an issue regarding the LIMIT return. #36322

adol001 · 2024-09-18T03:04:19Z

adol001
Sep 18, 2024

There are 847 records in the Milvus collection, with nlist set to 4 and nprobe set to 4. When the limit is 32, only 7 results are returned; when the limit is 320, about 20 results are returned; and when the limit is 16,384, about 667 results are returned.

I'm confused as to why, when the limit is 32, it doesn't return more results, given that nprobe is already the same as nlist.

Here is the table creation statement.

from pymilvus import MilvusClient, DataType

# Authentication not enabled
client = MilvusClient("http://192.168.31.161:19530")
print(client.list_databases())

table_name = "m1"
index_name = 'm1v'

schema = MilvusClient.create_schema(
    auto_id=False,
    enable_dynamic_field=True,
)

# 2.2. Add fields to schema
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=1024)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=2048)


if client.has_collection(collection_name=table_name):
    client.release_collection(collection_name=table_name)
    client.drop_index(collection_name=table_name, index_name=index_name)
    client.drop_collection(collection_name=table_name)


# 3. Create collection
client.create_collection(
    collection_name=table_name,
    schema=schema,
)

index_params = MilvusClient.prepare_index_params()

# 4.2. Add an index on the vector field.
index_params.add_index(
    field_name="vector",
    metric_type="COSINE",
    index_type="IVF_FLAT",
    index_name=index_name,
    params={ "nlist": 4 }
)

client.create_index(
    collection_name=table_name,
    index_params=index_params,
    sync=False # Whether to wait for index creation to complete before returning. Defaults to True.
)


res = client.list_indexes(
    collection_name=table_name
)

print(res)

Here is the query statement.

res = client.search(
        collection_name=table_name,  # Replace with the actual name of your collection
        # Replace with your query vector
        data=embeddings,
        limit=32,  # Max. number of search results to return
        search_params={"params": {"nprobe": 4}}  # Search parameters
    )

I hope to get some help.

Answered by yhmo

Sep 19, 2024

The current behavior of milvus is confusing users when there are duplicated primary keys in one collection, which is caused by historical reasons and not easy to change.

Just highlight these points:

insert() doesn't verify duplicate primary keys because it is time-consuming work, especially for huge datasets.
upsert() can avoid duplicate pk but it is also a heavy task for huge datasets.
search()/query() only returns one item for duplicate primary keys because it doesn't make sense if we return topk like this:

No.1  ID = 1, distance=0.01
No.2  ID = 1, distance=0.03
No.3  ID = 2, distance=0.06
No.4  ID = 2, distance=0.1
No.5  ID = 1, distance=0.2
......

View full answer

yhmo · 2024-09-18T04:03:00Z

yhmo
Sep 18, 2024
Collaborator

I could not reproduce the problem by random vectors:

import random
from pymilvus import MilvusClient, DataType

# Authentication not enabled
client = MilvusClient("http://localhost:19530")
print(client.list_databases())

table_name = "m1"
index_name = 'm1v'

schema = MilvusClient.create_schema(
    auto_id=False,
    enable_dynamic_field=True,
)

# 2.2. Add fields to schema
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=1024)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=2048)


if client.has_collection(collection_name=table_name):
    client.release_collection(collection_name=table_name)
    client.drop_index(collection_name=table_name, index_name=index_name)
    client.drop_collection(collection_name=table_name)


# 3. Create collection
client.create_collection(
    collection_name=table_name,
    schema=schema,
)

index_params = MilvusClient.prepare_index_params()

# 4.2. Add an index on the vector field.
index_params.add_index(
    field_name="vector",
    metric_type="COSINE",
    index_type="IVF_FLAT",
    index_name=index_name,
    params={ "nlist": 4 }
)

client.create_index(
    collection_name=table_name,
    index_params=index_params,
    sync=False # Whether to wait for index creation to complete before returning. Defaults to True.
)


res = client.list_indexes(
    collection_name=table_name
)
print(res)
client.load_collection(collection_name=table_name)

for i in range(847):
    client.insert(collection_name=table_name, data={"id": i, "vector": [random.random() for _ in range(1024)], "text": f"text_{i}"})

print(client.query(collection_name=table_name, filter="", output_fields=["count(*)"], consistency_level="Strong"))

def search(limit: int):
    embeddings = [[random.random() for _ in range(1024)]]
    res = client.search(
            collection_name=table_name,  # Replace with the actual name of your collection
            # Replace with your query vector
            data=embeddings,
            limit=limit,  # Max. number of search results to return
            search_params={"params": {"nprobe": 4}}  # Search parameters
        )
    # print("Search result:")
    # for item in res[0]:
    #     print(item)
    print("Result count:", len(res[0]))


search(32)
search(320)
search(16384)

The returned result number is expected:

['default']
['m1v']
data: ["{'count(*)': 847}"] 
Result count: 32
Result count: 320
Result count: 847

4 replies

adol001 Sep 18, 2024
Author

target = []
for i in range(100):
    item = {"id": i, "vector": [random.random() for _ in range(1024)], "text": f"text_{i}"}
    target.append(item)
    client.insert(collection_name=table_name, data=target)

My program was similar to this, and you can replicate it by modifying the insert code like this. Why does this happen?

Result count: 1
Result count: 8
Result count: 100

yhmo Sep 19, 2024
Collaborator

So, why the "target=[]" is outside of the loop?

i = 0, target = [{"id": 0, "vector": [], "text": "text_0"}]
i = 1, target = [{"id": 0, "vector": [], "text": "text_0"}, {"id": 1, "vector": [], "text": "text_1"}]
i = 2, target = [{"id": 0, "vector": [], "text": "text_0"}, {"id": 1, "vector": [], "text": "text_1"}, {"id": 2, "vector": [], "text": "text_2"}]
.....
i = 99, target = [{"id": 0, "vector": [], "text": "text_0"}, {"id": 1, "vector": [], "text": "text_1"}, ...... , {"id": 99, "vector": [], "text": "text_99"}]

Finally, there are only 100 unique ids.
For each unique id, the search result only returns one item. This is why you got the weird result count.

You can modify your script like this, each unique id is only inserted once:

for i in range(100):
    item = {"id": i, "vector": [random.random() for _ in range(1024)], "text": f"text_{i}"}
    client.insert(collection_name=table_name, data=[item])

adol001 Sep 19, 2024
Author

I know this is a bug, and in this case, I should use 'upsert' instead of 'insert'. However, the client didn't report an error, and the limit return value decreased. This results in a poor user experience.

yhmo Sep 19, 2024
Collaborator

The current behavior of milvus is confusing users when there are duplicated primary keys in one collection, which is caused by historical reasons and not easy to change.

Just highlight these points:

insert() doesn't verify duplicate primary keys because it is time-consuming work, especially for huge datasets.
upsert() can avoid duplicate pk but it is also a heavy task for huge datasets.
search()/query() only returns one item for duplicate primary keys because it doesn't make sense if we return topk like this:

No.1  ID = 1, distance=0.01
No.2  ID = 1, distance=0.03
No.3  ID = 2, distance=0.06
No.4  ID = 2, distance=0.1
No.5  ID = 1, distance=0.2
......

Answer selected by adol001

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I encountered an issue regarding the LIMIT return. #36322

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

I encountered an issue regarding the LIMIT return. #36322

adol001 Sep 18, 2024

Replies: 1 comment · 4 replies

yhmo Sep 18, 2024 Collaborator

adol001 Sep 18, 2024 Author

yhmo Sep 19, 2024 Collaborator

adol001 Sep 19, 2024 Author

yhmo Sep 19, 2024 Collaborator

adol001
Sep 18, 2024

Replies: 1 comment 4 replies

yhmo
Sep 18, 2024
Collaborator

adol001 Sep 18, 2024
Author

yhmo Sep 19, 2024
Collaborator

adol001 Sep 19, 2024
Author

yhmo Sep 19, 2024
Collaborator