Very slow reindexing #28452
Replies: 7 comments 15 replies
-
How many cpu cores are there in your cluster? |
Beta Was this translation helpful? Give feedback.
-
but if you are doing frequent flush, it may trigger recursive compaction and index. |
Beta Was this translation helpful? Give feedback.
-
If that's the code you are using, seems that the use case looks code. Could you offer detailed logs so we can investigate on the reason? To get logs you will need to run export log scripts |
Beta Was this translation helpful? Give feedback.
-
This seems to be standalone deployment.
|
Beta Was this translation helpful? Give feedback.
-
[2023/11/16 04:45:37.808 +00:00] [INFO] [indexnode/indexnode_service.go:209] ["Get Index Job Stats"] [traceID=714b518fc69f9d63] [Unissued=0] [Active=1] [Slot=0] |
Beta Was this translation helpful? Give feedback.
-
@ducanh997 Hi. Based on your configuration, a rough estimate suggests that the "pending_index_rows" value should decrease at a rate of approximately 300,000 per minute. I have noticed that the collection you mentioned earlier, which had 8,313,118 rows, now shows that the "pending_index_rows" value has reached 0.
it seems that there are other collections still undergoing index construction.
|
Beta Was this translation helpful? Give feedback.
-
Hello @ducanh997, could you share your code that shows how to monitor the indexing process in Milvus?
from pymilvus import (
connections,
Collection,
CollectionSchema,
FieldSchema,
DataType,
utility,db,MilvusClient
)
import random
db_name="test_dv"
collection_name="dev_collection"
connections.connect(host="127.0.0.1", port=19530,alias="test")
if db_name not in db.list_database("test"):
db.create_database(db_name,using="test")
milvus_vb = MilvusClient(uri='http://localhost:19530',db_name=db_name)
if milvus_vb.has_collection(collection_name):
milvus_vb.drop_collection(collection_name)
print("dropped")
milvus_vb.create_collection(
collection_name=collection_name,
schema=CollectionSchema([
FieldSchema("id", DataType.INT64, is_primary=True),
FieldSchema("vector", DataType.FLOAT_VECTOR, dim=128)
])
)
index_params = milvus_vb.prepare_index_params()
index_params.add_index(
field_name="vector",
index_type="IVF_FLAT",
metric_type="IP",
params={"nlist": 128}
)
milvus_vb.create_index(
collection_name=collection_name,
index_params=index_params,
)
print("Loading...")
milvus_vb.load_collection(
collection_name=collection_name,
replica_number=1 # Number of replicas to create on query nodes. Max value is 1 for Milvus Standalone, and no greater than `queryNode.replicas` for Milvus Cluster.
)
print("Loaded")
conn=milvus_vb._get_connection()
conn.get_index_build_progress(collection_name=collection_name, index_name="vector")
dummy_data =[{"id": i, "vector": [random.random() for _ in range(128)]} for i in range(10000)]
batch=1000
for i in range(0, len(dummy_data), batch):
print("Inserting batch", i)
milvus_vb.insert(collection_name=collection_name, data=dummy_data[i:i+batch])
res=conn.get_index_build_progress(collection_name=collection_name, index_name="vector")
print(res)
# Query
query_vector = [random.random() for _ in range(128)]
res=conn.get_index_build_progress(collection_name=collection_name, index_name="vector")
print(res)
results=milvus_vb.search(collection_name=collection_name, data=[query_vector], limit=5, output_fields=["id"])
res=conn.get_index_build_progress(collection_name=collection_name, index_name="vector")
print(res)
print(results)
|
Beta Was this translation helpful? Give feedback.
-
Milvus Version: v2.2.14 Standalone
I've added 8 million float vectors to Milvus, each with a length of 768. To add 8 million vectors, I splitted the data into chunks with 10,000 records, then called collection.insert() using PyMilvus. After completing the data insertion, I created an IVF_SQ8 index with nlist = 2048. While monitoring the indexing process, I observed:
{'total_rows': 8313118, 'indexed_rows': 8313118, 'pending_index_rows': 5063118 }
The equality of indexed_rows and total_rows suggests that I can now begin searches on Milvus. However, the pending_index_rows was decreasing very slowly, which caused 100% CPU usage for many hours. After reading this GitHub issue, I speculated that Milvus might be reindexing for optimization. Is there any way to disable this feature or set a specific time for Milvus for reindexing?
I would appreciate any insights or suggestions to enhance performance in this context.
Beta Was this translation helpful? Give feedback.
All reactions