Support hdf5 files in bulk operation #620

finnroblin · 2024-08-17T00:48:48Z

Description

Adds hdf5 file support for bulk ingestion. hdf5 files contain datasets of vectors in a non-json format so @VijayanB wrote separate parameter operations to send vectors to the bulk API. This PR adds vector support within OSB's bulk operation. This is advantageous for vector search benchmarking since the bulk operation supports additional features, and it decreases the number of vector search-specific features.

Testing

New functionality includes testing

Unit tests and manual verification. I modified the cohere 1000 document to include the information needed for the bulk operation.

Steps taken for manual verification:
Parameter file:

{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/faiss-index.json",
    "target_index_primary_shards": 1,
    "target_index_dimension": 768,
    "target_index_space_type": "l2",
    
    "target_index_bulk_size": 5,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_indexing_clients": 10,
    "target_index_bulk_index_data_set_corpus": "cohere",
    
    "target_index_max_num_segments": 1,
    "target_index_force_merge_timeout": 300,
    "hnsw_ef_search": 100,
    "hnsw_ef_construction": 100,

    "query_k": 100,
    "query_body": {
         "docvalue_fields" : ["_id"],
         "stored_fields" : "_none_"
    },

    "query_data_set_format": "hdf5",
    "query_data_set_corpus": "cohere",
    "query_count": 100
}

Bulk schedule:

{
    "operation": {
        "name": "delete-target-index",
        "operation-type": "delete-index",
        "only-if-exists": true,
        "index": "{{ target_index_name | default('target_index') }}"
    }
},
{
    "operation": {
        "name": "create-target-index",
        "operation-type": "create-index",
        "index": "{{ target_index_name | default('target_index') }}"
    }
},
{
    "operation": {
        "name": "bulk",
        "operation-type": "bulk",
        "bulk-size": 5,
        "data_set_format": "{{ target_index_bulk_index_data_set_format | default('hdf5') }}",
        "source_format": "hdf5",
        "index": "target_index",
        "field": "target_field",
        "vector_dataset_context": "index",
        "corpora": ["cohere"]
    },
    "clients": {{ target_index_bulk_indexing_clients | default(1)}}
},
{
    "name" : "refresh-target-index",
    "operation" : "refresh-target-index"
}

Corpus changes:

"corpora": [
    {
      "name": "cohere",
      "base-url": "https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings",
      "target-index": "{{ target_index_name }}",
      "documents": [
        {
          "source-file": "documents-1k.hdf5.bz2",
          "source-format": "hdf5",
          "document-count": 1000,
          "generate-increasing-vector-ids": true,
          "id-field-name": "_id",
          "vector-field-name": "target_field"
        }
      ]
    },

bulk-procedure:

    "name": "bulk-procedure",
    "default": false,
    "schedule": [
       {{ benchmark.collect(parts="common/bulk-schedule.json") }},

       {{ benchmark.collect(parts="common/search-only-schedule.json") }}
    ]
},

Result:

.venv) finnrobl@80a9970f4597 opensearch-benchmark % export PARAMS=/Users/finnrobl/Code/opensearch-benchmark-workloads/vectorsearch/params/bulk-params.json 
(.venv) finnrobl@80a9970f4597 opensearch-benchmark % opensearch-benchmark execute-test --target-hosts $ENDPOINT \                                                
    --workload-path /Users/finnrobl/Code/opensearch-benchmark-workloads/vectorsearch  --workload-params $PARAMS \
    --pipeline benchmark-only \
    --kill-running-processes \
  --test-procedure bulk-procedure

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: e8307702-7dda-4a30-8b87-6f2fc1834ecb
[INFO] Executing test with workload [vectorsearch], test_procedure [bulk-procedure] and provision_config_instance ['external'] with version [3.0.0-SNAPSHOT].

[WARNING] merges_total_time is 16 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 7 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 63 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 120 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-target-index                                                    [100% done]
Running create-target-index                                                    [100% done]
Running bulk                                                                   [100% done]
Running refresh-target-index                                                   [100% done]
Running warmup-indices                                                         [100% done]
Running prod-queries                                                           [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |           Task |       Value |   Unit |
|---------------------------------------------------------------:|---------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |                |   0.0371833 |    min |
|             Min cumulative indexing time across primary shards |                |           0 |    min |
|          Median cumulative indexing time across primary shards |                | 0.000116667 |    min |
|             Max cumulative indexing time across primary shards |                |   0.0370667 |    min |
|            Cumulative indexing throttle time of primary shards |                |           0 |    min |
|    Min cumulative indexing throttle time across primary shards |                |           0 |    min |
| Median cumulative indexing throttle time across primary shards |                |           0 |    min |
|    Max cumulative indexing throttle time across primary shards |                |           0 |    min |
|                        Cumulative merge time of primary shards |                | 0.000266667 |    min |
|                       Cumulative merge count of primary shards |                |           1 |        |
|                Min cumulative merge time across primary shards |                |           0 |    min |
|             Median cumulative merge time across primary shards |                |           0 |    min |
|                Max cumulative merge time across primary shards |                | 0.000266667 |    min |
|               Cumulative merge throttle time of primary shards |                |           0 |    min |
|       Min cumulative merge throttle time across primary shards |                |           0 |    min |
|    Median cumulative merge throttle time across primary shards |                |           0 |    min |
|       Max cumulative merge throttle time across primary shards |                |           0 |    min |
|                      Cumulative refresh time of primary shards |                |  0.00468333 |    min |
|                     Cumulative refresh count of primary shards |                |          12 |        |
|              Min cumulative refresh time across primary shards |                |           0 |    min |
|           Median cumulative refresh time across primary shards |                |     0.00105 |    min |
|              Max cumulative refresh time across primary shards |                |  0.00363333 |    min |
|                        Cumulative flush time of primary shards |                |       0.002 |    min |
|                       Cumulative flush count of primary shards |                |           2 |        |
|                Min cumulative flush time across primary shards |                |           0 |    min |
|             Median cumulative flush time across primary shards |                |           0 |    min |
|                Max cumulative flush time across primary shards |                |       0.002 |    min |
|                                        Total Young Gen GC time |                |        0.01 |      s |
|                                       Total Young Gen GC count |                |           1 |        |
|                                          Total Old Gen GC time |                |           0 |      s |
|                                         Total Old Gen GC count |                |           0 |        |
|                                                     Store size |                |   0.0173898 |     GB |
|                                                  Translog size |                |   0.0150675 |     GB |
|                                         Heap used for segments |                |           0 |     MB |
|                                       Heap used for doc values |                |           0 |     MB |
|                                            Heap used for terms |                |           0 |     MB |
|                                            Heap used for norms |                |           0 |     MB |
|                                           Heap used for points |                |           0 |     MB |
|                                    Heap used for stored fields |                |           0 |     MB |
|                                                  Segment count |                |          10 |        |
|                                                 Min Throughput |           bulk |     1640.19 | docs/s |
|                                                Mean Throughput |           bulk |     1640.19 | docs/s |
|                                              Median Throughput |           bulk |     1640.19 | docs/s |
|                                                 Max Throughput |           bulk |     1640.19 | docs/s |
|                                        50th percentile latency |           bulk |     17.3579 |     ms |
|                                        90th percentile latency |           bulk |     45.1002 |     ms |
|                                        99th percentile latency |           bulk |     83.7313 |     ms |
|                                       100th percentile latency |           bulk |      88.521 |     ms |
|                                   50th percentile service time |           bulk |     17.3579 |     ms |
|                                   90th percentile service time |           bulk |     45.1002 |     ms |
|                                   99th percentile service time |           bulk |     83.7313 |     ms |
|                                  100th percentile service time |           bulk |      88.521 |     ms |
|                                                     error rate |           bulk |           0 |      % |
|                                                 Min Throughput | warmup-indices |       36.24 |  ops/s |
|                                                Mean Throughput | warmup-indices |       36.24 |  ops/s |
|                                              Median Throughput | warmup-indices |       36.24 |  ops/s |
|                                                 Max Throughput | warmup-indices |       36.24 |  ops/s |
|                                       100th percentile latency | warmup-indices |     27.4253 |     ms |
|                                  100th percentile service time | warmup-indices |     27.4253 |     ms |
|                                                     error rate | warmup-indices |           0 |      % |
|                                                 Min Throughput |   prod-queries |       149.9 |  ops/s |
|                                                Mean Throughput |   prod-queries |       149.9 |  ops/s |
|                                              Median Throughput |   prod-queries |       149.9 |  ops/s |
|                                                 Max Throughput |   prod-queries |       149.9 |  ops/s |
|                                        50th percentile latency |   prod-queries |     3.36225 |     ms |
|                                        90th percentile latency |   prod-queries |      4.6824 |     ms |
|                                        99th percentile latency |   prod-queries |     58.3903 |     ms |
|                                       100th percentile latency |   prod-queries |     109.023 |     ms |
|                                   50th percentile service time |   prod-queries |     3.36225 |     ms |
|                                   90th percentile service time |   prod-queries |      4.6824 |     ms |
|                                   99th percentile service time |   prod-queries |     58.3903 |     ms |
|                                  100th percentile service time |   prod-queries |     109.023 |     ms |
|                                                     error rate |   prod-queries |           0 |      % |
|                                                  Mean recall@k |   prod-queries |        0.37 |        |
|                                                  Mean recall@1 |   prod-queries |        0.07 |        |


--------------------------------
[INFO] SUCCESS (took 63 seconds)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

IanHoang

@finnroblin Can you address merge conflicts?

Signed-off-by: Finn Roblin <[email protected]>

finnroblin marked this pull request as ready for review August 19, 2024 20:46

finnroblin requested review from IanHoang, gkamat, beaioun, cgchinmay, rishabh6788 and VijayanB as code owners August 19, 2024 20:46

finnroblin changed the title ~~[Draft] Initial vector bulk hdf5 implementation~~ Support hdf5 files in bulk operation Aug 27, 2024

IanHoang requested changes Sep 5, 2024

View reviewed changes

Initial vector bulk hdf5 implementation (fix conflicts)

d2cfd72

Signed-off-by: Finn Roblin <[email protected]>

finnroblin force-pushed the vectors-in-bulk-op branch from 3281b13 to d2cfd72 Compare September 11, 2024 22:35

finnroblin requested a review from IanHoang September 17, 2024 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support hdf5 files in bulk operation #620

Support hdf5 files in bulk operation #620

finnroblin commented Aug 17, 2024 •

edited

Loading

IanHoang left a comment •

edited

Loading

Support hdf5 files in bulk operation #620

Are you sure you want to change the base?

Support hdf5 files in bulk operation #620

Conversation

finnroblin commented Aug 17, 2024 • edited Loading

Description

Testing

IanHoang left a comment • edited Loading

Choose a reason for hiding this comment

finnroblin commented Aug 17, 2024 •

edited

Loading

IanHoang left a comment •

edited

Loading