Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat . as a nested field in field_map of text embedding processor #488

Conversation

Sanjana679
Copy link

@Sanjana679 Sanjana679 commented Nov 9, 2023

Description

These changes will allow embeddings to be computed when using a nested source field for a text embedding processor as shown in the configuration below:

PUT /_ingest/pipeline/nlp-ingest-pipeline
{
  "description": "An NLP ingest pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "",
        "field_map": {
          “message.text": “message_embedding”
        }
      }
    }
  ]
}

PUT /my-nlp-index
{
    "settings": {
        "index.knn": true,
        "default_pipeline": “nlp-ingest-pipeline”
    },
    "mappings": {
        "properties": {
            "message_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "engine": "lucene"
                }
            },
            "message.text": { 
                "type": "text"            
            },
            "color": {
                "type": "text"
            }
        }
    }
}

PUT /my-nlp-index/_doc/1
{
  "message: {"text": "Blueberry smoothie"},
  "color": "blue"
}

GET /my-nlp-index/_doc/1

Issues Resolved

#110

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@Sanjana679 Sanjana679 force-pushed the feature/text-embedding-processor-nested-fields branch from a428d30 to 2c7f491 Compare November 9, 2023 09:37
@martin-gaievski
Copy link
Member

how many nested levels are we going to support? We need to state that if it's more than 2 levels, e.g.
my_object.some_property_group.some_field. Currently with mapping we can do multiple levels. @navneet1v what do you think?

Copy link

codecov bot commented Nov 10, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (b3c73bd) 84.65% compared to head (2c7f491) 84.10%.

❗ Current head 2c7f491 differs from pull request most recent head 2177ba2. Consider uploading reports for the commit 2177ba2 to get more accurate results

Files Patch % Lines
...rch/neuralsearch/processor/InferenceProcessor.java 16.66% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #488      +/-   ##
============================================
- Coverage     84.65%   84.10%   -0.55%     
+ Complexity      508      498      -10     
============================================
  Files            40       40              
  Lines          1505     1497       -8     
  Branches        234      229       -5     
============================================
- Hits           1274     1259      -15     
- Misses          128      137       +9     
+ Partials        103      101       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -154,6 +154,16 @@ Map<String, Object> buildMapWithProcessorKeyAndOriginalValue(IngestDocument inge
for (Map.Entry<String, Object> fieldMapEntry : fieldMap.entrySet()) {
String originalKey = fieldMapEntry.getKey();
Object targetKey = fieldMapEntry.getValue();

int nestedDotIndex = originalKey.indexOf('.');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a user-provided info, can we add basic validation if it's not already done as part of the processor/pipeline definition.

if multiple levels of nested fields are needed this code may need a rework

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some basic validation done in validateEmbeddingConfiguration which is run on fieldMap in the constructor in this file. However, I will add some extra validation for nested fields.


int nestedDotIndex = originalKey.indexOf('.');
if (nestedDotIndex != -1) {
Map<String, Object> temp = new LinkedHashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this map object? Can you please use more meaningful name for map variable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a more meaningful name for the map variable.

@navneet1v
Copy link
Collaborator

navneet1v commented Nov 13, 2023

@Sanjana679 thanks for raising the PR. Can you please add output for this case.

PUT /_ingest/pipeline/nlp-ingest-pipeline
{
  "description": "An NLP ingest pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "",
        "field_map": {
          “message.text": “message.message_embedding”
        }
      }
    }
  ]
}

Also I don't see any integration test added, can we add the integration tests.

@navneet1v
Copy link
Collaborator

@Sanjana679 can you also run ./gradlew spotlessJavaCheck before raising the PR. It will fix the formatting issues.

@navneet1v
Copy link
Collaborator

@Sanjana679 I can within the diff there are lines which are coming because of the different formatter you might have in your local. Please fix that too and make sure that the lines which are actually modified are coming in the diff.

Comment on lines +169 to +175
if (nestedDotIndex != -1) {
Map<String, Object> newTargetKey = new LinkedHashMap<>();
newTargetKey.put(originalKey.substring(nestedDotIndex + 1), targetKey);
targetKey = newTargetKey;

originalKey = originalKey.substring(0, nestedDotIndex);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sanjana679 can you please provide details how we are handling multiple level of nesting here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, I'm not currently handling multiple levels of nesting, as I initially thought it was only for one level. However, I will work on handling multiple levels of nesting.

@navneet1v
Copy link
Collaborator

@Sanjana679 can you also see with your changes this issue will be resolved or not: https://forum.opensearch.org/t/need-neural-search-plugin-to-support-nested-field-type-array-of-objects/16760 ?

@Sanjana679
Copy link
Author

@Sanjana679 can you also see with your changes this issue will be resolved or not: https://forum.opensearch.org/t/need-neural-search-plugin-to-support-nested-field-type-array-of-objects/16760 ?

At the moment, my changes don't resolve this issue but if I can easily resolve this issue I will work on it.

@navneet1v
Copy link
Collaborator

@Sanjana679 can you also see with your changes this issue will be resolved or not: https://forum.opensearch.org/t/need-neural-search-plugin-to-support-nested-field-type-array-of-objects/16760 ?

At the moment, my changes don't resolve this issue but if I can easily resolve this issue I will work on it.

I checked the question the way user is using is not correct.

can we check if the processor is created like this below:

{
	“description”: “MLsearchtestpipeline”,
	“processors”: [
		{
			“text_embedding”: {
				“model_id”: “8-S0XYsBFQTip4T18k-U”,
				“field_map”: {
					“question_embeddings.text”: "question_embeddings.text_embedding"
				}
			}
		}
	]
}

will it work or not.

Copy link

@samuel-oci samuel-oci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add tests to cover the new logic described here for nested fields ingestion?

super(tag, description);
this.type = type;
if (StringUtils.isBlank(modelId)) throw new IllegalArgumentException("model_id is null or empty, cannot process it");
if (StringUtils.isBlank(modelId))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did the formatting change? did you run ./gradlew :spotlessApply prior?

@@ -154,9 +164,20 @@ Map<String, Object> buildMapWithProcessorKeyAndOriginalValue(IngestDocument inge
for (Map.Entry<String, Object> fieldMapEntry : fieldMap.entrySet()) {
String originalKey = fieldMapEntry.getKey();
Object targetKey = fieldMapEntry.getValue();

int nestedDotIndex = originalKey.indexOf('.');

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use a static constant and avoid magic characters in code.
e.g.

private static final char FIELD_SEPARATOR = '.';

Comment on lines +190 to +193
String parentKey,
Object processorKey,
Map<String, Object> sourceAndMetadataMap,
Map<String, Object> treeRes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these indents.

nestedFieldMapEntry.getKey(),
nestedFieldMapEntry.getValue(),
(Map<String, Object>) sourceAndMetadataMap.get(parentKey),
next);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep ) consistent

@@ -211,18 +233,21 @@ private void validateEmbeddingFieldsValue(IngestDocument ingestDocument) {
private void validateNestedTypeValue(String sourceKey, Object sourceValue, Supplier<Integer> maxDepthSupplier) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace Supplier<Integer> with a simple int?

Comment on lines +242 to +244
.stream()
.filter(Objects::nonNull)
.forEach(x -> validateNestedTypeValue(sourceKey, x, () -> maxDepth + 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these indents

@martin-gaievski
Copy link
Member

implemented same logic in #841, we can close this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants