[FEATURE] Add ignore missing field to text chunking processor #906

IanMenendez · 2024-09-13T22:41:57Z

What solution would you like?

Currently, if a document is ingested by a text chunking processor and the input field is null then the text chunking processor will output an empty list. There is no way to ignore the text chunking processor if the field does not exist

The proposed solution is to add the ignore_missing field to text chunking processors.

If ignore_missing == true then fields that should be chunked but do not exist will not ingest an empty list, instead they will get skipped

example:

Processor:

    {
      "text_chunking": {
        "ignore_missing": true,
        "field_map": {
          "body": "body_chunk"
        },
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "tokenizer": "letter"
          }
        }
      }
    }

Input:

{
"name":  "OpenSearch' 
}

Output:

{
"name": "OpenSearch"
}

if ignore_missing == false then it will continue to work as it currently does. Fields that do not exist will have an empty list as output

Processor:

    {
      "text_chunking": {
        "ignore_missing": false,
        "field_map": {
          "body": "body_chunk"
        },
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "tokenizer": "letter"
          }
        }
      }
    }

Input:

{
"name":  "OpenSearch' 
}

Output:

{
"name": "OpenSearch"
"body_chunk" : []
}

The default value would be ignore_missing = false

What alternatives have you considered?

To my knowledge, there is no alternative to this

The text was updated successfully, but these errors were encountered:

yuye-aws · 2024-09-14T01:11:28Z

Left a few review comments in #907

martin-gaievski · 2024-09-18T02:55:56Z

can we change the field name to "skip_if_absent" or something of this sort? Problem with "ignore" is that it has ambiguity of not specifying what will happen in case text is empty.

vibrantvarun · 2024-09-18T03:11:33Z

can we change the field name to "skip_if_absent" or something of this sort? Problem with "ignore" is that it has ambiguity of not specifying what will happen in case text is empty.

+1 to @martin-gaievski

IanMenendez · 2024-09-18T03:40:34Z

@martin-gaievski @vibrantvarun I do not think the field name "skip_if_absent" makes sense

There are tons of OpenSearch ingest processors that currently have the ignore_missing field name

Examples:
https://opensearch.org/docs/latest/ingest-pipelines/processors/split/#configuration-parameters
https://opensearch.org/docs/latest/ingest-pipelines/processors/lowercase/#configuration-parameters
https://opensearch.org/docs/latest/ingest-pipelines/processors/dissect/#configuration-parameters

I prefer ignore_missing to keep consistency between other ingest processors

martin-gaievski · 2024-09-28T01:16:01Z

if other processors has field with similar functionality then I agree, this name makes sense, although semantically it's not the best. Thanks for checking config of other processors.

IanMenendez added enhancement untriaged labels Sep 13, 2024

This was referenced Sep 13, 2024

[Feature]: add ignore missing field to text chunking processors #907

Open

[Feature]: add ignore missing to text chunking processor opensearch-project/documentation-website#8266

Open

naveentatikonda removed the untriaged label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add ignore missing field to text chunking processor #906

[FEATURE] Add ignore missing field to text chunking processor #906

IanMenendez commented Sep 13, 2024

yuye-aws commented Sep 14, 2024

martin-gaievski commented Sep 18, 2024

vibrantvarun commented Sep 18, 2024

IanMenendez commented Sep 18, 2024 •

edited

Loading

martin-gaievski commented Sep 28, 2024

[FEATURE] Add ignore missing field to text chunking processor #906

[FEATURE] Add ignore missing field to text chunking processor #906

Comments

IanMenendez commented Sep 13, 2024

What solution would you like?

What alternatives have you considered?

yuye-aws commented Sep 14, 2024

martin-gaievski commented Sep 18, 2024

vibrantvarun commented Sep 18, 2024

IanMenendez commented Sep 18, 2024 • edited Loading

martin-gaievski commented Sep 28, 2024

IanMenendez commented Sep 18, 2024 •

edited

Loading