Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add ignore missing field to text chunking processor #906

Open
IanMenendez opened this issue Sep 13, 2024 · 5 comments
Open

[FEATURE] Add ignore missing field to text chunking processor #906

IanMenendez opened this issue Sep 13, 2024 · 5 comments

Comments

@IanMenendez
Copy link
Contributor

What solution would you like?

Currently, if a document is ingested by a text chunking processor and the input field is null then the text chunking processor will output an empty list. There is no way to ignore the text chunking processor if the field does not exist

The proposed solution is to add the ignore_missing field to text chunking processors.

If ignore_missing == true then fields that should be chunked but do not exist will not ingest an empty list, instead they will get skipped

example:

Processor:

    {
      "text_chunking": {
        "ignore_missing": true,
        "field_map": {
          "body": "body_chunk"
        },
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "tokenizer": "letter"
          }
        }
      }
    }

Input:

{
"name":  "OpenSearch' 
}

Output:

{
"name": "OpenSearch"
}

if ignore_missing == false then it will continue to work as it currently does. Fields that do not exist will have an empty list as output

Processor:

    {
      "text_chunking": {
        "ignore_missing": false,
        "field_map": {
          "body": "body_chunk"
        },
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "tokenizer": "letter"
          }
        }
      }
    }

Input:

{
"name":  "OpenSearch' 
}

Output:

{
"name": "OpenSearch"
"body_chunk" : []
}

The default value would be ignore_missing = false

What alternatives have you considered?

To my knowledge, there is no alternative to this

@yuye-aws
Copy link
Member

Left a few review comments in #907

@martin-gaievski
Copy link
Member

can we change the field name to "skip_if_absent" or something of this sort? Problem with "ignore" is that it has ambiguity of not specifying what will happen in case text is empty.

@vibrantvarun
Copy link
Member

can we change the field name to "skip_if_absent" or something of this sort? Problem with "ignore" is that it has ambiguity of not specifying what will happen in case text is empty.

+1 to @martin-gaievski

@IanMenendez
Copy link
Contributor Author

IanMenendez commented Sep 18, 2024

@martin-gaievski @vibrantvarun I do not think the field name "skip_if_absent" makes sense

There are tons of OpenSearch ingest processors that currently have the ignore_missing field name

Examples:
https://opensearch.org/docs/latest/ingest-pipelines/processors/split/#configuration-parameters
https://opensearch.org/docs/latest/ingest-pipelines/processors/lowercase/#configuration-parameters
https://opensearch.org/docs/latest/ingest-pipelines/processors/dissect/#configuration-parameters

I prefer ignore_missing to keep consistency between other ingest processors

@martin-gaievski
Copy link
Member

if other processors has field with similar functionality then I agree, this name makes sense, although semantically it's not the best. Thanks for checking config of other processors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Now(This Quarter)
Development

No branches or pull requests

5 participants