Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ISM force_merge on datastream index #1255

Open
disaster37 opened this issue Sep 16, 2024 · 4 comments
Open

[BUG] ISM force_merge on datastream index #1255

disaster37 opened this issue Sep 16, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@disaster37
Copy link

disaster37 commented Sep 16, 2024

What is the bug?

On Opensearch 2.16.0

I have created ISM policy, with force_merge step to force to have one segment after the datastream index has rolled out and move to warm node. The step always finished on timeout.
After put ISM log level to DEBUG, I get the following logs:

{"type": "json_logger", "timestamp": "2024-09-16T14:04:13,248Z", "level": "DEBUG", "component": "o.o.i.i.s.f.WaitForForceMergeStep", "cluster.name": "logmanagement2-rec", "node.name": "opensearch-data-os-2", "message": "Force merge still running on [.ds-logs-log-default-000617] with [2] shards containing unmerged segments", "cluster.uuid": "ZbghcuYqTtWRmCHMd4tbyw", "node.id": "cYyrcay5QPS7_zi0HxvyJg"  }

How can one reproduce the bug?

  1. Create new Opensearch cluster with hot and warm tiers
  2. Create Index template to allow create datastream index
{
  "index_patterns": [
    "logs-*"
  ],
  "priority": "500",
  "data_stream": {
    "timestamp_field": {
      "name": "@timestamp"
    }
  },
  "name": "template_log",
  "template": {}
}
  1. Create datastream index logs-log-default
  2. Create ISM policy
{
    "id": "policy-log",
    "seqNo": 2848481,
    "primaryTerm": 23,
    "policy": {
        "policy_id": "policy-log",
        "description": "Policy for logs index",
        "last_updated_time": 1725961147454,
        "schema_version": 21,
        "error_notification": null,
        "default_state": "hot",
        "states": [
            {
                "name": "hot",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "rollover": {
                            "min_index_age": "1d",
                            "min_primary_shard_size": "5gb",
                            "copy_alias": false
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "warm",
                        "conditions": {
                            "min_index_age": "1d"
                        }
                    }
                ]
            },
            {
                "name": "warm",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "read_only": {}
                    },
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "allocation": {
                            "require": {
                                "temp": "warm"
                            },
                            "include": {},
                            "exclude": {},
                            "wait_for": false
                        }
                    },
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "index_priority": {
                            "priority": 50
                        }
                    },
                    {
                        "timeout": "1d",
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "force_merge": {
                            "max_num_segments": 1
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "2d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "delete": {}
                    }
                ],
                "transitions": []
            }
        ],
        "ism_template": [
            {
                "index_patterns": [
                    "logs-log-*"
                ],
                "priority": 100,
                "last_updated_time": 1725961147454
            }
        ]
    }
}

Wait Force merge step. The force_merge step always in timeout.

What is the expected behavior?
Force merge run successfully on get one segment per shard.

What is your host/environment?

Opensearch 2.16.0

@disaster37 disaster37 added bug Something isn't working untriaged labels Sep 16, 2024
@disaster37
Copy link
Author

I finnaly found a right log on data node that host the last shard without merge segments.
"Caused by: java.io.IOException: No space left on device",

@disaster37
Copy link
Author

I think the force_merge setp must be estimate the target size to look if there are sufficious space on node.
And in all case, the setp must be failed because node space left on device instead to failed with Action time out

@bharath-techie
Copy link

@disaster37 did you try explain API to get the information on the policy failure ?

@dblock dblock removed the untriaged label Oct 7, 2024
@dblock
Copy link
Member

dblock commented Oct 7, 2024

[Catch All Triage - 1, 2, 3, 4]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

3 participants