Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Negative Search Stats #16598

Open
dzane17 opened this issue Nov 8, 2024 · 2 comments
Open

[BUG] Negative Search Stats #16598

dzane17 opened this issue Nov 8, 2024 · 2 comments
Assignees
Labels
bug Something isn't working Search Search query, autocomplete ...etc untriaged

Comments

@dzane17
Copy link
Contributor

dzane17 commented Nov 8, 2024

Describe the bug

We have identified a bug where search statistics are being incorrectly reported as negative, which is causing failures when accessing the nodes/stats API. This issue is specifically related to the handling of negative values in the SearchStats class. The exception occurs when the API attempts to serialize the statistics, causing a java.lang.IllegalStateException due to the use of VLong encoding, which does not support negative values.

Example Response from nodes/stats API:

{
    "_nodes": {
        "total": 6,
        "successful": 5,
        "failed": 1,
        "failures": [
            {
                "type": "failed_node_exception",
                "reason": "Failed node [gxu1I2AFSo2Co-gL3Cq14A]",
                "node_id": "gxu1I2AFSo2Co-gL3Cq14A",
                "caused_by": {
                    "type": "illegal_state_exception",
                    "reason": "Negative longs unsupported, use writeLong or writeZLong for negative numbers [-3]"
                }
            }
        ]
    },
...

Exception in elasticsearch.log:

[2024-10-16T19:55:29,039][WARN ][o.o.t.OutboundHandler    ] [8da3ab95536131cd3252613bab350bd7] send message failed [channel: Netty4TcpChannel{localAddress=/172.16.29.155:9300, remoteAddress=/172.16.29.122:40652}]
java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-140]
    at org.opensearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:307)
    at org.opensearch.index.search.stats.SearchStats$PhaseStatsLongHolder.writeTo(SearchStats.java:88)
    at org.opensearch.index.search.stats.SearchStats$Stats.lambda$writeTo$0(SearchStats.java:393)
    at org.opensearch.common.io.stream.StreamOutput.writeMap(StreamOutput.java:637)
    at org.opensearch.index.search.stats.SearchStats$Stats.writeTo(SearchStats.java:390)
    at org.opensearch.index.search.stats.SearchStats.writeTo(SearchStats.java:588)
    at org.opensearch.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:959)
    at org.opensearch.action.admin.indices.stats.CommonStats.writeTo(CommonStats.java:273)
    at org.opensearch.indices.NodeIndicesStats.writeTo(NodeIndicesStats.java:207)
    at org.opensearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:429)
    at org.opensearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:104)
    at org.opensearch.transport.OutboundMessage.serialize(OutboundMessage.java:81)
    at org.opensearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:234)
    at org.opensearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:220)

Root Cause

The issue occurs due to the use of StreamOutput.writeVLong() in the SearchStats class, which does not support negative long values. The stack trace shows that when a negative value is encountered, it triggers the exception in elasticsearch.log shown above.

The problematic line in the code is:

out.writeVLong(current);

Source: SearchStats.java#L89

Related component

Search

To Reproduce

This bug does not occur consistently across all domains or search queries. We are still investigating the specific conditions or types of searches that trigger this issue.

Expected behavior

Search stats should not be negative.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@dzane17 dzane17 added bug Something isn't working untriaged labels Nov 8, 2024
@dzane17 dzane17 self-assigned this Nov 8, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Nov 8, 2024
@dbwiddis
Copy link
Member

dbwiddis commented Nov 8, 2024

Probably introduced in #8386

Similar issue #13343

@dbwiddis
Copy link
Member

dbwiddis commented Nov 8, 2024

The current stat can be decremented in two places, on phase end (success):

and on phase failure:

phaseStatsMap.get(context.getCurrentPhase().getSearchPhaseName()).current.dec();

I looked through and didn't find any cases where they might both be called.

However, I did notice that current stat is stored in an unsynchronized map:

Map<String, PhaseStatsLongHolder> requestStatsHolder = new HashMap<>();

And the key is just the phase name, which are common strings:

public enum SearchPhaseName {
DFS_PRE_QUERY("dfs_pre_query"),
QUERY("query"),
FETCH("fetch"),
DFS_QUERY("dfs_query"),
EXPAND("expand"),
CAN_MATCH("can_match");

But digging in to the increments and decrements they use atomic operations so that shouldn't be an issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc untriaged
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants