Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Log pattern support in OpenSearch #16627

Open
songkant-aws opened this issue Nov 13, 2024 · 1 comment
Open

[RFC] Log pattern support in OpenSearch #16627

songkant-aws opened this issue Nov 13, 2024 · 1 comment
Labels
enhancement Enhancement or improvement to existing feature or request Libraries Lucene Upgrades and Libraries, Any 3rd party library that Core depends on, ex: nebula; team is respo untriaged

Comments

@songkant-aws
Copy link

songkant-aws commented Nov 13, 2024

Is your feature request related to a problem? Please describe

Today, OpenSearch supports Grok or Patterns operator in PPL to leverage regex to exclude stop characters to generate a log message's pattern. By default, it applies a very simple rule to just exclude numerics and [a-zA-Z\d] characters. For example, [email protected] and [email protected] are potentially grouped as the same pattern because after the processing, their patterns are both @. This simple approach has low grouping accuracy because different log statements could have same combination of punctuations and the generated pattern is not friendly to human reading. To achieve better grouping accuracy, it needs expert with domain knowledge to manually apply suitable regex case by case.

I see automatic extracting log patterns is a popular trend in industrial log analysis. Industrial products like Sumo Logic has logreduce operator that groups log messages together based on string and pattern similarity. Ideally, a good log pattern functionality should process a stream of semi-structured log messages and identify which are constant words and variables for each log message. For example, a list of log messages[proxy.cse.cuhk.edu.hk:5070 open through proxy proxy.cse.cuhk.edy.hk:5070 HTTPS, proxy.cse.cuhk.edu.hk:5171 open through p3p.sogou.com:80 SOCKS, proxy.cse.cuhk.edu.hk:5070 open through proxy 182.254.114.110:80 HTTPS, proxy.cse.cuhk.edu.hk:5172 open through proxy socks.cse.cuhl.edu.hk:5070 SOCKS] could have such two common patterns: <*> open through proxy <*> HTTPS and <*> open through proxy <*> SOCKS.

I'm of an opinion to create a new module in OpenSearch to add several log parsing algorithms for extracting log common patterns from a stream of log message input so that other components like DSL or SQL/PPL plugin could leverage those algorithms to develop its own operators. Please share your thoughts and rate if this is a good idea or bad idea.

Describe the solution you'd like

The proposal here is to firstly create a new module like org.opensearch.patterns in milestone 1, similar to org.opensearch.grok. The goal of this module is to act as a library of multiple log parsing algorithms.

In milestone 2, import the algorithm in other plugins like opensearch-skills to migrate existing simple log pattern to the advanced algorithms.

In milestone 3, implement new operator in SQL/PPL plugin based on suitable algorithms

In milestone 4, grouping log patterns could be treated as a special aggregation, we could support log pattern aggregator (reduce) part in OpenSearch DSL or pipeline.

Related component

Libraries

Describe alternatives you've considered

Today, due to performance consideration, DSL or PPL may only return up to 10,000 results by default MAX_RESULT_WINDOW_SETTING. For this volume of data, it's probably enough to apply extracting common log patterns on Coordinator Node firstly.

Instead of only applying algorithms in aggregator (reduce) part, we could support partial log pattern aggregation on DataNode level for all of filtered documents, that could be over millions of log messages. Considering heavy work efforts, we want to prioritize grouping log patterns on Coordinator Node.

Additional context

Assumptions

  1. Based on industrial empirical knowledge, IP address, url, numbers, special software ids like process ids, etc are known variable tokens. At the preprocessing step, all of algorithms will apply a default regex to exclude those known variable tokens and default delimiters to split tokens. Users are also allowed to pass customized regex and delimiter to improve this preprocessing if they have deep domain knowledge.
  2. Log messages that are generated by the same log statement usually have the same number of tokens after delimiting.
  3. Constant tokens have high frequencies at the same token position if the same log statement logs many times.

Design Considerations

We run a bunch of algorithms as well as existing OpenSearch simple log pattern algorithm on an open-sourced benchmark called logparser to compare different algorithms' log grouping efficiency. The benchmark has 16 industrial software datasets in loghub. We also compared the time complexity and space complexity across different volumes of log data with 10 iterations to calculate its mean finish time in seconds and average memory cost in MB.

Grouping Accuracy

The following graph shows different algorithm's grouping accuracy percentiles in box plot across 16 industrial log datasets. We observed that OpenSearch simple log pattern approach is not as competitive as others. The top most accurate 3 algorithms are Brain > Drain > AEL.
image

Time Complexity

In overall, all of top 3 algorithms time complexity are bounded by O(n), n is number of log lines. Brain is the fastest algorithm in selected 4 datasets.
image
image
image
image

Space Complexity

In overall, all of top 3 algorithms space complexity are bounded by O(n * L), n is number of log lines, L is average number of tokens per log message. Brain has up to twice the memory cost of the other two algorithms.
image
image
image
image

Preferred Algorithm

Although, Brain algorithm has larger memory cost, it has excellent time efficiency to process different volumes of log data and highest grouping accuracy with lowest variance. It will be the first priority to be implemented.

Algorithm Introduction

  1. After preprocessing step, the algorithm input is a stream of split token list like [[token01, token02, token03, ..], [token11, token02, token03, ..], ...].
  2. Calculates the global token frequencies per column over the global data input, like a histogram of words at column position. Each token will be embedded as a frequency vector like <frequency, token, position>. For example, token02 and token03 has such vector like (2, token02, 1) and (2, token03, 2) based on the sample input mentioned in the first step.
  3. Initial log pattern is formed when tokens share the same highest frequency per log message. For example, <*> token02 token03 ... is the initial log pattern based on second step's result.
  4. The algorithm maintains a bidirectional tree data structure to supplement final log pattern with some heuristic rules for other tokens in the same log message.
  5. Final log patterns will be generated by traversing the tree.

Implementation Proposal

The benefit of creating a separate module is that it will provide general algorithm implementation and interfaces on any type of computation resource, whether it's a DataNode, CoordinatorNode or ML Node. It's agnostic to declarative language.
This section will simply discuss how could we implement grouping log pattern in OpenSearch SearchService.

Phase 1

In phase 1, we will prioritize grouping log patterns only in Coordinator node based on results from DataNode, considering there is MAX_RESULT_WINDOW limit of returning search results from DataNode and the algorithm has low cost for handling a data volume of 10,000. An example component is shown as follows:

image

Resource Isolation and Circuit Breakers

Applying grouping log pattern on single Coordinator node adds additional memory and CPU pressure. Although it's not a frequent query, it's still better to apply a quick circuit breaker to check memory usage to early cancel the search request.

Phase 2

In phase 2, we could push down log pattern to query phase so that DataNode can compute partial result for larger volume of data. Since Brain algorithm requires a global histogram, it needs two passes of map-reduce for distributed task computation. The global histogram generated in the first pass needs to dispatched to Data Nodes for second pass query. The initial idea is illustrated in the following graph

image

@songkant-aws songkant-aws added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 13, 2024
@github-actions github-actions bot added the Libraries Lucene Upgrades and Libraries, Any 3rd party library that Core depends on, ex: nebula; team is respo label Nov 13, 2024
@dbwiddis
Copy link
Member

I absolutely love this proposal.

I recently added a split search response processor and thought that the power of regex could really improve that capability. I envisioned the possibilities of regex and realized the Grok processor existed in ingest, and filed a feature request for a Grok Search Response Processor that I thought I would eventually get to.

But this proposal is better.

Yes, yes, yes, let's optimize common log patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Libraries Lucene Upgrades and Libraries, Any 3rd party library that Core depends on, ex: nebula; team is respo untriaged
Projects
None yet
Development

No branches or pull requests

2 participants