[RFC] Log pattern support in OpenSearch #16627
Labels
enhancement
Enhancement or improvement to existing feature or request
Libraries
Lucene Upgrades and Libraries, Any 3rd party library that Core depends on, ex: nebula; team is respo
untriaged
Is your feature request related to a problem? Please describe
Today, OpenSearch supports Grok or Patterns operator in PPL to leverage regex to exclude stop characters to generate a log message's pattern. By default, it applies a very simple rule to just exclude numerics and
[a-zA-Z\d]
characters. For example,[email protected]
and[email protected]
are potentially grouped as the same pattern because after the processing, their patterns are both@.
This simple approach has low grouping accuracy because different log statements could have same combination of punctuations and the generated pattern is not friendly to human reading. To achieve better grouping accuracy, it needs expert with domain knowledge to manually apply suitable regex case by case.I see automatic extracting log patterns is a popular trend in industrial log analysis. Industrial products like Sumo Logic has logreduce operator that groups log messages together based on string and pattern similarity. Ideally, a good log pattern functionality should process a stream of semi-structured log messages and identify which are constant words and variables for each log message. For example, a list of log messages
[proxy.cse.cuhk.edu.hk:5070 open through proxy proxy.cse.cuhk.edy.hk:5070 HTTPS, proxy.cse.cuhk.edu.hk:5171 open through p3p.sogou.com:80 SOCKS, proxy.cse.cuhk.edu.hk:5070 open through proxy 182.254.114.110:80 HTTPS, proxy.cse.cuhk.edu.hk:5172 open through proxy socks.cse.cuhl.edu.hk:5070 SOCKS]
could have such two common patterns:<*> open through proxy <*> HTTPS
and<*> open through proxy <*> SOCKS
.I'm of an opinion to create a new module in OpenSearch to add several log parsing algorithms for extracting log common patterns from a stream of log message input so that other components like DSL or SQL/PPL plugin could leverage those algorithms to develop its own operators. Please share your thoughts and rate if this is a good idea or bad idea.
Describe the solution you'd like
The proposal here is to firstly create a new module like
org.opensearch.patterns
in milestone 1, similar toorg.opensearch.grok
. The goal of this module is to act as a library of multiple log parsing algorithms.In milestone 2, import the algorithm in other plugins like opensearch-skills to migrate existing simple log pattern to the advanced algorithms.
In milestone 3, implement new operator in SQL/PPL plugin based on suitable algorithms
In milestone 4, grouping log patterns could be treated as a special aggregation, we could support log pattern aggregator (reduce) part in OpenSearch DSL or pipeline.
Related component
Libraries
Describe alternatives you've considered
Today, due to performance consideration, DSL or PPL may only return up to 10,000 results by default MAX_RESULT_WINDOW_SETTING. For this volume of data, it's probably enough to apply extracting common log patterns on Coordinator Node firstly.
Instead of only applying algorithms in aggregator (reduce) part, we could support partial log pattern aggregation on DataNode level for all of filtered documents, that could be over millions of log messages. Considering heavy work efforts, we want to prioritize grouping log patterns on Coordinator Node.
Additional context
Assumptions
Design Considerations
We run a bunch of algorithms as well as existing OpenSearch simple log pattern algorithm on an open-sourced benchmark called logparser to compare different algorithms' log grouping efficiency. The benchmark has 16 industrial software datasets in loghub. We also compared the time complexity and space complexity across different volumes of log data with 10 iterations to calculate its mean finish time in seconds and average memory cost in MB.
Grouping Accuracy
The following graph shows different algorithm's grouping accuracy percentiles in box plot across 16 industrial log datasets. We observed that OpenSearch simple log pattern approach is not as competitive as others. The top most accurate 3 algorithms are Brain > Drain > AEL.
Time Complexity
In overall, all of top 3 algorithms time complexity are bounded by O(n), n is number of log lines. Brain is the fastest algorithm in selected 4 datasets.
Space Complexity
In overall, all of top 3 algorithms space complexity are bounded by O(n * L), n is number of log lines, L is average number of tokens per log message. Brain has up to twice the memory cost of the other two algorithms.
Preferred Algorithm
Although, Brain algorithm has larger memory cost, it has excellent time efficiency to process different volumes of log data and highest grouping accuracy with lowest variance. It will be the first priority to be implemented.
Algorithm Introduction
<*> token02 token03 ...
is the initial log pattern based on second step's result.Implementation Proposal
The benefit of creating a separate module is that it will provide general algorithm implementation and interfaces on any type of computation resource, whether it's a DataNode, CoordinatorNode or ML Node. It's agnostic to declarative language.
This section will simply discuss how could we implement grouping log pattern in OpenSearch SearchService.
Phase 1
In phase 1, we will prioritize grouping log patterns only in Coordinator node based on results from DataNode, considering there is MAX_RESULT_WINDOW limit of returning search results from DataNode and the algorithm has low cost for handling a data volume of 10,000. An example component is shown as follows:
Resource Isolation and Circuit Breakers
Applying grouping log pattern on single Coordinator node adds additional memory and CPU pressure. Although it's not a frequent query, it's still better to apply a quick circuit breaker to check memory usage to early cancel the search request.
Phase 2
In phase 2, we could push down log pattern to query phase so that DataNode can compute partial result for larger volume of data. Since Brain algorithm requires a global histogram, it needs two passes of map-reduce for distributed task computation. The global histogram generated in the first pass needs to dispatched to Data Nodes for second pass query. The initial idea is illustrated in the following graph
The text was updated successfully, but these errors were encountered: