Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created the blog post announcing Data Prepper 2.0 #1066

Merged
166 changes: 166 additions & 0 deletions _posts/2022-10-10-Announcing-Data-Prepper-2.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
---
layout: post
title: "Announcing Data Prepper 2.0.0"
authors:
- dlv
- oeyh
date: 2022-10-10 15:00:00 -0500
categories:
- technical-post
---

The Data Prepper maintainers are proud to announce the release of Data Prepper 2.0. This release makes Data Prepper
easier to use and helps you improve your observability stack based on feedback from our users. Data Prepper 2.0 retains
compatibility with all current versions of OpenSearch.

Here are some of the major changes and enhancements made for Data Prepper 2.0.

## Conditional routing
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Now Data Prepper 2.0 supports conditional routing to help pipeline authors send different logs to specific OpenSearch clusters.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

One common use-case this supports is to reducing the volume of data going to some clusters.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
When you want info logs that produce large volumes of data to go to a cluster or index with more frequent rollovers or
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
deletions to clear out these large volumes of data, you now configure pipelines to route your data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deletions to clear out these large volumes of data, you now configure pipelines to route your data.



Simply pick a name appropriate for the domain and a Data Prepper expression.
Then for any sink that should only have some data coming through, define one or more routes to apply Data Prepper will evaluate
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

@oeyh oeyh Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this supposed to be a space, not a line break; also missing a period in front of Data Prepper will evaluate...:

Suggested change
Simply pick a name appropriate for the domain and a Data Prepper expression.
Then for any sink that should only have some data coming through, define one or more routes to apply Data Prepper will evaluate
Simply pick a name appropriate for the domain and a Data Prepper expression. Then for any sink that should only have some data coming through, define one or more routes to apply. Data Prepper will evaluate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line breaks should not affect the rendered page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a space and line break which did create a new paragraph in the rendered page. Thanks for noting that!

these expressions for each event to determine which sinks to route these events to. Any sink that has no routes defined will accept all events.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

For example, consider an application log that includes log data. A typical Java application log might look like the following.

```
2022-10-10T10:10:10,421 [main] INFO org.example.Application - Saving 10 records to SQL table "orders"
```

The text that reads `INFO` indicates that this is an INFO-level log. Data Prepper pipeline authors can now route logs with this level to only certain OpenSearch clusters.

The following example pipeline takes application logs from the `http` source. This source
accepts log data from external sources such as Fluent Bit.

The pipeline then uses the `grok` processor to split the log line into multiple fields.
The `grok` processor adds named `loglevel` to the event. Pipeline authors can use that field in routes. This pipeline has two OpenSearch sinks. The first sink only receives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's break this up a little more:


The pipeline then uses the grok processor to split the log line into multiple fields. The grok processor adds a named loglevel to the event. Pipeline authors can use that field in routes.

This pipeline contains two OpenSearch sinks. The first sink will only receive logs with a log level of WARN or ERROR. Data Prepper will route all events to the second sink.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took your suggestion and made one clarification by adding "field" which you can see here: "... adds a
field named loglevel ..."

logs with a log level of `WARN` or `ERROR`. Data Prepper will route all events to the second sink.

```
application-log-pipeline:
workers: 4
delay: "50"
source:
http:
processor:
- grok:
match:
log: [ "%{NOTSPACE:time} %{NOTSPACE:thread} %{NOTSPACE:loglevel} %{NOTSPACE:class} - %{GREEDYDATA:message}" ]

route:
- warn_and_above: '/loglevel == "WARN" or /loglevel == "ERROR"'
sink:
- opensearch:
routes:
- warn_and_above
hosts: ["https://opensearch:9200"]
insecure: true
username: "admin"
password: "admin"
index: warn-and-above-logs
- opensearch:
hosts: ["https://opensearch:9200"]
insecure: true
username: "admin"
password: "admin"
index: all-logs
```

There are many other use-cases that conditional routing can support. If there are other conditional expressions
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
you’d like to see support for, please create an issue in GitHub.

## Peer Forwarder
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Data Prepper 2.0 introduces peer forwarding as a core feature.

Previous to Data Prepper 2.0, performing stateful trace aggregations required using the peer-forwarder processor plugin.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
But this plugin only worked for traces and would send data back to the source. Also, log aggregations only worked on a
single node.

With peer forwarding as a core feature, pipeline authors can perform stateful
aggregations on multiple Data Prepper nodes. When performing stateful aggregations, Data Prepper uses a hash ring to determine
which nodes are responsible for processing different events based on the values of certain fields. Peer forwarder
routes events to the node responsible for processing the event. That node then holds all the state necessary for performing the aggregation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
routes events to the node responsible for processing the event. That node then holds all the state necessary for performing the aggregation.
routes events to the node responsible for processing them. That node then holds all the states necessary for performing the aggregation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the change to "states" here. Using a singular noun for state is quite common.

In information technology and computer science, a system is described as stateful if it is designed to remember preceding events or user interactions; the remembered information is called the state of the system.

https://en.wikipedia.org/wiki/State_(computer_science)


To use peer forwarding, configure how Data Prepper discovers other nodes and the security for connections in your
`data-prepper-config.yaml` file.

In the following example, Data Prepper discovers other peers using a DNS query on the `my-data-prepper-cluster.production` domain.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say "discovers other peers by using" for clarity?

When using peer forwarder with DNS, the DNS record should be an A record with a list of IP addresses for peers. The example also uses a custom certificate and private key.
For host verification, it checks the fingerprint of the certificate. Lastly, it configures each server to authenticate requests using
Mutual TLS (mTLS) to prevent data tampering.


```
peer_forwarder:
discovery_mode: dns
domain_name: "my-data-prepper-cluster.production"
ssl_certificate_file: /usr/share/data-prepper/config/my-certificate.crt
ssl_key_file: /usr/share/data-prepper/config/my-certificate.key
ssl_fingerprint_verification_only: true
authentication:
mutual_tls:
```


## Directory structure
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Before the release of Data Prepper 2.0, we distributed Data Prepper as a single executable JAR file. While convenient,
it made it difficult for us to include custom plugins.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

We now distribute Data Prepper 2.0 in a bundled directory structure. This structure features a shell script to launch
Data Prepper and dedicated subdirectories for JAR files, configurations, pipelines, logs, and more.

```
data-prepper-2.0.0/
bin/
data-prepper # Shell script to run Data Prepper
config/
data-prepper-config.yaml # The Data Prepper configuration file
log4j.properties # Logging configuration
pipelines/ # New directory for pipelines
trace-analytics.yaml
log-ingest.yaml
lib/
data-prepper-core.jar
... any other jar files
logs/
```

You now can launch Data Prepper by running `bin/data-prepper`; no need for additional command line arguments or Java system
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
property definitions. Instead, the application loads configurations from the `config/` subdirectory.

Data Prepper 2.0 reads pipeline configurations from the `pipelines/` subdirectory. You can now define pipelines across
multiple YAML files in the subdirectory, where each file contains the definition for one or more pipelines. The directory
also helps keep pipeline definition distinct and, therefore, more compact and focused.

## JSON & CSV parsing
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Many of our users have incoming data with embedded JSON or CSV fields. To help in these use-cases, Data Prepper 2.0
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
supports parsing JSON or CSV.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

For example, when one large object includes a serialized JSON string, you can use the `parse_json` processor to extract
the fields from the JSON into your event.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "from the JSON string"?


Data Prepper can now import CSV or TSV formatted files from Amazon S3 sources. This is useful for systems like Amazon CloudFront
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove "formatted"? Otherwise, this would need to be "CSV- or TSV-formatted files".

which write their access logs as TSV files. Now you can parse these logs using Data Prepper.

Additionally, if your events have
CSV or TSV fields, Data Prepper 2.0 now contains a `csv` processor which can create fields from your incoming CSV data.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

## Other improvements
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Data Prepper 2.0 includes a number of other improvements. We want to highlight a few of them.

* The OpenSearch sink now supports `create` actions for OpenSearch when writing documents. Pipeline authors can configure their pipelines to only create new documents and not update existing ones.
* The HTTP source now supports loading TLS/SSL credentials from either Amazon S3 or Amazon Certificate Manager. The OTel Trace Source supported these options; pipeline authors can now configure them for their log ingestion use-cases.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
* Data Prepper now requires Java 11 or higher, and the Docker image deploys with JDK 17.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Please see our [release notes](https://github.com/opensearch-project/data-prepper/releases/tag/2.0.0) for a complete list.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing we're missing here is a call to action. We need to conclude with a couple sentences telling the reader what we'd like for them to do next or where they can go to learn more. The below is an example from a recent blog post announcing Snapshot Management (SM):

Wrapping it up

SM automates taking snapshots of your cluster and provides useful features like notifications. To learn more about SM, check out the SM documentation section. For more technical details, read the SM meta issue.

If you’re interested in snapshots, consider contributing to the next improvement we’re working on: searchable snapshots.