In this new blog post (or github repo), we build on everything we have seen so far in this Kafka 101 Tutorial series, namely:
and we will see how to ingest our streaming data into a real-time database: Apache Druid! In addition, we will see how to visualize the data that we produce (either from our raw stream of events or our Flink-aggregated stream) using real-time dashboards supported by Apache Superset. Notice that we are using only Apache technologies so far; that is because we root for the open-source community π€.
The two new services we introduce in this article are a bit less known than what we have used so far, so I am just going to introduce them briefly.
- Druid is a column-oriented data store, which means that it stores data by column rather than by row. This allows for efficient compression and faster query performance.
- Druid is optimized for OLAP (Online Analytical Processing) queries, which means it's designed to handle complex queries on large data sets with low-latency response times.
- Druid supports both batch and real-time data ingestion, which means that it can handle both historical and streaming data.
- Druid includes a SQL-like query language called Druid Query Language (DSL), which allows users to write complex queries against their data.
- Druid integrates with a number of other data processing and analysis tools, including Apache Kafka, Apache Spark, and Apache Superset.
Apache Superset is a modern, open-source business intelligence (BI) platform that allows users to visualize and explore their data in real time.
- Superset was originally developed by Airbnb, and was later open-sourced and donated to the Apache Software Foundation.
- Superset is designed to connect to a wide variety of data sources, including databases, data warehouses, and big data platforms.
- Superset includes a web-based GUI that allows users to create charts, dashboards, and data visualizations using a drag-and-drop interface.
- Superset includes a wide variety of visualization options, including bar charts, line charts, scatterplots, heatmaps, and geographic maps.
- Superset includes a number of built-in features for data exploration and analysis, including SQL editors, data profiling tools, and interactive pivot tables.
To get this project running, you will just need minimal requirements: having Docker and Docker Compose installed on your computer.
The versions I used to build the project are
## Docker
docker --version
> Docker version 23.0.6, build ef23cbc431
## Docker Compose
docker-compose --version
> Docker Compose version 2.17.3
If your versions are different, it should not be a big problem. Though, some of the following might raise warnings or errors that you should debug on your own.
To have everything up and running, you will need to have the whole producer part running (the repo from the previous article) and Druid and Superset. To do this, do the following:
## Clone the repo
git clone https://github.com/theodorecurtil/flink_sql_job.git
## cd into the repo
cd flink_sql_job
## docker-compose the infra
docker-compose up -d
## go back to previous level
cd ..
## Clone the new repo
git clone [email protected]:theodorecurtil/real_time_dashboarding.git
## cd into the repo
cd real_time_dashboarding
## start the infrastructure
./start.sh
β οΈ You might have to make thestart.sh
script executable before being allowed to execute it. If this happens, simply type the following command
chmod +x start.sh
The Druid docker-compose file assumes that the Kafka cluster is running on the Docker network flink_sql_job_default
. It should be the case if you cloned the flink_sql_job
repo and started the infra using the commands listed before. Otherwise, simply adjust the references to the flink_sql_job_default
Docker network in the docker-compose
file.
To check that all the services are up and running (you will see that a lot of Docker containers are running by now), visit the following urls and check that all the UIs load properly:
- Kafka: http://localhost:9021
- Flink: http://localhost:18081
- Druid: username is
druid_system
and password ispassword2
http://localhost:8888 - Superset: username is
admin
and password isadmin
http://localhost:8088
You should see something like
The relationship between all the services is illustrated with the flow diagram below.
In the following sections, we will see how to setup the stream ingestion by Apache Druid from our Kafka cluster; and then how to link Druid to Superset to create real-time dashboards!
The Kafka producer we started produces messages in the SALES
topic. These are fake sales events produced every second with the following schema
{
"type": "record",
"name": "StoreSale",
"namespace": "com.acosom",
"fields": [
{
"name": "store_id",
"type": {
"type": "string",
"avro.java.string": "String"
},
"doc": "ID of the store where the sale was made."
},
{
"name": "sale_amount",
"type": "int",
"doc": "Amount of the sale in local currency. This value is an integer."
},
{
"name": "sale_ts",
"type": "long",
"doc": "Epoch timestamp when the sale happened."
}
]
}
First thing we will do is connect Druid to our Kafka cluster to allow the streaming ingestion of this topic and store it in our real-time database.
We will do this using Druid's UI at http://localhost:8888. Below we show how to create the streaming ingestion spec step-by-step, but note that the ingestion spec can simply be posted using Druid's rest API as a json.
The first step is to click the Load data
button and select the Streaming
option.
Then select Start a new streaming spec
to start writing our custom streaming spec.
Once this is done, select Apache Kafka
as the source.
Once this is done, write the address of the Kafka broker and the topic name and click to apply the configuration. You should then see on the left sample data points from our SALES
topic. Note that the data is not deserialized as this is the raw data coming from Kafka. We will need to include the Avro deserializer. To do this, click the Edit spec
button.
In the spec config, add the following (thank you Hellmar):
"inputFormat": {
"type": "avro_stream",
"binaryAsString": false,
"avroBytesDecoder": {
"type": "schema_registry",
"url": "http://schema-registry:8081"
}
}
and then click the Parse data
button to go back to editing the spec.
Now you should see our data being properly deserialized! Click the Parse time
button to continue.
From now on, there is not much more to edit, simply click through the panels until the streaming spec is submitted. Druid's UI will then automatically redirect to the Ingestion
tab where you should see the following, indicating that Druid is successfully consuming from Kafka.
Finally, to check that data is properly loaded in Druid, click the Query
tab and try a mock query. See example below.
π The stream ingestion is setup and now Druid will keep listening to the SALES
Kafka topic.
Now that the data in successfully being ingested in our streaming database, it is time to create our dashboard to report sales to management!
Navigate to the Superset UI at localhost:8088.
First thing we need to do is to connect our Druid database to Superset. To do this, click the +
button in the upper-right corner and select Data
and Connect database
. Then, follow the steps shown below. Note that the connection string contains the following:
- database type:
druid
- database credentials:
druid_system
andpassword2
- database sql endpoint:
172.18.0.1:8888/druid/v2/sql
Because Superset is not running on the same Docker network as the other services, we have to access it using the host. This will change depending on your OS. See Superset documentation.
Basically, the following should apply:
host.docker.internal
for Mac or Ubuntu users172.18.0.1
for Linux users
In Superset, a Dashboard
is made of Charts
; so the first thing is to create a Chart
. To create a Chart
we need to create a Dataset
object. Click the Dataset
button and follow the steps below to create a Dataset
from the SALES
topic.
Create a time-series area chart from the SALES
dataset and add a couple of configurations to the chart, as shown below. Then save the Chart
and add it to a new Dashboard
.
You will note that the Superset Chart
that we created does the 1-Minute aggregation that our Flink job also does. Let us directly consume the aggregated data from Apache Flink now, and see if both aggregations report the same figures.
The steps to follow:
- Start the Flink job
- Write the streaming ingestion spec into Druid from the new topic
- Create new chart in Superset and add it to the dashboard
To start the aggregating Flink job that computes aggregated sales per store per window of 1 minute, type the following commands:
## Docker exec in the container
docker exec -it sql-client bash
## Start the Flink job
sql-client.sh -f sales-aggregate.sql
## Exit container
exit
You will see in Kafka that a new topic SALES_AGGREGATE
is created by the Flink job.
Now repeat the operations related to Druid and Superset to ingest the data in Druid and create the Chart
in Superset.
See the dashboard below
As expected, the two charts are identical, as the same aggregation is performed; but in one case the aggregation in performed in Superset, and in the other case it is performed via Apache Flink, and Superset simply reads the data from the database without having to perform any work. In addition we created a pie-chart that shows the share of each store in the total amount of sales. As expected, this is close to a uniform distribution of the sales as each store generates sales from the same uniform distribution.
When you are done playing with Superset, follow these steps to stop the whole infrastructure. We will first bring Druid and Superset down, and then shut Flink, Kafka and the Avro records producer. The commands are shown below.
## In the real_time_dashboarding directory
./stop.sh
## Change to the flink_sql_job directory
cd ../flink_sql_job
## docker-compose down
docker-compose down
Again, make sure that the stop.sh
script in the real_time_dashboarding
folder is executable.
In this blog post, we covered the full data pipeline from end to end, from raw data creation with the Kafka producer, producing high quality data records to Kafka using Avro schema and the schema registry. Then we have the streaming analytics pipeline with the Flink cluster performing time aggregation on the raw data and sinking back to Kafka. We have also covered how to get started with Apache Druid as a real-time OLAP database that allows persiting the records from Kafka, and fast-querying. Lastly, we saw how to connect our streaming database to Superset to produce insightful visualizations and dashboards. Superset acts here as our BI tool.