diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..82ff49f --- /dev/null +++ b/.gitignore @@ -0,0 +1,10 @@ +.idea/ +.env + +# ascii doc generated files +docs/**/*.html +docs/**/*.css +docs/**/*.svg +docs/**/*.pdf +docs/.asciidoctor + diff --git a/README.html b/README.html new file mode 100644 index 0000000..f77234c --- /dev/null +++ b/README.html @@ -0,0 +1,1868 @@ + + + + + + + +Kafka Cost Control + + + + + + +
+
+
+
+
+
+
+

1. User Manual

+
+
+
+

1.1. Introduction

+
+

This user manual will help you understand kafka Cost Control and how to use it propertly. This document assumes that you already have a running application. If not please see the Installation section.

+
+
+

At this point you should have access to the Kafka Cost Control UI and to the Grafana Dashboard.

+
+
+

1.1.1. Graphql

+
+

Kafka cost control provides a graphql endpoint at: <your-host>/graphql-ui

+
+
+

In addition, there is a ready to use GraphQL UI. You can access it by going to the following URL: <your-host>/graphql-ui

+
+
+
+
+
+

1.2. Pricing rules

+
+

Pricing rules are a way to put a price on each metric. The price will be applied on the hourly aggregate. Also, it’s common for metrics to be in bytes and not Megabyte or Gigabyte. Keep that in mind when setting the price. +For example, if you want to have a price of 1.0$ per GB you will need to set the price to 1.0/10243 = 0.000976276$ per byte.

+
+
+

Pricing rules are stored in kafka in a compacted topic. The key should be the metric name.

+
+
+

1.2.1. Listing pricing rules

+
+
From the UI
+
+

Simply go to the pricing rules tab of the UI. You should see the metric name and it’s cost.

+
+
+
+
Using Graphql
+
+
+
query getAllRules {
+  pricingRules {
+    creationTime
+    metricName
+    baseCost
+    costFactor
+  }
+}
+
+
+
+
+
+

1.2.2. Setting a pricing rule

+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation saveRule {
+  savePricingRule(
+    request: {metricName: "whatever", baseCost: 0.12, costFactor: 0.0001}
+  ) {
+    creationTime
+    metricName
+    baseCost
+    costFactor
+  }
+}
+
+
+
+
+
+

1.2.3. Removing a pricing rule

+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation deleteRule {
+  deletePricingRule(request: {metricName: "whatever"}) {
+    creationTime
+    metricName
+    baseCost
+    costFactor
+  }
+}
+
+
+
+
+
+
+
+

1.3. Context data

+
+

Context data are a way to attach a context (attributes basically) to a kafka item (topic, principal, …​). Basically define a set of key/values for an item that match a regex. It is possible that one item match multiple regex (and thus multiple context), but in this case you have to be careful to not have conflicting key/values.

+
+
+

You can have as much key/values as you want. They will be used to sum up prices in the dashboard. It is therefor important that you have at least one key/value that defined the cost unit or organization unit. For example: organzation_unit=department1.

+
+
+

The context data are stored in kafka in a compacted topic. The key is free for the user to choose.

+
+
+

1.3.1. Listing existing context data

+
+
From the UI
+
+

Simply go to the context tab of the UI. You should see all the context with their type, regex, validity time and key/values.

+
+
+
+
Using Graphql
+
+
+
query getContextData {
+  contextData {
+    id
+    creationTime
+    validFrom
+    validUntil
+    entityType
+    regex
+    context {
+      key
+      value
+    }
+  }
+}
+
+
+
+
+
+

1.3.2. Setting context data

+
+

If you want to create a new context, you can omit the id if you want. If no id is set, the API will generate one for you using a UUID. +If you use an id that is not yet in the system, this means you’re creating a new context item.

+
+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation saveContextData {
+  saveContextData(
+    request: {id: "323b603d-5b5f-48d2-84fc-4e784e942289", entityType: TOPIC, regex: ".*collaboration", context: [{key: "app", value: "agoora"}, {key: "cost-unit", value: "spoud"}, {key: "domain", value: "collaboration"}]}
+  ) {
+    id
+    creationTime
+    entityType
+    regex
+    context {
+      key
+      value
+    }
+  }
+}
+
+
+
+
+
+

1.3.3. Removing context data

+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation deleteContextData {
+  deleteContextData(request: {id: "323b603d-5b5f-48d2-84fc-4e784e942289"}) {
+    id
+    creationTime
+    entityType
+    regex
+    context {
+      key
+      value
+    }
+  }
+}
+
+
+
+
+
+
+
+

1.4. Reprocess

+
+

Reprocessing should only be used when you made a mistake, fixed it and want to reprocess the raw data. Reprocessing will induce a lag, meaning data will not be live for a little while. Depending on how much data you want to reprocess this can take minutes or hours. So be sure to know what you are doing. After the reprocessing is done, the data will be live again. Reprocessing will NOT lose data. They will just take a bit of time to appear live again.

+
+
+

Be aware that in the reprocessing action may take a while to complete (usually about 1 min). This is why you should be patient with the request.

+
+
+

The process is as follows:

+
+
+
    +
  • +

    use request reprocessing

    +
  • +
  • +

    KafkaCostControl MetricProcess kafka stream application will stop

    +
  • +
  • +

    Wait for all consumers to stop and for kafka to release the consumer group (this may take time)

    +
  • +
  • +

    KafkaCostControl will look for the offset of the timestamp requested for the reprocessing (if not timestamp requested, it will just see to zero)

    +
  • +
  • +

    KafkaCostControl will self-destruct in order for kubernetes to restart it (you may see a restart count increasing)

    +
  • +
  • +

    KafkaCostControl kafka stream application will resume from the offset defined by the timestamp you gave

    +
  • +
+
+
+

The metric database should be independent. This means it should be able to accept updates. Otherwise, you will need to clean the database yourself before a reprocessing.

+
+
+

1.4.1. Using the UI

+
+
    +
  • +

    Go to the Others tab.

    +
  • +
  • +

    Choose a date for the start time of the reprocessing (empty means from the beginning of time). You can help yourself with the quick button on top.

    +
  • +
  • +

    Click on reprocess

    +
  • +
  • +

    Confirm the reprocessing

    +
  • +
+
+
+
Using Graphql
+
+
+
mutation reprocess {
+  reprocess(areYouSure: "no", startTime:"2024-01-01T00:00:00Z")
+}
+
+
+
+
+
+
+
+
+
+

2. Installation

+
+
+
+

2.1. Prerequisites

+
+

This installation manual assumes that

+
+
+
    +
  1. +

    You have a kafka cluster

    +
  2. +
  3. +

    You have a schema registry

    +
  4. +
  5. +

    You have a kubernetes clusters

    +
  6. +
+
+
+
+
+

2.2. Topics and AVRO schemas

+
+

Kafka cost control uses internal topic to compute pricing. You will have to create those topic before deploying the application. The documentation will show the default names, you can change them but don’t forget to adapt the aggregator configuration.

+
+
+

2.2.1. Reference AVRO schemas

+
+

Some schemas will reference EntityType. Please add it to your schema registry and reference it when needed.

+
+
+
+

2.2.2. Topics

+ ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Topic nameClean up policyKeyValue

context-data

compact

String

ContextData

pricing-rule

compact

String

PricingRule

aggregated

delete

AggregatedDataKey

AggregatedDataWindowed

aggregated-table-friendly

delete

AggregatedDataKey

AggregatedDataTableFriendly

metrics-raw-telegraf-dev

delete

None

String

+
+
Context data
+
+

This topic will contain the additional information you wish to attach to the metrics. SEE TODO for more information. This topic is compacted and it is important that you take care of the key yourself. If you wish to delete a context-data you can set null as payload (and provide the key you want to delete).

+
+
+
+
Pricing rule
+
+

This topic will contain the price of each metric. Be aware that most of the metric will be in bytes. So if you want for example to have a price of 1.0$ per GB you will need to set the price to 1.0/10243 = 0.000976276$ per byte. The key should be the metric name. If you wish to remove a price value, send the payload null with the key you want to delete. See TODO on how to use the API or the UI to set the price.

+
+
+
+
Aggregated
+
+

This topic will contain the enriched data. This is the result topic of the aggregator.

+
+
+
+
Aggregated table friendly
+
+

This is the exact same thing as aggregated except there are no hashmap and other nested field. Everything has be flattened. This topic make it easy to use Kafka Connect with a table database.

+
+
+
+
Metrics raw telegraf
+
+

You can have multiple raw topics. For example one per environment or one per kafka cluster. The topic name is up to you, just don’t forget to configure it properly when you deploy telegraf (see Kubernetes section).

+
+
+
+
+
+
+

2.3. Kubernetes

+
+

You can find all the deployment files in the deployment folder. This folder use Kustomize to simplify the deployment of multiple instances with some variations.

+
+
+

2.3.1. Create namespace

+
+

Create a namespace for the application

+
+
+
+
kubectl create namespace kafka-cost-control
+
+
+
+
+

2.3.2. Kafka metric scrapper

+
+

This part will be responsible to scrape kafka for relevant metrics. You will have to deploy one scrapper per kafka cluster. Depending on what metrics you want to provide you will need a user with read access to kafka metric but also kafka admin client. Read permission is enough ! You don’t need a user with write permission.

+
+
+

This documentation will assume that you use the dev/ folder, but you can configure as much Kustomize folders as you want.

+
+
+

Copy the environment sample file:

+
+
+
+
cd deployment/kafka-metric-scrapper/dev
+cp .env.sample .env
+vi .env
+
+
+
+

Edit the environment file with the correct output topic, endpoints and credentials.

+
+
+

Be sure to edit the namespace in the kustomization.yaml file.

+
+
+

Deploy the dev environment using kubectl

+
+
+
+
cd /deployment/kafka-metric-scrapper
+kubectl apply -k dev
+
+
+
+

Wait for the deployment to finish and check the output topic for metrics. You should receive new data every minute.

+
+
+
+

2.3.3. Kafka cost control

+
+

For this part we will deploy the kafka stream application that is responsible to enrich the metrics, TimescaleDB for storing the metrics, kafka connect instance to sink the metric into the database, a grafana dashboard and a simple UI to define prices and contexts.

+
+
+

This documentation will assume that you use the dev/ folder, but you can configure as much Kustomize folders as you want.

+
+
+

Copy the environment sample file:

+
+
+
+
cd deployment/kafka-cost-control/dev
+cp .env.sample .env
+vi .env
+
+
+
+

Edit the environment file with the correct credentials. The database password can be randomly generated. It will be used by kafka connect and grafana.

+
+
+

Be sure to edit the namespace in the kustomization.yaml file.

+
+
+

You also may want to adapt the ingress files to use a proper hosts. You will need two hosts, one for grafana and one for the kafka cost control application.

+
+
+

Deploy the application using kubectl

+
+
+
+
cd /deployment/kafka-metric-scrapper
+kubectl apply -k dev
+
+
+
+
+
+
+

2.4. Metric database

+
+

In order to store the metrics, we recommend using a timeserie database. Feel free to chose one that suits your needs. Be careful to chose one that is compatible with Kafka connect so you can easily transfer metrics from kafka to your database. In this example we will assume that you’re using TimescaleDB because it’s the one we provide kubernetes manifest for.

+
+
+

2.4.1. Database Schema

+
+

Feel free to adapt the partition size to fit your needs. In this example we put 1 day but please follow the TimescaleDB documentation to choose the right partition size for your use case.

+
+
+
+
CREATE TABLE "kafka_aggregated-table-friendly"
+(
+    "startTime"         TIMESTAMP        NOT NULL,
+    "endTime"           TIMESTAMP        NOT NULL,
+    "entityType"        VARCHAR          NOT NULL,
+    "initialMetricName" VARCHAR          NOT NULL,
+    "name"              VARCHAR          NOT NULL,
+    "value"             DOUBLE PRECISION NOT NULL,
+    "cost"              DOUBLE PRECISION NULL,
+    "tags"              JSONB            NOT NULL,
+    "context"           JSONB            NOT NULL,
+    PRIMARY KEY ("startTime", "endTime", "entityType", "initialMetricName", "name")
+);
+
+SELECT create_hypertable('kafka_aggregated-table-friendly', by_range('startTime', INTERVAL '1 day'));
+
+
+
+
+
+
+

2.5. Kafka connect

+
+

To write data from the kafka metric topic to the timeserie database we will use Kafka Connect.

+
+
+

Please refer to the kubenertes manifest to deploy a kafka connect cluster.

+
+
+

2.5.1. Configuration of the connectors

+
+

Don’t forget to adapt the hosts, users and password

+
+
+
+
{
+  "name": "kafka-cost-control-aggregated",
+  "config": {
+
+    "tasks.max": "1",
+    "topics": "aggregated-table-friendly",
+    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
+    "connection.url": "jdbc:postgresql://timescaledb-service:5432/postgres?sslmode=disable",
+    "connection.user": "postgres",
+    "connection.password": "password",
+    "insert.mode": "upsert",
+    "auto.create": "false",
+    "table.name.format": "kafka_${topic}",
+    "pk.mode": "record_value",
+    "pk.fields": "startTime,endTime,entityType,initialMetricName,name",
+    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
+    "value.converter": "io.confluent.connect.avro.AvroConverter",
+    "value.converter.schema.registry.url": "https://schema-registry-host",
+    "value.converter.basic.auth.credentials.source": "USER_INFO",
+    "value.converter.basic.auth.user.info": "schema-registry-user:schema-registry-password",
+    "transforms": "flatten",
+    "transforms.flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value",
+    "transforms.flatten.delimiter": "_"
+  }
+}
+
+
+
+

TODO curl command to create the connector

+
+
+
+
+
+

2.6. Grafana

+
+

TODO

+
+
+
+
+

2.7. Telegraf

+
+

TODO

+
+
+
+
+
+
+

3. Architecture

+
+
+
+

3.1. Introduction and Goals

+
+

Many organizations have introduced Kafka either on premise or in the cloud in recent years.
+Kafka platforms are often used as a shared service for multiple teams. +Having all costs centralized in a single cost center means that there is no incentive to save costs for individual users or projects.

+
+
+

Kafka Cost Control gives organizations transparency into the costs caused by applications and allow to distribute platform costs in a fair way to its users by providing a solution that

+
+
+
    +
  • +

    shows usage statistics per application and organizational unit

    +
  • +
  • +

    allows defining rules for platform cost distribution over organizational units or applications

    +
  • +
  • +

    works for most organizations, no matter if they use Confluent Cloud, Kubernetes or on-prem installations

    +
  • +
+
+
+

3.1.1. Requirements Overview

+
+
    +
  1. +

    Collection and aggregation of usage metrics and statistics from one or multiple Kafka clusters. Aggregation by time:

    +
    +
      +
    • +

      hourly (for debugging or as a metric to understand costs in near real-time)

      +
    • +
    • +

      daily

      +
    • +
    • +

      weekly

      +
    • +
    • +

      monthly

      +
    • +
    +
    +
  2. +
  3. +

    Management of associations between client applications, projects and organizational units (OU)

    +
    +
      +
    • +

      automatic recognition of running consumer groups

      +
    • +
    • +

      automatic detection of principals/clients

      +
    • +
    • +

      creation, modification and deletion of contexts (projects and OUs)

      +
    • +
    • +

      interface to hook in custom logic for automatic assignment of clients to projects and OUs

      +
    • +
    • +

      manual assignment of auto-detected principals or consumer groups to projects and OUs

      +
    • +
    • +

      context can change in time, each item should have a start and end date (optional). This means that an item (ex a topic) can switch ownership at any point in time

      +
    • +
    +
    +
  4. +
  5. +

    Visualization of usage statistics

    +
    +
      +
    • +

      Costs and usage statistics can be broken down interactively

      +
      +
        +
      • +

        Summary view: total costs for timespan (day, week, month) per OU

        +
      • +
      • +

        Detail View OU by category: costs by category (produce, consume, storage) for the selected OU in the selected timespan

        +
      • +
      • +

        Detail View OU by application/principal/consumer-group/topic

        +
      • +
      +
      +
    • +
    • +

      Data must be made available in a format that can be used to display it with standard software (e.g. Kibana, Grafana, PowerBI), so that organizations can integrate it into an existing application landscape

      +
    • +
    • +

      provisioning of a lightweight default dashboard e.g. as a simple SPA, so that extra tooling is not mandatory to view the cost breakdown

      +
    • +
    • +

      Items not yet classified should be easily identifiable, so we know what configuration is missing (for example a topic has no OU yet)

      +
    • +
    +
    +
  6. +
  7. +

    Management of rules, that describe how costs are calculated (aka pricing rules)

    +
  8. +
  9. +

    Management of rules, that describe how costs are calculated, e.g.

    +
    +
      +
    • +

      fixed rates for available metrics, i.e. CHF 0.15 per consumed GB

      +
    • +
    • +

      base charge, i.e. CHF 0.5 per principal per hour

      +
    • +
    • +

      rules can be changed at any time, but take effect at a specified start time

      +
    • +
    • +

      optional: backtesting of rules using historical data

      +
    • +
    +
    +
  10. +
  11. +

    Access Control

    +
    +
      +
    • +

      only authorized users can modify rules, OUs and projects

      +
    • +
    • +

      unauthenticated users should be able to see statistics

      +
    • +
    +
    +
  12. +
  13. +

    Observability

    +
    +
      +
    • +

      expose metrics so that the cost control app can be monitored

      +
    • +
    • +

      proper logging

      +
    • +
    +
    +
  14. +
  15. +

    Export of end-of-month reports as CSV or Excel for further manual processing

    +
  16. +
  17. +

    Ability to reprocess raw data in case a mistake was made. For example we see at the end of the month that an item was +wrongly attributed to an OU. We should be able to correct this and reprocess the data.

    +
  18. +
+
+
+
+

3.1.2. Quality Goals

+
+
    +
  1. +

    Transferability / Extensibility: Kafka Cost Control should be modular, so that company-specific extensions can be added.
    + A core layer should contain common base functionality. +Company specific terms or features should be separated into dedicated modules.

    +
  2. +
  3. +

    Maintainability: Reacting to changing requirements and implementing bug fixes should be possible within weeks.

    +
  4. +
+
+
+
+Categories of Quality Requirements +
+
+
+
+

3.1.3. Stakeholders

+ ++++ + + + + + + + + + + + + + + + + +
Role/NameExpectations

Kafka user

Should be able to see their usage. Should take ownership of resources.

Management

Should have an overview of the costs and usage of Kafka.

+
+
+
+
+

3.2. Architecture Constraints

+ ++++ + + + + + + + + + + + + + + + + +
ConstraintExplanation

JVM based

use common language at SPOUD and many clients to make sure many can contribute

Hosting On-Site (not SaaS only)

Companies may not want to expose usage data to a SaaS provider

+
+
+
+

3.3. System Scope and Context

+
+

Kafka Cost Control is a standalone application that needs to integrate into an existing IT landscape.

+
+
+
+context diagram +
+
+
+
+
+

3.4. Solution Strategy

+
+

3.4.1. Used Technologies

+ ++++ + + + + + + + + + + + + + + + + +
TechnologyReason

Telegraf

+
    +
  • +

    Used for scraping metrics from data sources like Prometheus agents or Confluent Cloud API.

    +
  • +
  • +

    Versatile and lightweight tool that can be run in all environments.

    +
  • +
  • +

    supports Kafka

    +
  • +
+

Kafka

for storing metrics, context info and pricing rules, reduces number of solution dependencies

+
+
+
+
+

3.5. Building Block View

+
+

3.5.1. Whitebox Overall System

+
+
+whitebox +
+
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Building blockDescription

PricingRules

Stores rules for turning usage information into costs

ContextProvider

+

Manages contextual information that can be used to enrich metrics with company-specific information. E.g. relations between clientIds, applications, projects, cost centers, …​

+

MetricProcessor

+
    +
  • +

    Defines interfaces for metrics, that must be used by MetricsScraper

    +
  • +
  • +

    Aggregates metrics into time buckets

    +
  • +
  • +

    Produces enriched data streams which includes contextual information

    +
  • +
+

MetricsScraper

+
    +
  • +

    uses a metric source, such as JMX or the confluent cloud metrics API to collect usage metrics

    +
  • +
  • +

    transforms the collected metrics into a format that is defined by MetricProcessor

    +
  • +
+
+
+
+

3.5.2. PricingRules

+
+
+pricingrules +
+
+
+
+

3.5.3. ContextProvider

+
+
+contextprovider +
+
+
+
Context format
+
+
    +
  • +

    metrics are defined in the core

    +
  • +
  • +

    a metric belongs to at least one of the dimensions

    +
    +
      +
    • +

      topic

      +
    • +
    • +

      consumer group

      +
    • +
    • +

      principal

      +
    • +
    +
    +
  • +
  • +

    a context object can be attached to existing dimensions as a AVRO key-value pair to provide the needed flexibility

    +
  • +
+
+
+
topic context as JSON record in a topic, record key="car-claims"
+
+
{
+  "creationTime": "2024-01-01T00:00:00Z",
+  "validFrom": "2024-01-01T00:00:00Z",
+  "validUntil": null,
+  "entityType": "TOPIC",
+  "regex": "car-claims",
+  "context": {
+    "project": "claims-processing",
+    "organization_unit": "non-life-insurance",
+    "sap_psp_element": "1234.234.abc"
+  }
+}
+
+
+
+
topic context rule as JSON record in a topic, record key="default-rule-since-2020"
+
+
{
+  "creationTime": "2024-01-01T00:00:00Z",
+  "validFrom": "2024-01-01T00:00:00Z",
+  "validUntil": null,
+  "entityType": "TOPIC",
+  "regex": "^([a-z0-9-]+)\\.([a-z0-9-]+)\\.([a-z0-9-]+)-.*$",
+  "context": {
+    "tenant": "$1",
+    "app_id": "$2",
+    "component_id": "$3"
+  }
+}
+
+
+
+

If naming conventions are very clear they could also be provided as a file / configuration.

+
+
+
principal context as JSON record in a topic, record key="cluster-id-principal-default-ctxt"
+
+
{
+  "creationTime": "2024-01-01T00:00:00Z",
+  "validFrom": "2024-01-01T00:00:00Z",
+  "validUntil": null,
+  "entityType": "PRINCIPAL",
+  "regex": "u-4j9my2",
+  "context": {
+    "project": "claims-processing",
+    "organization_unit": "non-life-insurance",
+    "sap_psp_element": "1234.234.abc"
+  }
+}
+
+
+
+
+
INFO
+
+

Context objects will be started as AVRO messages. We use JSON as a representation here for simplicity.

+
+
+
+
+
+
Context Lookup
+
+

State stores in Kafka Streams will be used to construct lookup tables for the context.

+
+
+

The key is a string and is a free value that can be set by the user. If no key is provided the API should create random unique key. The topic is compacted, meaning if we want to delete an item we can send a null payload with its key.

+
+ + ++++ + + + + + + + + + + + + + + + + + + + + +
Table 1. context lookup table
KeyValue

<type>_<cluster-id>_<principal_id>

<context-object>

PRINCIPAL_lx1dfsg_u-4j9my2_2024-01-01

{…​, "regex": "u-4j9my2","context": {…​}}

b0bd9c9a-08e6-46c7-9f71-9eafe370da6c

<context-object>

+
+

Once the table has been loaded, aggregated metrics can be enriched with a KTable - Streams join.

+
+
+
+
+
+
+

3.6. Runtime View

+
+

3.6.1. Metrics Ingestion from Confluent Cloud

+
+

Process to gather and aggregate metrics from Confluent Cloud.

+
+
+

The Confluent Metrics Scraper calls the endpoint +api.telemetry.confluent.cloud/v2/metrics/cloud/export?resource.kafka.id={CLUSTER-ID} +with Basic Auth in an interval of 1 Minute to obtain all metrics in Prometheus format.

+
+
+
+runtime scraping +
+
+
+

Telegraf is used to poll data using Confluent prometheus endpoint.

+
+
+
+runtime confluent telegraf +
+
+
+
+

3.6.2. Metrics using Kafka Admin API

+
+

Some information can be gathered from the Kafka Admin API. We will develop a simple application that connect to the Kafka Admin API and expose metrics as prometheus endpoint. We can then reuse Telegraf to publish those metrics to kafka.

+
+
+
+runtime kafka admin api +
+
+
+
+

3.6.3. Other sources of metrics

+
+

Anyone can publish to the raw metrics topic. The metrics should follow the telegraf format. +Recommendation: use one topic per source of metrics. The MetricEnricher application will anyway consume multiple raw metric topics.

+
+
+
+

3.6.4. Metrics Enrichment

+
+
+runtime enrich +
+
+
+
    +
  1. +

    Metrics are consumed from all the raw data topics.

    +
  2. +
  3. +

    Metrics are aggregated by the MetricsProcessor. +Here we:

    +
    +
      +
    • +

      aggregate by hours

      +
    • +
    • +

      attach context

      +
    • +
    • +

      attach pricing rule

      +
    • +
    +
    +
  4. +
  5. +

    The aggregates are stored in the aggregated-metrics topic.

    +
  6. +
  7. +

    The aggregated metrics are stored into the query database.

    +
  8. +
+
+
+

The storage procedure into the query database must be idempotent in order to reprocess the enrichment in case of reprocessing.

+
+
+
Enrichment for topics
+
+
metric with topic name from confluent cloud
+
+
{
+  "fields": {
+    "gauge": 40920
+  },
+  "name": "confluent_kafka_server_sent_bytes",
+  "tags": {
+    "env": "sdm",
+    "host": "confluent.cloud",
+    "kafka_id": "lkc-x5zqx",
+    "topic": "mobiliar-agoora-state-global",
+    "url": "https://api.telemetry.confluent.cloud/v2/metrics/cloud/export?resource.kafka.id=lkc-x5zqx"
+  },
+  "timestamp": 1704805140
+}
+
+
+
+
+
Enrichment for principals
+
+
metric with principal id from confluent cloud
+
+
{
+  "fields": {
+    "gauge": 0
+  },
+  "name": "confluent_kafka_server_request_bytes",
+  "tags": {
+    "env": "sdm",
+    "host": "confluent.cloud",
+    "kafka_id": "lkc-x5zqx",
+    "principal_id": "u-4j9my2",
+    "type": "ApiVersions",
+    "url": "https://api.telemetry.confluent.cloud/v2/metrics/cloud/export?resource.kafka.id=lkc-x5zqx"
+  },
+  "timestamp": 1704805200
+}
+
+
+
+
+
+

3.6.5. Metrics Grouping

+
+
    +
  • +

    confluent_kafka_server_request_bytes by kafka_id (Cluster) and principal_id (User) for the type Produce as sum stored in produced_bytes

    +
  • +
  • +

    confluent_kafka_server_response_bytes by kafka_id (Cluster) and principal_id (User) for the type Fetch as sum stored in fetched_bytes

    +
  • +
  • +

    confluent_kafka_server_retained_bytes by kafka_id (Cluster) and topic as min and max stored in retained_bytes_min and retained_bytes_max

    +
  • +
  • +

    confluent_kafka_server_consumer_lag_offsets by kafka_id (Cluster) and topic as list of consumer_group_id stored in consumergroups

    +
  • +
+
+
+

maybe more are possible.

+
+
+
+
+
+

3.7. Deployment View

+
+
+deployment view +
+
+
+
+
+

3.8. Risks and Technical Debts

+
+
    +
  • +

    Difficulty to get context data

    +
    +
      +
    • +

      Will the customer be willing to make the effort to provide the necessary data?

      +
    • +
    +
    +
  • +
  • +

    Difficulty to put a set price on each kafka item

    +
  • +
  • +

    How to integrate general cost like operation, etc. (not linked to a particular kafka item)

    +
  • +
  • +

    Difficulty of integration with companies cost dashboard

    +
  • +
+
+
+
+
+

3.9. Glossary

+ ++++ + + + + + + + + + + + + +
TermDefinition

OU

Organization Unit

+
+
+
+
+ + + + + + + + + \ No newline at end of file diff --git a/architecture/index.html b/architecture/index.html new file mode 100644 index 0000000..8d0357a --- /dev/null +++ b/architecture/index.html @@ -0,0 +1,1162 @@ + + + + + + + +Architecture + + + + + +
+
+

Architecture

+
+
+
+

Introduction and Goals

+
+

Many organizations have introduced Kafka either on premise or in the cloud in recent years.
+Kafka platforms are often used as a shared service for multiple teams. +Having all costs centralized in a single cost center means that there is no incentive to save costs for individual users or projects.

+
+
+

Kafka Cost Control gives organizations transparency into the costs caused by applications and allow to distribute platform costs in a fair way to its users by providing a solution that

+
+
+
    +
  • +

    shows usage statistics per application and organizational unit

    +
  • +
  • +

    allows defining rules for platform cost distribution over organizational units or applications

    +
  • +
  • +

    works for most organizations, no matter if they use Confluent Cloud, Kubernetes or on-prem installations

    +
  • +
+
+
+

Requirements Overview

+
+
    +
  1. +

    Collection and aggregation of usage metrics and statistics from one or multiple Kafka clusters. Aggregation by time:

    +
    +
      +
    • +

      hourly (for debugging or as a metric to understand costs in near real-time)

      +
    • +
    • +

      daily

      +
    • +
    • +

      weekly

      +
    • +
    • +

      monthly

      +
    • +
    +
    +
  2. +
  3. +

    Management of associations between client applications, projects and organizational units (OU)

    +
    +
      +
    • +

      automatic recognition of running consumer groups

      +
    • +
    • +

      automatic detection of principals/clients

      +
    • +
    • +

      creation, modification and deletion of contexts (projects and OUs)

      +
    • +
    • +

      interface to hook in custom logic for automatic assignment of clients to projects and OUs

      +
    • +
    • +

      manual assignment of auto-detected principals or consumer groups to projects and OUs

      +
    • +
    • +

      context can change in time, each item should have a start and end date (optional). This means that an item (ex a topic) can switch ownership at any point in time

      +
    • +
    +
    +
  4. +
  5. +

    Visualization of usage statistics

    +
    +
      +
    • +

      Costs and usage statistics can be broken down interactively

      +
      +
        +
      • +

        Summary view: total costs for timespan (day, week, month) per OU

        +
      • +
      • +

        Detail View OU by category: costs by category (produce, consume, storage) for the selected OU in the selected timespan

        +
      • +
      • +

        Detail View OU by application/principal/consumer-group/topic

        +
      • +
      +
      +
    • +
    • +

      Data must be made available in a format that can be used to display it with standard software (e.g. Kibana, Grafana, PowerBI), so that organizations can integrate it into an existing application landscape

      +
    • +
    • +

      provisioning of a lightweight default dashboard e.g. as a simple SPA, so that extra tooling is not mandatory to view the cost breakdown

      +
    • +
    • +

      Items not yet classified should be easily identifiable, so we know what configuration is missing (for example a topic has no OU yet)

      +
    • +
    +
    +
  6. +
  7. +

    Management of rules, that describe how costs are calculated (aka pricing rules)

    +
  8. +
  9. +

    Management of rules, that describe how costs are calculated, e.g.

    +
    +
      +
    • +

      fixed rates for available metrics, i.e. CHF 0.15 per consumed GB

      +
    • +
    • +

      base charge, i.e. CHF 0.5 per principal per hour

      +
    • +
    • +

      rules can be changed at any time, but take effect at a specified start time

      +
    • +
    • +

      optional: backtesting of rules using historical data

      +
    • +
    +
    +
  10. +
  11. +

    Access Control

    +
    +
      +
    • +

      only authorized users can modify rules, OUs and projects

      +
    • +
    • +

      unauthenticated users should be able to see statistics

      +
    • +
    +
    +
  12. +
  13. +

    Observability

    +
    +
      +
    • +

      expose metrics so that the cost control app can be monitored

      +
    • +
    • +

      proper logging

      +
    • +
    +
    +
  14. +
  15. +

    Export of end-of-month reports as CSV or Excel for further manual processing

    +
  16. +
  17. +

    Ability to reprocess raw data in case a mistake was made. For example we see at the end of the month that an item was +wrongly attributed to an OU. We should be able to correct this and reprocess the data.

    +
  18. +
+
+
+
+

Quality Goals

+
+
    +
  1. +

    Transferability / Extensibility: Kafka Cost Control should be modular, so that company-specific extensions can be added.
    + A core layer should contain common base functionality. +Company specific terms or features should be separated into dedicated modules.

    +
  2. +
  3. +

    Maintainability: Reacting to changing requirements and implementing bug fixes should be possible within weeks.

    +
  4. +
+
+
+
+Categories of Quality Requirements +
+
+
+
+

Stakeholders

+ ++++ + + + + + + + + + + + + + + + + +
Role/NameExpectations

Kafka user

Should be able to see their usage. Should take ownership of resources.

Management

Should have an overview of the costs and usage of Kafka.

+
+
+
+
+

Architecture Constraints

+ ++++ + + + + + + + + + + + + + + + + +
ConstraintExplanation

JVM based

use common language at SPOUD and many clients to make sure many can contribute

Hosting On-Site (not SaaS only)

Companies may not want to expose usage data to a SaaS provider

+
+
+
+

System Scope and Context

+
+

Kafka Cost Control is a standalone application that needs to integrate into an existing IT landscape.

+
+
+
+context diagram +
+
+
+
+
+

Solution Strategy

+
+

Used Technologies

+ ++++ + + + + + + + + + + + + + + + + +
TechnologyReason

Telegraf

+
    +
  • +

    Used for scraping metrics from data sources like Prometheus agents or Confluent Cloud API.

    +
  • +
  • +

    Versatile and lightweight tool that can be run in all environments.

    +
  • +
  • +

    supports Kafka

    +
  • +
+

Kafka

for storing metrics, context info and pricing rules, reduces number of solution dependencies

+
+
+
+
+

Building Block View

+
+

Whitebox Overall System

+
+
+whitebox +
+
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Building blockDescription

PricingRules

Stores rules for turning usage information into costs

ContextProvider

+

Manages contextual information that can be used to enrich metrics with company-specific information. E.g. relations between clientIds, applications, projects, cost centers, …​

+

MetricProcessor

+
    +
  • +

    Defines interfaces for metrics, that must be used by MetricsScraper

    +
  • +
  • +

    Aggregates metrics into time buckets

    +
  • +
  • +

    Produces enriched data streams which includes contextual information

    +
  • +
+

MetricsScraper

+
    +
  • +

    uses a metric source, such as JMX or the confluent cloud metrics API to collect usage metrics

    +
  • +
  • +

    transforms the collected metrics into a format that is defined by MetricProcessor

    +
  • +
+
+
+
+

PricingRules

+
+
+pricingrules +
+
+
+
+

ContextProvider

+
+
+contextprovider +
+
+
+
Context format
+
+
    +
  • +

    metrics are defined in the core

    +
  • +
  • +

    a metric belongs to at least one of the dimensions

    +
    +
      +
    • +

      topic

      +
    • +
    • +

      consumer group

      +
    • +
    • +

      principal

      +
    • +
    +
    +
  • +
  • +

    a context object can be attached to existing dimensions as a AVRO key-value pair to provide the needed flexibility

    +
  • +
+
+
+
topic context as JSON record in a topic, record key="car-claims"
+
+
{
+  "creationTime": "2024-01-01T00:00:00Z",
+  "validFrom": "2024-01-01T00:00:00Z",
+  "validUntil": null,
+  "entityType": "TOPIC",
+  "regex": "car-claims",
+  "context": {
+    "project": "claims-processing",
+    "organization_unit": "non-life-insurance",
+    "sap_psp_element": "1234.234.abc"
+  }
+}
+
+
+
+
topic context rule as JSON record in a topic, record key="default-rule-since-2020"
+
+
{
+  "creationTime": "2024-01-01T00:00:00Z",
+  "validFrom": "2024-01-01T00:00:00Z",
+  "validUntil": null,
+  "entityType": "TOPIC",
+  "regex": "^([a-z0-9-]+)\\.([a-z0-9-]+)\\.([a-z0-9-]+)-.*$",
+  "context": {
+    "tenant": "$1",
+    "app_id": "$2",
+    "component_id": "$3"
+  }
+}
+
+
+
+

If naming conventions are very clear they could also be provided as a file / configuration.

+
+
+
principal context as JSON record in a topic, record key="cluster-id-principal-default-ctxt"
+
+
{
+  "creationTime": "2024-01-01T00:00:00Z",
+  "validFrom": "2024-01-01T00:00:00Z",
+  "validUntil": null,
+  "entityType": "PRINCIPAL",
+  "regex": "u-4j9my2",
+  "context": {
+    "project": "claims-processing",
+    "organization_unit": "non-life-insurance",
+    "sap_psp_element": "1234.234.abc"
+  }
+}
+
+
+
+
+
INFO
+
+

Context objects will be started as AVRO messages. We use JSON as a representation here for simplicity.

+
+
+
+
+
+
Context Lookup
+
+

State stores in Kafka Streams will be used to construct lookup tables for the context.

+
+
+

The key is a string and is a free value that can be set by the user. If no key is provided the API should create random unique key. The topic is compacted, meaning if we want to delete an item we can send a null payload with its key.

+
+ + ++++ + + + + + + + + + + + + + + + + + + + + +
Table 1. context lookup table
KeyValue

<type>_<cluster-id>_<principal_id>

<context-object>

PRINCIPAL_lx1dfsg_u-4j9my2_2024-01-01

{…​, "regex": "u-4j9my2","context": {…​}}

b0bd9c9a-08e6-46c7-9f71-9eafe370da6c

<context-object>

+
+

Once the table has been loaded, aggregated metrics can be enriched with a KTable - Streams join.

+
+
+
+
+
+
+

Runtime View

+
+

Metrics Ingestion from Confluent Cloud

+
+

Process to gather and aggregate metrics from Confluent Cloud.

+
+
+

The Confluent Metrics Scraper calls the endpoint +api.telemetry.confluent.cloud/v2/metrics/cloud/export?resource.kafka.id={CLUSTER-ID} +with Basic Auth in an interval of 1 Minute to obtain all metrics in Prometheus format.

+
+
+
+runtime scraping +
+
+
+

Telegraf is used to poll data using Confluent prometheus endpoint.

+
+
+
+runtime confluent telegraf +
+
+
+
+

Metrics using Kafka Admin API

+
+

Some information can be gathered from the Kafka Admin API. We will develop a simple application that connect to the Kafka Admin API and expose metrics as prometheus endpoint. We can then reuse Telegraf to publish those metrics to kafka.

+
+
+
+runtime kafka admin api +
+
+
+
+

Other sources of metrics

+
+

Anyone can publish to the raw metrics topic. The metrics should follow the telegraf format. +Recommendation: use one topic per source of metrics. The MetricEnricher application will anyway consume multiple raw metric topics.

+
+
+
+

Metrics Enrichment

+
+
+runtime enrich +
+
+
+
    +
  1. +

    Metrics are consumed from all the raw data topics.

    +
  2. +
  3. +

    Metrics are aggregated by the MetricsProcessor. +Here we:

    +
    +
      +
    • +

      aggregate by hours

      +
    • +
    • +

      attach context

      +
    • +
    • +

      attach pricing rule

      +
    • +
    +
    +
  4. +
  5. +

    The aggregates are stored in the aggregated-metrics topic.

    +
  6. +
  7. +

    The aggregated metrics are stored into the query database.

    +
  8. +
+
+
+

The storage procedure into the query database must be idempotent in order to reprocess the enrichment in case of reprocessing.

+
+
+
Enrichment for topics
+
+
metric with topic name from confluent cloud
+
+
{
+  "fields": {
+    "gauge": 40920
+  },
+  "name": "confluent_kafka_server_sent_bytes",
+  "tags": {
+    "env": "sdm",
+    "host": "confluent.cloud",
+    "kafka_id": "lkc-x5zqx",
+    "topic": "mobiliar-agoora-state-global",
+    "url": "https://api.telemetry.confluent.cloud/v2/metrics/cloud/export?resource.kafka.id=lkc-x5zqx"
+  },
+  "timestamp": 1704805140
+}
+
+
+
+
+
Enrichment for principals
+
+
metric with principal id from confluent cloud
+
+
{
+  "fields": {
+    "gauge": 0
+  },
+  "name": "confluent_kafka_server_request_bytes",
+  "tags": {
+    "env": "sdm",
+    "host": "confluent.cloud",
+    "kafka_id": "lkc-x5zqx",
+    "principal_id": "u-4j9my2",
+    "type": "ApiVersions",
+    "url": "https://api.telemetry.confluent.cloud/v2/metrics/cloud/export?resource.kafka.id=lkc-x5zqx"
+  },
+  "timestamp": 1704805200
+}
+
+
+
+
+
+

Metrics Grouping

+
+
    +
  • +

    confluent_kafka_server_request_bytes by kafka_id (Cluster) and principal_id (User) for the type Produce as sum stored in produced_bytes

    +
  • +
  • +

    confluent_kafka_server_response_bytes by kafka_id (Cluster) and principal_id (User) for the type Fetch as sum stored in fetched_bytes

    +
  • +
  • +

    confluent_kafka_server_retained_bytes by kafka_id (Cluster) and topic as min and max stored in retained_bytes_min and retained_bytes_max

    +
  • +
  • +

    confluent_kafka_server_consumer_lag_offsets by kafka_id (Cluster) and topic as list of consumer_group_id stored in consumergroups

    +
  • +
+
+
+

maybe more are possible.

+
+
+
+
+
+

Deployment View

+
+
+deployment view +
+
+
+
+
+

Risks and Technical Debts

+
+
    +
  • +

    Difficulty to get context data

    +
    +
      +
    • +

      Will the customer be willing to make the effort to provide the necessary data?

      +
    • +
    +
    +
  • +
  • +

    Difficulty to put a set price on each kafka item

    +
  • +
  • +

    How to integrate general cost like operation, etc. (not linked to a particular kafka item)

    +
  • +
  • +

    Difficulty of integration with companies cost dashboard

    +
  • +
+
+
+
+
+

Glossary

+ ++++ + + + + + + + + + + + + +
TermDefinition

OU

Organization Unit

+
+
+
+
+ + + \ No newline at end of file diff --git a/architecture/src/01_introduction_and_goals.html b/architecture/src/01_introduction_and_goals.html new file mode 100644 index 0000000..8822673 --- /dev/null +++ b/architecture/src/01_introduction_and_goals.html @@ -0,0 +1,654 @@ + + + + + + + +Introduction and Goals + + + + + +
+
+

Introduction and Goals

+
+

Many organizations have introduced Kafka either on premise or in the cloud in recent years.
+Kafka platforms are often used as a shared service for multiple teams. +Having all costs centralized in a single cost center means that there is no incentive to save costs for individual users or projects.

+
+
+

Kafka Cost Control gives organizations transparency into the costs caused by applications and allow to distribute platform costs in a fair way to its users by providing a solution that

+
+
+
    +
  • +

    shows usage statistics per application and organizational unit

    +
  • +
  • +

    allows defining rules for platform cost distribution over organizational units or applications

    +
  • +
  • +

    works for most organizations, no matter if they use Confluent Cloud, Kubernetes or on-prem installations

    +
  • +
+
+
+

Requirements Overview

+
+
    +
  1. +

    Collection and aggregation of usage metrics and statistics from one or multiple Kafka clusters. Aggregation by time:

    +
    +
      +
    • +

      hourly (for debugging or as a metric to understand costs in near real-time)

      +
    • +
    • +

      daily

      +
    • +
    • +

      weekly

      +
    • +
    • +

      monthly

      +
    • +
    +
    +
  2. +
  3. +

    Management of associations between client applications, projects and organizational units (OU)

    +
    +
      +
    • +

      automatic recognition of running consumer groups

      +
    • +
    • +

      automatic detection of principals/clients

      +
    • +
    • +

      creation, modification and deletion of contexts (projects and OUs)

      +
    • +
    • +

      interface to hook in custom logic for automatic assignment of clients to projects and OUs

      +
    • +
    • +

      manual assignment of auto-detected principals or consumer groups to projects and OUs

      +
    • +
    • +

      context can change in time, each item should have a start and end date (optional). This means that an item (ex a topic) can switch ownership at any point in time

      +
    • +
    +
    +
  4. +
  5. +

    Visualization of usage statistics

    +
    +
      +
    • +

      Costs and usage statistics can be broken down interactively

      +
      +
        +
      • +

        Summary view: total costs for timespan (day, week, month) per OU

        +
      • +
      • +

        Detail View OU by category: costs by category (produce, consume, storage) for the selected OU in the selected timespan

        +
      • +
      • +

        Detail View OU by application/principal/consumer-group/topic

        +
      • +
      +
      +
    • +
    • +

      Data must be made available in a format that can be used to display it with standard software (e.g. Kibana, Grafana, PowerBI), so that organizations can integrate it into an existing application landscape

      +
    • +
    • +

      provisioning of a lightweight default dashboard e.g. as a simple SPA, so that extra tooling is not mandatory to view the cost breakdown

      +
    • +
    • +

      Items not yet classified should be easily identifiable, so we know what configuration is missing (for example a topic has no OU yet)

      +
    • +
    +
    +
  6. +
  7. +

    Management of rules, that describe how costs are calculated (aka pricing rules)

    +
  8. +
  9. +

    Management of rules, that describe how costs are calculated, e.g.

    +
    +
      +
    • +

      fixed rates for available metrics, i.e. CHF 0.15 per consumed GB

      +
    • +
    • +

      base charge, i.e. CHF 0.5 per principal per hour

      +
    • +
    • +

      rules can be changed at any time, but take effect at a specified start time

      +
    • +
    • +

      optional: backtesting of rules using historical data

      +
    • +
    +
    +
  10. +
  11. +

    Access Control

    +
    +
      +
    • +

      only authorized users can modify rules, OUs and projects

      +
    • +
    • +

      unauthenticated users should be able to see statistics

      +
    • +
    +
    +
  12. +
  13. +

    Observability

    +
    +
      +
    • +

      expose metrics so that the cost control app can be monitored

      +
    • +
    • +

      proper logging

      +
    • +
    +
    +
  14. +
  15. +

    Export of end-of-month reports as CSV or Excel for further manual processing

    +
  16. +
  17. +

    Ability to reprocess raw data in case a mistake was made. For example we see at the end of the month that an item was +wrongly attributed to an OU. We should be able to correct this and reprocess the data.

    +
  18. +
+
+
+
+

Quality Goals

+
+
    +
  1. +

    Transferability / Extensibility: Kafka Cost Control should be modular, so that company-specific extensions can be added.
    + A core layer should contain common base functionality. +Company specific terms or features should be separated into dedicated modules.

    +
  2. +
  3. +

    Maintainability: Reacting to changing requirements and implementing bug fixes should be possible within weeks.

    +
  4. +
+
+
+
+Categories of Quality Requirements +
+
+
+
+

Stakeholders

+ ++++ + + + + + + + + + + + + + + + + +
Role/NameExpectations

Kafka user

Should be able to see their usage. Should take ownership of resources.

Management

Should have an overview of the costs and usage of Kafka.

+
+
+
+ + + \ No newline at end of file diff --git a/architecture/src/02_architecture_constraints.html b/architecture/src/02_architecture_constraints.html new file mode 100644 index 0000000..295cfa5 --- /dev/null +++ b/architecture/src/02_architecture_constraints.html @@ -0,0 +1,472 @@ + + + + + + + +Architecture Constraints + + + + + +
+
+

Architecture Constraints

+ ++++ + + + + + + + + + + + + + + + + +
ConstraintExplanation

JVM based

use common language at SPOUD and many clients to make sure many can contribute

Hosting On-Site (not SaaS only)

Companies may not want to expose usage data to a SaaS provider

+
+
+ + + \ No newline at end of file diff --git a/architecture/src/03_system_scope_and_context.html b/architecture/src/03_system_scope_and_context.html new file mode 100644 index 0000000..a75910f --- /dev/null +++ b/architecture/src/03_system_scope_and_context.html @@ -0,0 +1,458 @@ + + + + + + + +System Scope and Context + + + + + +
+
+

System Scope and Context

+
+

Kafka Cost Control is a standalone application that needs to integrate into an existing IT landscape.

+
+
+
+context diagram +
+
+
+
+ + + \ No newline at end of file diff --git a/architecture/src/04_solution_strategy.html b/architecture/src/04_solution_strategy.html new file mode 100644 index 0000000..9673843 --- /dev/null +++ b/architecture/src/04_solution_strategy.html @@ -0,0 +1,487 @@ + + + + + + + +Solution Strategy + + + + + +
+
+

Solution Strategy

+
+

Used Technologies

+ ++++ + + + + + + + + + + + + + + + + +
TechnologyReason

Telegraf

+
    +
  • +

    Used for scraping metrics from data sources like Prometheus agents or Confluent Cloud API.

    +
  • +
  • +

    Versatile and lightweight tool that can be run in all environments.

    +
  • +
  • +

    supports Kafka

    +
  • +
+

Kafka

for storing metrics, context info and pricing rules, reduces number of solution dependencies

+
+
+
+ + + \ No newline at end of file diff --git a/architecture/src/05_building_block_view.html b/architecture/src/05_building_block_view.html new file mode 100644 index 0000000..bc6c22f --- /dev/null +++ b/architecture/src/05_building_block_view.html @@ -0,0 +1,657 @@ + + + + + + + +Building Block View + + + + + +
+
+

Building Block View

+
+

Whitebox Overall System

+
+
+whitebox +
+
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Building blockDescription

PricingRules

Stores rules for turning usage information into costs

ContextProvider

+

Manages contextual information that can be used to enrich metrics with company-specific information. E.g. relations between clientIds, applications, projects, cost centers, …​

+

MetricProcessor

+
    +
  • +

    Defines interfaces for metrics, that must be used by MetricsScraper

    +
  • +
  • +

    Aggregates metrics into time buckets

    +
  • +
  • +

    Produces enriched data streams which includes contextual information

    +
  • +
+

MetricsScraper

+
    +
  • +

    uses a metric source, such as JMX or the confluent cloud metrics API to collect usage metrics

    +
  • +
  • +

    transforms the collected metrics into a format that is defined by MetricProcessor

    +
  • +
+
+
+
+

PricingRules

+
+
+pricingrules +
+
+
+
+

ContextProvider

+
+
+contextprovider +
+
+
+
Context format
+
+
    +
  • +

    metrics are defined in the core

    +
  • +
  • +

    a metric belongs to at least one of the dimensions

    +
    +
      +
    • +

      topic

      +
    • +
    • +

      consumer group

      +
    • +
    • +

      principal

      +
    • +
    +
    +
  • +
  • +

    a context object can be attached to existing dimensions as a AVRO key-value pair to provide the needed flexibility

    +
  • +
+
+
+
topic context as JSON record in a topic, record key="car-claims"
+
+
{
+  "creationTime": "2024-01-01T00:00:00Z",
+  "validFrom": "2024-01-01T00:00:00Z",
+  "validUntil": null,
+  "entityType": "TOPIC",
+  "regex": "car-claims",
+  "context": {
+    "project": "claims-processing",
+    "organization_unit": "non-life-insurance",
+    "sap_psp_element": "1234.234.abc"
+  }
+}
+
+
+
+
topic context rule as JSON record in a topic, record key="default-rule-since-2020"
+
+
{
+  "creationTime": "2024-01-01T00:00:00Z",
+  "validFrom": "2024-01-01T00:00:00Z",
+  "validUntil": null,
+  "entityType": "TOPIC",
+  "regex": "^([a-z0-9-]+)\\.([a-z0-9-]+)\\.([a-z0-9-]+)-.*$",
+  "context": {
+    "tenant": "$1",
+    "app_id": "$2",
+    "component_id": "$3"
+  }
+}
+
+
+
+

If naming conventions are very clear they could also be provided as a file / configuration.

+
+
+
principal context as JSON record in a topic, record key="cluster-id-principal-default-ctxt"
+
+
{
+  "creationTime": "2024-01-01T00:00:00Z",
+  "validFrom": "2024-01-01T00:00:00Z",
+  "validUntil": null,
+  "entityType": "PRINCIPAL",
+  "regex": "u-4j9my2",
+  "context": {
+    "project": "claims-processing",
+    "organization_unit": "non-life-insurance",
+    "sap_psp_element": "1234.234.abc"
+  }
+}
+
+
+
+
+
INFO
+
+

Context objects will be started as AVRO messages. We use JSON as a representation here for simplicity.

+
+
+
+
+
+
Context Lookup
+
+

State stores in Kafka Streams will be used to construct lookup tables for the context.

+
+
+

The key is a string and is a free value that can be set by the user. If no key is provided the API should create random unique key. The topic is compacted, meaning if we want to delete an item we can send a null payload with its key.

+
+ + ++++ + + + + + + + + + + + + + + + + + + + + +
Table 1. context lookup table
KeyValue

<type>_<cluster-id>_<principal_id>

<context-object>

PRINCIPAL_lx1dfsg_u-4j9my2_2024-01-01

{…​, "regex": "u-4j9my2","context": {…​}}

b0bd9c9a-08e6-46c7-9f71-9eafe370da6c

<context-object>

+
+

Once the table has been loaded, aggregated metrics can be enriched with a KTable - Streams join.

+
+
+
+
+
+ + + \ No newline at end of file diff --git a/architecture/src/06_runtime_view.html b/architecture/src/06_runtime_view.html new file mode 100644 index 0000000..134d701 --- /dev/null +++ b/architecture/src/06_runtime_view.html @@ -0,0 +1,600 @@ + + + + + + + +Runtime View + + + + + +
+
+

Runtime View

+
+

Metrics Ingestion from Confluent Cloud

+
+

Process to gather and aggregate metrics from Confluent Cloud.

+
+
+

The Confluent Metrics Scraper calls the endpoint +api.telemetry.confluent.cloud/v2/metrics/cloud/export?resource.kafka.id={CLUSTER-ID} +with Basic Auth in an interval of 1 Minute to obtain all metrics in Prometheus format.

+
+
+
+runtime scraping +
+
+
+

Telegraf is used to poll data using Confluent prometheus endpoint.

+
+
+
+runtime confluent telegraf +
+
+
+
+

Metrics using Kafka Admin API

+
+

Some information can be gathered from the Kafka Admin API. We will develop a simple application that connect to the Kafka Admin API and expose metrics as prometheus endpoint. We can then reuse Telegraf to publish those metrics to kafka.

+
+
+
+runtime kafka admin api +
+
+
+
+

Other sources of metrics

+
+

Anyone can publish to the raw metrics topic. The metrics should follow the telegraf format. +Recommendation: use one topic per source of metrics. The MetricEnricher application will anyway consume multiple raw metric topics.

+
+
+
+

Metrics Enrichment

+
+
+runtime enrich +
+
+
+
    +
  1. +

    Metrics are consumed from all the raw data topics.

    +
  2. +
  3. +

    Metrics are aggregated by the MetricsProcessor. +Here we:

    +
    +
      +
    • +

      aggregate by hours

      +
    • +
    • +

      attach context

      +
    • +
    • +

      attach pricing rule

      +
    • +
    +
    +
  4. +
  5. +

    The aggregates are stored in the aggregated-metrics topic.

    +
  6. +
  7. +

    The aggregated metrics are stored into the query database.

    +
  8. +
+
+
+

The storage procedure into the query database must be idempotent in order to reprocess the enrichment in case of reprocessing.

+
+
+
Enrichment for topics
+
+
metric with topic name from confluent cloud
+
+
{
+  "fields": {
+    "gauge": 40920
+  },
+  "name": "confluent_kafka_server_sent_bytes",
+  "tags": {
+    "env": "sdm",
+    "host": "confluent.cloud",
+    "kafka_id": "lkc-x5zqx",
+    "topic": "mobiliar-agoora-state-global",
+    "url": "https://api.telemetry.confluent.cloud/v2/metrics/cloud/export?resource.kafka.id=lkc-x5zqx"
+  },
+  "timestamp": 1704805140
+}
+
+
+
+
+
Enrichment for principals
+
+
metric with principal id from confluent cloud
+
+
{
+  "fields": {
+    "gauge": 0
+  },
+  "name": "confluent_kafka_server_request_bytes",
+  "tags": {
+    "env": "sdm",
+    "host": "confluent.cloud",
+    "kafka_id": "lkc-x5zqx",
+    "principal_id": "u-4j9my2",
+    "type": "ApiVersions",
+    "url": "https://api.telemetry.confluent.cloud/v2/metrics/cloud/export?resource.kafka.id=lkc-x5zqx"
+  },
+  "timestamp": 1704805200
+}
+
+
+
+
+
+

Metrics Grouping

+
+
    +
  • +

    confluent_kafka_server_request_bytes by kafka_id (Cluster) and principal_id (User) for the type Produce as sum stored in produced_bytes

    +
  • +
  • +

    confluent_kafka_server_response_bytes by kafka_id (Cluster) and principal_id (User) for the type Fetch as sum stored in fetched_bytes

    +
  • +
  • +

    confluent_kafka_server_retained_bytes by kafka_id (Cluster) and topic as min and max stored in retained_bytes_min and retained_bytes_max

    +
  • +
  • +

    confluent_kafka_server_consumer_lag_offsets by kafka_id (Cluster) and topic as list of consumer_group_id stored in consumergroups

    +
  • +
+
+
+

maybe more are possible.

+
+
+
+
+ + + \ No newline at end of file diff --git a/architecture/src/07_deployment_view.html b/architecture/src/07_deployment_view.html new file mode 100644 index 0000000..fef5f92 --- /dev/null +++ b/architecture/src/07_deployment_view.html @@ -0,0 +1,455 @@ + + + + + + + +Deployment View + + + + + +
+
+

Deployment View

+
+
+deployment view +
+
+
+
+ + + \ No newline at end of file diff --git a/architecture/src/08_concepts.html b/architecture/src/08_concepts.html new file mode 100644 index 0000000..c17b05b --- /dev/null +++ b/architecture/src/08_concepts.html @@ -0,0 +1,471 @@ + + + + + + + +Cross-cutting Concepts + + + + + +
+
+

Cross-cutting Concepts

+
+

<Concept 1>

+
+

<explanation>

+
+
+
+

<Concept 2>

+
+

<explanation>

+
+
+

…​

+
+
+
+

<Concept n>

+
+

<explanation>

+
+
+
+
+ + + \ No newline at end of file diff --git a/architecture/src/09_architecture_decisions.html b/architecture/src/09_architecture_decisions.html new file mode 100644 index 0000000..561eebb --- /dev/null +++ b/architecture/src/09_architecture_decisions.html @@ -0,0 +1,451 @@ + + + + + + + +Architecture Decisions + + + + + +
+
+

Architecture Decisions

+ +
+
+ + + \ No newline at end of file diff --git a/architecture/src/10_quality_requirements.html b/architecture/src/10_quality_requirements.html new file mode 100644 index 0000000..f5b4302 --- /dev/null +++ b/architecture/src/10_quality_requirements.html @@ -0,0 +1,458 @@ + + + + + + + +Quality Requirements + + + + + +
+
+

Quality Requirements

+
+

Quality Tree

+ +
+
+

Quality Scenarios

+ +
+
+
+ + + \ No newline at end of file diff --git a/architecture/src/11_technical_risks.html b/architecture/src/11_technical_risks.html new file mode 100644 index 0000000..f359e54 --- /dev/null +++ b/architecture/src/11_technical_risks.html @@ -0,0 +1,473 @@ + + + + + + + +Risks and Technical Debts + + + + + +
+
+

Risks and Technical Debts

+
+
    +
  • +

    Difficulty to get context data

    +
    +
      +
    • +

      Will the customer be willing to make the effort to provide the necessary data?

      +
    • +
    +
    +
  • +
  • +

    Difficulty to put a set price on each kafka item

    +
  • +
  • +

    How to integrate general cost like operation, etc. (not linked to a particular kafka item)

    +
  • +
  • +

    Difficulty of integration with companies cost dashboard

    +
  • +
+
+
+
+ + + \ No newline at end of file diff --git a/architecture/src/12_glossary.html b/architecture/src/12_glossary.html new file mode 100644 index 0000000..3d9470d --- /dev/null +++ b/architecture/src/12_glossary.html @@ -0,0 +1,468 @@ + + + + + + + +Glossary + + + + + +
+
+

Glossary

+ ++++ + + + + + + + + + + + + +
TermDefinition

OU

Organization Unit

+
+
+ + + \ No newline at end of file diff --git a/build.sh b/build.sh new file mode 100755 index 0000000..1cc52ad --- /dev/null +++ b/build.sh @@ -0,0 +1,5 @@ +#!/bin/bash + + +asciidoctor -r asciidoctor-diagram README.adoc + diff --git a/config.html b/config.html new file mode 100644 index 0000000..f2a2921 --- /dev/null +++ b/config.html @@ -0,0 +1,460 @@ + + + + + + + +Untitled + + + + + + +
+ +
+ + + + + + + + + \ No newline at end of file diff --git a/ebook.pdf b/ebook.pdf new file mode 100644 index 0000000..bb98ce1 Binary files /dev/null and b/ebook.pdf differ diff --git a/images/01_2_iso-25010-topics-EN.drawio.png b/images/01_2_iso-25010-topics-EN.drawio.png new file mode 100644 index 0000000..548f6fa Binary files /dev/null and b/images/01_2_iso-25010-topics-EN.drawio.png differ diff --git a/images/05_building_blocks-EN.png b/images/05_building_blocks-EN.png new file mode 100644 index 0000000..0862b64 Binary files /dev/null and b/images/05_building_blocks-EN.png differ diff --git a/images/08-Crosscutting-Concepts-Structure-EN.png b/images/08-Crosscutting-Concepts-Structure-EN.png new file mode 100644 index 0000000..5598a0b Binary files /dev/null and b/images/08-Crosscutting-Concepts-Structure-EN.png differ diff --git a/images/arc42-logo.png b/images/arc42-logo.png new file mode 100644 index 0000000..88c76d0 Binary files /dev/null and b/images/arc42-logo.png differ diff --git a/index.html b/index.html new file mode 120000 index 0000000..b70e3e8 --- /dev/null +++ b/index.html @@ -0,0 +1 @@ +README.html \ No newline at end of file diff --git a/installation/index.html b/installation/index.html new file mode 100644 index 0000000..c4d8549 --- /dev/null +++ b/installation/index.html @@ -0,0 +1,752 @@ + + + + + + + +Installation + + + + + +
+
+

Installation

+
+
+
+

Prerequisites

+
+

This installation manual assumes that

+
+
+
    +
  1. +

    You have a kafka cluster

    +
  2. +
  3. +

    You have a schema registry

    +
  4. +
  5. +

    You have a kubernetes clusters

    +
  6. +
+
+
+
+
+

Topics and AVRO schemas

+
+

Kafka cost control uses internal topic to compute pricing. You will have to create those topic before deploying the application. The documentation will show the default names, you can change them but don’t forget to adapt the aggregator configuration.

+
+
+

Reference AVRO schemas

+
+

Some schemas will reference EntityType. Please add it to your schema registry and reference it when needed.

+
+
+
+

Topics

+ ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Topic nameClean up policyKeyValue

context-data

compact

String

ContextData

pricing-rule

compact

String

PricingRule

aggregated

delete

AggregatedDataKey

AggregatedDataWindowed

aggregated-table-friendly

delete

AggregatedDataKey

AggregatedDataTableFriendly

metrics-raw-telegraf-dev

delete

None

String

+
+
Context data
+
+

This topic will contain the additional information you wish to attach to the metrics. SEE TODO for more information. This topic is compacted and it is important that you take care of the key yourself. If you wish to delete a context-data you can set null as payload (and provide the key you want to delete).

+
+
+
+
Pricing rule
+
+

This topic will contain the price of each metric. Be aware that most of the metric will be in bytes. So if you want for example to have a price of 1.0$ per GB you will need to set the price to 1.0/10243 = 0.000976276$ per byte. The key should be the metric name. If you wish to remove a price value, send the payload null with the key you want to delete. See TODO on how to use the API or the UI to set the price.

+
+
+
+
Aggregated
+
+

This topic will contain the enriched data. This is the result topic of the aggregator.

+
+
+
+
Aggregated table friendly
+
+

This is the exact same thing as aggregated except there are no hashmap and other nested field. Everything has be flattened. This topic make it easy to use Kafka Connect with a table database.

+
+
+
+
Metrics raw telegraf
+
+

You can have multiple raw topics. For example one per environment or one per kafka cluster. The topic name is up to you, just don’t forget to configure it properly when you deploy telegraf (see Kubernetes section).

+
+
+
+
+
+
+

Kubernetes

+
+

You can find all the deployment files in the deployment folder. This folder use Kustomize to simplify the deployment of multiple instances with some variations.

+
+
+

Create namespace

+
+

Create a namespace for the application

+
+
+
+
kubectl create namespace kafka-cost-control
+
+
+
+
+

Kafka metric scrapper

+
+

This part will be responsible to scrape kafka for relevant metrics. You will have to deploy one scrapper per kafka cluster. Depending on what metrics you want to provide you will need a user with read access to kafka metric but also kafka admin client. Read permission is enough ! You don’t need a user with write permission.

+
+
+

This documentation will assume that you use the dev/ folder, but you can configure as much Kustomize folders as you want.

+
+
+

Copy the environment sample file:

+
+
+
+
cd deployment/kafka-metric-scrapper/dev
+cp .env.sample .env
+vi .env
+
+
+
+

Edit the environment file with the correct output topic, endpoints and credentials.

+
+
+

Be sure to edit the namespace in the kustomization.yaml file.

+
+
+

Deploy the dev environment using kubectl

+
+
+
+
cd /deployment/kafka-metric-scrapper
+kubectl apply -k dev
+
+
+
+

Wait for the deployment to finish and check the output topic for metrics. You should receive new data every minute.

+
+
+
+

Kafka cost control

+
+

For this part we will deploy the kafka stream application that is responsible to enrich the metrics, TimescaleDB for storing the metrics, kafka connect instance to sink the metric into the database, a grafana dashboard and a simple UI to define prices and contexts.

+
+
+

This documentation will assume that you use the dev/ folder, but you can configure as much Kustomize folders as you want.

+
+
+

Copy the environment sample file:

+
+
+
+
cd deployment/kafka-cost-control/dev
+cp .env.sample .env
+vi .env
+
+
+
+

Edit the environment file with the correct credentials. The database password can be randomly generated. It will be used by kafka connect and grafana.

+
+
+

Be sure to edit the namespace in the kustomization.yaml file.

+
+
+

You also may want to adapt the ingress files to use a proper hosts. You will need two hosts, one for grafana and one for the kafka cost control application.

+
+
+

Deploy the application using kubectl

+
+
+
+
cd /deployment/kafka-metric-scrapper
+kubectl apply -k dev
+
+
+
+
+
+
+

Metric database

+
+

In order to store the metrics, we recommend using a timeserie database. Feel free to chose one that suits your needs. Be careful to chose one that is compatible with Kafka connect so you can easily transfer metrics from kafka to your database. In this example we will assume that you’re using TimescaleDB because it’s the one we provide kubernetes manifest for.

+
+
+

Database Schema

+
+

Feel free to adapt the partition size to fit your needs. In this example we put 1 day but please follow the TimescaleDB documentation to choose the right partition size for your use case.

+
+
+
+
CREATE TABLE "kafka_aggregated-table-friendly"
+(
+    "startTime"         TIMESTAMP        NOT NULL,
+    "endTime"           TIMESTAMP        NOT NULL,
+    "entityType"        VARCHAR          NOT NULL,
+    "initialMetricName" VARCHAR          NOT NULL,
+    "name"              VARCHAR          NOT NULL,
+    "value"             DOUBLE PRECISION NOT NULL,
+    "cost"              DOUBLE PRECISION NULL,
+    "tags"              JSONB            NOT NULL,
+    "context"           JSONB            NOT NULL,
+    PRIMARY KEY ("startTime", "endTime", "entityType", "initialMetricName", "name")
+);
+
+SELECT create_hypertable('kafka_aggregated-table-friendly', by_range('startTime', INTERVAL '1 day'));
+
+
+
+
+
+
+

Kafka connect

+
+

To write data from the kafka metric topic to the timeserie database we will use Kafka Connect.

+
+
+

Please refer to the kubenertes manifest to deploy a kafka connect cluster.

+
+
+

Configuration of the connectors

+
+

Don’t forget to adapt the hosts, users and password

+
+
+
+
{
+  "name": "kafka-cost-control-aggregated",
+  "config": {
+
+    "tasks.max": "1",
+    "topics": "aggregated-table-friendly",
+    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
+    "connection.url": "jdbc:postgresql://timescaledb-service:5432/postgres?sslmode=disable",
+    "connection.user": "postgres",
+    "connection.password": "password",
+    "insert.mode": "upsert",
+    "auto.create": "false",
+    "table.name.format": "kafka_${topic}",
+    "pk.mode": "record_value",
+    "pk.fields": "startTime,endTime,entityType,initialMetricName,name",
+    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
+    "value.converter": "io.confluent.connect.avro.AvroConverter",
+    "value.converter.schema.registry.url": "https://schema-registry-host",
+    "value.converter.basic.auth.credentials.source": "USER_INFO",
+    "value.converter.basic.auth.user.info": "schema-registry-user:schema-registry-password",
+    "transforms": "flatten",
+    "transforms.flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value",
+    "transforms.flatten.delimiter": "_"
+  }
+}
+
+
+
+

TODO curl command to create the connector

+
+
+
+
+
+

Grafana

+
+

TODO

+
+
+
+
+

Telegraf

+
+

TODO

+
+
+
+
+
+ + + \ No newline at end of file diff --git a/installation/src/01_prerequisites.html b/installation/src/01_prerequisites.html new file mode 100644 index 0000000..c4609e4 --- /dev/null +++ b/installation/src/01_prerequisites.html @@ -0,0 +1,466 @@ + + + + + + + +Prerequisites + + + + + +
+
+

Prerequisites

+
+

This installation manual assumes that

+
+
+
    +
  1. +

    You have a kafka cluster

    +
  2. +
  3. +

    You have a schema registry

    +
  4. +
  5. +

    You have a kubernetes clusters

    +
  6. +
+
+
+
+ + + \ No newline at end of file diff --git a/installation/src/02_kafka_setup.html b/installation/src/02_kafka_setup.html new file mode 100644 index 0000000..e8e0af7 --- /dev/null +++ b/installation/src/02_kafka_setup.html @@ -0,0 +1,540 @@ + + + + + + + +Topics and AVRO schemas + + + + + +
+
+

Topics and AVRO schemas

+
+

Kafka cost control uses internal topic to compute pricing. You will have to create those topic before deploying the application. The documentation will show the default names, you can change them but don’t forget to adapt the aggregator configuration.

+
+
+

Reference AVRO schemas

+
+

Some schemas will reference EntityType. Please add it to your schema registry and reference it when needed.

+
+
+
+

Topics

+ ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Topic nameClean up policyKeyValue

context-data

compact

String

ContextData

pricing-rule

compact

String

PricingRule

aggregated

delete

AggregatedDataKey

AggregatedDataWindowed

aggregated-table-friendly

delete

AggregatedDataKey

AggregatedDataTableFriendly

metrics-raw-telegraf-dev

delete

None

String

+
+
Context data
+
+

This topic will contain the additional information you wish to attach to the metrics. SEE TODO for more information. This topic is compacted and it is important that you take care of the key yourself. If you wish to delete a context-data you can set null as payload (and provide the key you want to delete).

+
+
+
+
Pricing rule
+
+

This topic will contain the price of each metric. Be aware that most of the metric will be in bytes. So if you want for example to have a price of 1.0$ per GB you will need to set the price to 1.0/10243 = 0.000976276$ per byte. The key should be the metric name. If you wish to remove a price value, send the payload null with the key you want to delete. See TODO on how to use the API or the UI to set the price.

+
+
+
+
Aggregated
+
+

This topic will contain the enriched data. This is the result topic of the aggregator.

+
+
+
+
Aggregated table friendly
+
+

This is the exact same thing as aggregated except there are no hashmap and other nested field. Everything has be flattened. This topic make it easy to use Kafka Connect with a table database.

+
+
+
+
Metrics raw telegraf
+
+

You can have multiple raw topics. For example one per environment or one per kafka cluster. The topic name is up to you, just don’t forget to configure it properly when you deploy telegraf (see Kubernetes section).

+
+
+
+
+
+ + + \ No newline at end of file diff --git a/installation/src/03_kubernetes.html b/installation/src/03_kubernetes.html new file mode 100644 index 0000000..63fa73c --- /dev/null +++ b/installation/src/03_kubernetes.html @@ -0,0 +1,538 @@ + + + + + + + +Kubernetes + + + + + +
+
+

Kubernetes

+
+

You can find all the deployment files in the deployment folder. This folder use Kustomize to simplify the deployment of multiple instances with some variations.

+
+
+

Create namespace

+
+

Create a namespace for the application

+
+
+
+
kubectl create namespace kafka-cost-control
+
+
+
+
+

Kafka metric scrapper

+
+

This part will be responsible to scrape kafka for relevant metrics. You will have to deploy one scrapper per kafka cluster. Depending on what metrics you want to provide you will need a user with read access to kafka metric but also kafka admin client. Read permission is enough ! You don’t need a user with write permission.

+
+
+

This documentation will assume that you use the dev/ folder, but you can configure as much Kustomize folders as you want.

+
+
+

Copy the environment sample file:

+
+
+
+
cd deployment/kafka-metric-scrapper/dev
+cp .env.sample .env
+vi .env
+
+
+
+

Edit the environment file with the correct output topic, endpoints and credentials.

+
+
+

Be sure to edit the namespace in the kustomization.yaml file.

+
+
+

Deploy the dev environment using kubectl

+
+
+
+
cd /deployment/kafka-metric-scrapper
+kubectl apply -k dev
+
+
+
+

Wait for the deployment to finish and check the output topic for metrics. You should receive new data every minute.

+
+
+
+

Kafka cost control

+
+

For this part we will deploy the kafka stream application that is responsible to enrich the metrics, TimescaleDB for storing the metrics, kafka connect instance to sink the metric into the database, a grafana dashboard and a simple UI to define prices and contexts.

+
+
+

This documentation will assume that you use the dev/ folder, but you can configure as much Kustomize folders as you want.

+
+
+

Copy the environment sample file:

+
+
+
+
cd deployment/kafka-cost-control/dev
+cp .env.sample .env
+vi .env
+
+
+
+

Edit the environment file with the correct credentials. The database password can be randomly generated. It will be used by kafka connect and grafana.

+
+
+

Be sure to edit the namespace in the kustomization.yaml file.

+
+
+

You also may want to adapt the ingress files to use a proper hosts. You will need two hosts, one for grafana and one for the kafka cost control application.

+
+
+

Deploy the application using kubectl

+
+
+
+
cd /deployment/kafka-metric-scrapper
+kubectl apply -k dev
+
+
+
+
+
+ + + \ No newline at end of file diff --git a/installation/src/04_metric_database.html b/installation/src/04_metric_database.html new file mode 100644 index 0000000..2d8334b --- /dev/null +++ b/installation/src/04_metric_database.html @@ -0,0 +1,478 @@ + + + + + + + +Metric database + + + + + +
+
+

Metric database

+
+

In order to store the metrics, we recommend using a timeserie database. Feel free to chose one that suits your needs. Be careful to chose one that is compatible with Kafka connect so you can easily transfer metrics from kafka to your database. In this example we will assume that you’re using TimescaleDB because it’s the one we provide kubernetes manifest for.

+
+
+

Database Schema

+
+

Feel free to adapt the partition size to fit your needs. In this example we put 1 day but please follow the TimescaleDB documentation to choose the right partition size for your use case.

+
+
+
+
CREATE TABLE "kafka_aggregated-table-friendly"
+(
+    "startTime"         TIMESTAMP        NOT NULL,
+    "endTime"           TIMESTAMP        NOT NULL,
+    "entityType"        VARCHAR          NOT NULL,
+    "initialMetricName" VARCHAR          NOT NULL,
+    "name"              VARCHAR          NOT NULL,
+    "value"             DOUBLE PRECISION NOT NULL,
+    "cost"              DOUBLE PRECISION NULL,
+    "tags"              JSONB            NOT NULL,
+    "context"           JSONB            NOT NULL,
+    PRIMARY KEY ("startTime", "endTime", "entityType", "initialMetricName", "name")
+);
+
+SELECT create_hypertable('kafka_aggregated-table-friendly', by_range('startTime', INTERVAL '1 day'));
+
+
+
+
+
+ + + \ No newline at end of file diff --git a/installation/src/05_kafka_connect.html b/installation/src/05_kafka_connect.html new file mode 100644 index 0000000..3ccebfa --- /dev/null +++ b/installation/src/05_kafka_connect.html @@ -0,0 +1,494 @@ + + + + + + + +Kafka connect + + + + + +
+
+

Kafka connect

+
+

To write data from the kafka metric topic to the timeserie database we will use Kafka Connect.

+
+
+

Please refer to the kubenertes manifest to deploy a kafka connect cluster.

+
+
+

Configuration of the connectors

+
+

Don’t forget to adapt the hosts, users and password

+
+
+
+
{
+  "name": "kafka-cost-control-aggregated",
+  "config": {
+
+    "tasks.max": "1",
+    "topics": "aggregated-table-friendly",
+    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
+    "connection.url": "jdbc:postgresql://timescaledb-service:5432/postgres?sslmode=disable",
+    "connection.user": "postgres",
+    "connection.password": "password",
+    "insert.mode": "upsert",
+    "auto.create": "false",
+    "table.name.format": "kafka_${topic}",
+    "pk.mode": "record_value",
+    "pk.fields": "startTime,endTime,entityType,initialMetricName,name",
+    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
+    "value.converter": "io.confluent.connect.avro.AvroConverter",
+    "value.converter.schema.registry.url": "https://schema-registry-host",
+    "value.converter.basic.auth.credentials.source": "USER_INFO",
+    "value.converter.basic.auth.user.info": "schema-registry-user:schema-registry-password",
+    "transforms": "flatten",
+    "transforms.flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value",
+    "transforms.flatten.delimiter": "_"
+  }
+}
+
+
+
+

TODO curl command to create the connector

+
+
+
+
+ + + \ No newline at end of file diff --git a/installation/src/06_grafana.html b/installation/src/06_grafana.html new file mode 100644 index 0000000..f5251e4 --- /dev/null +++ b/installation/src/06_grafana.html @@ -0,0 +1,453 @@ + + + + + + + +Grafana + + + + + +
+
+

Grafana

+
+

TODO

+
+
+
+ + + \ No newline at end of file diff --git a/installation/src/07_telegraf.html b/installation/src/07_telegraf.html new file mode 100644 index 0000000..fdedf4c --- /dev/null +++ b/installation/src/07_telegraf.html @@ -0,0 +1,453 @@ + + + + + + + +Telegraf + + + + + +
+
+

Telegraf

+
+

TODO

+
+
+
+ + + \ No newline at end of file diff --git a/manual/index.html b/manual/index.html new file mode 100644 index 0000000..44ce1a5 --- /dev/null +++ b/manual/index.html @@ -0,0 +1,728 @@ + + + + + + + +User Manual + + + + + +
+
+

User Manual

+
+
+
+

Introduction

+
+

This user manual will help you understand kafka Cost Control and how to use it propertly. This document assumes that you already have a running application. If not please see the [section-installation] section.

+
+
+

At this point you should have access to the Kafka Cost Control UI and to the Grafana Dashboard.

+
+
+

Graphql

+
+

Kafka cost control provides a graphql endpoint at: <your-host>/graphql-ui

+
+
+

In addition, there is a ready to use GraphQL UI. You can access it by going to the following URL: <your-host>/graphql-ui

+
+
+
+
+
+

Pricing rules

+
+

Pricing rules are a way to put a price on each metric. The price will be applied on the hourly aggregate. Also, it’s common for metrics to be in bytes and not Megabyte or Gigabyte. Keep that in mind when setting the price. +For example, if you want to have a price of 1.0$ per GB you will need to set the price to 1.0/10243 = 0.000976276$ per byte.

+
+
+

Pricing rules are stored in kafka in a compacted topic. The key should be the metric name.

+
+
+

Listing pricing rules

+
+
From the UI
+
+

Simply go to the pricing rules tab of the UI. You should see the metric name and it’s cost.

+
+
+
+
Using Graphql
+
+
+
query getAllRules {
+  pricingRules {
+    creationTime
+    metricName
+    baseCost
+    costFactor
+  }
+}
+
+
+
+
+
+

Setting a pricing rule

+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation saveRule {
+  savePricingRule(
+    request: {metricName: "whatever", baseCost: 0.12, costFactor: 0.0001}
+  ) {
+    creationTime
+    metricName
+    baseCost
+    costFactor
+  }
+}
+
+
+
+
+
+

Removing a pricing rule

+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation deleteRule {
+  deletePricingRule(request: {metricName: "whatever"}) {
+    creationTime
+    metricName
+    baseCost
+    costFactor
+  }
+}
+
+
+
+
+
+
+
+

Context data

+
+

Context data are a way to attach a context (attributes basically) to a kafka item (topic, principal, …​). Basically define a set of key/values for an item that match a regex. It is possible that one item match multiple regex (and thus multiple context), but in this case you have to be careful to not have conflicting key/values.

+
+
+

You can have as much key/values as you want. They will be used to sum up prices in the dashboard. It is therefor important that you have at least one key/value that defined the cost unit or organization unit. For example: organzation_unit=department1.

+
+
+

The context data are stored in kafka in a compacted topic. The key is free for the user to choose.

+
+
+

Listing existing context data

+
+
From the UI
+
+

Simply go to the context tab of the UI. You should see all the context with their type, regex, validity time and key/values.

+
+
+
+
Using Graphql
+
+
+
query getContextData {
+  contextData {
+    id
+    creationTime
+    validFrom
+    validUntil
+    entityType
+    regex
+    context {
+      key
+      value
+    }
+  }
+}
+
+
+
+
+
+

Setting context data

+
+

If you want to create a new context, you can omit the id if you want. If no id is set, the API will generate one for you using a UUID. +If you use an id that is not yet in the system, this means you’re creating a new context item.

+
+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation saveContextData {
+  saveContextData(
+    request: {id: "323b603d-5b5f-48d2-84fc-4e784e942289", entityType: TOPIC, regex: ".*collaboration", context: [{key: "app", value: "agoora"}, {key: "cost-unit", value: "spoud"}, {key: "domain", value: "collaboration"}]}
+  ) {
+    id
+    creationTime
+    entityType
+    regex
+    context {
+      key
+      value
+    }
+  }
+}
+
+
+
+
+
+

Removing context data

+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation deleteContextData {
+  deleteContextData(request: {id: "323b603d-5b5f-48d2-84fc-4e784e942289"}) {
+    id
+    creationTime
+    entityType
+    regex
+    context {
+      key
+      value
+    }
+  }
+}
+
+
+
+
+
+
+
+

Reprocess

+
+

Reprocessing should only be used when you made a mistake, fixed it and want to reprocess the raw data. Reprocessing will induce a lag, meaning data will not be live for a little while. Depending on how much data you want to reprocess this can take minutes or hours. So be sure to know what you are doing. After the reprocessing is done, the data will be live again. Reprocessing will NOT lose data. They will just take a bit of time to appear live again.

+
+
+

Be aware that in the reprocessing action may take a while to complete (usually about 1 min). This is why you should be patient with the request.

+
+
+

The process is as follows:

+
+
+
    +
  • +

    use request reprocessing

    +
  • +
  • +

    KafkaCostControl MetricProcess kafka stream application will stop

    +
  • +
  • +

    Wait for all consumers to stop and for kafka to release the consumer group (this may take time)

    +
  • +
  • +

    KafkaCostControl will look for the offset of the timestamp requested for the reprocessing (if not timestamp requested, it will just see to zero)

    +
  • +
  • +

    KafkaCostControl will self-destruct in order for kubernetes to restart it (you may see a restart count increasing)

    +
  • +
  • +

    KafkaCostControl kafka stream application will resume from the offset defined by the timestamp you gave

    +
  • +
+
+
+

The metric database should be independent. This means it should be able to accept updates. Otherwise, you will need to clean the database yourself before a reprocessing.

+
+
+

Using the UI

+
+
    +
  • +

    Go to the Others tab.

    +
  • +
  • +

    Choose a date for the start time of the reprocessing (empty means from the beginning of time). You can help yourself with the quick button on top.

    +
  • +
  • +

    Click on reprocess

    +
  • +
  • +

    Confirm the reprocessing

    +
  • +
+
+
+
Using Graphql
+
+
+
mutation reprocess {
+  reprocess(areYouSure: "no", startTime:"2024-01-01T00:00:00Z")
+}
+
+
+
+
+
+
+
+
+ + + \ No newline at end of file diff --git a/manual/src/context_data.html b/manual/src/context_data.html new file mode 100644 index 0000000..3aa939b --- /dev/null +++ b/manual/src/context_data.html @@ -0,0 +1,551 @@ + + + + + + + +Context data + + + + + +
+
+

Context data

+
+

Context data are a way to attach a context (attributes basically) to a kafka item (topic, principal, …​). Basically define a set of key/values for an item that match a regex. It is possible that one item match multiple regex (and thus multiple context), but in this case you have to be careful to not have conflicting key/values.

+
+
+

You can have as much key/values as you want. They will be used to sum up prices in the dashboard. It is therefor important that you have at least one key/value that defined the cost unit or organization unit. For example: organzation_unit=department1.

+
+
+

The context data are stored in kafka in a compacted topic. The key is free for the user to choose.

+
+
+

Listing existing context data

+
+
From the UI
+
+

Simply go to the context tab of the UI. You should see all the context with their type, regex, validity time and key/values.

+
+
+
+
Using Graphql
+
+
+
query getContextData {
+  contextData {
+    id
+    creationTime
+    validFrom
+    validUntil
+    entityType
+    regex
+    context {
+      key
+      value
+    }
+  }
+}
+
+
+
+
+
+

Setting context data

+
+

If you want to create a new context, you can omit the id if you want. If no id is set, the API will generate one for you using a UUID. +If you use an id that is not yet in the system, this means you’re creating a new context item.

+
+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation saveContextData {
+  saveContextData(
+    request: {id: "323b603d-5b5f-48d2-84fc-4e784e942289", entityType: TOPIC, regex: ".*collaboration", context: [{key: "app", value: "agoora"}, {key: "cost-unit", value: "spoud"}, {key: "domain", value: "collaboration"}]}
+  ) {
+    id
+    creationTime
+    entityType
+    regex
+    context {
+      key
+      value
+    }
+  }
+}
+
+
+
+
+
+

Removing context data

+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation deleteContextData {
+  deleteContextData(request: {id: "323b603d-5b5f-48d2-84fc-4e784e942289"}) {
+    id
+    creationTime
+    entityType
+    regex
+    context {
+      key
+      value
+    }
+  }
+}
+
+
+
+
+
+
+ + + \ No newline at end of file diff --git a/manual/src/dashboard.html b/manual/src/dashboard.html new file mode 100644 index 0000000..4ca8f08 --- /dev/null +++ b/manual/src/dashboard.html @@ -0,0 +1,456 @@ + + + + + + + +Dashboard + + + + + +
+
+

Dashboard

+
+

The dashboard allows you to consult aggregated data. You can easily group them by context item.

+
+
+

TODO

+
+
+
+ + + \ No newline at end of file diff --git a/manual/src/introduction.html b/manual/src/introduction.html new file mode 100644 index 0000000..bb9a2d3 --- /dev/null +++ b/manual/src/introduction.html @@ -0,0 +1,465 @@ + + + + + + + +Introduction + + + + + +
+
+

Introduction

+
+

This user manual will help you understand kafka Cost Control and how to use it propertly. This document assumes that you already have a running application. If not please see the [section-installation] section.

+
+
+

At this point you should have access to the Kafka Cost Control UI and to the Grafana Dashboard.

+
+
+

Graphql

+
+

Kafka cost control provides a graphql endpoint at: <your-host>/graphql-ui

+
+
+

In addition, there is a ready to use GraphQL UI. You can access it by going to the following URL: <your-host>/graphql-ui

+
+
+
+
+ + + \ No newline at end of file diff --git a/manual/src/pricing_rules.html b/manual/src/pricing_rules.html new file mode 100644 index 0000000..c12d451 --- /dev/null +++ b/manual/src/pricing_rules.html @@ -0,0 +1,531 @@ + + + + + + + +Pricing rules + + + + + +
+
+

Pricing rules

+
+

Pricing rules are a way to put a price on each metric. The price will be applied on the hourly aggregate. Also, it’s common for metrics to be in bytes and not Megabyte or Gigabyte. Keep that in mind when setting the price. +For example, if you want to have a price of 1.0$ per GB you will need to set the price to 1.0/10243 = 0.000976276$ per byte.

+
+
+

Pricing rules are stored in kafka in a compacted topic. The key should be the metric name.

+
+
+

Listing pricing rules

+
+
From the UI
+
+

Simply go to the pricing rules tab of the UI. You should see the metric name and it’s cost.

+
+
+
+
Using Graphql
+
+
+
query getAllRules {
+  pricingRules {
+    creationTime
+    metricName
+    baseCost
+    costFactor
+  }
+}
+
+
+
+
+
+

Setting a pricing rule

+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation saveRule {
+  savePricingRule(
+    request: {metricName: "whatever", baseCost: 0.12, costFactor: 0.0001}
+  ) {
+    creationTime
+    metricName
+    baseCost
+    costFactor
+  }
+}
+
+
+
+
+
+

Removing a pricing rule

+
+
From the UI
+
+

Not available yet.

+
+
+
+
Using Graphql
+
+
+
mutation deleteRule {
+  deletePricingRule(request: {metricName: "whatever"}) {
+    creationTime
+    metricName
+    baseCost
+    costFactor
+  }
+}
+
+
+
+
+
+
+ + + \ No newline at end of file diff --git a/manual/src/reprocess.html b/manual/src/reprocess.html new file mode 100644 index 0000000..8ab66fd --- /dev/null +++ b/manual/src/reprocess.html @@ -0,0 +1,513 @@ + + + + + + + +Reprocess + + + + + +
+
+

Reprocess

+
+

Reprocessing should only be used when you made a mistake, fixed it and want to reprocess the raw data. Reprocessing will induce a lag, meaning data will not be live for a little while. Depending on how much data you want to reprocess this can take minutes or hours. So be sure to know what you are doing. After the reprocessing is done, the data will be live again. Reprocessing will NOT lose data. They will just take a bit of time to appear live again.

+
+
+

Be aware that in the reprocessing action may take a while to complete (usually about 1 min). This is why you should be patient with the request.

+
+
+

The process is as follows:

+
+
+
    +
  • +

    use request reprocessing

    +
  • +
  • +

    KafkaCostControl MetricProcess kafka stream application will stop

    +
  • +
  • +

    Wait for all consumers to stop and for kafka to release the consumer group (this may take time)

    +
  • +
  • +

    KafkaCostControl will look for the offset of the timestamp requested for the reprocessing (if not timestamp requested, it will just see to zero)

    +
  • +
  • +

    KafkaCostControl will self-destruct in order for kubernetes to restart it (you may see a restart count increasing)

    +
  • +
  • +

    KafkaCostControl kafka stream application will resume from the offset defined by the timestamp you gave

    +
  • +
+
+
+

The metric database should be independent. This means it should be able to accept updates. Otherwise, you will need to clean the database yourself before a reprocessing.

+
+
+

Using the UI

+
+
    +
  • +

    Go to the Others tab.

    +
  • +
  • +

    Choose a date for the start time of the reprocessing (empty means from the beginning of time). You can help yourself with the quick button on top.

    +
  • +
  • +

    Click on reprocess

    +
  • +
  • +

    Confirm the reprocessing

    +
  • +
+
+
+
Using Graphql
+
+
+
mutation reprocess {
+  reprocess(areYouSure: "no", startTime:"2024-01-01T00:00:00Z")
+}
+
+
+
+
+
+
+ + + \ No newline at end of file