Skip to content

TimeSeries

Roy Brondgeest edited this page Aug 6, 2018 · 9 revisions

Introduction

TimeSeries is a part of our Scala DSL that allows one to group timestamped records by periods of time such as weeks, months, quarters etc., by transforming the UTC timestamp columns of each record.

Motivation

Clickhouse is fast enough that we are able to store our raw unprocessed data directly, while we are still able to use it to generate complex reports "on-the-fly". Clickhouse however, does not offer a simple way to apply deterministic interval grouping. By having a simple operator to group our data by its timestamp on any interval for any timezone allows for flexibility in queries (that clickhouse otherwise does not offer) as well as to create aggregations / views on our data.

Explaining MultiInterval: An extension of joda.time.Interval with deterministic sub intervals.

For offering this functionality, we introduced a new class MultiInterval. By understanding what it does, the clickhouse DSL feature becomes more obvious.

Take a look at below example:

val jan = new DateTime("2018-01-01T00:00:00+01:00")
val dec = new DateTime("2018-12-31T23:59:59+01:00")

//This represents an interval of the entire year of 2018, with subintervals of 2 months, in Amsterdam/Europe timezone.
val yearBiMonthlyAmsterdam = new MultiInterval(jan, dec, MultiDuration(2,TimeUnit.Month)); 

If we now do yearBiMonthlyAmsterdam.subIntervals() we will receive a Seq[Interval] with joda.time.Interval in the same timezone (jan/feb, mar/apr, jun/jul etc. etc.)

Automatic expansion of MultiInterval to its SubIntervals

If we would provide 2nd of January as the start date, some days within the MultiInterval belong to a subinterval which doesn't fit. MultiInterval automatically expands its interval to fit the entirety of its subintervals ranges.

Therefore we will see the below behaviour:

val midJan = new DateTime("2018-01-21T00:00:00+01:00")
val janToDec = MultiInterval(midJan, dec, MultiDuration(2,TimeUnit.Month)); 

janTwoToDec.getStart();
//org.joda.time.DateTime = 2018-01-01T00:00:00+02:00

Using MultiInterval in Clickhouse DSL

ActionID ActionType Timestamp UserID
1 Click 1330684623 1
2 Click 1331348034 2
3 Buy 1323523246 2

For the above table, which we will call ActionTable we could do:

select(
  uniq(ActionTable.ActionID) as action_count,
  uniq(ActionTable.ActionID) as user_count,
  ActionTable.ActionType,
  timeSeries(ActionTable.TimeStamp, yearBiMonthlyAmsterdam) as interval_start_time
)
.from(
  ActionTable
)
.groupBy(
  interval_start_time, 
  ActionType
)

In this example, we provide the MultiDuration of 2 month subintervals for 2018, in amsterdam timezone (see previous code examples for the definition of this interval). The timeSeries operator will bring back the timestamp to the start of a bi-monthly interval, also taking into account the timezone set for this MultiInterval. Important! It assumes that the timestamp of the record is UTC timezone.

Deterministic subIntervals

The subIntervals of a multi interval are both in clickhouse as well as in scala/java deterministic. That means that if you would take a "5 month" interval, the first interval within 2018 might actually start in november of 2017, as 5 months do not align with 12 months a year.

The reference point for these deterministic "chunks" is decided upon by the clickhouse getRelativeXXXNum functions.

See their documentation for more information on this: https://clickhouse.yandex/docs/en/query_language/functions/date_time_functions/#torelativeyearnum

Todo pages:

Clone this wiki locally