Skip to content

Latest commit

 

History

History
42 lines (26 loc) · 2.12 KB

spark-streaming-windoweddstreams.adoc

File metadata and controls

42 lines (26 loc) · 2.12 KB

WindowedDStreams

WindowedDStream (aka windowed stream) is an internal DStream with dependency on the parent stream.

Note
It is the result of window operators.

windowDuration has to be a multiple of the parent stream’s slide duration.

slideDuration has to be a multiple of the parent stream’s slide duration.

Note
When windowDuration or slideDuration are not multiples of the parent stream’s slide duration, Exception is thrown.

The parent’s RDDs are automatically changed to be persisted at MEMORY_ONLY_SER storage level (since they need to last longer than the parent’s slide duration for this stream to generate its own RDDs).

Obviously, slide duration of the stream is given explicitly (and must be a multiple of the parent’s slide duration).

parentRememberDuration is extended to cover the parent’s rememberDuration and the window duration.

compute method always returns a RDD, either PartitionerAwareUnionRDD or UnionRDD, depending on the number of the Partitioner defined by the RDDs in the window. It uses slice operator on the parent stream (using the slice window of [now - windowDuration + parent.slideDuration, now]).

If only one partitioner is used across the RDDs in window, PartitionerAwareUnionRDD is created and you should see the following DEBUG message in the logs:

DEBUG WindowedDStream: Using partition aware union for windowing at [time]

Otherwise, when there are multiple different partitioners in use, UnionRDD is created and you should see the following DEBUG message in the logs:

DEBUG WindowedDStream: Using normal union for windowing at [time]
Tip

Enable DEBUG logging level for org.apache.spark.streaming.dstream.WindowedDStream logger to see what happens inside WindowedDStream.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.streaming.dstream.WindowedDStream=DEBUG