- Problem: that this project aims to address is to provide a solution for handling sudden spikes in workload in Apache Flink applications. This is an important problem because Apache Flink is a distributed stream processing framework that is designed to handle large amounts of data in real-time. However, when the input rate exceeds the processing capacity, the system can become overwhelmed, leading to increased latency, decreased performance, and even failures.
- Technique: The cloud bursting technique is an alternative to back-pressure, which is a mechanism that slows down the input rate when the system is overwhelmed. Instead, cloud bursting allows the excess workload to be offloaded to a cloud provider, such as AWS, where it can be processed in parallel. This can help to ensure that the system continues to operate at a high level of performance, even when there is a sudden spike in workload.
- Target audience: people who will benefit from solving this problem include organizations that use Apache Flink for real-time data processing, particularly in cases where the workload can fluctuate rapidly. This solution can help to improve the reliability and performance of these systems, and make them more scalable and cost-effective. Additionally, this solution can benefit developers and data scientists who use Apache Flink, by providing them with a more robust and flexible tool for processing large amounts of data in real-time.
- Source
- Nexmark
- Yahoo Streaming Benchmark
- Operator
- Flatmap in nexmark(Java)
- Flatmap in nexmark(Python)
- Cloud
- AWS Lambda Function
- Azure Functions
- Sink
- Any operator or sink
The data is generated from one of the source, and is processed in the operator. Finally, the result is emitted to the sink. Flink metrics can be used to detect the traffic of the input event stream. We might use metrics like JVM heap usage, event input rate to detect the traffic in the operator. And based on these metrics, we might determine the suitable offloading portion based on our experiment. After the computation is done in the lambda function, it sends the result back to the operator. Finally, the operator sends the data to the sink or another operator.
- We choose flatmap as a start because it is stateless. We don't need to consider aggregation operation and order of the event.
- We choose to send the event from operator to cloud and send back data from cloud to operator. Because it is easier.
- We choose to offload event to cloud, thus we can intuitively observe the difference of workload between using only operator and using cloudburst technique.
-
When a bursting workload happens, we expect our solution will successfully detect the bursting input stream and locate the bottleneck node.
-
Then our solution will decide whether cloud bursting is necessary and will spawn lambda functions to handle the excess load.
-
We expect our solution will improve the bottleneck node and achieve better performance than backpressure in terms of latency, througput and processing time of the node.
-
Therefore, our solution will provide a more efficient and scalable approach compared to backpressure mechanisms.
> Assumptions:
- The system is fault tolerant.
- The system is very secure and doesn't need any additional security features <br />
We do plan to incorporate all of the above in our future work
Step 1
We will start with a subsystem of the problem, For the above figure, we have the following steps -
- We consider a flatmap that just tokenizes text as our operator
- Find out the metrics affecting the streaming pipeline
- Register our lambda function on aws
- Connect our lambda function to our operator
- Detect when queue fills up for tokenizer(flat map) and direct half of the load to our lambda function
- Merge the results produced by the operator and the lambda function
- Measure the latency of offloading the work to lambda function
Step 2
After implementing the first prototype in step 1, we plan to evaluate our streaming pipeline by - Connecting the lambda function to the source or to the sink instead of the operator - Considering different language implementations on the lambda function(like python, Java) - Working with more complicated dataflow graphs, including stateful and stateless operators
- Data flow pipline in nexmark works as it was intended without data/packet loss.
- Cloud bursting technique should improve the performance of the entire dataflow with respect to latency and act as an alternative to back pressure.
- Architecture tuned enough to make the best decisions on when to offload the data packets to the cloud function without incurring more processing delay than the original architecture.
- Devise a method to handle data merge and data propagation with respect to both stateful and stateless operators without loss of order or data.
- Cloud function instances are created/destroyed based on requirement, reducing cost and resource utilization
-
Intermidiate milestones:
- Setting up the basic Flink pipeline with 1 stateles operator and connecting the operator to the lambda function to experiment with the performance and scenarios.
- Extending the Flink pipeline setup to test the setup against 1 stateful operator.
- Experimenting with the setup of cloud bursting with a combination of stateful and stateless operators (operator chaining). Identifying the operator causing the bottleneck before attaching the lambda function would be an additional step
- Experimenting with the setup with complicated dataflow graphs to estimate throughput improvement, latency improvement and overall performance impact of the cloud bursting technique.
Task | Subtask | Assigned To |
---|---|---|
Prerequisites: | 1)how to generate generic data type for flatmap | All |
src->tokenizer->fmap | 2)read through nexmark documentation | Team |
3)get input events in flatmap irrespective of source | Members | |
Bash script | 1)generate events at specific interval based on need | Sakshi |
Process Function | 1)how to build the function | All |
2)restrictions of the function | Team | |
3))similarity to flatmap -> are they interchangeable? | Members | |
Flink Pipeline | 1)setup Flink pipeline -> jar required for step | Yujie Yan |
2)test the initial pipeline | Ye Tian | |
Metrics | 1)get metrics from Flink documentation | Sakshi Sharma |
2)gather metric data under normal conditions | Sanath Bhimsen | |
3)generate back pressure in pipeline | ||
4)gather second set of metrics | ||
Cloud Integration | 1)setup cloud resources | Showndarya Madhavan |
2)setup integration with Flink->Cross platform | Anwesha Saha | |
3)connections to 3rd party systems | ||
4)expose lambda to outside environment | ||
Final Integration | 1)stop cloud and verify data goes to nexmark | All |
Create->Test->Overload->Test | 2)stop nexmark and verify data goes to cloud | Team |
3)ensure no loss of data | Members | |
Experiment -> | 1)model performs load balancing by offloading to aws lambda when overloaded | All |
Identify success indicators are met | Team | |
Members | ||
The task distribution was made keeping in mind the strengths and weaknesses of each team member. All of us collectively have some Java coding experience whereas Flink and Stream processing is a new area that we are delving into. On the basis of this, we decided to perform some of the tasks as a group. The tasks assigned individually are the ones we believe can be performed parallely where multiple group members perform separate tasks which can then be integrated. Having said this, we have considered that any of us might face difficulty in the area we have assigned ourselves and will redistribute or discuss accordingly whenever required.