MapReduce design pattern #2927

iluwatar · 2024-04-20T12:16:51Z

The MapReduce design pattern is designed to process large volumes of data in a distributed and parallel manner, improving scalability and performance by utilizing multiple processing nodes. Originating from functional programming paradigms, it was popularized by Google as a way to perform distributed processing on huge datasets across many servers. Here’s a breakdown of its intent, main components, and data flow:

Intent

The main intent of the MapReduce design pattern is to allow for the processing of large data sets with a distributed algorithm, minimizing the overall time of computation by exploiting various parallel computing nodes. This design pattern simplifies the complexity of concurrency and hides the details of data distribution, fault tolerance, and load balancing, making it an effective model for processing vast amounts of data.

Main Components

The MapReduce design pattern primarily consists of three components:

Map Function: This component takes an input pair and produces a set of intermediate key/value pairs. The Map tasks are distributed across different nodes so that each node processes a subset of the data independently of others.
Reduce Function: This component processes all the intermediate values associated with the same intermediate key. It merges these values to form a possibly smaller set of values. Typically, each Reduce function operates in a way that it processes the outputs of multiple Map functions.
Master Node: The Master node orchestrates the process by dividing the input data into smaller sub-problems and assigning them to worker nodes. After the workers complete their tasks, the Master node collects the answers to form the output dataset.

Typical Data Flow

The typical data flow in a MapReduce operation involves several key steps:

Input Slicing: The input data is divided into smaller chunks, which are then assigned to different worker nodes for processing. This is usually handled by the Master node.
Map Phase: Each worker node processes its assigned chunk of data, applying the Map function to each element. The results are intermediate key/value pairs stored in memory.
Shuffling: After the Map phase, the system reorganizes the intermediate data so that all data belonging to one key is sent to the same reducer. This involves sorting and transferring data across nodes.
Reduce Phase: Each reducer node processes the intermediate data pertaining to a specific key. The Reduce function is applied to merge these values into a smaller set of values or a single output value.
Output Generation: The final output of the Reduce functions is collected and often stored in a file system or returned to the application.

By breaking down data into smaller pieces that can be processed in parallel, and by organizing the processing so that each stage builds appropriately on the last, MapReduce can efficiently handle tasks that are too large for a single processing unit. This model is well-suited for tasks like large-scale text processing, data mining, and log analysis.

Acceptance Criteria:

The implementation must clearly define and separate the 'Map' function for mapping input data into intermediate key/value pairs, and the 'Reduce' function for merging all intermediate values associated with the same intermediate key.
Include comprehensive unit tests to verify both the map and reduce functions operate as expected on test datasets.
Ensure the code adheres to the coding conventions and documentation requirements outlined in the project's contribution guidelines.

rankans · 2024-08-03T11:50:07Z

@iluwatar Can I start working on this? This is my first time contributing so might need some help as well.

stale · 2024-10-05T19:14:11Z

This issue has been automatically marked as stale because it has not had recent activity. The issue will be unassigned if no further activity occurs. Thank you for your contributions.

wizzac · 2024-10-09T16:23:12Z

Hello, I would like to work on this one, if it's still not taken

iluwatar added info: help wanted epic: pattern type: feature labels Apr 20, 2024

iluwatar assigned rankans Aug 6, 2024

iluwatar removed the info: help wanted label Aug 6, 2024

stale bot added the status: stale issues and pull requests that have not had recent interaction label Oct 5, 2024

stale bot removed the status: stale issues and pull requests that have not had recent interaction label Oct 9, 2024

wizzac linked a pull request Oct 10, 2024 that will close this issue

Add map reduce pattern issue 2927 #3057

Open

iluwatar assigned wizzac and unassigned rankans Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MapReduce design pattern #2927

MapReduce design pattern #2927

iluwatar commented Apr 20, 2024

rankans commented Aug 3, 2024

stale bot commented Oct 5, 2024

wizzac commented Oct 9, 2024

MapReduce design pattern #2927

MapReduce design pattern #2927

Comments

iluwatar commented Apr 20, 2024

rankans commented Aug 3, 2024

stale bot commented Oct 5, 2024

wizzac commented Oct 9, 2024