Skip to content

Simple explanation on Map reduce

chris_podorsiki edited this page Dec 18, 2017 · 1 revision

what is Map- reduce and where it could be used?

Mapreduce is a program model which are largely used in parallel and distribution calculation. It has two part in the model: Map() and Reduce(). Between Map and Reduce, it usually will have a sort and shuffle progress. In that shuffle and sort progress, sometimes we will use combiner() to deduct the reduce() workload by putting the value together by the same key.

Mapreduce is broadly used in most of the parallel calculating system like Hadoop HDFS and some other cloud data warehouse (Azure and AWS all have Map-reduce method to optimize the calculation volume by options).

Using a example to explain:

Simply to understand the Map-reduce, we could have a very common life example:

Company has the ledger data to store all employee's salary information. And this info will be updated by every month as most of the employee will have their salary monthly. So in the end of the financial year, CFO would like to know the top 100 high salary employee and you need to do this without using the MYOB payroll software (that software did this perfectly in seconds) for more than 5,000 employees.

Well, now it is the time to let your CTO surprise: you decided to use Map-reduce method to solve this problem. Let's look the data first:

Year and month,Name,Gender,Salary,WorkID,Division,Position ......  
2017 / Jan,Jason,M,9000   ......   
2017 / Feb,Chris,M,8500   ......   
2017 / Jan,Kate,F,9200    ......  

We could see that for every month, we will have each person's salary stored in row. So for Map(), what we need to do is picking the key and value out and form a key-value pair on each record. For this example, the key-value pair will be (Name,Salary). So each row will give a key pair like (Jason,9000) and we define the key pair is [k1,v1].

So in python, the pseudo code will be like:

Ledger = read.file(file)
Lines = ledger.read_lines
Names = []
Salarylist = []
For line in Lines:
    columns = line.split(',')
    Name = columns[1]
    # as name is in the second column for each line, so choose [1] to get the string
    Salary = columns[3]
    Names.append(Name)
    Salarylist.append(Salary)
Names = np.array(Names)
Salarylist = np.array(Salarylist)
Outcome_map = np.stack(Names,Salarylist.T, axis = 1)

Now after the Map(), we should get an array like: [[Jason,9000],[Chris,8500],[Kate,9200],[Jason,9200],[Lily,8000],[Chris,9300]...]. The shape of this array is (Kn, 2). Kn = count(rows)

Then we will need to sort this key pairs by their Key, as we want to let each reducer only process several person only. So what we do is we need to first make a dataframe on this (Kn,2) array.

Salary_info = pandas.dataframe(array,index=...,columns=['Name','Salary'])
Salary_info.sort_value('Name',ascending = F)

Then our data is like:

Jason 9000
Jason 9200
Jason 9020
...
Chris 8500
Chris 9300
Chris 9100
...
Kate  9000
Kate  8000
...
Lily ...
...

As you could see, the data is well sorted by the Key and Reducer just need to aggregate the salary amount on the same key.

Now we move to the last step which is the Reduce:

Name_employee = None
salary_amount = 0
Name_list = []
Salary_T_list = []
for row in Salary.itertuples:
   salary_amount = get.attr(row[1])
   if Name_employee == get.attr(row[0]):
      salary_amount += get.attr(row[1])
   Else:
      Name_list.append(Name_employee)
      Salary_T_list.append(salary_amount)
      Name_employee = get.attr(row[0])
      salary_amount = get.attr(row[1])
Name_list = np.array(Name_list)
Salary_T_list = np.array(Salary_T_list)
Outcome_reduce = np.stack(Name_list,Salary_T_list.T, axis = 1)

The idea of above reduce code is:

1 as the data has been sorted, so if the name is not new, then keep use the name in last row and store it in ram, aggregate the salary amount and store it in the ram as well

2 if the row gives a new name, we first store the name and amount in the ram and then use the new name and new salary amount to update the pair in the ram till we find the next new pair

3 as we are looking the data by rows, so in the last row, we still just to determine whether it is a new pair or not

4 This will put the first pre-defined pair [None,0] into the outcome as this one is always the new one for any records, so we just need to do a header clean when we output the final results

5 the results from the reduce() should only give u a unique Key in the whole list which is like :[[K1,v],[K2,v],[K3,v]...]

Now we finished the task by using Map-reduce!

Next page we will have a look and see how does Hadoop run the Map-reduce and how did the parallel calculation work in node and Master cluster. Also if possible, we will have a look on using combiner in Map-reduce and use two key for map-reduce

Clone this wiki locally