Skip to content

Run MapReduce on hadoop 2 Developing MapReduce by using Adelaide Crows Player data

chris_podorsiki edited this page Jan 1, 2018 · 1 revision

Find the maximum kick_score on each player from Adelaide Crows player in last 15 seasons

the raw data is listed in https://github.com/chriszhangpodo/AFL_adelaide_crows_data repository. I further use some data processing method to combining all csv into a big one which contains 15 seasons player data Now the question is how to find the maximum kick_score player in all 15 seasons?

if we use SQL to solve this problem, it will be

select player,max(kick_score) from player
group by player;

What it will do is querying all rows for column player and kick_score, group the data by player and then output the max(kick_score).

While in Map-reduce, we could scope this task like this: 1 input each line's playername, kick_score as a key-value pair 2 output this key-value pair as the result of the map() 3 in reduce() phase, for the same player in all players, we only get the largest kick_score and output it as name-kick_score pair 4 output the reduce() to a file, finish.

But how about i want to have a condition on the maximum score findings? i only want to output each player's maximum score on each year.

While, it is almost same, we just need to change the key from "playername" to something like "playername" + "year", others keep the same

Sounds not hard, let's write it:

public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {

    /*
     * TODO implement
     */
	  String line = value.toString();
	  String[] lineSplit = line.split(",");
	  String name = lineSplit[0];
	  long score = Long.parseLong(lineSplit[1]);
	  long year = Long.parseLong(lineSplit[23]);
	  String name_y = name + "-" + year;
	  //name_key.set(name_y);
	  //score_value.set(score);
	  context.write(new Text(name_y),new LongWritable(score));
	  
	  
  }

From this code, you could see that, the map() method only need 3 variables, key, value, and context key,value will be each line's key and contents, context is using for passing the new key and new value to reducer _The initial key format is longwritable, that is because the default mapper class's key_input is longwritable, and value is the text format. _

You need to make sure the output of the map() method will match the mapper class's generics which is like the code below:

public class playermapper extends Mapper<LongWritable, Text, Text, LongWritable> {
/* 
Mapper is the superclass of playermapper, the Mapper's input and output is defined   
as LongWritable, Text AND Text, LongWritable. which means,   
the initial key format and final value is LongWritable  
*/

Then we need to write the reduce() part, code is listed as below:

public void reduce(Text key, Iterable<LongWritable> values, Context context)
      throws IOException, InterruptedException {
	  long max = 0l;
	  for (LongWritable a : values){
		  if(a.get() > max){
			  max = a.get(); 
		  }  
	  }
	  context.write(key,new LongWritable(max));
    
  }

As you could see, in reduce() method, for each key, we just need to find the max from the Longwritable value, and in Java, we need to use Longwritale.get() to get the value. The key is not changing. The context will record the output for job and ready to output to the file.

Then we use eclipse to run the Java file and compile it to the jar, Then we could use it in hadoop The final results is like below:

However, what if i want to select the kick_score and kick_avg for each player in 2017 only and then group sum the score and avg? A little bit hard, don't worry we will solve it in next page!