test

Ruby-Spark

More informations

Wiki
ruby-doc

Installation

Add this line to your application's Gemfile:

gem 'ruby-spark'

And then execute:

$ bundle

Or install it yourself as:

$ gem install ruby-spark

Install Apache Spark

To install latest supported Spark. Project is build by SBT.

$ ruby-spark build

Usage

You can use Ruby Spark via interactive shell

$ ruby-spark pry

Or on existing project

require 'ruby-spark'
Spark.start

Spark.sc # => context

If you want configure Spark first. See configurations for more details.

require 'ruby-spark'

Spark.load_lib(spark_home)
Spark.config do
   set_app_name "RubySpark"
   set 'spark.ruby.batch_size', 100
   set 'spark.ruby.serializer', 'oj'
end
Spark.start

Spark.sc # => context

Uploading a data

Single file

$sc.text_file(FILE, workers_num, custom_options)

All files on directory

$sc.whole_text_files(DIRECTORY, workers_num, custom_options)

Direct

$sc.parallelize([1,2,3,4,5], workers_num, custom_options)
$sc.parallelize(1..5, workers_num, custom_options)

Options

workers_num: Min count of works computing this task.
(This value can be overwriten by spark)
custom_options: serializer: name of serializator used for this RDD
batch_size: see configuration

(Available only for parallelize)
use: direct (upload direct to java), file (upload throught a file)

Examples

Sum of numbers

$sc.parallelize(0..10).sum
# => 55

Words count using methods

def split_line(line)
  line.split
end

def map(x)
  [x,1]
end

def reduce(x,y)
  x+y
end

rdd = $sc.text_file("spec/inputs/lorem_300.txt")
rdd = rdd.flat_map(:split_line)
rdd = rdd.map(:map)
rdd = rdd.reduce_by_key(:reduce)
rdd.collect_as_hash

# => {word: count}

Estimating pi with a custom serializer

slices = 2
n = 100000 * slices

def map(_)
  x = rand * 2 - 1
  y = rand * 2 - 1

  if x**2 + y**2 < 1
    return 1
  else
    return 0
  end
end

rdd = $sc.parallelize(1..n, slices, serializer: "oj")
rdd = rdd.map(:map)

puts "Pi is roughly %f" % (4.0 * rdd.sum / n)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly