-
Notifications
You must be signed in to change notification settings - Fork 29
test
Ondřej Moravčík edited this page Mar 14, 2015
·
6 revisions
More informations
- Wiki
- ruby-doc
Add this line to your application's Gemfile:
gem 'ruby-spark'
And then execute:
$ bundle
Or install it yourself as:
$ gem install ruby-spark
To install latest supported Spark. Project is build by SBT.
$ ruby-spark build
You can use Ruby Spark via interactive shell
$ ruby-spark pry
Or on existing project
require 'ruby-spark'
Spark.start
Spark.sc # => context
If you want configure Spark first. See configurations for more details.
require 'ruby-spark'
Spark.load_lib(spark_home)
Spark.config do
set_app_name "RubySpark"
set 'spark.ruby.batch_size', 100
set 'spark.ruby.serializer', 'oj'
end
Spark.start
Spark.sc # => context
Single file
$sc.text_file(FILE, workers_num, custom_options)
All files on directory
$sc.whole_text_files(DIRECTORY, workers_num, custom_options)
Direct
$sc.parallelize([1,2,3,4,5], workers_num, custom_options)
$sc.parallelize(1..5, workers_num, custom_options)
- workers_num
-
Min count of works computing this task.
(This value can be overwriten by spark) - custom_options
-
serializer: name of serializator used for this RDD
batch_size: see configuration
(Available only for parallelize)
use: direct (upload direct to java), file (upload throught a file)
Sum of numbers
$sc.parallelize(0..10).sum
# => 55
Words count using methods
def split_line(line)
line.split
end
def map(x)
[x,1]
end
def reduce(x,y)
x+y
end
rdd = $sc.text_file("spec/inputs/lorem_300.txt")
rdd = rdd.flat_map(:split_line)
rdd = rdd.map(:map)
rdd = rdd.reduce_by_key(:reduce)
rdd.collect_as_hash
# => {word: count}
Estimating pi with a custom serializer
slices = 2
n = 100000 * slices
def map(_)
x = rand * 2 - 1
y = rand * 2 - 1
if x**2 + y**2 < 1
return 1
else
return 0
end
end
rdd = $sc.parallelize(1..n, slices, serializer: "oj")
rdd = rdd.map(:map)
puts "Pi is roughly %f" % (4.0 * rdd.sum / n)