We have a plan to create a docker environment of Spark test use, but current situation, you need to prepare Spark firstly.
Get Yosegi's jar from Maven repository.It can be easily obtained from the following command.
$ ./bin/get_jar.sh get
It is in the following path.
./jars/yosegi/latest/yosegi.jar
Specify yosegi's jars for Spark's startup option "- jar".
$ ./bin/spark-shell --jars ./jars/yosegi/latest/yosegi.jar,./target/scala-2.11/yosegi-spark_2.11-1.0.jar
In this example, CSV data is assumed to be input data. With this data as input, we create a table in Yosegi format and read data from the table.
Create the following csv file.In this example, to illustrate writing to and reading from the table, the data is a simple example. If you already have a table, you still have the input.
{"id":"X_0001","name":"AAA","age":20}
{"id":"X_0002","name":"BBB","age":30}
{"id":"X_0003","name":"CCC","age":32}
{"id":"X_0004","name":"DDD","age":21}
{"id":"X_0005","name":"EEE","age":28}
{"id":"X_0006","name":"FFF","age":21}
Run Spark. Load json file and create yosegi file.
val sqlContext = new SQLContext(sc);
val df = sqlContext.read.json("/tmp/example.json")
JSON output looks like this.
scala> df.show()
+---+------+----+
|age| id|name|
+---+------+----+
| 20|X_0001| AAA|
| 30|X_0002| BBB|
| 32|X_0003| CCC|
| 21|X_0004| DDD|
| 28|X_0005| EEE|
| 21|X_0006| FFF|
+---+------+----+
Save the data frame created from JSON as a yosegi file.
df.write.format("jp.co.yahoo.yosegi.spark.YosegiFileFormat").save("/tmp/example.yosegi")
Read as a data frame from yosegi file.
val sqlContext = new SQLContext(sc);
val df = sqlContext.read.format("jp.co.yahoo.yosegi.spark.YosegiFileFormat" ).load( "/tmp/example.yosegi")
df.show
The output looks like this.
scala> df.show()
+----+------+---+
|name| id|age|
+----+------+---+
| AAA|X_0001| 20|
| BBB|X_0002| 30|
| CCC|X_0003| 32|
| DDD|X_0004| 21|
| EEE|X_0005| 28|
| FFF|X_0006| 21|
+----+------+---+