Skip to content
This repository has been archived by the owner on Aug 19, 2020. It is now read-only.

SequenceFile Format

okram edited this page Jan 13, 2013 · 3 revisions

  • InputFormat: org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat
  • OutputFormat: org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

Hadoop’s native binary data file is the SequenceFile. Every Writable object implements methods that enable it to both read itself from and write itself to a SequenceFile. Because both FaunusVertex and FaunusEdge implement Writable, they can be captured by a SequenceFile. Moreover, given that a SequenceFile is a binary format, it supports a more compact representation that found with other text-based formats such as GraphSON.

Faunus-Specific Compression

The following is a list of compression techniques used by Faunus within a SequenceFile.

  • Variable-width encoding of all ints and longs.
  • Edge’s sorted by direction to reduce the number of direction encodings.
  • Edge’s sorted by label to reduce the number of label encodings.
  • Only the adjacent vertex id stored as the root vertex’s id can be inferred.
  • Element property type encoding represented by a single byte.

Intermediate Format

Given that a SequenceFile is compact, splittable, and a native Hadoop format, Faunus makes use of the SequenceFile as the intermediate representation between consecutive Faunus jobs. In other words, when a Faunus computation requires more than one MapReduce phase, a SequenceFile representing the output of the first MapReduce job is temporarily persisted in HDFS and fed as the input to the second MapReduce job.