Skip to content

Latest commit

 

History

History
63 lines (43 loc) · 2.79 KB

s3-guide.md

File metadata and controls

63 lines (43 loc) · 2.79 KB

S3 User Guide

Apache Hadoop provides an AWS connector allowing Hadoop file system to access S3, and subsequently allowing the connector to use S3 as the staging area.

Required Dependencies

What you will need:

  • Spark 3.x
  • An appropriate hadoop-aws version for your hadoop install. Note that Spark comes bundled with Hadoop or stand-alone.
    • Importantly, the versions of hadoop-aws must be identical to the hadoop install.
    • For example, for a sbt project using Hadoop 3.3.0, add to your build.sbt: libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.3.0"
  • An S3 bucket configured to use either A) access key ID + secret access key or B) IAM roles for authentication

Some features may work with older versions of hadoop-aws, but we currently only test against the hadoop-aws version compatible with the latest Spark 3.

Spark with User-Provided Hadoop

The following example sets up a user-provided Apache Hadoop. To download Spark, go here. Be sure to select package type "Pre-built with user-provided Apache Hadoop".

You can download Hadoop 3.3 here. Make sure to download the binary.

Setting up Spark with Hadoop

Note: All instructions here are for MacOS or Linux users.

First, you will need to decompress the Spark tar file and Hadoop tar file:

tar xvf spark-3.0.2-bin-without-hadoop.tgz
tar xvf hadoop-3.3.0.tar.gz

Move the resulting folder to /opt/spark/: mv spark-3.0.2-bin-without-hadoop/ /opt/spark

Go to the Spark configuration directory: cd /opt/spark/conf

There should be a spark-env.sh.template file. You will want a real spark-env.sh file, so rename the template to spark-env.sh: mv spark-env.sh.template spark-env.sh

Next, set the JAVA_HOME environment variable: export JAVA_HOME=/usr/lib/jvm/jre-11-openjdk

Now, edit spark-env.sh and point SPARK_DIST_CLASSPATH to the Hadoop folder you extracted earlier. For example, if you extracted it to /myhadoop, you should add the following line: export SPARK_DIST_CLASSPATH=$(/myhadoop/hadoop-3.3.0/bin/hadoop classpath)

See Spark's documentation for more information.

Finally, set the SPARK_HOME environment variable:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Example Application Using S3

See here for an example of how to connect to an S3 bucket with the Spark Connector.

Troubleshooting

If you see this error: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities it is likely because you are not using Spark with Hadoop 3.3.0 and hadoop-aws 3.3.0.