This guide will take you through the steps involved in running dblink on a small test data set. To make the guide accessible, we assume that you're running dblink on your local machine. Of course, in practical applications you'll likely want to run dblink on a cluster. We'll provide some pointers for this option as we go along.
The following two steps require that Java 8+ is installed on your system. To check whether it is installed on a macOS or Linux system, run the command
$ java -version
You should see a version number of the form 8.x (or equivalently 1.8.x). Installation instructions for Oracle JDK on Windows, macOS and Linux are available here.
Note: As of April 2019, the licensing terms of the Oracle JDK have changed. We recommend using an open source alternative such as the OpenJDK. Packages are available in many Linux distributions. Instructions for macOS are available here.
Since dblink is implemented as a Spark application, you'll need access to a Spark cluster in order to run it. Setting up a Spark cluster from scratch can be quite involved and is beyond the scope of this guide. We refer interested readers to the Spark documentation, which discusses various deployment options. An easier route for most users is to use a preconfigured Spark cluster available through public cloud providers, such as Amazon EMR, Azure HDInsight, and Google Cloud Dataproc. In this guide, we take an even simpler approach: we'll run Spark in pseudocluster mode on your local machine. This is fine for testing purposes or for small data sets.
We'll now take you through detailed instructions for setting up Spark in pseudocluster mode on a macOS or Linux system.
First, download the prebuilt 2.3.1 release from the Spark release archive.
$ wget https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
then extract the archive.
$ tar -xvf spark-2.4.5-bin-hadoop2.7.tgz
Move the Spark folder to /opt
and create a symbolic link so that you can
easily switch to another version in the future.
$ sudo mv spark-2.4.5-bin-hadoop2.7 /opt
$ sudo ln -s /opt/spark-2.4.5-bin-hadoop2.7/ /opt/spark
Define the SPARK_HOME
variable and add the Spark binaries to your PATH
.
The way that this is done depends on your operating system and/or shell.
Assuming enviornment variables are defined in ~/.profile
, you can
run the following commands:
$ echo 'export SPARK_HOME=/opt/spark' >> ~/.profile
$ echo 'export PATH=$PATH:$SPARK_HOME/bin' >> ~/.profile
After appending these two lines, run the following command to update your path for the current session.
$ source ~/.profile
Notes:
- If using Bash on Debian, Fedora or RHEL derivatives, environment
variables are typically defined in
~/.bash_profile
rather than~/.profile
- If using ZSH, environment variables are typically defined in
~/.zprofile
- You can check which shell you're using by running
echo $SHELL
In this step you'll obtain the dblink fat JAR, which has file name
dblink-assembly-0.2.0.jar
.
It contains all of the class files and resources for dblink, packed together
with any dependencies.
There are two options:
- (Recommended) Download a prebuilt JAR from here. This has been built against Spark 2.4.5 and is not guaranteed to work with other versions of Spark.
- Build the fat JAR file from source as explained in the section below.
The build tool used for dblink is called sbt. You'll need to install sbt on your system. Instructions are available for Windows, macOS and Linux in the sbt. We give alternative installtion in the second set of instructions for those using bash on MacOS. documentation
On macOS or Linux, you can verify that sbt is installed correctly by running.
$ sbt about
Once you've successfully installed sbt, get the dblink source code from GitHub:
$ git clone https://github.com/cleanzr/dblink.git
then change into the dblink directory and build the package
$ cd dblink
$ sbt assembly
This should produce a fat JAR at ./target/scala-2.11/dblink-assembly-0.2.0.jar
.
Note: IntelliJ IDEA can also be used to build the fat JAR. It is arguably more user-friendly as it has a GUI and users can avoid installing sbt.
Having completed the above two steps, you're now ready to launch dblink.
This is done using the spark-submit
interface, which supports all types of Spark deployments.
As a test, let's try running the RLdata500 example provided with the source
code on your local machine.
From within the dblink
directory, run the following command:
$SPARK_HOME/bin/spark-submit \
--master "local[1]" \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties" \
--conf "spark.driver.extraClassPath=./target/scala-2.11/dblink-assembly-0.2.0.jar" \
./target/scala-2.11/dblink-assembly-0.2.0.jar \
./examples/RLdata500.conf
This will run Spark in pseudocluster (local) mode with 1 core. You can increase
the number of cores available by changing local[1]
to local[n]
where n
is the number of cores or local[*]
to use all available cores.
To run dblink on other data sets you will need to edit the config file (called
RLdata500.conf
above).
Instructions for doing this are provided here.
dblink saves output into a specified directory. In the RLdata500 example from
above, the output is written to ./examples/RLdata500_results/
.
Below we provide a brief description of the files:
run.txt
: contains details about the job (MCMC run). This includes the data files, the attributes used, parameter settings etc.partitions-state.parquet
anddriver-state
: stores the final state of the Markov chain, so that MCMC can be resumed (e.g. you can run the Markov chain for longer without starting from scratch).diagnostics.csv
contains summary statistics along the chain which can be used to assess convergence/mixing.linkage-chain.parquet
contains posterior samples of the linkage structure in Parquet format.
Optional files:
evaluation-results.txt
: contains output from an "evaluate" step (e.g. precision, recall, other measures). Requires ground truth entity identifiers in the data files.cluster-size-distribution.csv
contains the cluster size distribution along the chain (rows are iterations, columns contain counts for each cluster/entity size. Only appears if requested in a "summarize" step.partition-sizes.csv
contains the partition sizes along the chain (rows are iterations, columns are counts of the number of entities residing in each partition). Only appears if requested in a "summarize" step.shared-most-probable-clusters.csv
is a point estimate of the linkage structure computed from the posterior samples. Each line in the file contains a comma-separated list of record identifiers which are assigned to the same cluster/entity.