[Insert Project Desc here]
- Basic Java, Maven and Git knowledge.
- Basic knowledge of databricks.
- Databricks Cluster: You need a running Databricks cluster with the appropriate Spark runtime version and libraries installed.
- Get set up here
- Input Data: The input data should be loaded into a Delta Lake table in the specified catalog and schema.
- Download from the DRR Distribution
- Application JAR: The compiled JAR file of the project should be uploaded to a Databricks volume.
- See section Build the project
-
Checkout the project:
git clone https://github.com/REGnosys/opensource-reg-reporting.git cd opensource-reg-reporting
-
Build the project:
mvn clean install
This will compile the code, run tests, and package the application into a JAR file located in the
target
directory calledopensource-reg-reporting-1.0-SNAPSHOT.jar
Note: The project uses the Maven Shade plugin to create a "fat JAR" that includes all the necessary DRR (Data Regulatory Reporting) dependencies. The Apache Spark dependencies are marked as "provided" in the
pom.xml
file, as these will be available in the Databricks environment where the application will be deployed. -
Applications in this Project
- DataLoader.java: Spark application to load data from a directory into a Spark table
- output-table: Name of the output table.
- input-path: The path to the directory containing the data files.
- catalog.schema: Name of the database/schema.
- SparkReportRunner.java: Spark application to run a Report (e.g. ESMA EMIR) and save the results on a Spark table
- output-table: Name of the output table.
- input-table: Name of the input table.
- function-name: Fully qualified name of the Rosetta report function.
- catalog.schema: Name of the database/schema.
- SparkProjectRunner.java: Spark application to run a Projection (e.g. ISO20022 XML) and save the results on a Spark table
- output-table: Name of the output table.
- input-table: Name of the input table.
- function-name: Fully qualified name of the Rosetta projection function.
- xml-config-path: Path to the XML configuration file.
- catalog.schema: Name of the database/schema.
- SparkValidationRunner.java: Spark application to run a Validation report on either 1 or 2 Spark tables and save the results above and save the results on a Spark table
- output-table: Name of the output table.
- input-table: Name of the input table.
- function-name: Fully qualified name of the Rosetta function to validate against.
- catalog.schema: Name of the database/schema.
-
Open Databricks SQL:
- Navigate to the Databricks workspace.
- Click on the "SQL" icon on the sidebar.
-
Create the Catalog:
- Run the following SQL command to create the catalog:
%sql
CREATE CATALOG opensource_reg_reporting;
- Create the Schema:
- Run the following SQL command to create the schema within the catalog:
%sql
CREATE SCHEMA opensource_reg_reporting.orr;
- Create the Volumes:
- Run the following SQL commands to create the volumes:
%sql
CREATE VOLUME opensource_reg_reporting.orr.cdm_trades;
CREATE VOLUME opensource_reg_reporting.orr.jar;
-
Navigate to Compute:
- Click on the "Compute" icon on the sidebar.
-
Create a New Cluster:
-
Upload Applications Jar file to the cluster
- Select the Compute Resource
- Click on
Libraries
tab - Click on
Install New
button to the top right - Select
Volumes
- Navigate to
/Volumes/opensource_reg_reporting/orr/jar
- Upload the Jar file from Build the project section
- The page should look like this:
-
Restart the cluster
Each application can be run as a Spark job on Databricks. You can configure the job using the Databricks UI or a YAML configuration file. Here's a general outline of the steps involved:
This project includes four main Spark applications: DataLoader
, SparkProjectRunner
, SparkReportRunner
, and SparkValidationRunner
. These applications are designed to be run on a Databricks cluster and interact with data stored in a Delta Lake table.
This is the quickest and easiest to get set up. See sections below for detailed set up.
Use the following YAML configuration to create the job and tasks:
-
click on
Workflows
on the left menu -
click
create job
-
click on the 3 dot menu on the top right hand side and
switch to code version (YAML)
. -
save the Job
- Spark Jar Task
- Specifies the main_class_name as one of the following
- org.finos.orr.DataLoader
- org.finos.orr.SparkProjectRunner
- org.finos.orr.SparkReportRunner
- org.finos.orr.SparkValidationRunner
- Set the jar to the location of the application JAR file in the Databricks volume.
- Specifies the main_class_name as one of the following
- Parameters
- Each application requires specific parameters to be passed as arguments. Refer to the application's Javadoc for details on the required parameters.
- Configure Dependencies
- Add the application JAR as a library to the Spark job.
- Execute the Job
- Run the Spark job on the Databricks cluster.
Note: Replace the placeholders (e.g., , , , etc.) with the actual values for your environment.
- task_key: <task-key>
spark_jar_task:
jar_uri: "<jar-uri>"
main_class_name: <main-class-name>
parameters:
- <output-table>
- <input-table>
- <function-name>
- <xml-config-path> (for SparkProjectRunner)
- <catalog.schema>
run_as_repl: true
existing_cluster_id: <cluster-id>
libraries:
- jar: "<jar-uri>"
TODO - Add section here