This section describes how you can get started at developing DataFusion.
For information on developing with Ballista, see the Ballista developer documentation.
DataFusion is written in Rust and it uses a standard rust toolkit:
cargo build
cargo fmt
to format the codecargo test
to test- etc.
Testing setup:
git submodule init
git submodule update
export PARQUET_TEST_DATA=$(pwd)/parquet-testing/data/
export ARROW_TEST_DATA=$(pwd)/testing/data/
Below is a checklist of what you need to do to add a new scalar function to DataFusion:
- Add the actual implementation of the function:
- In src/physical_plan/functions, add:
- a new variant to
BuiltinScalarFunction
- a new entry to
FromStr
with the name of the function as called by SQL - a new line in
return_type
with the expected return type of the function, given an incoming type - a new line in
signature
with the signature of the function (number and types of its arguments) - a new line in
create_physical_expr
/create_physical_fun
mapping the built-in to the implementation - tests to the function.
- a new variant to
- In tests/sql.rs, add a new test where the function is called through SQL against well known data and returns the expected result.
- In src/logical_plan/expr, add:
- a new entry of the
unary_scalar_expr!
macro for the new function.
- a new entry of the
- In src/logical_plan/mod, add:
- a new entry in the
pub use expr::{}
set.
- a new entry in the
Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
- Add the actual implementation of an
Accumulator
andAggregateExpr
: - In src/physical_plan/aggregates, add:
- a new variant to
BuiltinAggregateFunction
- a new entry to
FromStr
with the name of the function as called by SQL - a new line in
return_type
with the expected return type of the function, given an incoming type - a new line in
signature
with the signature of the function (number and types of its arguments) - a new line in
create_aggregate_expr
mapping the built-in to the implementation - tests to the function.
- a new variant to
- In tests/sql.rs, add a new test where the function is called through SQL against well known data and returns the expected result.
The query plans represented by LogicalPlan
nodes can be graphically
rendered using Graphviz.
To do so, save the output of the display_graphviz
function to a file.:
// Create plan somehow...
let mut output = File::create("/tmp/plan.dot")?;
write!(output, "{}", plan.display_graphviz());
Then, use the dot
command line tool to render it into a file that
can be displayed. For example, the following command creates a
/tmp/plan.pdf
file:
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
We formalize Datafusion semantics and behaviors through specification documents. These specifications are useful to be used as references to help resolve ambiguities during development or code reviews.
You are also welcome to propose changes to existing specifications or create new specifications as you see fit.
Here is the list current active specifications:
All specifications are stored in the docs/source/specification
folder.
We are using prettier
to format .md
files.
You can either use npm i -g prettier
to install it globally or use npx
to run it as a standalone binary. Using npx
required a working node environment. Upgrading to the latest prettier is recommended (by adding --upgrade
to the npm
command).
$ prettier --version
2.3.0
After you've confirmed your prettier version, you can format all the .md
files:
prettier -w {ballista,datafusion,datafusion-examples,dev,docs,python}/**/*.md