Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expression plugins #26

Merged
merged 9 commits into from
Sep 18, 2023
Merged

feat: expression plugins #26

merged 9 commits into from
Sep 18, 2023

Conversation

ritchie46
Copy link
Member

@ritchie46 ritchie46 commented Sep 15, 2023

This allows support for polars plugins. These are expression exposed in a different shared library and dynamically linked into the polars main library.

This mean we or third parties can create their own expressions and they will run on our engine without python interference. So no blockage by the GIL.

We can therefore keep polars more lean and maybe add support for a polars-distance, polars-geo, polars-ml, etc. Those can then have specialized expressions and don't have to worry as much for code bloat as they can be optionally installed.

The idea is that you define an expression in another Rust crate with a proc_macro polars_expr.

That macro can have the following attributes:

  • output_type -> to define the output type of that expression
  • type_func -> to define a function that computes the output type based on input types.

Here is an example of a String conversion expression that converts any string to pig latin:

fn pig_latin_str(value: &str, output: &mut String) {
    if let Some(first_char) = value.chars().next() {
        write!(output, "{}{}ay", &value[1..], first_char).unwrap()
    }
}

#[polars_expr(output_type=Utf8)]
fn pig_latinnify(inputs: &[Series]) -> PolarsResult<Series> {
    let ca = inputs[0].utf8()?;
    let out: Utf8Chunked = ca.apply_to_buffer(pig_latin_str);
    Ok(out.into_series())
}

On the python side this expression can then be registered under a namespace:

import polars as pl
from polars.utils.udfs import _get_shared_lib_location

lib = _get_shared_lib_location(__file__)


@pl.api.register_expr_namespace("language")
class Language:
    def __init__(self, expr: pl.Expr):
        self._expr = expr

    def pig_latinnify(self) -> pl.Expr:
        return self._expr._register_plugin(
            lib=lib,
            symbol="pig_latinnify",
            is_elementwise=True,
        )

Compile/ship and then it is ready to use:

import polars as pl
from expression_lib import Language

df = pl.DataFrame({
    "names": ["Richard", "Alice", "Bob"],
})


out = df.with_columns(
   pig_latin = pl.col("names").language.pig_latinnify()
)

See the full example here: https://github.com/pola-rs/pyo3-polars/tree/plugin/example/derive_expression

polars-core = { workspace = true, default-features = false }
polars-ffi = { workspace = true, optional = true }
polars-plan = { workspace = true, optional = true }
polars-lazy = { workspace = true, optional = true }
pyo3 = "0.19.0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all external deps be pinned to micro version here? E.g. any chance of pyo3 being pinned to 0.19 only?

(exact pinning probably makes sense for polars libraries themselves though...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it in cargo. 0.19.0 is equal to 0.19.0..0.19.n, though this is not super clear. To exactly pin it, we should type =0.19.0.

@aldanor
Copy link

aldanor commented Sep 17, 2023

Pretty exciting!

Here is an example of a String conversion expression that converts any string to pig latin

Question 1: (apologies in advance if the questions are stupid) it's just not immediately clear from the diff... how would you pass extra arguments to those functions? E.g. you have a compiled expression multiply_by which you'd want to call like .multiply_by(1) or .multiply_by(2).

(You'd think many expressions, including prebuilt ones, often have various non-series parameters.)

Question 2: is it the plan to only allow this on series level, or frame level as well? (since some computations you can do much faster internally, handling parallelization yourself if it's problem-specific, as opposed to having polars runtime sort it out).

@ritchie46
Copy link
Member Author

ritchie46 commented Sep 18, 2023

Question 1: (apologies in advance if the questions are stupid) it's just not immediately clear from the diff... how would you pass extra arguments to those functions? E.g. you have a compiled expression multiply_by which you'd want to call like .multiply_by(1) or .multiply_by(2).

Indeed. Currently it only works for multiple expression arguments. Though many types can be represented as single element series. I'd welcome the ability to make non-series arguments easier, but I would still have to think a little bit about that.

Question 2: is it the plan to only allow this on series level, or frame level as well? (since some computations you can do much faster internally, handling parallelization yourself if it's problem-specific, as opposed to having polars runtime sort it out).

Only on series level. You could ofcourse always accept a series of type struct to deal with dataframe like inputs. Handling the parallelism yourself will lead to contention with the default polars runtime. If we were to solve this we also should make the rayon threadpool work over FFI. Which currently is out of scope of this functionality.

Ideally, I think we would have arguments that allow you to influence the paralllism strategy.

@ritchie46 ritchie46 merged commit 7d5e384 into main Sep 18, 2023
2 checks passed
@ritchie46 ritchie46 deleted the plugin branch September 18, 2023 09:27
@aldanor
Copy link

aldanor commented Sep 18, 2023

Indeed. Currently it only works for multiple expression arguments. Though many types can be represented as single element series. I'd welcome the ability to make non-series arguments easier, but I would still have to think a little bit about that.

Just a random thought, one way would be to come up with a restricted set of allowed argument types safe to send across ffi and thread boundaries, like a json::Value-style enum of all valid scalars (iirc polars already has something similar) plus lists, dicts, nullable stuff etc, and then provide conversions on both pyo3 and rust sides. Along the lines of

// args: Arc<[Arg]>

enum Arg {
   Null,
   String(String),
   List(Arc<[Arg]>),
   ...
}

It could be implemented completely differently, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants