Weighted aggregation. #8

koaning · 2021-11-28T11:33:25Z

Things like weighted mean/sum/std might be good to support.

ritchie46 · 2021-11-28T12:05:08Z

You mean like rolling functions? We have those.

koaning · 2021-11-28T18:35:32Z

Ah, I meant in just plain old aggregation.

ritchie46 · 2021-11-28T18:55:52Z

We also have those. Or do you mean as an example here?

koaning · 2021-11-29T06:15:26Z

Really? Interesting, in pandas I've seen people resort to apply to calculate those.

ritchie46 · 2021-11-29T06:36:06Z

Really? Interesting, in pandas I've seen people resort to apply to calculate those.

Can you show me a pandas snippet, so I understand?

koaning · 2021-11-29T07:01:52Z

This is how you'd calculate a mean in pandas.

import pandas as pd 

data = [
    {"group": "a", "rating": 10, "weight": 0.5},
    {"group": "a", "rating":  5, "weight": 1.5},
    {"group": "b", "rating":  5, "weight": 1.5}
]

pd.DataFrame(data).groupby("group").agg(weighted_mean=("rating", "mean"))

But that's not a weighted mean. Instead you'd like to do something like;

import pandas as pd
import numpy as np

data = [
    {"group": "a", "rating": 10, "weight": 0.5},
    {"group": "a", "rating":  5, "weight": 1.5},
    {"group": "b", "rating":  5, "weight": 1.5}
]

(pd.DataFrame(data)
 .groupby("group")
 .agg(weighted_mean=("rating", lambda d: np.sum(d['rating'])/np.sum(d['weight']))))

But this doesn't work in pandas with a named aggregation because in lambda d the d refers to the ratings-array. Not the entire dataframe. That's why some folks resort to;

(pd.DataFrame(data)
 .groupby("group")
 .apply(lambda d: np.average(d['rating'], weights=d['weight'])))

But this is using apply which comes with an overhead.

That's why I think it'd be nice to just have a method that can attach a weighted mean column, but done in a performant method.

ritchie46 · 2021-11-29T07:27:32Z

Yeap, that's why the expressions are awesome! 😄

(pl.DataFrame(data)
    .groupby("group")
    .agg([
        (pl.col("rating") * pl.col("weight")).sum() / pl.sum("weight")
    ])
)

koaning · 2021-11-29T11:04:22Z

Yep! But that's why this may also be a nice example to add to this repository. Not 100% sure though. It feels like something that's only worth the effort in pandas-land. Less so in polars-country.

ritchie46 · 2021-11-29T11:31:50Z

Yes, it shows the power of expressions. One of my arguments often is that the expression API reduces the need of running python bytecode, which this example shows. So yeah, I think it fits.

kjyv · 2023-07-21T10:59:31Z

The above example using expressions is great, yet it does not address the many cases where some values are NaN or the weights sum up to 0. I'm currently trying to reproduce the behaviour we have with a much slower pandas aggregation.
The polars expression is becoming quite complex and I think this would be much nicer if it was supported natively.
See this thread where someone has similar requirements: https://stackoverflow.com/questions/74714338/how-to-compute-a-group-weighted-average-controlling-for-null-values-in-polars

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weighted aggregation. #8

Weighted aggregation. #8

koaning commented Nov 28, 2021

ritchie46 commented Nov 28, 2021

koaning commented Nov 28, 2021

ritchie46 commented Nov 28, 2021

koaning commented Nov 29, 2021

ritchie46 commented Nov 29, 2021

koaning commented Nov 29, 2021

ritchie46 commented Nov 29, 2021 •

edited

Loading

koaning commented Nov 29, 2021

ritchie46 commented Nov 29, 2021

kjyv commented Jul 21, 2023

Weighted aggregation. #8

Weighted aggregation. #8

Comments

koaning commented Nov 28, 2021

ritchie46 commented Nov 28, 2021

koaning commented Nov 28, 2021

ritchie46 commented Nov 28, 2021

koaning commented Nov 29, 2021

ritchie46 commented Nov 29, 2021

koaning commented Nov 29, 2021

ritchie46 commented Nov 29, 2021 • edited Loading

koaning commented Nov 29, 2021

ritchie46 commented Nov 29, 2021

kjyv commented Jul 21, 2023

ritchie46 commented Nov 29, 2021 •

edited

Loading