Skip to content
This repository has been archived by the owner on Nov 10, 2023. It is now read-only.

Weighted aggregation. #8

Open
koaning opened this issue Nov 28, 2021 · 10 comments
Open

Weighted aggregation. #8

koaning opened this issue Nov 28, 2021 · 10 comments

Comments

@koaning
Copy link
Collaborator

koaning commented Nov 28, 2021

Things like weighted mean/sum/std might be good to support.

@ritchie46
Copy link
Member

You mean like rolling functions? We have those.

@koaning
Copy link
Collaborator Author

koaning commented Nov 28, 2021

Ah, I meant in just plain old aggregation.

@ritchie46
Copy link
Member

We also have those. Or do you mean as an example here?

@koaning
Copy link
Collaborator Author

koaning commented Nov 29, 2021

Really? Interesting, in pandas I've seen people resort to apply to calculate those.

@ritchie46
Copy link
Member

Really? Interesting, in pandas I've seen people resort to apply to calculate those.

Can you show me a pandas snippet, so I understand?

@koaning
Copy link
Collaborator Author

koaning commented Nov 29, 2021

This is how you'd calculate a mean in pandas.

import pandas as pd 

data = [
    {"group": "a", "rating": 10, "weight": 0.5},
    {"group": "a", "rating":  5, "weight": 1.5},
    {"group": "b", "rating":  5, "weight": 1.5}
]

pd.DataFrame(data).groupby("group").agg(weighted_mean=("rating", "mean"))

But that's not a weighted mean. Instead you'd like to do something like;

import pandas as pd
import numpy as np

data = [
    {"group": "a", "rating": 10, "weight": 0.5},
    {"group": "a", "rating":  5, "weight": 1.5},
    {"group": "b", "rating":  5, "weight": 1.5}
]

(pd.DataFrame(data)
 .groupby("group")
 .agg(weighted_mean=("rating", lambda d: np.sum(d['rating'])/np.sum(d['weight']))))

But this doesn't work in pandas with a named aggregation because in lambda d the d refers to the ratings-array. Not the entire dataframe. That's why some folks resort to;

(pd.DataFrame(data)
 .groupby("group")
 .apply(lambda d: np.average(d['rating'], weights=d['weight'])))

But this is using apply which comes with an overhead.

That's why I think it'd be nice to just have a method that can attach a weighted mean column, but done in a performant method.

@ritchie46
Copy link
Member

ritchie46 commented Nov 29, 2021

Yeap, that's why the expressions are awesome! 😄

(pl.DataFrame(data)
    .groupby("group")
    .agg([
        (pl.col("rating") * pl.col("weight")).sum() / pl.sum("weight")
    ])
)

@koaning
Copy link
Collaborator Author

koaning commented Nov 29, 2021

Yep! But that's why this may also be a nice example to add to this repository. Not 100% sure though. It feels like something that's only worth the effort in pandas-land. Less so in polars-country.

@ritchie46
Copy link
Member

Yes, it shows the power of expressions. One of my arguments often is that the expression API reduces the need of running python bytecode, which this example shows. So yeah, I think it fits.

@kjyv
Copy link

kjyv commented Jul 21, 2023

The above example using expressions is great, yet it does not address the many cases where some values are NaN or the weights sum up to 0. I'm currently trying to reproduce the behaviour we have with a much slower pandas aggregation.
The polars expression is becoming quite complex and I think this would be much nicer if it was supported natively.
See this thread where someone has similar requirements: https://stackoverflow.com/questions/74714338/how-to-compute-a-group-weighted-average-controlling-for-null-values-in-polars

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants