Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add group_by method to AgentSet for attribute-based grouping #2092

Closed

Conversation

ai-naymul
Copy link

@ai-naymul ai-naymul commented Mar 28, 2024

Discussion(#2074)

In this pr I have added a group_by method in the AgentSet class, allowing agents to grouped based on some specific attributes.

Copy link

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
Schelling small 🔵 -0.0% [-0.3%, +0.3%] 🔵 -1.0% [-1.3%, -0.8%]
Schelling large 🔵 +0.2% [-0.7%, +1.1%] 🔵 -3.9% [-6.6%, -1.5%]
WolfSheep small 🔵 -0.4% [-1.3%, +0.5%] 🔵 -0.5% [-0.7%, -0.3%]
WolfSheep large 🔵 +0.6% [+0.2%, +0.9%] 🔵 +0.9% [-2.0%, +3.6%]
BoidFlockers small 🔵 +1.8% [+1.1%, +2.5%] 🔵 +1.0% [+0.3%, +1.6%]
BoidFlockers large 🔵 -3.2% [-6.6%, -0.0%] 🔵 -3.8% [-5.8%, -1.6%]

@EwoutH
Copy link
Member

EwoutH commented Mar 28, 2024

Thanks for this PR! It looks like it could fill a nice use case, especially for attributes with discrete values.

On thing I'm thinking about is if we can do something smart to select ranges or thresholds for attributes with continuous values. For example divide agents in to bins based on some attribute.

Curious what the performance of this is compared to (multiple) select() cases.

@EwoutH EwoutH requested review from quaquel, rht and Corvince March 28, 2024 10:02
@EwoutH EwoutH added the enhancement Release notes label label Mar 28, 2024
@quaquel
Copy link
Member

quaquel commented Mar 28, 2024

I also had a quick look. I like the direction. Some quick points:

  1. how to handle non-discrete attributes? The current code is not explicit on this. The obvious direction is to have some kind of binning. Alternatively, there could be some check of the type of the value of the attribute. If this is, for example, a float, an exception or warning might be raised. Either way, this issue need to be handled explicitly.
  2. I would like to see unit tests for this code to ensure we maintain proper coverage.

@ai-naymul
Copy link
Author

Thanks for this PR! It looks like it could fill a nice use case, especially for attributes with discrete values.

On thing I'm thinking about is if we can do something smart to select ranges or thresholds for attributes with continuous values. For example divide agents in to bins based on some attribute.

Curious what the performance of this is compared to (multiple) select() cases.

Hi,
Thank you for the feedback!

Integrating range or threshold-based grouping for continuous attributes is a great idea. We could potentially extend group_by to accept a binning function or range specifications to categorize these attributes. Regarding performance, initial tests suggest group_by is more efficient for discrete attributes compared to multiple select() calls, especially as the number of attributes increases. I'll explore the binning functionality and conduct further performance comparisons.

@ai-naymul
Copy link
Author

I also had a quick look. I like the direction. Some quick points:

  1. how to handle non-discrete attributes? The current code is not explicit on this. The obvious direction is to have some kind of binning. Alternatively, there could be some check of the type of the value of the attribute. If this is, for example, a float, an exception or warning might be raised. Either way, this issue need to be handled explicitly.
  2. I would like to see unit tests for this code to ensure we maintain proper coverage.

Addressing non-discrete attributes through binning or type checks is a valid concern. I propose adding an optional binning_function parameter to the method to handle continuous attributes. This function can categorize attributes into discrete bins. If not provided for continuous attributes, we'll issue a warning. I'll also add the necessary unit tests.

@rht
Copy link
Contributor

rht commented Mar 29, 2024

I tried pandas' groupby on float values:

In [4]: data = {"a": np.random.random(4), "b": np.random.random(4)}

In [5]: df = pd.DataFrame(data=data)

In [6]: df
Out[6]:
          a         b
0  0.437550  0.389805
1  0.180405  0.032264
2  0.509254  0.147376
3  0.181636  0.102611
In [9]: for k, v in df.groupby("b"):
   ...:     print(df.groupby("b").get_group(k))
   ...:
          a         b
1  0.180405  0.032264
          a         b
3  0.181636  0.102611
          a         b
2  0.509254  0.147376
         a         b
0  0.43755  0.389805

i.e. they seem to just let you do it even if the group has 1 element each.

@quaquel
Copy link
Member

quaquel commented Apr 7, 2024

I tried pandas' groupby on float values:

I like to get other people's perspective on this as well. Personally, I am in favor of mimicking pandas as much as possible, so just group on the float as normal. This keeps the code nice and simple. If you want to do binning, you can already easily code this up through AgentSet.do and a callable. So

my_agents.do(some_binning_function).group_by("my_newly assigned attribute")

Returns:
dict: A dictionary where keys are attribute values and values are lists of agents with those attribute values.
"""
grouped_agents = defaultdict(list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be a defaultdict(AgentSet)? Similar to the binning discussion, this would mimic pandas in that groupby returns a group identifier and an AgentSet with the agents in that group.

Copy link
Author

@ai-naymul ai-naymul Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise a valid point regarding the return type of the group_by method. Changing the return type to defaultdict(AgentSet) would indeed align more closely with pandas' groupby functionality.

grouped_agents = defaultdict(AgentSet)

when adding agents to the groups:

for agent in self: attr_value = getattr(agent, attr_name, None) if attr_value not in grouped_agents: grouped_agents[attr_value] = AgentSet(model=self.model) grouped_agents[attr_value].add(agent)

Let me know whats your thought on these changes. Looking forward :)

@quaquel
Copy link
Member

quaquel commented Apr 7, 2024

Thanks for this! Just one thought on the binning conversation and a question regarding the appropriate return type. The code itself, and the tests look fine.

@ai-naymul
Copy link
Author

Thanks for this! Just one thought on the binning conversation and a question regarding the appropriate return type. The code itself, and the tests look fine.

Hi sorry for the late reply.

I think we don't need to change the It group_by method to handle the binning specifically, it will handle continuous attributes directly without enforcing binning.

Here is example usage for binning and grouping :

def binning_function(agent):
agent.binned_attribute = agent.continuous_attribute // 10 * 10

my_agents.do(binning_function)

grouped_agents = my_agents.group_by("binned_attribute")

@EwoutH
Copy link
Member

EwoutH commented Apr 17, 2024

@ai-naymul is this PR ready for (final) review? If so, can you mark it as such (by pressing the "Ready to review") button?

Please also update the PR message with a bit of context (including which problem this solves and how it's implemented on a high level).

@ai-naymul
Copy link
Author

@ai-naymul is this PR ready for (final) review? If so, can you mark it as such (by pressing the "Ready to review") button?

Please also update the PR message with a bit of context (including which problem this solves and how it's implemented on a high level).

Hi sorry for the inconvenience, I was having some health issues and wasn't able to work on this properly as well.

I think need some improvement like need the feedback regarding the return type of the group_by method from @quaquel .

I think it's final for review although. I will add some description about this pr as well.

@ai-naymul ai-naymul marked this pull request as ready for review April 20, 2024 14:55
@EwoutH
Copy link
Member

EwoutH commented Jul 3, 2024

@ai-naymul are you still interested on working on this? If so I would like to have an explanation of the feature and an example in the PR message. A bit like: #1890 or #1916 (but it can be shorter since it's a smaller feature).

@quaquel quaquel mentioned this pull request Aug 18, 2024
@quaquel
Copy link
Member

quaquel commented Aug 22, 2024

This has been superseded by #2220. @ai-naymul thanks for the hard work on this. I started from your code in #2220.

@quaquel quaquel closed this Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Release notes label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants