Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aggregate_function argument to utils.concat() #401

Merged
merged 24 commits into from
Oct 26, 2023
Merged

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Oct 25, 2023

This adds the aggregate_function argument to audformat.utils.concat().
If overwrite=False and aggregate_function is not None it will be used as a callable that combines values for all entries that have more than one data point.

image

NOTE: in test_utils_concat.py only the tests test_concat_aggregate_function() and test_concat_overwrite_aggregate_function() are new, the other part is moved from test_util.py.

@codecov
Copy link

codecov bot commented Oct 25, 2023

Codecov Report

Merging #401 (2dc437e) into main (d461614) will not change coverage.
The diff coverage is 100.0%.

Files Coverage Δ
audformat/core/utils.py 100.0% <100.0%> (ø)

audformat/core/utils.py Outdated Show resolved Hide resolved
audformat/core/utils.py Outdated Show resolved Hide resolved
audformat/core/utils.py Outdated Show resolved Hide resolved
audformat/core/utils.py Outdated Show resolved Hide resolved
@frankenjoe
Copy link
Collaborator

index = audformat.filewise_index(['f1', 'f2', 'f3', 'f4'])
df1 = pd.DataFrame(
    {
        'a': [1, 1, 1],
        'b': [1, 1, 1],
    },
    index=index[:3],
)
df2 = pd.DataFrame(
    {
        'a': [2, 2, 2],
        'b': [2, 2, 2],
    },
    index=index[1:],
)

audformat.utils.concat([df1, df2], aggregate_function=np.sum)
         a     b
file            
f1    <NA>  <NA>
f2       3     3
f3       3     3
f4       2     2

But I was expecting:

         a     b
file            
f1       1     1
f2       3     3
f3       3     3
f4       2     2

@hagenw
Copy link
Member Author

hagenw commented Oct 25, 2023

Thanks for spotting this, I fixed it (and a related issue) and now we get:

      a  b
file      
f1    1  1
f2    3  3
f3    3  3
f4    2  2

audformat/core/utils.py Outdated Show resolved Hide resolved
audformat/core/utils.py Outdated Show resolved Hide resolved
@frankenjoe
Copy link
Collaborator

Thanks for spotting this, I fixed it (and a related issue) and now we get:

I can confirm that example is now working with two frames, but when I add a third one I get:

index = audformat.filewise_index(['f1', 'f2', 'f3', 'f4'])
df1 = pd.DataFrame(
    {
        'a': [1, 1, 1],
        'b': [1, 1, 1],
    },
    index=index[:3],
)
df2 = pd.DataFrame(
    {
        'a': [2, 2, 2],
        'b': [2, 2, 2],
    },
    index=index[1:],
)
df3 = pd.DataFrame(
    {
        'a': [3],
        'b': [3],
    },
    index[:1],
)

audformat.utils.concat([df1, df2, df3], aggregate_function=np.sum)
      a  b
file      
f1    1  1
f2    3  3
f3    3  3
f4    2  2

but what I would expect is:

      a  b
file      
f1    4  4
f2    3  3
f3    3  3
f4    2  2

audformat/core/utils.py Outdated Show resolved Hide resolved
@hagenw
Copy link
Member Author

hagenw commented Oct 26, 2023

Thanks for finding another example that didn't worked. There was indeed a bigger error in how the overlapping values were collected as for the very first column we need to collect all values that overlap with any of the other columns. Whereas for all other columns it is just fine to collect values that overlap with the first column. I changed the code accordingly, and added a few more tests (most likely still not enough ;) ).

Now we get:

import audformat
import numpy as np
import pandas as pd

index = audformat.filewise_index(['f1', 'f2', 'f3', 'f4'])
df1 = pd.DataFrame(
    {
        'a': [1, 1, 1],
        'b': [1, 1, 1],
    },
    index=index[:3],
)
df2 = pd.DataFrame(
    {
        'a': [2, 2, 2],
        'b': [2, 2, 2],
    },
    index=index[1:],
)
df3 = pd.DataFrame(
    {
        'a': [3],
        'b': [3],
    },
    index[:1],
)

audformat.utils.concat([df1, df2, df3], aggregate_function=np.sum)
      a  b
file      
f1    4  4
f2    3  3
f3    3  3
f4    2  2

@frankenjoe
Copy link
Collaborator

Looks good, can't find another failing example. Nice extension, strange we didn't come up with this aggregate solution earlier.

@frankenjoe frankenjoe merged commit 77f69bf into main Oct 26, 2023
10 checks passed
@frankenjoe frankenjoe deleted the concat-aggregate branch October 26, 2023 13:53
@hagenw
Copy link
Member Author

hagenw commented Oct 26, 2023

I guess there was no urgent need to have it earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants