FIX-#4522: Correct multiindex metadata with groupby #4523

devin-petersohn · 2022-06-02T19:26:34Z

Signed-off-by: Devin Petersohn [email protected]

What do these changes do?

commit message follows format outlined here
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves reset_index followed by groupby causes exception in some cases #4522
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date
added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

Signed-off-by: Devin Petersohn <[email protected]>

codecov · 2022-06-02T19:42:21Z

Codecov Report

Merging #4523 (2e9a056) into master (8679052) will decrease coverage by 24.76%.
The diff coverage is 88.23%.

@@             Coverage Diff             @@
##           master    #4523       +/-   ##
===========================================
- Coverage   86.34%   61.57%   -24.77%     
===========================================
  Files         228      228               
  Lines       18438    18454       +16     
===========================================
- Hits        15920    11363     -4557     
- Misses       2518     7091     +4573

Impacted Files	Coverage Δ
modin/core/dataframe/algebra/groupby.py	`85.57% <50.00%> (-9.53%)`	⬇️
modin/pandas/dataframe.py	`91.28% <92.85%> (-0.20%)`	⬇️
modin/pandas/base.py	`94.63% <100.00%> (-0.09%)`	⬇️
modin/experimental/core/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
modin/core/execution/ray/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
modin/experimental/pandas/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
modin/experimental/sklearn/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
modin/experimental/core/execution/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
modin/core/dataframe/base/exchange/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
modin/experimental/xgboost/test/test_dmatrix.py	`0.00% <0.00%> (-100.00%)`	⬇️
... and 100 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

modin-bot · 2022-06-02T20:12:30Z

TeamCity Python test results bot

Tests FAILed

Tests Logs

============================= test session starts ==============================
platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /modin, configfile: setup.cfg
plugins: benchmark-3.4.1, cov-2.11.0, forked-1.4.0, xdist-2.5.0
collected 2464 items

modin/pandas/test/test_io.py ........................................... [  1%]
........................................................................ [  4%]
.............................................ssssssssssssssssssss.ss.ss. [  7%]
ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss. [ 10%]
ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.sssssssss [ 13%]
sssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss. [ 16%]
ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss. [ 19%]
ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.sssssssss [ 22%]
sssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss............................ [ 25%]
........................................................................ [ 28%]
........................................................................ [ 30%]
........................................................................ [ 33%]
........................................................................ [ 36%]
........................................................................ [ 39%]
........................................................................ [ 42%]
........................................................................ [ 45%]
........................................................................ [ 48%]
.................................s...................................... [ 51%]
........................................................................ [ 54%]
........................................................................ [ 57%]
........................................................................ [ 60%]
........................................................................ [ 63%]
........................................................................ [ 66%]
........................................................................ [ 68%]
........................................................................ [ 71%]
........................................................................ [ 74%]
........................................................................ [ 77%]
........................................................................ [ 80%]
................................xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [ 83%]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [ 86%]
xxxxxxxx................................................................ [ 89%]
...........................................................F............ [ 92%]
........................................................................ [ 95%]
.............................................................X.......ss. [ 98%]
...x....................xx.........ss........                            [100%]

=================================== FAILURES ===================================
________________________ TestCsv.test_read_csv_s3[None] ________________________
Traceback (most recent call last):
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/httpsession.py", line 178, in send
    response = await self._session.request(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/client.py", line 559, in _request
    await resp.start(conn)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 898, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/modin/modin/pandas/test/utils.py", line 728, in execute_callable
    pd_result = fn(pandas_df, **pd_kwargs)
  File "/modin/modin/pandas/test/utils.py", line 812, in applyier
    result = getattr(module, fn_name)(*args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1217, in _make_engine
    self.handles = get_handle(  # type: ignore[call-overload]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py", line 670, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py", line 385, in _get_filepath_or_buffer
    file_obj = fsspec.open(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/core.py", line 141, in open
    out = self.__enter__()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/core.py", line 104, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1037, in open
    f = self._open(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 605, in _open
    return S3File(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 1911, in __init__
    super().__init__(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1385, in __init__
    self.size = self.details["size"]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1398, in details
    self._details = self.fs.info(self.path)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 86, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 66, in sync
    raise return_result
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 26, in _runner
    result[0] = await coro
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 1140, in _info
    out = await self._call_s3(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 325, in _call_s3
    await self.set_session()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 473, in set_session
    self._s3 = await s3creator.__aenter__()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 22, in __aenter__
    self._client = await self._coro
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 102, in _create_client
    credentials = await self.get_credentials()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 133, in get_credentials
    self._credentials = await (self._components.get_component(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/credentials.py", line 814, in load_credentials
    creds = await provider.load()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/credentials.py", line 486, in load
    metadata = await fetcher.retrieve_iam_role_credentials()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 176, in retrieve_iam_role_credentials
    role_name = await self._get_iam_role(token)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 200, in _get_iam_role
    return await (await self._get_request(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 127, in _get_request
    response = await session.send(request.prepare())
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/httpsession.py", line 212, in send
    raise ReadTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/modin/modin/pandas/test/test_io.py", line 936, in test_read_csv_s3
    eval_io(
  File "/modin/modin/pandas/test/utils.py", line 835, in eval_io
    call_eval_general()
  File "/modin/modin/pandas/test/utils.py", line 818, in call_eval_general
    eval_general(
  File "/modin/modin/pandas/test/utils.py", line 767, in eval_general
    values = execute_callable(
  File "/modin/modin/pandas/test/utils.py", line 734, in execute_callable
    repr(fn(modin_df, **md_kwargs))
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/_pytest/python_api.py", line 966, in __exit__
    fail(self.message)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/_pytest/outcomes.py", line 196, in fail
    raise Failed(msg=reason, pytrace=pytrace)
Failed: DID NOT RAISE <class 'Exception'>

---------- coverage: platform linux, python 3.8.13-final-0 -----------
Coverage XML written to file coverage.xml

=========================== short test summary info ============================
FAILED modin/pandas/test/test_io.py::TestCsv::test_read_csv_s3[None] - Failed...
= 1 failed, 1998 passed, 341 skipped, 123 xfailed, 1 xpassed, 3914 warnings in 135.28s (0:02:15) =
============================= test session starts ==============================
platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /modin, configfile: setup.cfg
plugins: benchmark-3.4.1, cov-2.11.0, forked-1.4.0, xdist-2.5.0
collected 2464 items

modin/pandas/test/test_io.py ........................................... [  1%]
........................................................................ [  4%]
.............................................ssssssssssssssssssss.ss.ss. [  7%]
ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss. [ 10%]
ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.sssssssss [ 13%]
sssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss. [ 16%]
ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss. [ 19%]
ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.sssssssss [ 22%]
sssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss............................ [ 25%]
........................................................................ [ 28%]
........................................................................ [ 30%]
........................................................................ [ 33%]
........................................................................ [ 36%]
........................................................................ [ 39%]
........................................................................ [ 42%]
........................................................................ [ 45%]
........................................................................ [ 48%]
.................................s...................................... [ 51%]
........................................................................ [ 54%]
........................................................................ [ 57%]
........................................................................ [ 60%]
........................................................................ [ 63%]
........................................................................ [ 66%]
........................................................................ [ 68%]
........................................................................ [ 71%]
........................................................................ [ 74%]
........................................................................ [ 77%]
........................................................................ [ 80%]
................................xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [ 83%]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [ 86%]
xxxxxxxx................................................................ [ 89%]
...........................................................F............ [ 92%]
........................................................................ [ 95%]
.............................................................X.......ss. [ 98%]
...x....................xx.........ss........                            [100%]

=================================== FAILURES ===================================
________________________ TestCsv.test_read_csv_s3[None] ________________________
Traceback (most recent call last):
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/httpsession.py", line 178, in send
    response = await self._session.request(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/client.py", line 559, in _request
    await resp.start(conn)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 898, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/modin/modin/pandas/test/utils.py", line 728, in execute_callable
    pd_result = fn(pandas_df, **pd_kwargs)
  File "/modin/modin/pandas/test/utils.py", line 812, in applyier
    result = getattr(module, fn_name)(*args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1217, in _make_engine
    self.handles = get_handle(  # type: ignore[call-overload]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py", line 670, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py", line 385, in _get_filepath_or_buffer
    file_obj = fsspec.open(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/core.py", line 141, in open
    out = self.__enter__()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/core.py", line 104, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1037, in open
    f = self._open(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 605, in _open
    return S3File(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 1911, in __init__
    super().__init__(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1385, in __init__
    self.size = self.details["size"]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1398, in details
    self._details = self.fs.info(self.path)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 86, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 66, in sync
    raise return_result
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 26, in _runner
    result[0] = await coro
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 1140, in _info
    out = await self._call_s3(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 325, in _call_s3
    await self.set_session()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 473, in set_session
    self._s3 = await s3creator.__aenter__()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 22, in __aenter__
    self._client = await self._coro
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 102, in _create_client
    credentials = await self.get_credentials()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 133, in get_credentials
    self._credentials = await (self._components.get_component(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/credentials.py", line 814, in load_credentials
    creds = await provider.load()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/credentials.py", line 486, in load
    metadata = await fetcher.retrieve_iam_role_credentials()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 176, in retrieve_iam_role_credentials
    role_name = await self._get_iam_role(token)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 200, in _get_iam_role
    return await (await self._get_request(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 127, in _get_request
    response = await session.send(request.prepare())
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/httpsession.py", line 212, in send
    raise ReadTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/modin/modin/pandas/test/test_io.py", line 936, in test_read_csv_s3
    eval_io(
  File "/modin/modin/pandas/test/utils.py", line 835, in eval_io
    call_eval_general()
  File "/modin/modin/pandas/test/utils.py", line 818, in call_eval_general
    eval_general(
  File "/modin/modin/pandas/test/utils.py", line 767, in eval_general
    values = execute_callable(
  File "/modin/modin/pandas/test/utils.py", line 734, in execute_callable
    repr(fn(modin_df, **md_kwargs))
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/_pytest/python_api.py", line 966, in __exit__
    fail(self.message)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/_pytest/outcomes.py", line 196, in fail
    raise Failed(msg=reason, pytrace=pytrace)
Failed: DID NOT RAISE <class 'Exception'>

---------- coverage: platform linux, python 3.8.13-final-0 -----------
Coverage XML written to file coverage.xml

=========================== short test summary info ============================
FAILED modin/pandas/test/test_io.py::TestCsv::test_read_csv_s3[None] - Failed...
= 1 failed, 1998 passed, 341 skipped, 123 xfailed, 1 xpassed, 3914 warnings in 135.28s (0:02:15) =

RehanSD · 2022-06-03T00:43:12Z

Fix looks good, but introduces a bug when we have a column that shares the name of the index. In pandas, that fails like so:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[1, 2, 3]], index=pd.Index([0], name="so"), columns=['so', 'b', 'c'])

In [3]: df.groupby(by=['so', 'b']).count()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 df.groupby(by=['so', 'b']).count()

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py:7712, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   7707 axis = self._get_axis_number(axis)
   7709 # https://github.com/python/mypy/issues/7642
   7710 # error: Argument "squeeze" to "DataFrameGroupBy" has incompatible type
   7711 # "Union[bool, NoDefault]"; expected "bool"
-> 7712 return DataFrameGroupBy(
   7713     obj=self,
   7714     keys=by,
   7715     axis=axis,
   7716     level=level,
   7717     as_index=as_index,
   7718     sort=sort,
   7719     group_keys=group_keys,
   7720     squeeze=squeeze,  # type: ignore[arg-type]
   7721     observed=observed,
   7722     dropna=dropna,
   7723 )

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    879 if grouper is None:
    880     from pandas.core.groupby.grouper import get_grouper
--> 882     grouper, exclusions, obj = get_grouper(
    883         obj,
    884         keys,
    885         axis=axis,
    886         level=level,
    887         sort=sort,
    888         observed=observed,
    889         mutated=self.mutated,
    890         dropna=self.dropna,
    891     )
    893 self.obj = obj
    894 self.axis = obj._get_axis_number(axis)

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:872, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    870 if gpr in obj:
    871     if validate:
--> 872         obj._check_label_or_level_ambiguity(gpr, axis=axis)
    873     in_axis, name, gpr = True, gpr, obj[gpr]
    874     if gpr.ndim != 1:
    875         # non-unique columns; raise here to get the name in the
    876         # exception message

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/generic.py:1794, in NDFrame._check_label_or_level_ambiguity(self, key, axis)
   1786 label_article, label_type = (
   1787     ("a", "column") if axis == 0 else ("an", "index")
   1788 )
   1790 msg = (
   1791     f"'{key}' is both {level_article} {level_type} level and "
   1792     f"{label_article} {label_type} label, which is ambiguous."
   1793 )
-> 1794 raise ValueError(msg)

ValueError: 'so' is both an index level and a column label, which is ambiguous.

while it now succeeds in this pr:

In [1]: import modin.pandas as pd

In [2]: df = pd.DataFrame([[1, 2, 3]], index=pd.Index([0], name="so"), columns=['so', 'b', 'c'])
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:

    import ray
    ray.init()

UserWarning: On Macs, Ray's performance is known to degrade with object store size greater than 2.0 GiB. Ray by default does not allow setting an object store size greater than that. Modin is overriding that default limit because it would rather have a larger, slower object store than spill to disk more often. To override Modin's behavior, you can initialize Ray yourself.
UserWarning: Distributing <class 'list'> object. This may take some time.

In [3]: df.groupby(by=['so', 'b']).count()
Out[3]:
      c
so b
1  2  1

this is probably a simple fix though - we can just check if this is happening in the API level so that lower layers know that if the index and column share the same name, its due to a deferred index propagation.

Signed-off-by: Devin Petersohn <[email protected]>

Signed-off-by: Rehan Durrani <[email protected]>

RehanSD · 2022-06-07T18:34:40Z

modin/pandas/test/test_groupby.py

@@ -1570,7 +1570,7 @@ def test_agg_exceptions(operation):
        },
    ],
 )
-def test_to_pandas_convertion(kwargs):
+def test_to_pandas_conversion(kwargs):


Quick fix for a typo I noticed when updating the testing suite!

naren-ponder · 2022-06-07T18:35:57Z

modin/pandas/dataframe.py

@@ -415,7 +415,6 @@ def groupby(
            )
        else:
            squeeze = False
-


Nit: why are we removing the empty line?

Will remove!

modin/pandas/dataframe.py

modin/pandas/test/test_groupby.py

Signed-off-by: Rehan Durrani <[email protected]>

…into bugs/4522 Signed-off-by: Rehan Durrani <[email protected]>

Signed-off-by: Rehan Durrani <[email protected]>

RehanSD

LGTM!

jeffreykennethli · 2022-06-07T18:56:03Z

modin/pandas/dataframe.py

+                level_name, index_name = "an index", "a column"
+                if axis == 1:
+                    level_name, index_name = index_name, level_name
+                raise ValueError(


nit: it's possible for multiple k to be invalid and cause us to raise a ValueError, however currently only one such k will be raised. Is it worth collecting all invalid k and raising a ValueError if any k are invalid?

I checked, and in pandas, only the first invalid k is raised!

jeffreykennethli · 2022-06-07T18:59:33Z

modin/pandas/test/test_groupby.py

@@ -2032,3 +2032,70 @@ def test_sum_with_level():
    }
    modin_df, pandas_df = pd.DataFrame(data), pandas.DataFrame(data)
    eval_general(modin_df, pandas_df, lambda df: df.set_index("C").groupby("C").sum())
+
+
+def test_reset_index_groupby():


without context, this test is unclear. Could you add a brief description of what error condition this is testing or link to the gh issue?

Sure, can do!

jeffreykennethli · 2022-06-07T19:00:06Z

modin/pandas/test/test_groupby.py

+        ),
+    ).add_prefix("col")
+    pandas_df.index.names = [f"index_{i}" for i in range(len(pandas_df.index.names))]
+    # Convert every other column to string


nit: convert every even column to string

Will change!

jeffreykennethli · 2022-06-07T19:01:49Z

modin/pandas/test/test_groupby.py

+    for col in pandas_df.iloc[
+        :, [i for i in range(len(pandas_df.columns)) if i % 2 == 0]
+    ]:
+        pandas_df[col] = [str(chr(i)) for i in pandas_df[col]]


would be helpful if you could describe the schema for pandas_df here, to make the results of the above computation clearer.

Makes sense!

Signed-off-by: Rehan Durrani <[email protected]>

RehanSD

LGTM

Signed-off-by: Rehan Durrani <[email protected]>

RehanSD · 2022-06-07T21:10:15Z

There was another bug, where if we were passed a pandas.Series or Index to groupby's by argument, we were treating it the same as a list (i.e. looking at its values and doing a groupby by them). I've added a fix that handles these cases correctly, as well as some tests for both of these cases!

Signed-off-by: Rehan Durrani <[email protected]>

naren-ponder · 2022-06-08T01:39:17Z

modin/pandas/dataframe.py

+        # We don't need to check if `by` is a Series or Index, since those
+        # won't be referencing labels
+        if not isinstance(by, (pandas.Series, Series, pandas.Index)):
+            _by_check = by if is_list_like(by) else [by]


Is it okay to rename this from _by_check to _by_list?

Yeah, I think that makes sense to me!

Signed-off-by: Rehan Durrani <[email protected]>

RehanSD · 2022-06-08T04:16:50Z

We've decided to revert back the fix that makes us handle pandas.Index as the by argument to groupby the same as pandas does, and move that to a different PR.

FIX-modin-project#4522: Correct multiindex metadata with groupby

13395f0

Signed-off-by: Devin Petersohn <[email protected]>

devin-petersohn requested a review from a team as a code owner June 2, 2022 19:26

devin-petersohn added 2 commits June 2, 2022 14:32

Add test

8040915

Signed-off-by: Devin Petersohn <[email protected]>

Add release note

6e8466e

Signed-off-by: Devin Petersohn <[email protected]>

devin-petersohn and others added 4 commits June 3, 2022 08:50

Fix error checking for groupby

95f5510

Signed-off-by: Devin Petersohn <[email protected]>

Fix error checking

d2e8c5f

Signed-off-by: Devin Petersohn <[email protected]>

Add test for when by is in intersection of index names and columns

a5bb81e

Signed-off-by: Rehan Durrani <[email protected]>

Merge branch 'master' into bugs/4522

5a11af0

RehanSD reviewed Jun 7, 2022

View reviewed changes

naren-ponder reviewed Jun 7, 2022

View reviewed changes

RehanSD added 3 commits June 7, 2022 11:41

lint

75c6089

Signed-off-by: Rehan Durrani <[email protected]>

Merge branch 'bugs/4522' of https://github.com/devin-petersohn/modin …

177bfe2

…into bugs/4522 Signed-off-by: Rehan Durrani <[email protected]>

remove unused line

a547d32

Signed-off-by: Rehan Durrani <[email protected]>

RehanSD requested a review from naren-ponder June 7, 2022 18:48

RehanSD previously approved these changes Jun 7, 2022

View reviewed changes

jeffreykennethli reviewed Jun 7, 2022

View reviewed changes

naren-ponder previously approved these changes Jun 7, 2022

View reviewed changes

jeffreykennethli reviewed Jun 7, 2022

View reviewed changes

add comments to test

3ce19ea

Signed-off-by: Rehan Durrani <[email protected]>

RehanSD dismissed stale reviews from naren-ponder and themself via 3ce19ea June 7, 2022 19:15

RehanSD requested review from naren-ponder, jeffreykennethli and RehanSD June 7, 2022 19:16

RehanSD previously approved these changes Jun 7, 2022

View reviewed changes

jeffreykennethli previously approved these changes Jun 7, 2022

View reviewed changes

fix typo and add solution for series + index

0e4a521

Signed-off-by: Rehan Durrani <[email protected]>

RehanSD dismissed stale reviews from jeffreykennethli and themself via 0e4a521 June 7, 2022 21:09

RehanSD requested review from jeffreykennethli and RehanSD June 7, 2022 21:09

RehanSD added 2 commits June 7, 2022 14:12

update test

639a201

Signed-off-by: Rehan Durrani <[email protected]>

Fix value counts to call groupby with appropriate arguments

d2c7295

Signed-off-by: Rehan Durrani <[email protected]>

naren-ponder previously approved these changes Jun 8, 2022

View reviewed changes

Fix naming

55ec3a1

Signed-off-by: Rehan Durrani <[email protected]>

RehanSD dismissed naren-ponder’s stale review via 55ec3a1 June 8, 2022 01:41

Merge branch 'master' into bugs/4522

2e9a056

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#4522: Correct multiindex metadata with groupby #4523

FIX-#4522: Correct multiindex metadata with groupby #4523

devin-petersohn commented Jun 2, 2022 •

edited

Loading

codecov bot commented Jun 2, 2022 •

edited

Loading

modin-bot commented Jun 2, 2022 •

edited

Loading

RehanSD commented Jun 3, 2022 •

edited

Loading

RehanSD Jun 7, 2022

naren-ponder Jun 7, 2022

RehanSD Jun 7, 2022

RehanSD left a comment

jeffreykennethli Jun 7, 2022

RehanSD Jun 7, 2022

jeffreykennethli Jun 7, 2022

RehanSD Jun 7, 2022

jeffreykennethli Jun 7, 2022

RehanSD Jun 7, 2022

jeffreykennethli Jun 7, 2022

RehanSD Jun 7, 2022

RehanSD left a comment

RehanSD commented Jun 7, 2022

naren-ponder Jun 8, 2022

RehanSD Jun 8, 2022

RehanSD commented Jun 8, 2022

FIX-#4522: Correct multiindex metadata with groupby #4523

Are you sure you want to change the base?

FIX-#4522: Correct multiindex metadata with groupby #4523

Conversation

devin-petersohn commented Jun 2, 2022 • edited Loading

What do these changes do?

codecov bot commented Jun 2, 2022 • edited Loading

Codecov Report

modin-bot commented Jun 2, 2022 • edited Loading

TeamCity Python test results bot

Tests FAILed

RehanSD commented Jun 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RehanSD left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RehanSD left a comment

Choose a reason for hiding this comment

RehanSD commented Jun 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RehanSD commented Jun 8, 2022

devin-petersohn commented Jun 2, 2022 •

edited

Loading

codecov bot commented Jun 2, 2022 •

edited

Loading

modin-bot commented Jun 2, 2022 •

edited

Loading

RehanSD commented Jun 3, 2022 •

edited

Loading