Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#4522: Correct multiindex metadata with groupby #4523

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

devin-petersohn
Copy link
Collaborator

@devin-petersohn devin-petersohn commented Jun 2, 2022

Signed-off-by: Devin Petersohn [email protected]

What do these changes do?

  • commit message follows format outlined here
  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves reset_index followed by groupby causes exception in some cases #4522
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date
  • added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

@devin-petersohn devin-petersohn requested a review from a team as a code owner June 2, 2022 19:26
Signed-off-by: Devin Petersohn <[email protected]>
Signed-off-by: Devin Petersohn <[email protected]>
@codecov
Copy link

codecov bot commented Jun 2, 2022

Codecov Report

Merging #4523 (2e9a056) into master (8679052) will decrease coverage by 24.76%.
The diff coverage is 88.23%.

@@             Coverage Diff             @@
##           master    #4523       +/-   ##
===========================================
- Coverage   86.34%   61.57%   -24.77%     
===========================================
  Files         228      228               
  Lines       18438    18454       +16     
===========================================
- Hits        15920    11363     -4557     
- Misses       2518     7091     +4573     
Impacted Files Coverage Δ
modin/core/dataframe/algebra/groupby.py 85.57% <50.00%> (-9.53%) ⬇️
modin/pandas/dataframe.py 91.28% <92.85%> (-0.20%) ⬇️
modin/pandas/base.py 94.63% <100.00%> (-0.09%) ⬇️
modin/experimental/core/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
modin/core/execution/ray/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
modin/experimental/pandas/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
modin/experimental/sklearn/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
modin/experimental/core/execution/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
modin/core/dataframe/base/exchange/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
modin/experimental/xgboost/test/test_dmatrix.py 0.00% <0.00%> (-100.00%) ⬇️
... and 100 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@modin-bot
Copy link

modin-bot commented Jun 2, 2022

TeamCity Python test results bot

Tests FAILed

Tests Logs
============================= test session starts ==============================
platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /modin, configfile: setup.cfg
plugins: benchmark-3.4.1, cov-2.11.0, forked-1.4.0, xdist-2.5.0
collected 2464 items

modin/pandas/test/test_io.py ........................................... [  1%]
........................................................................ [  4%]
.............................................ssssssssssssssssssss.ss.ss. [  7%]
ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss. [ 10%]
ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.sssssssss [ 13%]
sssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss. [ 16%]
ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss. [ 19%]
ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.sssssssss [ 22%]
sssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss............................ [ 25%]
........................................................................ [ 28%]
........................................................................ [ 30%]
........................................................................ [ 33%]
........................................................................ [ 36%]
........................................................................ [ 39%]
........................................................................ [ 42%]
........................................................................ [ 45%]
........................................................................ [ 48%]
.................................s...................................... [ 51%]
........................................................................ [ 54%]
........................................................................ [ 57%]
........................................................................ [ 60%]
........................................................................ [ 63%]
........................................................................ [ 66%]
........................................................................ [ 68%]
........................................................................ [ 71%]
........................................................................ [ 74%]
........................................................................ [ 77%]
........................................................................ [ 80%]
................................xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [ 83%]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [ 86%]
xxxxxxxx................................................................ [ 89%]
...........................................................F............ [ 92%]
........................................................................ [ 95%]
.............................................................X.......ss. [ 98%]
...x....................xx.........ss........                            [100%]

=================================== FAILURES ===================================
________________________ TestCsv.test_read_csv_s3[None] ________________________
Traceback (most recent call last):
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/httpsession.py", line 178, in send
    response = await self._session.request(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/client.py", line 559, in _request
    await resp.start(conn)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 898, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/modin/modin/pandas/test/utils.py", line 728, in execute_callable
    pd_result = fn(pandas_df, **pd_kwargs)
  File "/modin/modin/pandas/test/utils.py", line 812, in applyier
    result = getattr(module, fn_name)(*args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1217, in _make_engine
    self.handles = get_handle(  # type: ignore[call-overload]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py", line 670, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py", line 385, in _get_filepath_or_buffer
    file_obj = fsspec.open(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/core.py", line 141, in open
    out = self.__enter__()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/core.py", line 104, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1037, in open
    f = self._open(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 605, in _open
    return S3File(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 1911, in __init__
    super().__init__(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1385, in __init__
    self.size = self.details["size"]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1398, in details
    self._details = self.fs.info(self.path)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 86, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 66, in sync
    raise return_result
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 26, in _runner
    result[0] = await coro
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 1140, in _info
    out = await self._call_s3(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 325, in _call_s3
    await self.set_session()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 473, in set_session
    self._s3 = await s3creator.__aenter__()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 22, in __aenter__
    self._client = await self._coro
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 102, in _create_client
    credentials = await self.get_credentials()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 133, in get_credentials
    self._credentials = await (self._components.get_component(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/credentials.py", line 814, in load_credentials
    creds = await provider.load()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/credentials.py", line 486, in load
    metadata = await fetcher.retrieve_iam_role_credentials()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 176, in retrieve_iam_role_credentials
    role_name = await self._get_iam_role(token)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 200, in _get_iam_role
    return await (await self._get_request(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 127, in _get_request
    response = await session.send(request.prepare())
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/httpsession.py", line 212, in send
    raise ReadTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/modin/modin/pandas/test/test_io.py", line 936, in test_read_csv_s3
    eval_io(
  File "/modin/modin/pandas/test/utils.py", line 835, in eval_io
    call_eval_general()
  File "/modin/modin/pandas/test/utils.py", line 818, in call_eval_general
    eval_general(
  File "/modin/modin/pandas/test/utils.py", line 767, in eval_general
    values = execute_callable(
  File "/modin/modin/pandas/test/utils.py", line 734, in execute_callable
    repr(fn(modin_df, **md_kwargs))
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/_pytest/python_api.py", line 966, in __exit__
    fail(self.message)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/_pytest/outcomes.py", line 196, in fail
    raise Failed(msg=reason, pytrace=pytrace)
Failed: DID NOT RAISE <class 'Exception'>

---------- coverage: platform linux, python 3.8.13-final-0 -----------
Coverage XML written to file coverage.xml

=========================== short test summary info ============================
FAILED modin/pandas/test/test_io.py::TestCsv::test_read_csv_s3[None] - Failed...
= 1 failed, 1998 passed, 341 skipped, 123 xfailed, 1 xpassed, 3914 warnings in 135.28s (0:02:15) =
============================= test session starts ==============================
platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /modin, configfile: setup.cfg
plugins: benchmark-3.4.1, cov-2.11.0, forked-1.4.0, xdist-2.5.0
collected 2464 items

modin/pandas/test/test_io.py ........................................... [  1%]
........................................................................ [  4%]
.............................................ssssssssssssssssssss.ss.ss. [  7%]
ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss. [ 10%]
ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.sssssssss [ 13%]
sssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss. [ 16%]
ss.ss.ss.ss.ss.ss.ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss. [ 19%]
ss.ss.ss.ssssssssssssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.sssssssss [ 22%]
sssssssssss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss.ss............................ [ 25%]
........................................................................ [ 28%]
........................................................................ [ 30%]
........................................................................ [ 33%]
........................................................................ [ 36%]
........................................................................ [ 39%]
........................................................................ [ 42%]
........................................................................ [ 45%]
........................................................................ [ 48%]
.................................s...................................... [ 51%]
........................................................................ [ 54%]
........................................................................ [ 57%]
........................................................................ [ 60%]
........................................................................ [ 63%]
........................................................................ [ 66%]
........................................................................ [ 68%]
........................................................................ [ 71%]
........................................................................ [ 74%]
........................................................................ [ 77%]
........................................................................ [ 80%]
................................xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [ 83%]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [ 86%]
xxxxxxxx................................................................ [ 89%]
...........................................................F............ [ 92%]
........................................................................ [ 95%]
.............................................................X.......ss. [ 98%]
...x....................xx.........ss........                            [100%]

=================================== FAILURES ===================================
________________________ TestCsv.test_read_csv_s3[None] ________________________
Traceback (most recent call last):
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/httpsession.py", line 178, in send
    response = await self._session.request(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/client.py", line 559, in _request
    await resp.start(conn)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 898, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/modin/modin/pandas/test/utils.py", line 728, in execute_callable
    pd_result = fn(pandas_df, **pd_kwargs)
  File "/modin/modin/pandas/test/utils.py", line 812, in applyier
    result = getattr(module, fn_name)(*args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1217, in _make_engine
    self.handles = get_handle(  # type: ignore[call-overload]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py", line 670, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py", line 385, in _get_filepath_or_buffer
    file_obj = fsspec.open(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/core.py", line 141, in open
    out = self.__enter__()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/core.py", line 104, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1037, in open
    f = self._open(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 605, in _open
    return S3File(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 1911, in __init__
    super().__init__(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1385, in __init__
    self.size = self.details["size"]
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/spec.py", line 1398, in details
    self._details = self.fs.info(self.path)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 86, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 66, in sync
    raise return_result
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/fsspec/asyn.py", line 26, in _runner
    result[0] = await coro
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 1140, in _info
    out = await self._call_s3(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 325, in _call_s3
    await self.set_session()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/s3fs/core.py", line 473, in set_session
    self._s3 = await s3creator.__aenter__()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 22, in __aenter__
    self._client = await self._coro
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 102, in _create_client
    credentials = await self.get_credentials()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/session.py", line 133, in get_credentials
    self._credentials = await (self._components.get_component(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/credentials.py", line 814, in load_credentials
    creds = await provider.load()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/credentials.py", line 486, in load
    metadata = await fetcher.retrieve_iam_role_credentials()
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 176, in retrieve_iam_role_credentials
    role_name = await self._get_iam_role(token)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 200, in _get_iam_role
    return await (await self._get_request(
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/utils.py", line 127, in _get_request
    response = await session.send(request.prepare())
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/aiobotocore/httpsession.py", line 212, in send
    raise ReadTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/modin/modin/pandas/test/test_io.py", line 936, in test_read_csv_s3
    eval_io(
  File "/modin/modin/pandas/test/utils.py", line 835, in eval_io
    call_eval_general()
  File "/modin/modin/pandas/test/utils.py", line 818, in call_eval_general
    eval_general(
  File "/modin/modin/pandas/test/utils.py", line 767, in eval_general
    values = execute_callable(
  File "/modin/modin/pandas/test/utils.py", line 734, in execute_callable
    repr(fn(modin_df, **md_kwargs))
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/_pytest/python_api.py", line 966, in __exit__
    fail(self.message)
  File "/home/ray/anaconda3/envs/modin/lib/python3.8/site-packages/_pytest/outcomes.py", line 196, in fail
    raise Failed(msg=reason, pytrace=pytrace)
Failed: DID NOT RAISE <class 'Exception'>

---------- coverage: platform linux, python 3.8.13-final-0 -----------
Coverage XML written to file coverage.xml

=========================== short test summary info ============================
FAILED modin/pandas/test/test_io.py::TestCsv::test_read_csv_s3[None] - Failed...
= 1 failed, 1998 passed, 341 skipped, 123 xfailed, 1 xpassed, 3914 warnings in 135.28s (0:02:15) =

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 3, 2022

Fix looks good, but introduces a bug when we have a column that shares the name of the index. In pandas, that fails like so:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[1, 2, 3]], index=pd.Index([0], name="so"), columns=['so', 'b', 'c'])

In [3]: df.groupby(by=['so', 'b']).count()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 df.groupby(by=['so', 'b']).count()

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py:7712, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   7707 axis = self._get_axis_number(axis)
   7709 # https://github.com/python/mypy/issues/7642
   7710 # error: Argument "squeeze" to "DataFrameGroupBy" has incompatible type
   7711 # "Union[bool, NoDefault]"; expected "bool"
-> 7712 return DataFrameGroupBy(
   7713     obj=self,
   7714     keys=by,
   7715     axis=axis,
   7716     level=level,
   7717     as_index=as_index,
   7718     sort=sort,
   7719     group_keys=group_keys,
   7720     squeeze=squeeze,  # type: ignore[arg-type]
   7721     observed=observed,
   7722     dropna=dropna,
   7723 )

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    879 if grouper is None:
    880     from pandas.core.groupby.grouper import get_grouper
--> 882     grouper, exclusions, obj = get_grouper(
    883         obj,
    884         keys,
    885         axis=axis,
    886         level=level,
    887         sort=sort,
    888         observed=observed,
    889         mutated=self.mutated,
    890         dropna=self.dropna,
    891     )
    893 self.obj = obj
    894 self.axis = obj._get_axis_number(axis)

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:872, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    870 if gpr in obj:
    871     if validate:
--> 872         obj._check_label_or_level_ambiguity(gpr, axis=axis)
    873     in_axis, name, gpr = True, gpr, obj[gpr]
    874     if gpr.ndim != 1:
    875         # non-unique columns; raise here to get the name in the
    876         # exception message

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/generic.py:1794, in NDFrame._check_label_or_level_ambiguity(self, key, axis)
   1786 label_article, label_type = (
   1787     ("a", "column") if axis == 0 else ("an", "index")
   1788 )
   1790 msg = (
   1791     f"'{key}' is both {level_article} {level_type} level and "
   1792     f"{label_article} {label_type} label, which is ambiguous."
   1793 )
-> 1794 raise ValueError(msg)

ValueError: 'so' is both an index level and a column label, which is ambiguous.

while it now succeeds in this pr:

In [1]: import modin.pandas as pd

In [2]: df = pd.DataFrame([[1, 2, 3]], index=pd.Index([0], name="so"), columns=['so', 'b', 'c'])
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:

    import ray
    ray.init()

UserWarning: On Macs, Ray's performance is known to degrade with object store size greater than 2.0 GiB. Ray by default does not allow setting an object store size greater than that. Modin is overriding that default limit because it would rather have a larger, slower object store than spill to disk more often. To override Modin's behavior, you can initialize Ray yourself.
UserWarning: Distributing <class 'list'> object. This may take some time.

In [3]: df.groupby(by=['so', 'b']).count()
Out[3]:
      c
so b
1  2  1

this is probably a simple fix though - we can just check if this is happening in the API level so that lower layers know that if the index and column share the same name, its due to a deferred index propagation.

@@ -1570,7 +1570,7 @@ def test_agg_exceptions(operation):
},
],
)
def test_to_pandas_convertion(kwargs):
def test_to_pandas_conversion(kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick fix for a typo I noticed when updating the testing suite!

@@ -415,7 +415,6 @@ def groupby(
)
else:
squeeze = False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why are we removing the empty line?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove!

modin/pandas/dataframe.py Outdated Show resolved Hide resolved
modin/pandas/test/test_groupby.py Show resolved Hide resolved
Signed-off-by: Rehan Durrani <[email protected]>
Signed-off-by: Rehan Durrani <[email protected]>
@RehanSD RehanSD requested a review from naren-ponder June 7, 2022 18:48
RehanSD
RehanSD previously approved these changes Jun 7, 2022
Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

level_name, index_name = "an index", "a column"
if axis == 1:
level_name, index_name = index_name, level_name
raise ValueError(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's possible for multiple k to be invalid and cause us to raise a ValueError, however currently only one such k will be raised. Is it worth collecting all invalid k and raising a ValueError if any k are invalid?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked, and in pandas, only the first invalid k is raised!

naren-ponder
naren-ponder previously approved these changes Jun 7, 2022
@@ -2032,3 +2032,70 @@ def test_sum_with_level():
}
modin_df, pandas_df = pd.DataFrame(data), pandas.DataFrame(data)
eval_general(modin_df, pandas_df, lambda df: df.set_index("C").groupby("C").sum())


def test_reset_index_groupby():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without context, this test is unclear. Could you add a brief description of what error condition this is testing or link to the gh issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can do!

),
).add_prefix("col")
pandas_df.index.names = [f"index_{i}" for i in range(len(pandas_df.index.names))]
# Convert every other column to string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: convert every even column to string

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change!

for col in pandas_df.iloc[
:, [i for i in range(len(pandas_df.columns)) if i % 2 == 0]
]:
pandas_df[col] = [str(chr(i)) for i in pandas_df[col]]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be helpful if you could describe the schema for pandas_df here, to make the results of the above computation clearer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

Signed-off-by: Rehan Durrani <[email protected]>
@RehanSD RehanSD dismissed stale reviews from naren-ponder and themself via 3ce19ea June 7, 2022 19:15
RehanSD
RehanSD previously approved these changes Jun 7, 2022
Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@RehanSD RehanSD dismissed stale reviews from jeffreykennethli and themself via 0e4a521 June 7, 2022 21:09
@RehanSD
Copy link
Collaborator

RehanSD commented Jun 7, 2022

There was another bug, where if we were passed a pandas.Series or Index to groupby's by argument, we were treating it the same as a list (i.e. looking at its values and doing a groupby by them). I've added a fix that handles these cases correctly, as well as some tests for both of these cases!

naren-ponder
naren-ponder previously approved these changes Jun 8, 2022
# We don't need to check if `by` is a Series or Index, since those
# won't be referencing labels
if not isinstance(by, (pandas.Series, Series, pandas.Index)):
_by_check = by if is_list_like(by) else [by]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to rename this from _by_check to _by_list?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that makes sense to me!

Signed-off-by: Rehan Durrani <[email protected]>
@RehanSD
Copy link
Collaborator

RehanSD commented Jun 8, 2022

We've decided to revert back the fix that makes us handle pandas.Index as the by argument to groupby the same as pandas does, and move that to a different PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reset_index followed by groupby causes exception in some cases
5 participants