Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Merge on Ray Engine does not produce the column "_merge" when indicator is True #7384

Open
3 tasks done
castelojb opened this issue Sep 5, 2024 · 0 comments
Open
3 tasks done
Labels
bug 🦗 Something isn't working Triage 🩹 Issues that need triage

Comments

@castelojb
Copy link

castelojb commented Sep 5, 2024

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

i get a KeyError for _merge from the following code


import os
os.environ["MODIN_ENGINE"] = "ray"

import modin.pandas as pd


data1 = {
    'id': [1, 2, 3, 4, 5],
    'name': ['Ana', 'Bruno', 'Carlos', 'Daniela', 'Eduardo'],
    'age': [23, 35, 45, 22, 39]
}

df1 = pd.DataFrame(data1)

data2 = {
    'id': [3, 4, 5, 6, 7],
    'name': ['Carlos', 'Daniela', 'Eduardo', 'Fernanda', 'Gustavo'],
    'age': [5000, 6000, 7000, 8000, 9000]
}

df2 = pd.DataFrame(data2)

df_merged = df1.merge(df2, on=['id', 'name'], how='left', indicator=True)

# key error
df_merged[df_merged['_merge'] == 'both']

Issue Description

The issue is that Modin's merge operation does not support the indicator=True option, which is available in Pandas. This option typically creates an additional column, _merge, indicating the source of each row after the merge (i.e., whether the row came from the left DataFrame, the right DataFrame, or both).

Expected Behavior

When performing a merge with indicator=True, Modin should add a column _merge that shows whether a row is from the left DataFrame (left_only), the right DataFrame (right_only), or from both (both).

Error Logs

2024-09-05 16:44:04,679	INFO worker.py:1783 -- Started a local Ray instance.
Traceback (most recent call last):
  File "C:\Users\Pichau\WorkSpaceIdeos\partech\aux.py", line 27, in <module>
    df_merged[df_merged['_merge'] == 'both']
  File "C:\Users\Pichau\miniconda3\envs\partech_gloe_update\lib\site-packages\modin\logging\logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\Pichau\miniconda3\envs\partech_gloe_update\lib\site-packages\modin\pandas\base.py", line 3948, in __getitem__
    return self._getitem(key)
  File "C:\Users\Pichau\miniconda3\envs\partech_gloe_update\lib\site-packages\modin\logging\logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\Pichau\miniconda3\envs\partech_gloe_update\lib\site-packages\modin\pandas\dataframe.py", line 3247, in _getitem
    return self._getitem_column(key)
  File "C:\Users\Pichau\miniconda3\envs\partech_gloe_update\lib\site-packages\modin\logging\logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\Pichau\miniconda3\envs\partech_gloe_update\lib\site-packages\modin\pandas\dataframe.py", line 2581, in _getitem_column
    raise KeyError("{}".format(key))
KeyError: '_merge'

Installed Versions

INSTALLED VERSIONS

commit : c8bbca8
python : 3.10.14.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22621
machine : AMD64
processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : pt_BR.cp1252

Modin dependencies

modin : 0.31.0
ray : 2.35.0
dask : 2023.3.2
distributed : 2023.3.2.1

pandas dependencies

pandas : 2.2.0
numpy : 1.24.3
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.5.1
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : 2.9.6
jinja2 : 3.1.4
IPython : 8.27.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : None
matplotlib : 3.9.2
numba : 0.57.0
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 17.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 2.0.34
tables : None
tabulate : 0.9.0
xarray : 2024.7.0
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
None

@castelojb castelojb added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working Triage 🩹 Issues that need triage
Projects
None yet
Development

No branches or pull requests

1 participant