Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] pd.DataFrames modified during ingest/outgest round-trips #2829

Open
ryan-williams opened this issue Aug 2, 2024 · 4 comments
Open

Comments

@ryan-williams
Copy link
Member

ryan-williams commented Aug 2, 2024

(Factored out of #2804)

Describe the bug
Certain pd.DataFrame column/index names can be shuffled or dropped during an ingest/outgest round-trip (e.g. {from,to}_anndata). In some cases, a column or df.index can be dropped.

To Reproduce
See test_dataframe_io_roundtrips.py.

Versions (please complete the following information):

  • TileDB-SOMA version: 1.12.3, seems like ingest/outgest have always worked this way though
  • Language and language version: Python, all supported versions
  • OS: all

Examples

(see test_dataframe_io_roundtrips.py; obs-roundtrip.py does similar and is used in the code snippets below)

1. df.index is named "index":
  • df.index.name is dropped (index becomes unnamed)
python obs-roundtrip.py -i index=xx,yy,zz -c col0=aa,bb,cc -c col1=AA,BB,CC 2>/dev/null
# Before:
#       col0 col1
# index
# xx      aa   AA
# yy      bb   BB
# zz      cc   CC
#
# SOMA DataFrame schema:
# soma_joinid: int64 not null
# obs_id: large_string
# col0: large_string
# col1: large_string
# metadata[soma_dataframe_original_index_name]:
# null
#
# After:
#    col0 col1
# xx   aa   AA
# yy   bb   BB
# zz   cc   CC
2. DataFrame has a column named obs_id:
  • obs_id column becomes unnamed index
  • Original df.index dropped
python obs-roundtrip.py -i xx,yy,zz -c col0=aa,bb,cc -c obs_id=AA,BB,CC 2>/dev/null
# Before:
#    col0 obs_id
# xx   aa     AA
# yy   bb     BB
# zz   cc     CC
#
# SOMA DataFrame schema:
# soma_joinid: int64 not null
# col0: large_string
# obs_id: large_string
# metadata[soma_dataframe_original_index_name]:
# null
#
# After:
#    col0
# AA   aa
# BB   bb
# CC   cc

DataFrame has a column named "index"

3. df.index is unnamed:
  • "index" column is promoted to df.index (unnamed)
  • Original (unnamed) index becomes a column named level_0
python obs-roundtrip.py -i xx,yy,zz -c col0=aa,bb,cc -c index=AA,BB,CC 2>/dev/null
# Before:
#    col0 index
# xx   aa    AA
# yy   bb    BB
# zz   cc    CC
#
# SOMA DataFrame schema:
# soma_joinid: int64 not null
# level_0: large_string
# col0: large_string
# obs_id: large_string
# metadata[soma_dataframe_original_index_name]:
# null
#
# After:
#    level_0 col0
# AA      xx   aa
# BB      yy   bb
# CC      zz   cc
4. df.index is named id_column_name (default obs_id):
  • "index" column is dropped
python obs-roundtrip.py -i obs_id=xx,yy,zz -c col0=aa,bb,cc -c index=AA,BB,CC 2>/dev/null
# Before:
#        col0 index
# obs_id
# xx       aa    AA
# yy       bb    BB
# zz       cc    CC
#
# SOMA DataFrame schema:
# soma_joinid: int64 not null
# obs_id: large_string
# col0: large_string
# metadata[soma_dataframe_original_index_name]:
# "obs_id"
#
# After:
#        col0
# obs_id
# xx       aa
# yy       bb
# zz       cc
5. df.index has another name:
  • "index" column renamed to obs_id
python obs-roundtrip.py -i idx=xx,yy,zz -c col0=aa,bb,cc -c index=AA,BB,CC 2>/dev/null
# Before:
#     col0 index
# idx
# xx    aa    AA
# yy    bb    BB
# zz    cc    CC
#
# SOMA DataFrame schema:
# soma_joinid: int64 not null
# idx: large_string
# col0: large_string
# obs_id: large_string
# metadata[soma_dataframe_original_index_name]:
# "idx"
#
# After:
#     col0 obs_id
# idx
# xx    aa     AA
# yy    bb     BB
# zz    cc     CC

DataFrame also has a column named id_column_name (default: obs_id)

6. df.index is unnamed:
  • unnamed index → column named level_0
  • obs_id column → unnamed index
  • "index" column dropped
python obs-roundtrip.py -i xx,yy,zz -c obs_id=aa,bb,cc -c index=AA,BB,CC 2>/dev/null
# Before:
#    obs_id index
# xx     aa    AA
# yy     bb    BB
# zz     cc    CC
#
# SOMA DataFrame schema:
# soma_joinid: int64 not null
# level_0: large_string
# obs_id: large_string
# metadata[soma_dataframe_original_index_name]:
# null
#
# After:
#    level_0
# aa      xx
# bb      yy
# cc      zz
7. df.index has a name:
  • "index" column is dropped
python obs-roundtrip.py -i idx=xx,yy,zz -c obs_id=aa,bb,cc -c index=AA,BB,CC 2>/dev/null
# Before:
#     obs_id index
# idx
# xx      aa    AA
# yy      bb    BB
# zz      cc    CC
#
# SOMA DataFrame schema:
# soma_joinid: int64 not null
# idx: large_string
# obs_id: large_string
# metadata[soma_dataframe_original_index_name]:
# "idx"
#
# After:
#     obs_id
# idx
# xx      aa
# yy      bb
# zz      cc

It's not clear what backwards-compatibility concerns might exist from fixing each case.

@johnkerl
Copy link
Member

johnkerl commented Aug 5, 2024

It's not clear what backwards-compatibility concerns might exist from fixing each case.

Our existing unit-test suite is our bulwark against all bugs in this area that have been reported and fixed. As long as we can add new unit-test cases for new fixes, without any old unit-test cases breaking, we should be good.

@ryan-williams ryan-williams changed the title [Bug] Lossy ingest/outgest round-trips [Bug] pd.DataFrames modified during ingest/outgest round-trips Aug 5, 2024
@johnkerl
Copy link
Member

@ryan-williams given the merged PRs here are we good to close this, or are there more PRs upcoming?

@ryan-williams
Copy link
Member Author

None of the "examples" above have been addressed. They apply to DataFrames in obs, var, and uns, but only when certain column-/index-name conditions are met.

#2874 fixed two separate issues that were specific to uns DataFrames:

  • Users' uns DataFrames were mutated on ingest (soma_joinid index added)
  • uns DataFrames were modified by I/O round trips (soma_joinid column added)

These applied to all uns DataFrames, regardless of index-/column-names. There was never an issue filed about them, just some Slack discussion.

@johnkerl
Copy link
Member

Thanks @ryan-williams !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants