Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Ingest/outgest round-trip improvements #2804

Merged
merged 3 commits into from
Aug 6, 2024
Merged

[python] Ingest/outgest round-trip improvements #2804

merged 3 commits into from
Aug 6, 2024

Conversation

ryan-williams
Copy link
Member

@ryan-williams ryan-williams commented Jul 22, 2024

  1. The first commit in this PR just adds new unit tests, which verify several categories of ingest/outgest "round-trip mutations."

  2. The second commit factors DF ingest/outgest logic currently ≈duplicated in 2 places each:

    • ingest: ingest.py/signatures.py
    • outgest: obs/var

    In the near future we'll want to use both for uns as well.

  3. The third commit adds a 2nd set of tests, functionally identical to those added in 1., except they directly round-trip an "obs" DataFrame (using _write_dataframe and _read_dataframe; the latter is added in 2.).

    • The tests from 1. round-trip using {from,to}_anndata), and are still present.

Copy link

codecov bot commented Jul 22, 2024

Codecov Report

Attention: Patch coverage is 97.05882% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.03%. Comparing base (edabcb6) to head (a238e7d).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2804      +/-   ##
==========================================
- Coverage   90.06%   90.03%   -0.04%     
==========================================
  Files          37       37              
  Lines        3945     3942       -3     
==========================================
- Hits         3553     3549       -4     
- Misses        392      393       +1     
Flag Coverage Δ
python 90.03% <97.05%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 90.03% <97.05%> (-0.04%) ⬇️
libtiledbsoma ∅ <ø> (∅)

@ryan-williams ryan-williams changed the title [python]: fix ingest/outgest round-tripping [python]: ingest/outgest round-trip improvements Jul 26, 2024
@ryan-williams ryan-williams marked this pull request as ready for review August 2, 2024 02:25
Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for splitting out new unit-test cases from the ones you're modifying on #2824. This (a) makes it easier to verify you're not re-breaking old bugfixes that were completed in the past, and (b) makes it much easier to verify/highlight new bugfixes which need to be made in the future.

apis/python/src/tiledbsoma/io/_registration/signatures.py Outdated Show resolved Hide resolved
arrow_table = df_to_arrow(df)
arrow_schema = arrow_table.schema.remove_metadata()
return _string_dict_from_arrow_schema(arrow_schema)


# Metadata indicating a DataFrame's original index column name, serialized as a JSON string or `null`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Metadata indicating a DataFrame's original index column name, serialized as a JSON string or `null`.
# Metadata indicating a DataFrame's original H5AD/AnnData index-column name, serialized as a JSON string or `null`.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworked this comment, but am -1 on "H5AD/AnnData index-column name."

H5AD/AnnData don't have an index-column; this is about how we ingest and outgest DataFrames, regardless of their provenance.

apis/python/src/tiledbsoma/io/_registration/signatures.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/io/_registration/signatures.py Outdated Show resolved Hide resolved
# the column named "index" here.
df.drop(columns=["index"], inplace=True)
else:
# If `id_column_name` was passed, and is not already a column in the DataFrame, we assume the original index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your careful analysis here -- I really appreciate it.

The enumeration of problems here should moved from code comments here, to a tracking issue I'm asking you to create. Then that issue's link should be placed here.

Copy link
Member Author

@ryan-williams ryan-williams Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a link to #2829, couldn't tell if you wanted me to remove the rest of the comment here.

I've left it for now, since the explanation here is specific to the nested if/else context, and the issues the rename below specifically introduces given that context.

Copy link
Member

@johnkerl johnkerl Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. What we have now is a bug-report listing as an extended code comment. That is more appropriately written as an issue we can track and schedule (#2829) -- and I asked you to move this bug-listing from here to there. But if you feel it's crucial for understanding of the code to have this list-out duplicated here, I won't reject this PR on that basis.

apis/python/tests/test_lossy_ingest_outgest_roundtrips.py Outdated Show resolved Hide resolved
apis/python/tests/test_lossy_ingest_outgest_roundtrips.py Outdated Show resolved Hide resolved
apis/python/tests/test_lossy_ingest_outgest_roundtrips.py Outdated Show resolved Hide resolved
apis/python/tests/test_lossy_ingest_outgest_roundtrips.py Outdated Show resolved Hide resolved
apis/python/tests/test_lossy_ingest_outgest_roundtrips.py Outdated Show resolved Hide resolved
Copy link
Member Author

@ryan-williams ryan-williams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I addressed everything

apis/python/src/tiledbsoma/io/_registration/signatures.py Outdated Show resolved Hide resolved
arrow_table = df_to_arrow(df)
arrow_schema = arrow_table.schema.remove_metadata()
return _string_dict_from_arrow_schema(arrow_schema)


# Metadata indicating a DataFrame's original index column name, serialized as a JSON string or `null`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworked this comment, but am -1 on "H5AD/AnnData index-column name."

H5AD/AnnData don't have an index-column; this is about how we ingest and outgest DataFrames, regardless of their provenance.

apis/python/src/tiledbsoma/io/_registration/signatures.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/io/_registration/signatures.py Outdated Show resolved Hide resolved
# the column named "index" here.
df.drop(columns=["index"], inplace=True)
else:
# If `id_column_name` was passed, and is not already a column in the DataFrame, we assume the original index
Copy link
Member Author

@ryan-williams ryan-williams Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a link to #2829, couldn't tell if you wanted me to remove the rest of the comment here.

I've left it for now, since the explanation here is specific to the nested if/else context, and the issues the rename below specifically introduces given that context.

apis/python/src/tiledbsoma/io/outgest.py Show resolved Hide resolved
apis/python/tests/test_lossy_ingest_outgest_roundtrips.py Outdated Show resolved Hide resolved
apis/python/tests/test_lossy_ingest_outgest_roundtrips.py Outdated Show resolved Hide resolved
apis/python/tests/test_lossy_ingest_outgest_roundtrips.py Outdated Show resolved Hide resolved
apis/python/tests/test_lossy_ingest_outgest_roundtrips.py Outdated Show resolved Hide resolved
@johnkerl
Copy link
Member

johnkerl commented Aug 5, 2024

I've reworked this comment, but am -1 on "H5AD/AnnData index-column name."

H5AD/AnnData don't have an index-column; this is about how we ingest and outgest DataFrames, regardless of their provenance.

You're right, and thanks -- I meant index-column names on DataFrames withih H5AD/AnnData

# the column named "index" here.
df.drop(columns=["index"], inplace=True)
else:
# If `id_column_name` was passed, and is not already a column in the DataFrame, we assume the original index
Copy link
Member

@johnkerl johnkerl Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. What we have now is a bug-report listing as an extended code comment. That is more appropriately written as an issue we can track and schedule (#2829) -- and I asked you to move this bug-listing from here to there. But if you feel it's crucial for understanding of the code to have this list-out duplicated here, I won't reject this PR on that basis.

@johnkerl johnkerl changed the title [python]: ingest/outgest round-trip improvements [python] Ingest/outgest round-trip improvements Aug 5, 2024
@ryan-williams ryan-williams merged commit 72fdfb3 into main Aug 6, 2024
11 checks passed
@ryan-williams ryan-williams deleted the rw/uns branch August 6, 2024 02:11
github-actions bot pushed a commit that referenced this pull request Aug 6, 2024
* anndata-based dataframe round-trip tests

* factor+document {in,out}gest DF logic

* anndata- and df-based dataframe round-trip tests
johnkerl pushed a commit that referenced this pull request Aug 6, 2024
* anndata-based dataframe round-trip tests

* factor+document {in,out}gest DF logic

* anndata- and df-based dataframe round-trip tests

Co-authored-by: Ryan Williams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants