[python] Ingest/outgest round-trip improvements #2804

ryan-williams · 2024-07-22T20:42:58Z

The first commit in this PR just adds new unit tests, which verify several categories of ingest/outgest "round-trip mutations."
- See [Bug] pd.DataFrames modified during ingest/outgest round-trips #2829 for more details.
The second commit factors DF ingest/outgest logic currently ≈duplicated in 2 places each:
- ingest: ingest.py/signatures.py
- outgest: obs/var
In the near future we'll want to use both for uns as well.
The third commit adds a 2nd set of tests, functionally identical to those added in 1., except they directly round-trip an "obs" DataFrame (using _write_dataframe and _read_dataframe; the latter is added in 2.).
- The tests from 1. round-trip using {from,to}_anndata), and are still present.

codecov · 2024-07-22T20:55:18Z

Codecov Report

Attention: Patch coverage is 97.05882% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.03%. Comparing base (edabcb6) to head (a238e7d).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2804      +/-   ##
==========================================
- Coverage   90.06%   90.03%   -0.04%     
==========================================
  Files          37       37              
  Lines        3945     3942       -3     
==========================================
- Hits         3553     3549       -4     
- Misses        392      393       +1

Flag	Coverage Δ
python	`90.03% <97.05%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
python_api	`90.03% <97.05%> (-0.04%)`	⬇️
libtiledbsoma	`∅ <ø> (∅)`

johnkerl

Thank you for splitting out new unit-test cases from the ones you're modifying on #2824. This (a) makes it easier to verify you're not re-breaking old bugfixes that were completed in the past, and (b) makes it much easier to verify/highlight new bugfixes which need to be made in the future.

apis/python/src/tiledbsoma/io/_registration/signatures.py

johnkerl · 2024-08-02T20:03:45Z

apis/python/src/tiledbsoma/io/_registration/signatures.py

    arrow_table = df_to_arrow(df)
    arrow_schema = arrow_table.schema.remove_metadata()
    return _string_dict_from_arrow_schema(arrow_schema)


+# Metadata indicating a DataFrame's original index column name, serialized as a JSON string or `null`.


Suggested change

# Metadata indicating a DataFrame's original index column name, serialized as a JSON string or `null`.

# Metadata indicating a DataFrame's original H5AD/AnnData index-column name, serialized as a JSON string or `null`.

I've reworked this comment, but am -1 on "H5AD/AnnData index-column name."

H5AD/AnnData don't have an index-column; this is about how we ingest and outgest DataFrames, regardless of their provenance.

apis/python/src/tiledbsoma/io/_registration/signatures.py

johnkerl · 2024-08-02T20:07:19Z

apis/python/src/tiledbsoma/io/_registration/signatures.py

+                # the column named "index" here.
+                df.drop(columns=["index"], inplace=True)
+        else:
+            # If `id_column_name` was passed, and is not already a column in the DataFrame, we assume the original index


Thanks for your careful analysis here -- I really appreciate it.

The enumeration of problems here should moved from code comments here, to a tracking issue I'm asking you to create. Then that issue's link should be placed here.

Added a link to #2829, couldn't tell if you wanted me to remove the rest of the comment here.

I've left it for now, since the explanation here is specific to the nested if/else context, and the issues the rename below specifically introduces given that context.

OK. What we have now is a bug-report listing as an extended code comment. That is more appropriately written as an issue we can track and schedule (#2829) -- and I asked you to move this bug-listing from here to there. But if you feel it's crucial for understanding of the code to have this list-out duplicated here, I won't reject this PR on that basis.

apis/python/tests/test_lossy_ingest_outgest_roundtrips.py

ryan-williams

I believe I addressed everything

apis/python/src/tiledbsoma/io/_registration/signatures.py

ryan-williams · 2024-08-02T21:57:52Z

apis/python/src/tiledbsoma/io/_registration/signatures.py

    arrow_table = df_to_arrow(df)
    arrow_schema = arrow_table.schema.remove_metadata()
    return _string_dict_from_arrow_schema(arrow_schema)


+# Metadata indicating a DataFrame's original index column name, serialized as a JSON string or `null`.


I've reworked this comment, but am -1 on "H5AD/AnnData index-column name."

H5AD/AnnData don't have an index-column; this is about how we ingest and outgest DataFrames, regardless of their provenance.

apis/python/src/tiledbsoma/io/_registration/signatures.py

ryan-williams · 2024-08-02T22:03:29Z

apis/python/src/tiledbsoma/io/_registration/signatures.py

+                # the column named "index" here.
+                df.drop(columns=["index"], inplace=True)
+        else:
+            # If `id_column_name` was passed, and is not already a column in the DataFrame, we assume the original index


Added a link to #2829, couldn't tell if you wanted me to remove the rest of the comment here.

I've left it for now, since the explanation here is specific to the nested if/else context, and the issues the rename below specifically introduces given that context.

apis/python/src/tiledbsoma/io/outgest.py

apis/python/tests/test_lossy_ingest_outgest_roundtrips.py

johnkerl · 2024-08-05T13:03:51Z

I've reworked this comment, but am -1 on "H5AD/AnnData index-column name."

H5AD/AnnData don't have an index-column; this is about how we ingest and outgest DataFrames, regardless of their provenance.

You're right, and thanks -- I meant index-column names on DataFrames withih H5AD/AnnData

johnkerl · 2024-08-05T13:08:30Z

apis/python/src/tiledbsoma/io/_registration/signatures.py

+                # the column named "index" here.
+                df.drop(columns=["index"], inplace=True)
+        else:
+            # If `id_column_name` was passed, and is not already a column in the DataFrame, we assume the original index


OK. What we have now is a bug-report listing as an extended code comment. That is more appropriately written as an issue we can track and schedule (#2829) -- and I asked you to move this bug-listing from here to there. But if you feel it's crucial for understanding of the code to have this list-out duplicated here, I won't reject this PR on that basis.

* anndata-based dataframe round-trip tests * factor+document {in,out}gest DF logic * anndata- and df-based dataframe round-trip tests

* anndata-based dataframe round-trip tests * factor+document {in,out}gest DF logic * anndata- and df-based dataframe round-trip tests Co-authored-by: Ryan Williams <[email protected]>

ryan-williams force-pushed the rw/uns branch from fc7fdd0 to 3f0f7ad Compare July 22, 2024 20:43

ryan-williams force-pushed the rw/uns branch 8 times, most recently from 24d1acb to bf65845 Compare July 26, 2024 16:11

ryan-williams changed the title ~~[python]: fix ingest/outgest round-tripping~~ [python]: ingest/outgest round-trip improvements Jul 26, 2024

ryan-williams force-pushed the rw/uns branch 4 times, most recently from d38e11f to ef5c1ac Compare August 1, 2024 13:58

ryan-williams mentioned this pull request Aug 1, 2024

[python] Misc. ingest/outgest code-neaten improvements #2824

Merged

ryan-williams force-pushed the rw/uns branch 6 times, most recently from 6fc4709 to ab46b40 Compare August 1, 2024 22:44

ryan-williams marked this pull request as ready for review August 2, 2024 02:25

ryan-williams requested a review from johnkerl August 2, 2024 02:25

ryan-williams force-pushed the rw/uns branch from ab46b40 to dd22e52 Compare August 2, 2024 16:56

johnkerl requested changes Aug 2, 2024

View reviewed changes

ryan-williams mentioned this pull request Aug 2, 2024

[Bug] pd.DataFrames modified during ingest/outgest round-trips #2829

Open

ryan-williams force-pushed the rw/uns branch 2 times, most recently from 9ad3c6f to c8f2bfa Compare August 5, 2024 12:34

ryan-williams commented Aug 5, 2024

View reviewed changes

johnkerl approved these changes Aug 5, 2024

View reviewed changes

johnkerl changed the title ~~[python]: ingest/outgest round-trip improvements~~ [python] Ingest/outgest round-trip improvements Aug 5, 2024

anndata-based dataframe round-trip tests

c218a96

ryan-williams force-pushed the rw/uns branch from c8f2bfa to c218a96 Compare August 5, 2024 23:39

ryan-williams added 2 commits August 5, 2024 19:47

factor+document {in,out}gest DF logic

6fe2b23

anndata- and df-based dataframe round-trip tests

a238e7d

ryan-williams force-pushed the rw/uns branch from b967eda to a238e7d Compare August 5, 2024 23:47

ryan-williams merged commit 72fdfb3 into main Aug 6, 2024
11 checks passed

ryan-williams deleted the rw/uns branch August 6, 2024 02:11

ryan-williams mentioned this pull request Aug 6, 2024

[python] Debug CI mypy failures related to # type: ignore[misc] #2838

Merged

johnkerl added the backport release-1.13 label Aug 6, 2024

github-actions bot pushed a commit that referenced this pull request Aug 6, 2024

[python] Ingest/outgest round-trip improvements (#2804)

602ff61

* anndata-based dataframe round-trip tests * factor+document {in,out}gest DF logic * anndata- and df-based dataframe round-trip tests

github-actions bot mentioned this pull request Aug 6, 2024

[Backport release-1.13] [python] Ingest/outgest round-trip improvements #2850

Merged

ryan-williams mentioned this pull request Aug 9, 2024

[python] Update comments, remove unused test params #2873

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Ingest/outgest round-trip improvements #2804

[python] Ingest/outgest round-trip improvements #2804

ryan-williams commented Jul 22, 2024 •

edited

Loading

codecov bot commented Jul 22, 2024 •

edited

Loading

johnkerl left a comment

johnkerl Aug 2, 2024

ryan-williams Aug 2, 2024

johnkerl Aug 2, 2024

ryan-williams Aug 2, 2024 •

edited

Loading

johnkerl Aug 5, 2024 •

edited

Loading

ryan-williams left a comment

ryan-williams Aug 2, 2024

ryan-williams Aug 2, 2024 •

edited

Loading

johnkerl commented Aug 5, 2024

johnkerl Aug 5, 2024 •

edited

Loading

	# Metadata indicating a DataFrame's original index column name, serialized as a JSON string or `null`.
	# Metadata indicating a DataFrame's original H5AD/AnnData index-column name, serialized as a JSON string or `null`.

[python] Ingest/outgest round-trip improvements #2804

[python] Ingest/outgest round-trip improvements #2804

Conversation

ryan-williams commented Jul 22, 2024 • edited Loading

codecov bot commented Jul 22, 2024 • edited Loading

Codecov Report

johnkerl left a comment

Choose a reason for hiding this comment

johnkerl Aug 2, 2024

Choose a reason for hiding this comment

ryan-williams Aug 2, 2024

Choose a reason for hiding this comment

johnkerl Aug 2, 2024

Choose a reason for hiding this comment

ryan-williams Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

johnkerl Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

ryan-williams left a comment

Choose a reason for hiding this comment

ryan-williams Aug 2, 2024

Choose a reason for hiding this comment

ryan-williams Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

johnkerl commented Aug 5, 2024

johnkerl Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

ryan-williams commented Jul 22, 2024 •

edited

Loading

codecov bot commented Jul 22, 2024 •

edited

Loading

ryan-williams Aug 2, 2024 •

edited

Loading

johnkerl Aug 5, 2024 •

edited

Loading

ryan-williams Aug 2, 2024 •

edited

Loading

johnkerl Aug 5, 2024 •

edited

Loading