-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Ingest/outgest round-trip improvements #2804
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2804 +/- ##
==========================================
- Coverage 90.06% 90.03% -0.04%
==========================================
Files 37 37
Lines 3945 3942 -3
==========================================
- Hits 3553 3549 -4
- Misses 392 393 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
24d1acb
to
bf65845
Compare
d38e11f
to
ef5c1ac
Compare
6fc4709
to
ab46b40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for splitting out new unit-test cases from the ones you're modifying on #2824. This (a) makes it easier to verify you're not re-breaking old bugfixes that were completed in the past, and (b) makes it much easier to verify/highlight new bugfixes which need to be made in the future.
arrow_table = df_to_arrow(df) | ||
arrow_schema = arrow_table.schema.remove_metadata() | ||
return _string_dict_from_arrow_schema(arrow_schema) | ||
|
||
|
||
# Metadata indicating a DataFrame's original index column name, serialized as a JSON string or `null`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Metadata indicating a DataFrame's original index column name, serialized as a JSON string or `null`. | |
# Metadata indicating a DataFrame's original H5AD/AnnData index-column name, serialized as a JSON string or `null`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reworked this comment, but am -1 on "H5AD/AnnData index-column name."
H5AD/AnnData don't have an index-column; this is about how we ingest and outgest DataFrames, regardless of their provenance.
# the column named "index" here. | ||
df.drop(columns=["index"], inplace=True) | ||
else: | ||
# If `id_column_name` was passed, and is not already a column in the DataFrame, we assume the original index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your careful analysis here -- I really appreciate it.
The enumeration of problems here should moved from code comments here, to a tracking issue I'm asking you to create. Then that issue's link should be placed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a link to #2829, couldn't tell if you wanted me to remove the rest of the comment here.
I've left it for now, since the explanation here is specific to the nested if
/else
context, and the issues the rename
below specifically introduces given that context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. What we have now is a bug-report listing as an extended code comment. That is more appropriately written as an issue we can track and schedule (#2829) -- and I asked you to move this bug-listing from here to there. But if you feel it's crucial for understanding of the code to have this list-out duplicated here, I won't reject this PR on that basis.
9ad3c6f
to
c8f2bfa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I addressed everything
arrow_table = df_to_arrow(df) | ||
arrow_schema = arrow_table.schema.remove_metadata() | ||
return _string_dict_from_arrow_schema(arrow_schema) | ||
|
||
|
||
# Metadata indicating a DataFrame's original index column name, serialized as a JSON string or `null`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reworked this comment, but am -1 on "H5AD/AnnData index-column name."
H5AD/AnnData don't have an index-column; this is about how we ingest and outgest DataFrames, regardless of their provenance.
# the column named "index" here. | ||
df.drop(columns=["index"], inplace=True) | ||
else: | ||
# If `id_column_name` was passed, and is not already a column in the DataFrame, we assume the original index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a link to #2829, couldn't tell if you wanted me to remove the rest of the comment here.
I've left it for now, since the explanation here is specific to the nested if
/else
context, and the issues the rename
below specifically introduces given that context.
You're right, and thanks -- I meant index-column names on DataFrames withih H5AD/AnnData |
# the column named "index" here. | ||
df.drop(columns=["index"], inplace=True) | ||
else: | ||
# If `id_column_name` was passed, and is not already a column in the DataFrame, we assume the original index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. What we have now is a bug-report listing as an extended code comment. That is more appropriately written as an issue we can track and schedule (#2829) -- and I asked you to move this bug-listing from here to there. But if you feel it's crucial for understanding of the code to have this list-out duplicated here, I won't reject this PR on that basis.
* anndata-based dataframe round-trip tests * factor+document {in,out}gest DF logic * anndata- and df-based dataframe round-trip tests
* anndata-based dataframe round-trip tests * factor+document {in,out}gest DF logic * anndata- and df-based dataframe round-trip tests Co-authored-by: Ryan Williams <[email protected]>
The first commit in this PR just adds new unit tests, which verify several categories of ingest/outgest "round-trip mutations."
pd.DataFrame
s modified during ingest/outgest round-trips #2829 for more details.The second commit factors DF ingest/outgest logic currently ≈duplicated in 2 places each:
ingest.py
/signatures.py
obs
/var
In the near future we'll want to use both for
uns
as well.The third commit adds a 2nd set of tests, functionally identical to those added in 1., except they directly round-trip an "obs" DataFrame (using
_write_dataframe
and_read_dataframe
; the latter is added in 2.).{from,to}_anndata
), and are still present.