Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Fix ManagedQuery usage in SOMAArray write path #2989

Merged
merged 10 commits into from
Sep 13, 2024

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented Sep 12, 2024

Issue and/or context:

#2928

Changes:

Commits should be viewed individually. Although much of this code already existed prior to this PR, there were enough modifications that the changes were too noisy to read clearly.

  1. First commit labels all the old code as legacy
  2. ManagedQuery::setup_write_column takes in a column name, number of elements, and pointers to buffers that contain the passed in data. This use to be three separate calls to ManagedQuery::setup_column_data, ColumnBuffer::set_data, and ManagedQuery::set_column_data but now simplified
  3. The old SOMAArray::set_array_data made a copy of the passed in ArrowTable and then passed the casted buffers to ColumnBuffer::set_data. We now iterate through each column calling SOMAArray::_cast_column, which casts the values into temporary vectors and the passes the std::vector::data() pointer into ManagedQuery::setup_write_column
  4. SOMAArray::_cast_column handles all permutations of attributes with or without enumerations and columns with or without dictionaries.
    1. In the case where the attribute has a enumeration but the passed in column does not have a dictionary, error out.
    2. In the case where the attribute is not enumerated but the passed column does have a dictionary, call SOMAArray::_promote_indexes_to_values which maps the dictionary's indexes to the associated dictionary values and passes that dictionary value mapping vector into ManagedQuery::setup_write_column
    3. In all other cases, call SOMAArray::_cast_column_aux for further processing
  5. SOMAArray::_promote_indexes_to_values calls SOMAArray::_cast_dictionary_values. It retrieves the dictionary's indexes via SOMAArray::_get_index_vector and then iterates through the index vector and maps and casts to the correct dictionary value
  6. SOMAArray::_cast_column_aux calls SOMAArray::_set_column.
    1. If the attribute is enumerated and the column has a dictionary, we need to write out the dictionary indexes and potentially extend the enumeration if there are any new values
    2. In all other cases, place the values into a a temporary vector, iterate and cast these values to the type on disk, and pass the casted vector into ManagedQuery::setup_write_buffer
  7. SOMAArray::_extend_enumeration calls SOMAArray::_extend_and_evolve_schema which check if any new enumerations have been introduced
    1. If so, then we need to call SOMAArray::_remap_indexes which calls SOMAArray::_remap_indexes_aux to iterate through the original indexes and shift them to match the new extended enumeration on-disk. SOMAArray::_cast_shifted_indexes casts the shifted index values to the type on disk and passes this into ManagedQuery::setup_write_buffer
    2. If not, then pass the dictionary's index as-is into ManagedQuery::setup_write_column
  8. Return a vector<uint8_t> of Boolean mappings instead of casting inplace
  9. Correct unit test to pass in offset buffer for variable length string columns
  10. Last commit removes all dead code

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.79%. Comparing base (d9bd18d) to head (37dadf5).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2989      +/-   ##
==========================================
+ Coverage   89.64%   89.79%   +0.14%     
==========================================
  Files          39       39              
  Lines        4096     4096              
==========================================
+ Hits         3672     3678       +6     
+ Misses        424      418       -6     
Flag Coverage Δ
python 89.79% <ø> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 89.79% <ø> (+0.14%) ⬆️
libtiledbsoma ∅ <ø> (∅)

@nguyenv nguyenv force-pushed the add-new-managedquery-code branch 4 times, most recently from a8705ef to ea4dec5 Compare September 13, 2024 00:34
@nguyenv nguyenv marked this pull request as ready for review September 13, 2024 15:41
@johnkerl johnkerl changed the title [c++] Replace malloc with ManagedQuery in write path [c++] Replace malloc with ManagedQuery in write path Sep 13, 2024
@nguyenv nguyenv changed the title [c++] Replace malloc with ManagedQuery in write path [c++] Fix ManagedQuery usage in SOMAArray write path Sep 13, 2024
Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-organized PR, easy to review and easy for everyone to understand later. Thank you!!

There are a couple items here, both of which can be follow-on PRs if you like.

libtiledbsoma/src/soma/soma_array.cc Show resolved Hide resolved
libtiledbsoma/src/soma/soma_array.cc Show resolved Hide resolved
libtiledbsoma/test/unit_soma_dataframe.cc Show resolved Hide resolved
@nguyenv
Copy link
Member Author

nguyenv commented Sep 13, 2024

I will move the bit to uin8_t conversion method and do the extend enumeration fix in two separate follow up PRs.

@nguyenv nguyenv merged commit 0efe51a into main Sep 13, 2024
15 checks passed
@nguyenv nguyenv deleted the add-new-managedquery-code branch September 13, 2024 17:05
@nguyenv nguyenv linked an issue Sep 13, 2024 that may be closed by this pull request
@johnkerl johnkerl mentioned this pull request Sep 16, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[c++/python] Memory leak in Python sparse-array write
2 participants