Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize TensorList resizing. #5638

Merged
merged 2 commits into from
Sep 19, 2024
Merged

Conversation

mzient
Copy link
Contributor

@mzient mzient commented Sep 18, 2024

Category:

Refactoring (Redesign of existing code that doesn't affect functionality)

Description:

This change optimizes the performance of TensorList::Resize

  • simple inline functions are moved to the header
  • shared_ptr in ShareData is now passed by value, allowing move semantics and reducing the number of atomic operations
  • some code motion to improve inlining (e.g. wrapping frequent calls to DLL_PUBLIC functions into a trampoline function)

Additional information:

Many of the changes were tuned experimentally. Don't hesitate to ask if you see something not obvious or outright weird.

Affected modules and functionalities:

Buffer, Tensor, TensorList

Key points relevant for the review:

Tests:

No new functionality or functional changes - all existing tests apply

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

mzient and others added 2 commits September 18, 2024 17:02
Signed-off-by: Michal Zientkiewicz <[email protected]>
Signed-off-by: Michał Zientkiewicz <[email protected]>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18510898]: BUILD STARTED

@JanuszL JanuszL self-assigned this Sep 18, 2024
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18510898]: BUILD PASSED

template <typename Backend>
void TensorList<Backend>::recreate_views() {
// precondition: type, shape are configured
uint8_t *sample_ptr = static_cast<uint8_t *>(contiguous_buffer_.raw_mutable_data());
int64_t num_samples = shape().num_samples();
auto &data_ptr = contiguous_buffer_.get_data_ptr();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoisting this line was perhaps the biggest saving here.

for (int64_t i = 0; i < num_samples; i++) {
// or any other way
auto tensor_size = shape().tensor_size(i);

std::shared_ptr<void> sample_alias(contiguous_buffer_.get_data_ptr(), sample_ptr);
tensors_[i].ShareData(sample_alias, tensor_size * type_info().size(), is_pinned(), shape()[i],
tensors_[i].ShareData(std::shared_ptr<void>(data_ptr, sample_ptr),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an intermediate variable and moving it was noticeably slower (but still noticeably faster than passing by const-ref and copying).

@mzient mzient merged commit f34a227 into NVIDIA:main Sep 19, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants