-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use connected lists for tape data structure #954
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 10 out of 16. Check the log or trigger a new build to see more.
// | ||
|
||
#ifndef CLAD_NEWTAPE_H | ||
#define CLAD_NEWTAPE_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: header guard does not follow preferred style [llvm-header-guard]
#define CLAD_NEWTAPE_H | |
#ifndef CLAD_DIFFERENTIATOR_NEWTAPE_H | |
#define CLAD_DIFFERENTIATOR_NEWTAPE_H |
include/clad/Differentiator/NewTape.h:87:
- #endif // CLAD_NEWTAPE_H
+ #endif // CLAD_DIFFERENTIATOR_NEWTAPE_H
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 10 out of 12. Check the log or trigger a new build to see more.
@@ -0,0 +1,84 @@ | |||
#ifndef CLAD_NEWTAPE_H | |||
#define CLAD_NEWTAPE_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: header guard does not follow preferred style [llvm-header-guard]
#define CLAD_NEWTAPE_H | |
#ifndef CLAD_DIFFERENTIATOR_NEWTAPE_H | |
#define CLAD_DIFFERENTIATOR_NEWTAPE_H |
include/clad/Differentiator/NewTape.h:83:
- #endif // CLAD_NEWTAPE_H
+ #endif // CLAD_DIFFERENTIATOR_NEWTAPE_H
void emplace_back(ArgsT&&... args) { | ||
if (!cur_block || _size >= _capacity) { | ||
Block<T>* prev_block = cur_block; | ||
cur_block = new Block<T>(_capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: assigning newly created 'gsl::owner<>' to non-owner 'Block *' [cppcoreguidelines-owning-memory]
cur_block = new Block<T>(_capacity);
^
_size = 0; | ||
} | ||
_size += 1; | ||
::new (const_cast<void*>(static_cast<const volatile void*>(block_end()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: do not use const_cast [cppcoreguidelines-pro-type-const-cast]
::new (const_cast<void*>(static_cast<const volatile void*>(block_end())))
^
@rohanjulka19, do you mind addressing the clang-tidy complaints where relevant? |
hi, yeah will address these clang-tidy issues, actually i got stuck on the loop condition differentiation issue (once again) #818. But now I am fixing that one, once I push that fix will address these then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
@@ -0,0 +1,87 @@ | |||
#ifndef CLAD_NEWTAPE_H | |||
#define CLAD_NEWTAPE_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: header guard does not follow preferred style [llvm-header-guard]
#define CLAD_NEWTAPE_H | |
#ifndef CLAD_DIFFERENTIATOR_NEWTAPE_H | |
#define CLAD_DIFFERENTIATOR_NEWTAPE_H |
include/clad/Differentiator/NewTape.h:86:
- #endif // CLAD_NEWTAPE_H
+ #endif // CLAD_DIFFERENTIATOR_NEWTAPE_H
void emplace_back(ArgsT&&... args) { | ||
if (!m_cur_block || m_size >= _capacity) { | ||
Block<T>* prev_block = m_cur_block; | ||
m_cur_block = new Block<T>(_capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: assigning newly created 'gsl::owner<>' to non-owner 'Block *' [cppcoreguidelines-owning-memory]
m_cur_block = new Block<T>(_capacity);
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
~Block() { | ||
destroy(block_begin(), block_end()); | ||
::operator delete( | ||
const_cast<void*>(static_cast<const volatile void*>(data))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: do not use const_cast [cppcoreguidelines-pro-type-const-cast]
const_cast<void*>(static_cast<const volatile void*>(data)));
^
void emplace_back(ArgsT&&... args) { | ||
if (!m_cur_block || m_size >= m_capacity) { | ||
Block<T>* prev_block = m_cur_block; | ||
m_cur_block = new Block<T>(m_capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: assigning newly created 'gsl::owner<>' to non-owner 'Block *' [cppcoreguidelines-owning-memory]
m_cur_block = new Block<T>(m_capacity);
^
m_size = 0; | ||
} | ||
m_size += 1; | ||
::new (const_cast<void*>(static_cast<const volatile void*>(end()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: do not use const_cast [cppcoreguidelines-pro-type-const-cast]
::new (const_cast<void*>(static_cast<const volatile void*>(end())))
^
@vgvassilev I have fixed most of the suggestion, the above four I don't think needs to be addressed. |
The include protector suggestions seem good. Looks like the new tape has a problem with cuda. |
452b55d
to
38fdfb3
Compare
Have changed the include protector to align with the suggestion. The cuda test is passing but I am not sure if it will work when this runs on more than one thread in parallel |
I am a little puzzled since the new tape implementation is supposed to be faster than the existing one but it seems consistently slower: https://github.com/vgvassilev/clad/actions/runs/9798455926/job/27056951548?pr=954#step:31:596 Do you have any clue what might be going wrong? |
I thought it was due to the small block size leading to more memory allocation but increasing the block size also didn't worked so will investigate this further |
@rohanjulka19, do we have a clue what might be still wrong for the benchmarks? Maybe we should profile it with something like heaptrack, kcachegrind or similar... |
oh yeah I tried with timer but i couldn't figure out but I had a intution that because I am assigning memory twice once for the block and once for the array inside the block maybe that's whats causing the issue. So I am trying to just assign memory once by taking a fixed array of size 32 that should allocate all the memory once, but I am not very good with cpp so there is some issue with either memory allocation or something else. I will investigate the code this weekend, what do you think can this be the performance issue ? Should I use heaptrack to check if this is true ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
||
template <typename T> class Block { | ||
public: | ||
T data[capacity]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: do not declare C-style arrays, use std::array<> instead [cppcoreguidelines-avoid-c-arrays]
T data[capacity];
^
CUDA_HOST_DEVICE void emplace_back(ArgsT&&... args) { | ||
if (!m_cur_block || m_size >= capacity) { | ||
Block<NonConstT>* prev_block = m_cur_block; | ||
m_cur_block = new Block<NonConstT>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: assigning newly created 'gsl::owner<>' to non-owner 'Block *' (aka 'Block<typename std::remove_cv::type> *') [cppcoreguidelines-owning-memory]
m_cur_block = new Block<NonConstT>();
^
|
||
~new_tape_impl() { | ||
if (m_cur_block) { | ||
// destroy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vgvassilev so when I comment this line the time reduces. This function calls the destructors for all the blocks. This should never be called as we are destructing the block when we pop the last element from it, but this is to be safe in case tape is terminated without popping all the values due to some error. Now even though I have put this function inside the if stmt still this function is affecting the performance even when this is not being called. I believe this is because of function inlining. Is there way to ensure this doesn't happen ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably that's right. If you compile the code with clang you can add -Rpass-missed=.*
to get more information about the missed optimization opportunities.
Additionally on my local for BM_ReverseModeProductExecute I am getting this time BM_ReverseModeProductExecute 0.151 ns 0.150 ns 1000000000 But the benchmark returns this result BM_ReverseModeProductExecute 114 ns 114 ns 5956940 What must be the reason for this ? Also I am on M2 Mac (ARM) so like what tool should I use to actually understand what's the issue ? |
if (m_size == 0) { | ||
Block<NonConstT>* temp = m_cur_block; | ||
m_cur_block = m_cur_block->prev; | ||
temp->~Block<NonConstT>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we do not delete
the m_cur_block
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are deleting the current block at line 107.
I added the destroy function in the tape destructor to ensure that incase tape terminates without popping all the values, it should then call the destructor for all the pending blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~new_tape_impl() {
if (m_cur_block) {
// destroy();
In this line I don't want the compiler optimization and instead keep the code of destroy function separate because if it is added to the destructor it will slow down the destructor even though the destroy function won't likely be called. If we are sure that all the values will always be popped before the tape terminates then I can just remove this function from the tape destructor
|
||
~new_tape_impl() { | ||
if (m_cur_block) { | ||
// destroy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably that's right. If you compile the code with clang you can add -Rpass-missed=.*
to get more information about the missed optimization opportunities.
Do you mean that the bots give different benchmark results? If so, that's expected because they are different machines with different properties. |
Yeah but in my laptop the new tape is perfoming better than old tape and according to the benchmarks exectued by the pipeline the old tape is doing better than the new tape |
Do you run your benchmarks on release builds? Could you log in the bots and run some profiling there? You can do that by following: https://clad.readthedocs.io/en/latest/user/DevelopersDocumentation.html#debugging-github-runners |
3fe7769
to
1cb9c4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
using iterator = pointer; | ||
|
||
CUDA_HOST_DEVICE Block() { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use '= default' to define a trivial default constructor [modernize-use-equals-default]
} | |
CUDA_HOST_DEVICE Block() = default; |
public: | ||
new_tape_impl() = default; | ||
|
||
~new_tape_impl() { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use '= default' to define a trivial destructor [modernize-use-equals-default]
~new_tape_impl() { } | |
~new_tape_impl() = default; |
if (m_size == 0) { | ||
Block<NonConstT>* temp = m_cur_block; | ||
m_cur_block = m_cur_block->prev; | ||
delete temp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: deleting a pointer through a type that is not marked 'gsl::owner<>'; consider using a smart pointer instead [cppcoreguidelines-owning-memory]
delete temp;
^
Additional context
include/clad/Differentiator/NewTape.h:101: variable declared here
Block<NonConstT>* temp = m_cur_block;
^
1cb9c4b
to
2495073
Compare
2495073
to
0a96b90
Compare
Instead of a single list use mutliple connected lists to store elements. This allows to dynamically increase the size of tape without the need of relocating elements.
0a96b90
to
26fcee4
Compare
Currently I have added a basic implementation of tape data structure with connected lists/slabs. It is more or less similar to current tape data strucutre I copied a lot of code just added blocks and only added the functionality which I feel is needed for reverse mode.
Right now NumericalDiff also uses tape so I defined two types one is "tape" which points to the new tape implementation and "old_tape" which points to current tape data strcuture being used.
Tests are running fine for the new implementation.
These are the benchmarks for old_tape and new_tape.
LastTest-new-tape.log
LastTest-tape.log
Fixes - #793