Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverse temperature #10

Open
DavidMrd opened this issue Dec 21, 2022 · 5 comments
Open

Inverse temperature #10

DavidMrd opened this issue Dec 21, 2022 · 5 comments

Comments

@DavidMrd
Copy link

Hello, In the original article, they say "In order to avoid very soft decisions in the tree, we introduced an inverse
temperature β to the filter activations prior to calculating the sigmoid." I am not sure, but I think you did not implement this temperature, did you?. Thanks!

@xuyxu
Copy link
Owner

xuyxu commented Dec 22, 2022

Hi @DavidMrd, beta is not implemented in the current version, may it is suffice to add a discount factor here.

@YaserGholizade
Copy link

YaserGholizade commented Aug 11, 2023

Many thanks for your implementation. It works well.
But I am a little bit confused.
According to the original paper "Distilling a NN Into SDT" and equation 2, in each leaf we should have N values, where their sum is one. (N: number of classes) i.e. each leaf contains a probability vector.
To make a prediction, the model uses the maximum path probabilities to select a leaf and then its probability vector as an output.
But in your code, you considered the path probability as a final value for each leaf and then using them through a fully connected layer to compute the final prediction of the model.

Am I right?
Thanks

@xuyxu
Copy link
Owner

xuyxu commented Aug 12, 2023

Hi @YaserGholizade, the fully connected layer is treated as the final value on each leaf node. You can see its dimension (code here) is (n_leaf_node, n_classes) for classification.

@YaserGholizade
Copy link

Thanks @xuyxu
I completely understood how your code works and it is working very well.
But my question was about the difference between your code and the algorithm in the main paper.

According to the main paper, for MNIST classification, each leaf should has ten values (a probability vector). There is not any fully connected layer.
After training the model, for prediction the class of an image, the model select the leaf with maximum path probability and then according the probability vector of that leaf, make the final decision.

But in your code, you assigned the path probability to each leaf ( _mu = _mu * _path_prob). And then feed these values to a fully connected layer self.leaf_nodes = nn.Linear(self.leaf_node_num_, self.output_dim,bias=False).

@xuyxu
Copy link
Owner

xuyxu commented Aug 13, 2023

The algorithm here is same as that in the raw paper, despite that we use matrices for faster computation. Instead of computing path probabilities one by one, which is very slow, _mu = _mu * _path_prob enables us to compute the path probability to all nodes in one specific layer at the same time. Furthermore, using the fully connected layer to simulate all leaf nodes also enables us to get the weighted sum of leaf node outputs (weighted demermined by the path probability) more quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants