Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update knowledge-distillation.mdx #318

Open
wants to merge 22 commits into
base: stage
Choose a base branch
from

Conversation

ghassen-fatnassi
Copy link

just some knowledge i learned , though it was good to share , thanks HF

jungnerd and others added 5 commits May 30, 2024 18:51
added video processing part
just some knowledge i learned , though it was good to share , thanks HF
Copy link
Collaborator

@mmhamdy mmhamdy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! Just a couple of comments.


### Why Does Knowledge Distillation Work?

1. **Entropy Gain**: Distillation helps in transferring the "dark knowledge" (i.e., soft probabilities) from the teacher model to the student model, which contains more information than hard labels. This extra information helps the student model generalize better.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "Entropy Gain"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for your fast reply , you are right , entropy gain has no sense , the more convenient term will likely be "information gain" since by absorbing that cross entropy loss between the 2 distributions (student model output & teacher model output) the model gains "more information" than by absorbing cross entropy loss between (student & groundtruth)
i got confused between the two terms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That makes sense. But may I suggest something like "Information Transfer" rather than "gain", in order not to confuse it with the metric.

Copy link
Owner

@johko johko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution and I really like the facts you added 👍

I left some suggestions to make it even more informative for people going through the course.


To see this loss function implemented in Python and a fully worked out example in Python, let's check out the [notebook for this section](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb).

<a target="_blank" href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## So How Can We Use This to Our Advantage?

Knowledge distillation became very useful when the machine learning community realized there was great potential in deploying AI models on edge devices. However, we can't just deploy a model that's 1 GB in size and has a 1-second latency for users to use—it’s too big and too slow. These drawbacks are mainly caused by the large size of the model. From that moment, people started distilling models and reducing their parameter count by more than 90% with negligible performance drops.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Knowledge distillation became very useful when the machine learning community realized there was great potential in deploying AI models on edge devices. However, we can't just deploy a model that's 1 GB in size and has a 1-second latency for users to useit’s too big and too slow. These drawbacks are mainly caused by the large size of the model. From that moment, people started distilling models and reducing their parameter count by more than 90% with negligible performance drops.
Knowledge distillation became very useful when the machine learning community realized there was great potential in deploying AI models on edge devices. However, we can't just deploy a model that's 1 GB in size and has a 1-second latency for users to useit’s too big and too slow. These drawbacks are mainly caused by the big number of weights of the model. From that moment, people started distilling models and reducing their parameter count by more than 90% with negligible performance drops.


### Why Does Knowledge Distillation Work?

1. **Entropy Gain**: One way distillation works is by transferring the "dark knowledge" (i.e., soft probabilities) from the teacher model to the student model, which contains more information than hard labels. This extra information helps the student model generalize better.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really interesting fact, could you maybe explain soft probabilities a bit more? Our target audience are mainly beginners, so I think it would help to give some more details about that.


1. **Entropy Gain**: One way distillation works is by transferring the "dark knowledge" (i.e., soft probabilities) from the teacher model to the student model, which contains more information than hard labels. This extra information helps the student model generalize better.

2. **Coherent Gradient Updates**: The gradients provided by the teacher model's soft targets help in more stable and coherent updates during training, resulting in a smoother training process for the student model.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, maybe you can explain what you mean by soft targets

@johko
Copy link
Owner

johko commented Jul 19, 2024

And feel free to add your name in this file in the Vision Transformer part @ghassen-fatnassi: https://github.com/johko/computer-vision-course/blob/main/chapters/en/unit0/welcome/welcome.mdx

@johko
Copy link
Owner

johko commented Aug 14, 2024

Hey @ghassen-fatnassi , just wanted to check if you have any time to go through my requested changes soon? I'm currently cleaning up the PRs a bit, so it would be great if you could have a look.

@ghassen-fatnassi
Copy link
Author

Hey @ghassen-fatnassi , just wanted to check if you have any time to go through my requested changes soon? I'm currently cleaning up the PRs a bit, so it would be great if you could have a look.

yes I'll try my best to do the requested changes today!

@ghassen-fatnassi
Copy link
Author

done!

Copy link
Owner

@johko johko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @ghassen-fatnassi , LGTM 👍

Copy link
Collaborator

@mmhamdy mmhamdy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ghassen-fatnassi, thanks for this amazing elaboration! 😀 My main issue, as I mentioned previously, is the "entropy gain" section. I think the term is a little bit confusing. In this section, you're explaining what entropy is but it's not clear how this relates to knowledge distillation. I understand you're trying to convey the important role that soft targets play in the student's learning, and how it contain much more information than hard labels, but it would be great to make this connection explicit in this section.

Copy link
Collaborator

@mmhamdy mmhamdy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good! 👍

@ghassen-fatnassi ghassen-fatnassi changed the base branch from main to stage August 23, 2024 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants