-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update knowledge-distillation.mdx #318
base: stage
Are you sure you want to change the base?
Conversation
added video processing part
Co-authored-by: A Taylor <[email protected]>
just some knowledge i learned , though it was good to share , thanks HF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions! Just a couple of comments.
|
||
### Why Does Knowledge Distillation Work? | ||
|
||
1. **Entropy Gain**: Distillation helps in transferring the "dark knowledge" (i.e., soft probabilities) from the teacher model to the student model, which contains more information than hard labels. This extra information helps the student model generalize better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "Entropy Gain"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot for your fast reply , you are right , entropy gain has no sense , the more convenient term will likely be "information gain" since by absorbing that cross entropy loss between the 2 distributions (student model output & teacher model output) the model gains "more information" than by absorbing cross entropy loss between (student & groundtruth)
i got confused between the two terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That makes sense. But may I suggest something like "Information Transfer" rather than "gain", in order not to confuse it with the metric.
chapters/en/unit3/vision-transformers/knowledge-distillation.mdx
Outdated
Show resolved
Hide resolved
Co-authored-by: Mohammed Hamdy <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution and I really like the facts you added 👍
I left some suggestions to make it even more informative for people going through the course.
|
||
To see this loss function implemented in Python and a fully worked out example in Python, let's check out the [notebook for this section](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb). | ||
|
||
<a target="_blank" href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb"> | ||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> | ||
</a> | ||
|
||
## So How Can We Use This to Our Advantage? | ||
|
||
Knowledge distillation became very useful when the machine learning community realized there was great potential in deploying AI models on edge devices. However, we can't just deploy a model that's 1 GB in size and has a 1-second latency for users to use—it’s too big and too slow. These drawbacks are mainly caused by the large size of the model. From that moment, people started distilling models and reducing their parameter count by more than 90% with negligible performance drops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Knowledge distillation became very useful when the machine learning community realized there was great potential in deploying AI models on edge devices. However, we can't just deploy a model that's 1 GB in size and has a 1-second latency for users to use—it’s too big and too slow. These drawbacks are mainly caused by the large size of the model. From that moment, people started distilling models and reducing their parameter count by more than 90% with negligible performance drops. | |
Knowledge distillation became very useful when the machine learning community realized there was great potential in deploying AI models on edge devices. However, we can't just deploy a model that's 1 GB in size and has a 1-second latency for users to use — it’s too big and too slow. These drawbacks are mainly caused by the big number of weights of the model. From that moment, people started distilling models and reducing their parameter count by more than 90% with negligible performance drops. |
|
||
### Why Does Knowledge Distillation Work? | ||
|
||
1. **Entropy Gain**: One way distillation works is by transferring the "dark knowledge" (i.e., soft probabilities) from the teacher model to the student model, which contains more information than hard labels. This extra information helps the student model generalize better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really interesting fact, could you maybe explain soft probabilities a bit more? Our target audience are mainly beginners, so I think it would help to give some more details about that.
|
||
1. **Entropy Gain**: One way distillation works is by transferring the "dark knowledge" (i.e., soft probabilities) from the teacher model to the student model, which contains more information than hard labels. This extra information helps the student model generalize better. | ||
|
||
2. **Coherent Gradient Updates**: The gradients provided by the teacher model's soft targets help in more stable and coherent updates during training, resulting in a smoother training process for the student model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, maybe you can explain what you mean by soft targets
And feel free to add your name in this file in the Vision Transformer part @ghassen-fatnassi: https://github.com/johko/computer-vision-course/blob/main/chapters/en/unit0/welcome/welcome.mdx |
Hey @ghassen-fatnassi , just wanted to check if you have any time to go through my requested changes soon? I'm currently cleaning up the PRs a bit, so it would be great if you could have a look. |
Added video processing section (Unit 7 - introduction )
yes I'll try my best to do the requested changes today! |
Copy-Editing/Grammatical Fixes: Chapters 2-13
better explanation hopefully
done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @ghassen-fatnassi , LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ghassen-fatnassi, thanks for this amazing elaboration! 😀 My main issue, as I mentioned previously, is the "entropy gain" section. I think the term is a little bit confusing. In this section, you're explaining what entropy is but it's not clear how this relates to knowledge distillation. I understand you're trying to convey the important role that soft targets play in the student's learning, and how it contain much more information than hard labels, but it would be great to make this connection explicit in this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good! 👍
just some knowledge i learned , though it was good to share , thanks HF