New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Update knowledge-distillation.mdx #318

Open

ghassen-fatnassi wants to merge 22 commits into johko:stage from ghassen-fatnassi:patch-1

ghassen-fatnassi commented Jun 24, 2024

just some knowledge i learned , though it was good to share , thanks HF

jungnerd and others added 5 commits

May 30, 2024 18:51


          Update introduction-to-video.mdx

d82e037

added video processing part


          Update introduction-to-video.mdx

d9211f5


          Added credits as a writer

0c1da84


          Apply suggestions from code review

475d3ff

Co-authored-by: A Taylor <[email protected]>


          Update knowledge-distillation.mdx

c14f309

just some knowledge i learned , though it was good to share , thanks HF

ghassen-fatnassi requested a review from merveenoyan as a code owner

June 24, 2024 12:24

mmhamdy reviewed

View reviewed changes

Collaborator

mmhamdy left a comment

Thanks for the suggestions! Just a couple of comments.

chapters/en/unit3/vision-transformers/knowledge-distillation.mdx Outdated


		### Why Does Knowledge Distillation Work?

		1. Entropy Gain: Distillation helps in transferring the "dark knowledge" (i.e., soft probabilities) from the teacher model to the student model, which contains more information than hard labels. This extra information helps the student model generalize better.

Collaborator

mmhamdy Jun 25, 2024

What do you mean by "Entropy Gain"?

Author

ghassen-fatnassi Jun 25, 2024

thanks a lot for your fast reply , you are right , entropy gain has no sense , the more convenient term will likely be "information gain" since by absorbing that cross entropy loss between the 2 distributions (student model output & teacher model output) the model gains "more information" than by absorbing cross entropy loss between (student & groundtruth)
i got confused between the two terms.

Collaborator

mmhamdy Jul 20, 2024

Thanks! That makes sense. But may I suggest something like "Information Transfer" rather than "gain", in order not to confuse it with the metric.

chapters/en/unit3/vision-transformers/knowledge-distillation.mdx Outdated Show resolved Hide resolved


          Update chapters/en/unit3/vision-transformers/knowledge-distillation.mdx

50b22b5

Co-authored-by: Mohammed Hamdy <[email protected]>

johko requested changes

View reviewed changes

Owner

johko left a comment

Thanks for the contribution and I really like the facts you added 👍

I left some suggestions to make it even more informative for people going through the course.

chapters/en/unit3/vision-transformers/knowledge-distillation.mdx Outdated

               To see this loss function implemented in Python and a fully worked out example in Python, let's check out the [notebook for this section](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb).
               <a target="_blank" href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb">
                 <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
               </a>
+              ## So How Can We Use This to Our Advantage?
+              Knowledge distillation became very useful when the machine learning community realized there was great potential in deploying AI models on edge devices. However, we can't just deploy a model that's 1 GB in size and has a 1-second latency for users to use—it’s too big and too slow. These drawbacks are mainly caused by the large size of the model. From that moment, people started distilling models and reducing their parameter count by more than 90% with negligible performance drops.

Owner

johko Jul 19, 2024

Suggested change

      
            Knowledge distillation became very useful when the machine learning community realized there was great potential in deploying AI models on edge devices. However, we can't just deploy a model that's 1 GB in size and has a 1-second latency for users to use—it’s too big and too slow. These drawbacks are mainly caused by the large size of the model. From that moment, people started distilling models and reducing their parameter count by more than 90% with negligible performance drops.
          
            Knowledge distillation became very useful when the machine learning community realized there was great potential in deploying AI models on edge devices. However, we can't just deploy a model that's 1 GB in size and has a 1-second latency for users to use — it’s too big and too slow. These drawbacks are mainly caused by the big number of weights of the model. From that moment, people started distilling models and reducing their parameter count by more than 90% with negligible performance drops.

chapters/en/unit3/vision-transformers/knowledge-distillation.mdx Outdated


		### Why Does Knowledge Distillation Work?

		1. Entropy Gain: One way distillation works is by transferring the "dark knowledge" (i.e., soft probabilities) from the teacher model to the student model, which contains more information than hard labels. This extra information helps the student model generalize better.

Owner

johko Jul 19, 2024

This is a really interesting fact, could you maybe explain soft probabilities a bit more? Our target audience are mainly beginners, so I think it would help to give some more details about that.

chapters/en/unit3/vision-transformers/knowledge-distillation.mdx Outdated


		1. Entropy Gain: One way distillation works is by transferring the "dark knowledge" (i.e., soft probabilities) from the teacher model to the student model, which contains more information than hard labels. This extra information helps the student model generalize better.

		2. Coherent Gradient Updates: The gradients provided by the teacher model's soft targets help in more stable and coherent updates during training, resulting in a smoother training process for the student model.

Owner

johko Jul 19, 2024

Similar to above, maybe you can explain what you mean by soft targets

Owner

johko commented Jul 19, 2024

And feel free to add your name in this file in the Vision Transformer part @ghassen-fatnassi: https://github.com/johko/computer-vision-course/blob/main/chapters/en/unit0/welcome/welcome.mdx

jungnerd and others added 5 commits

July 20, 2024 12:58


          Apply suggestions from code review

158a120

Co-authored-by: Johannes Kolbe <[email protected]>


          Merge branch 'main' into main

a412664


          Merge branch 'main' into typo-grammar-wording

bc2fd88


          Fixed Various Grammatical Issues Across Course

26a1833


          Merge branch 'main' into typo-grammar-wording

c33fca7

Owner

johko commented Aug 14, 2024

Hey @ghassen-fatnassi , just wanted to check if you have any time to go through my requested changes soon? I'm currently cleaning up the PRs a bit, so it would be great if you could have a look.

Johannes and others added 5 commits

August 14, 2024 10:22


          Merge branch 'stage'

a9713a8


          Merge branch 'stage'

59e8b16


          Merge branch 'main' into typo-grammar-wording

58deb91


          Merge branch 'main' into main

b240fba


          Merge pull request johko#314 from jungnerd/main

eb3d94d

Added video processing section (Unit 7 - introduction )

Author

ghassen-fatnassi commented Aug 14, 2024

Hey @ghassen-fatnassi , just wanted to check if you have any time to go through my requested changes soon? I'm currently cleaning up the PRs a bit, so it would be great if you could have a look.

yes I'll try my best to do the requested changes today!

ericoulster and others added 4 commits

August 14, 2024 09:36


          Staged change in toctree.yml from name change in title

e1b45a4


          Merge branch 'main' into typo-grammar-wording

1d04893


          Merge pull request johko#326 from ericoulster/typo-grammar-wording

c3f6f7e

Copy-Editing/Grammatical Fixes: Chapters 2-13


          Update 2 knowledge-distillation.mdx

1490f12

better explanation hopefully

Author

ghassen-fatnassi commented Aug 15, 2024

done!


          Merge branch 'main' into patch-1

bba3c60

ghassen-fatnassi requested a review from johko

August 20, 2024 07:05

johko approved these changes

View reviewed changes

Owner

johko left a comment

Thanks for the changes @ghassen-fatnassi , LGTM 👍

ghassen-fatnassi requested a review from mmhamdy

August 20, 2024 13:17

mmhamdy reviewed

View reviewed changes

Collaborator

mmhamdy left a comment

Hey @ghassen-fatnassi, thanks for this amazing elaboration! 😀 My main issue, as I mentioned previously, is the "entropy gain" section. I think the term is a little bit confusing. In this section, you're explaining what entropy is but it's not clear how this relates to knowledge distillation. I understand you're trying to convey the important role that soft targets play in the student's learning, and how it contain much more information than hard labels, but it would be great to make this connection explicit in this section.


          Update knowledge-distillation.mdx

d57d7e4

ghassen-fatnassi requested a review from mmhamdy

August 22, 2024 01:36

mmhamdy approved these changes

View reviewed changes

Collaborator

mmhamdy left a comment

Thanks! Looks good! 👍

ghassen-fatnassi changed the base branch from main to stage

August 23, 2024 15:45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet