Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update knowledge-distillation.mdx #318

Open
wants to merge 22 commits into
base: stage
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
d82e037
Update introduction-to-video.mdx
jungnerd May 30, 2024
d9211f5
Update introduction-to-video.mdx
jungnerd May 30, 2024
0c1da84
Added credits as a writer
jungnerd Jun 18, 2024
475d3ff
Apply suggestions from code review
jungnerd Jun 20, 2024
c14f309
Update knowledge-distillation.mdx
ghassen-fatnassi Jun 24, 2024
50b22b5
Update chapters/en/unit3/vision-transformers/knowledge-distillation.mdx
ghassen-fatnassi Jun 25, 2024
158a120
Apply suggestions from code review
jungnerd Jul 20, 2024
a412664
Merge branch 'main' into main
jungnerd Jul 22, 2024
bc2fd88
Merge branch 'main' into typo-grammar-wording
ericoulster Aug 1, 2024
26a1833
Fixed Various Grammatical Issues Across Course
ericoulster Aug 1, 2024
c33fca7
Merge branch 'main' into typo-grammar-wording
sergiopaniego Aug 12, 2024
a9713a8
Merge branch 'stage'
Aug 14, 2024
59e8b16
Merge branch 'stage'
Aug 14, 2024
58deb91
Merge branch 'main' into typo-grammar-wording
johko Aug 14, 2024
b240fba
Merge branch 'main' into main
johko Aug 14, 2024
eb3d94d
Merge pull request #314 from jungnerd/main
johko Aug 14, 2024
e1b45a4
Staged change in toctree.yml from name change in title
ericoulster Aug 14, 2024
1d04893
Merge branch 'main' into typo-grammar-wording
johko Aug 14, 2024
c3f6f7e
Merge pull request #326 from ericoulster/typo-grammar-wording
johko Aug 14, 2024
1490f12
Update 2 knowledge-distillation.mdx
ghassen-fatnassi Aug 15, 2024
bba3c60
Merge branch 'main' into patch-1
ghassen-fatnassi Aug 15, 2024
d57d7e4
Update knowledge-distillation.mdx
ghassen-fatnassi Aug 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion chapters/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
- title: MobileViT v2
local: "unit3/vision-transformers/mobilevit"
- title: FineTuning Vision Transformer for Object Detection
local: "unit3/vision-transformers/vision-transformer-for-objection-detection"
local: "unit3/vision-transformers/vision-transformer-for-object-detection"
- title: DEtection TRansformer (DETR)
local: "unit3/vision-transformers/detr"
- title: Vision Transformers for Image Segmentation
Expand Down
2 changes: 1 addition & 1 deletion chapters/en/unit0/welcome/welcome.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ Our goal was to create a computer vision course that is beginner-friendly and th
**Unit 7 - Video and Video Processing**

- Reviewers: [Ameed Taylor](https://github.com/atayloraerospace)
- Writers: [Diwakar Basnet](https://github.com/DiwakarBasnet)
- Writers: [Diwakar Basnet](https://github.com/DiwakarBasnet), [Woojun Jung](https://github.com/jungnerd)

**Unit 8 - 3D Vision, Scene Rendering, and Reconstruction**

Expand Down
2 changes: 1 addition & 1 deletion chapters/en/unit13/hyena.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Developed by Hazy Research, it features a subquadratic computational efficiency,

Long convolutions are similar to standard convolutions except the kernel is the size of the input.
It is equivalent to having a global receptive field instead of a local one.
Having an implicitly parametrized convultion means that the convolution filters values are not directly learnt, instead, learning a function that can recover thoses values is prefered.
Having an implicitly parametrized convolution means that the convolution filters values are not directly learned. Instead, learning a function that can recover thoses values is preferred.

</Tip>

Expand Down
41 changes: 27 additions & 14 deletions chapters/en/unit3/vision-transformers/knowledge-distillation.mdx
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
# Knowledge Distillation with Vision Transformers

We are going to learn about Knowledge Distillation, the method behind [distilGPT](https://huggingface.co/distilgpt2) and [distilbert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), two of *the most downloaded models on the Hugging Face Hub!*
We are going to learn about Knowledge Distillation, the method behind [distilGPT](https://huggingface.co/distilgpt2) and [distilbert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), two of *the most downloaded models on the Hugging Face Hub!*

Presumably, we've all had teachers who "teach" by simply providing us the correct answers and then testing us on questions we haven't seen before, analogous to supervised learning
of machine learning models where we provide a labeled dataset to train on. Instead of having a model train on labels, however,
we can pursue [Knowledge Distillation](https://arxiv.org/abs/1503.02531) as an alternative to arriving at a much smaller model that can perform comparably to the larger model and much faster to boot.
Presumably, we've all had teachers who "teach" by simply providing us the correct answers and then testing us on questions we haven't seen before, analogous to supervised learning of machine learning models where we provide a labeled dataset to train on. Instead of having a model train on labels, however, we can pursue [Knowledge Distillation](https://arxiv.org/abs/1503.02531) as an alternative to arriving at a much smaller model that can perform comparably to the larger model and much faster to boot.

## Intuition Behind Knowledge Distillation

Expand All @@ -14,29 +12,44 @@ Imagine you were given this multiple-choice question:

If you had someone just tell you, "The answer is Draco Malfoy," that doesn't teach you a whole lot about each of the characters' relative relationships with Harry Potter.

On the other hand, if someone tells you, "I am very confident it is not Ron Weasley, I am somewhat confident it is not Neville Longbottom, and
I am very confident that it *is* Draco Malfoy", this gives you some information about these characters' relationships to Harry Potter!
This is precisely the kind of information that gets passed down to our student model under the Knowledge Distillation paradigm.
On the other hand, if someone tells you, "I am very confident it is not Ron Weasley, I am somewhat confident it is not Neville Longbottom, and I am very confident that it *is* Draco Malfoy," this gives you some information about these characters' relationships to Harry Potter! This is precisely the kind of information that gets passed down to our student model under the Knowledge Distillation paradigm.

## Distilling the Knowledge in a Neural Network

In the paper [*Distilling the Knowledge in a Neural Network*](https://arxiv.org/abs/1503.02531), Hinton et al. introduced the training methodology known as knowledge distillation,
taking inspiration from *insects*, of all things. Just as insects transition from larval to adult forms that are optimized for different tasks, large-scale machine learning models can
initially be cumbersome, like larvae, for extracting structure from data but can distill their knowledge into smaller, more efficient models for deployment.
In the paper [*Distilling the Knowledge in a Neural Network*](https://arxiv.org/abs/1503.02531), Hinton et al. introduced the training methodology known as knowledge distillation, taking inspiration from *insects*, of all things. Just as insects transition from larval to adult forms that are optimized for different tasks, large-scale machine learning models can initially be cumbersome, like larvae, for extracting structure from data but can distill their knowledge into smaller, more efficient models for deployment.

The essence of Knowledge Distillation is using the predicted logits from a teacher network to pass information to a smaller, more efficient student model. We do this
by re-writing the loss function to contain a *distillation loss*, which encourages the student model's distribution over the output space to approximate the teacher's.
The essence of Knowledge Distillation is using the predicted logits from a teacher network to pass information to a smaller, more efficient student model. We do this by re-writing the loss function to contain a *distillation loss*, which encourages the student model's distribution over the output space to approximate the teacher's.

The distillation loss is formulated as:

![Distillation Loss](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/KL-Loss.png)

The KL loss refers to the [Kullback-Leibler Divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) between the teacher and the student's output distributions.
The overall loss for the student model is then formulated as the sum of this distillation loss with the standard cross-entropy loss over the ground-truth labels.
The KL loss refers to the [Kullback-Leibler Divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) between the teacher and the student's output distributions. The overall loss for the student model is then formulated as the sum of this distillation loss with the standard cross-entropy loss over the ground-truth labels.

To see this loss function implemented in Python and a fully worked out example in Python, let's check out the [notebook for this section](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb).

<a target="_blank" href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Leveraging Knowledge Distillation for Edge Devices

Knowledge distillation has become increasingly crucial as AI models are deployed on edge devices. Deploying a large-scale model, such as one with a 1 GB size and a latency of 1 second, is impractical for real-time applications due to high computational and memory requirements. These limitations are primarily attributed to the model's size. As a result, the field has embraced knowledge distillation, a technique that reduces model parameters by over 90% with minimal performance degradation.

## The Mechanisms Behind Knowledge Distillation

### 1. Entropy Gain
In the context of information theory, entropy is analogous to its counterpart in physics, where it measures the "chaos" or disorder within a system. In our scenario, it quantifies the amount of information a distribution contains. Consider the following example:

- Which is harder to remember: `[0, 1, 0, 0]` or `[0.2, 0.5, 0.2, 0.1]`?

The first vector, `[0, 1, 0, 0]`, is easier to remember and compress, as it contains less information. This can be represented as "1" in the second position. On the other hand, `[0.2, 0.5, 0.2, 0.1]` contains more information, which means it conveys "inter-class structure" that the model can learn from, as discussed earlier.

### 2. Coherent Gradient Updates
Models learn iteratively by minimizing a loss function and updating their parameters through gradient descent. Consider a set of parameters `P = {w1, w2, w3, ..., wn}`, whose role in the teacher model is to activate when detecting a sample of class A. If an ambiguous sample resembles class A but belongs to class B, the model's gradient update will be aggressive after the misclassification, leading to instability. In contrast, the distillation process, with the teacher model's soft targets, promotes more stable and coherent gradient updates during training, resulting in a smoother learning process for the student model.

### 3. Ability to Train on Unlabeled Data
The presence of a teacher model allows the student model to train on unlabeled data. The teacher model can generate pseudo-labels for these unlabeled samples, which the student model can then use for training. This approach significantly increases the amount of usable training data.

### 4. A Shift in Perspective
Deep learning models are typically trained with the assumption that providing enough data will allow them to approximate a function `F` that accurately represents the underlying phenomenon. However, in many cases, data scarcity makes this assumption unrealistic. The traditional approach involves building larger models and fine-tuning them iteratively to achieve optimal results. In contrast, knowledge distillation shifts this perspective: given that we already have a well-trained teacher model `F`, the goal becomes approximating `F` using a smaller model `f`.
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ To deepen your understanding of the ins-and-outs of object detection, check out

### The Need to Fine-tune Models in Object Detection 🤔

That is an awesome question. Training an object detection model from scratch means:
Should you build a new model, or alter an existing one? That is an awesome question. Training an object detection model from scratch means:

- Doing already done research over and over again.
- Writing repetitive model code, training them, and maintaining different repositories for different use cases.
Expand Down Expand Up @@ -59,7 +59,7 @@ So, we are going to fine-tune a lightweight object detection model for doing jus

### Dataset

For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeaster University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with 🤗 `datasets`.
For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeastern University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with 🤗 `datasets`.

```python
from datasets import load_dataset
Expand Down
5 changes: 2 additions & 3 deletions chapters/en/unit4/multimodal-models/a_multimodal_world.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,7 @@ A dataset consisting of multiple modalities is a multimodal dataset. Out of the
- Vision + Audio: [VGG-Sound Dataset](https://www.robots.ox.ac.uk/~vgg/data/vggsound/), [RAVDESS Dataset](https://zenodo.org/records/1188976), [Audio-Visual Identity Database (AVID)](https://www.avid.wiki/Main_Page).
- Vision + Audio + Text: [RECOLA Database](https://diuf.unifr.ch/main/diva/recola/), [IEMOCAP Dataset](https://sail.usc.edu/iemocap/).

Now let us see what kind of tasks can be performed using a multimodal dataset? There are many examples, but we will focus generally on tasks that contains the visual and textual
A multimodal dataset will require a model which is able to process data from multiple modalities, such a model is a multimodal model.
Now, let us see what kind of tasks can be performed using a multimodal dataset. There are many examples, but we will generally focus on tasks that contain both visual and textual elements. A multimodal dataset requires a model that is able to process data from multiple modalities. Such a model is called a multimodal model.

## Multimodal Tasks and Models

Expand Down Expand Up @@ -89,7 +88,7 @@ A detailed section on multimodal tasks and models with a focus on Vision and Tex
## An application of multimodality: Multimodal Search 🔎📲💻

Internet search was the one key advantage Google had, but with the introduction of ChatGPT by OpenAI, Microsoft started out with
powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible form of online content is largely multimodal. When we search about an image, the image pops up with a corresponding text to describe it. Won't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.
powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible forms of online content are largely multimodal. When we search for an image, the image pops up with a corresponding text to describe it. Wouldn't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.

Vision Language Models (VLMs) are models that can understand and process both vision and text modalities. The joint understanding of both modalities lead VLMs to perform various tasks efficiently like Visual Question Answering, Text-to-image search etc. VLMs thus can serve as one of the best candidates for multimodal search. So overall, VLMs should find some way to map text and image pairs to a joint embedding space where each text-image pair is present as an embedding. We can perform various downstream tasks using these embeddings, which can also be used for search. The idea of such a joint space is that image and text embeddings that are similar in meaning will lie close together, enabling us to do searches for images based on text (text-to-image search) or vice-versa.

Expand Down
Loading
Loading