johko · ghassen-fatnassi · May 30, 2024 · May 30, 2024 · Jun 18, 2024 · Jun 20, 2024
@@ -57,7 +57,7 @@
   - title: MobileViT v2
     local: "unit3/vision-transformers/mobilevit"
   - title: FineTuning Vision Transformer for Object Detection
-    local: "unit3/vision-transformers/vision-transformer-for-objection-detection"
+    local: "unit3/vision-transformers/vision-transformer-for-object-detection"
   - title: DEtection TRansformer (DETR)
     local: "unit3/vision-transformers/detr"
   - title: Vision Transformers for Image Segmentation

@@ -126,7 +126,7 @@ Our goal was to create a computer vision course that is beginner-friendly and th
 **Unit 7 - Video and Video Processing**
 
 - Reviewers: [Ameed Taylor](https://github.com/atayloraerospace)
-- Writers: [Diwakar Basnet](https://github.com/DiwakarBasnet)
+- Writers: [Diwakar Basnet](https://github.com/DiwakarBasnet), [Woojun Jung](https://github.com/jungnerd)
 
 **Unit 8 - 3D Vision, Scene Rendering, and Reconstruction**
 

@@ -12,7 +12,7 @@ Developed by Hazy Research, it features a subquadratic computational efficiency,
 
 Long convolutions are similar to standard convolutions except the kernel is the size of the input. 
 It is equivalent to having a global receptive field instead of a local one. 
-Having an implicitly parametrized convultion means that the convolution filters values are not directly learnt, instead, learning a function that can recover thoses values is prefered. 
+Having an implicitly parametrized convolution means that the convolution filters values are not directly learned. Instead, learning a function that can recover thoses values is preferred. 
 
 </Tip>
 

@@ -1,10 +1,8 @@
 # Knowledge Distillation with Vision Transformers
 
-We are going to learn about Knowledge Distillation, the method behind [distilGPT](https://huggingface.co/distilgpt2) and [distilbert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), two of *the most downloaded models on the Hugging Face Hub!*
+We are going to learn about Knowledge Distillation, the method behind [distilGPT](https://huggingface.co/distilgpt2) and [distilbert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), two of *the most downloaded models on the Hugging Face Hub!*     
 
-Presumably, we've all had teachers who "teach" by simply providing us the correct answers and then testing us on questions we haven't seen before, analogous to supervised learning
-of machine learning models where we provide a labeled dataset to train on. Instead of having a model train on labels, however,
-we can pursue [Knowledge Distillation](https://arxiv.org/abs/1503.02531) as an alternative to arriving at a much smaller model that can perform comparably to the larger model and much faster to boot.
+Presumably, we've all had teachers who "teach" by simply providing us the correct answers and then testing us on questions we haven't seen before, analogous to supervised learning of machine learning models where we provide a labeled dataset to train on. Instead of having a model train on labels, however, we can pursue [Knowledge Distillation](https://arxiv.org/abs/1503.02531) as an alternative to arriving at a much smaller model that can perform comparably to the larger model and much faster to boot.
 
 ## Intuition Behind Knowledge Distillation
 
@@ -14,29 +12,44 @@ Imagine you were given this multiple-choice question:
 
 If you had someone just tell you, "The answer is Draco Malfoy," that doesn't teach you a whole lot about each of the characters' relative relationships with Harry Potter.
 
-On the other hand, if someone tells you, "I am very confident it is not Ron Weasley, I am somewhat confident it is not Neville Longbottom, and 
-I am very confident that it *is* Draco Malfoy", this gives you some information about these characters' relationships to Harry Potter! 
-This is precisely the kind of information that gets passed down to our student model under the Knowledge Distillation paradigm.
+On the other hand, if someone tells you, "I am very confident it is not Ron Weasley, I am somewhat confident it is not Neville Longbottom, and I am very confident that it *is* Draco Malfoy," this gives you some information about these characters' relationships to Harry Potter! This is precisely the kind of information that gets passed down to our student model under the Knowledge Distillation paradigm.
 
 ## Distilling the Knowledge in a Neural Network
 
-In the paper [*Distilling the Knowledge in a Neural Network*](https://arxiv.org/abs/1503.02531), Hinton et al. introduced the training methodology known as knowledge distillation,
-taking inspiration from *insects*, of all things. Just as insects transition from larval to adult forms that are optimized for different tasks, large-scale machine learning models can 
-initially be cumbersome, like larvae, for extracting structure from data but can distill their knowledge into smaller, more efficient models for deployment.
+In the paper [*Distilling the Knowledge in a Neural Network*](https://arxiv.org/abs/1503.02531), Hinton et al. introduced the training methodology known as knowledge distillation, taking inspiration from *insects*, of all things. Just as insects transition from larval to adult forms that are optimized for different tasks, large-scale machine learning models can initially be cumbersome, like larvae, for extracting structure from data but can distill their knowledge into smaller, more efficient models for deployment.
 
-The essence of Knowledge Distillation is using the predicted logits from a teacher network to pass information to a smaller, more efficient student model. We do this
-by re-writing the loss function to contain a *distillation loss*, which encourages the student model's distribution over the output space to approximate the teacher's.
+The essence of Knowledge Distillation is using the predicted logits from a teacher network to pass information to a smaller, more efficient student model. We do this by re-writing the loss function to contain a *distillation loss*, which encourages the student model's distribution over the output space to approximate the teacher's.
 
 The distillation loss is formulated as:
 
 ![Distillation Loss](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/KL-Loss.png)
 
-The KL loss refers to the [Kullback-Leibler Divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) between the teacher and the student's output distributions. 
-The overall loss for the student model is then formulated as the sum of this distillation loss with the standard cross-entropy loss over the ground-truth labels.
+The KL loss refers to the [Kullback-Leibler Divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) between the teacher and the student's output distributions. The overall loss for the student model is then formulated as the sum of this distillation loss with the standard cross-entropy loss over the ground-truth labels.
 
 To see this loss function implemented in Python and a fully worked out example in Python, let's check out the [notebook for this section](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb).
 
 <a target="_blank" href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb">
   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
 </a>
 
+# Leveraging Knowledge Distillation for Edge Devices
+
+Knowledge distillation has become increasingly crucial as AI models are deployed on edge devices. Deploying a large-scale model, such as one with a 1 GB size and a latency of 1 second, is impractical for real-time applications due to high computational and memory requirements. These limitations are primarily attributed to the model's size. As a result, the field has embraced knowledge distillation, a technique that reduces model parameters by over 90% with minimal performance degradation.
+
+## The Mechanisms Behind Knowledge Distillation
+
+### 1. Entropy Gain
+In the context of information theory, entropy is analogous to its counterpart in physics, where it measures the "chaos" or disorder within a system. In our scenario, it quantifies the amount of information a distribution contains. Consider the following example:
+
+- Which is harder to remember: `[0, 1, 0, 0]` or `[0.2, 0.5, 0.2, 0.1]`?
+
+The first vector, `[0, 1, 0, 0]`, is easier to remember and compress, as it contains less information. This can be represented as "1" in the second position. On the other hand, `[0.2, 0.5, 0.2, 0.1]` contains more information, which means it conveys "inter-class structure" that the model can learn from, as discussed earlier.
+
+### 2. Coherent Gradient Updates
+Models learn iteratively by minimizing a loss function and updating their parameters through gradient descent. Consider a set of parameters `P = {w1, w2, w3, ..., wn}`, whose role in the teacher model is to activate when detecting a sample of class A. If an ambiguous sample resembles class A but belongs to class B, the model's gradient update will be aggressive after the misclassification, leading to instability. In contrast, the distillation process, with the teacher model's soft targets, promotes more stable and coherent gradient updates during training, resulting in a smoother learning process for the student model.
+
+### 3. Ability to Train on Unlabeled Data
+The presence of a teacher model allows the student model to train on unlabeled data. The teacher model can generate pseudo-labels for these unlabeled samples, which the student model can then use for training. This approach significantly increases the amount of usable training data.
+
+### 4. A Shift in Perspective
+Deep learning models are typically trained with the assumption that providing enough data will allow them to approximate a function `F` that accurately represents the underlying phenomenon. However, in many cases, data scarcity makes this assumption unrealistic. The traditional approach involves building larger models and fine-tuning them iteratively to achieve optimal results. In contrast, knowledge distillation shifts this perspective: given that we already have a well-trained teacher model `F`, the goal becomes approximating `F` using a smaller model `f`.
@@ -29,7 +29,7 @@ To deepen your understanding of the ins-and-outs of object detection, check out
 
 ### The Need to Fine-tune Models in Object Detection 🤔
 
-That is an awesome question. Training an object detection model from scratch means:
+Should you build a new model, or alter an existing one? That is an awesome question. Training an object detection model from scratch means:
 
 - Doing already done research over and over again.
 - Writing repetitive model code, training them, and maintaining different repositories for different use cases.
@@ -59,7 +59,7 @@ So, we are going to fine-tune a lightweight object detection model for doing jus
 
 ### Dataset
 
-For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeaster University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with  🤗 `datasets`.  
+For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeastern University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with  🤗 `datasets`.  
 
 ```python
 from datasets import load_dataset

@@ -48,8 +48,7 @@ A dataset consisting of multiple modalities is a multimodal dataset. Out of the
 - Vision + Audio: [VGG-Sound Dataset](https://www.robots.ox.ac.uk/~vgg/data/vggsound/), [RAVDESS Dataset](https://zenodo.org/records/1188976), [Audio-Visual Identity Database (AVID)](https://www.avid.wiki/Main_Page).
 - Vision + Audio + Text: [RECOLA Database](https://diuf.unifr.ch/main/diva/recola/), [IEMOCAP Dataset](https://sail.usc.edu/iemocap/).
 
-Now let us see what kind of tasks can be performed using a multimodal dataset? There are many examples, but we will focus generally on tasks that contains the visual and textual
-A multimodal dataset will require a model which is able to process data from multiple modalities, such a model is a multimodal model.
+Now, let us see what kind of tasks can be performed using a multimodal dataset. There are many examples, but we will generally focus on tasks that contain both visual and textual elements. A multimodal dataset requires a model that is able to process data from multiple modalities. Such a model is called a multimodal model.
 
 ## Multimodal Tasks and Models
 
@@ -89,7 +88,7 @@ A detailed section on multimodal tasks and models with a focus on Vision and Tex
 ## An application of multimodality: Multimodal Search 🔎📲💻
 
 Internet search was the one key advantage Google had, but with the introduction of ChatGPT by OpenAI, Microsoft started out with
-powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible form of online content is largely multimodal. When we search about an image, the image pops up with a corresponding text to describe it. Won't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.
+powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible forms of online content are largely multimodal. When we search for an image, the image pops up with a corresponding text to describe it. Wouldn't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.
 
 Vision Language Models (VLMs) are models that can understand and process both vision and text modalities. The joint understanding of both modalities lead VLMs to perform various tasks efficiently like Visual Question Answering, Text-to-image search etc. VLMs thus can serve as one of the best candidates for multimodal search. So overall, VLMs should find some way to map text and image pairs to a joint embedding space where each text-image pair is present as an embedding. We can perform various downstream tasks using these embeddings, which can also be used for search. The idea of such a joint space is that image and text embeddings that are similar in meaning will lie close together, enabling us to do searches for images based on text (text-to-image search) or vice-versa.