johko · ghassen-fatnassi · May 30, 2024 · May 30, 2024 · Jun 18, 2024 · Jun 20, 2024
@@ -57,7 +57,7 @@
   - title: MobileViT v2
     local: "unit3/vision-transformers/mobilevit"
   - title: FineTuning Vision Transformer for Object Detection
-    local: "unit3/vision-transformers/vision-transformer-for-objection-detection"
+    local: "unit3/vision-transformers/vision-transformer-for-object-detection"
   - title: DEtection TRansformer (DETR)
     local: "unit3/vision-transformers/detr"
   - title: Vision Transformers for Image Segmentation

@@ -106,7 +106,7 @@ Our goal was to create a computer vision course that is beginner-friendly and th
 **Unit 3 - Vision Transformers**
 
 - Reviewers: [Ratan Prasad](https://github.com/ratan), [Mohammed Hamdy](https://github.com/mmhamdy), [Ameed Taylor](https://github.com/atayloraerospace), [Sezan](https://github.com/sezan92)
-- Writers: [Surya Guthikonda](https://github.com/SuryaKrishna02), [Ker Lee Yap](https://github.com/klyap), [Anindyadeep Sannigrahi](https://bento.me/anindyadeep), [Celina Hanouti](https://github.com/hanouticelina), [Malcolm Krolick](https://github.com/Mkrolick), [Alvin Li](https://github.com/alvanli), [Shreyas Daniel Gaddam](https://shreydan.github.io), [Anthony Susevski](https://github.com/asusevski), [Alan Ahmet](https://github.com/alanahmet)
+- Writers: [Surya Guthikonda](https://github.com/SuryaKrishna02), [Ker Lee Yap](https://github.com/klyap), [Anindyadeep Sannigrahi](https://bento.me/anindyadeep), [Celina Hanouti](https://github.com/hanouticelina), [Malcolm Krolick](https://github.com/Mkrolick), [Alvin Li](https://github.com/alvanli), [Shreyas Daniel Gaddam](https://shreydan.github.io), [Anthony Susevski](https://github.com/asusevski), [Alan Ahmet](https://github.com/alanahmet), [Ghassen Fatnassi](https://github.com/ghassen-fatnassi)
 
 **Unit 4 - Multimodal Models**
 
@@ -126,7 +126,7 @@ Our goal was to create a computer vision course that is beginner-friendly and th
 **Unit 7 - Video and Video Processing**
 
 - Reviewers: [Ameed Taylor](https://github.com/atayloraerospace)
-- Writers: [Diwakar Basnet](https://github.com/DiwakarBasnet)
+- Writers: [Diwakar Basnet](https://github.com/DiwakarBasnet), [Woojun Jung](https://github.com/jungnerd)
 
 **Unit 8 - 3D Vision, Scene Rendering, and Reconstruction**
 

@@ -12,7 +12,7 @@ Developed by Hazy Research, it features a subquadratic computational efficiency,
 
 Long convolutions are similar to standard convolutions except the kernel is the size of the input. 
 It is equivalent to having a global receptive field instead of a local one. 
-Having an implicitly parametrized convultion means that the convolution filters values are not directly learnt, instead, learning a function that can recover thoses values is prefered. 
+Having an implicitly parametrized convolution means that the convolution filters values are not directly learned. Instead, learning a function that can recover thoses values is preferred. 
 
 </Tip>
 

@@ -29,7 +29,7 @@ To deepen your understanding of the ins-and-outs of object detection, check out
 
 ### The Need to Fine-tune Models in Object Detection 🤔
 
-That is an awesome question. Training an object detection model from scratch means:
+Should you build a new model, or alter an existing one? That is an awesome question. Training an object detection model from scratch means:
 
 - Doing already done research over and over again.
 - Writing repetitive model code, training them, and maintaining different repositories for different use cases.
@@ -59,7 +59,7 @@ So, we are going to fine-tune a lightweight object detection model for doing jus
 
 ### Dataset
 
-For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeaster University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with  🤗 `datasets`.  
+For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeastern University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with  🤗 `datasets`.  
 
 ```python
 from datasets import load_dataset

@@ -48,8 +48,7 @@ A dataset consisting of multiple modalities is a multimodal dataset. Out of the
 - Vision + Audio: [VGG-Sound Dataset](https://www.robots.ox.ac.uk/~vgg/data/vggsound/), [RAVDESS Dataset](https://zenodo.org/records/1188976), [Audio-Visual Identity Database (AVID)](https://www.avid.wiki/Main_Page).
 - Vision + Audio + Text: [RECOLA Database](https://diuf.unifr.ch/main/diva/recola/), [IEMOCAP Dataset](https://sail.usc.edu/iemocap/).
 
-Now let us see what kind of tasks can be performed using a multimodal dataset? There are many examples, but we will focus generally on tasks that contains the visual and textual
-A multimodal dataset will require a model which is able to process data from multiple modalities, such a model is a multimodal model.
+Now, let us see what kind of tasks can be performed using a multimodal dataset. There are many examples, but we will generally focus on tasks that contain both visual and textual elements. A multimodal dataset requires a model that is able to process data from multiple modalities. Such a model is called a multimodal model.
 
 ## Multimodal Tasks and Models
 
@@ -89,7 +88,7 @@ A detailed section on multimodal tasks and models with a focus on Vision and Tex
 ## An application of multimodality: Multimodal Search 🔎📲💻
 
 Internet search was the one key advantage Google had, but with the introduction of ChatGPT by OpenAI, Microsoft started out with
-powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible form of online content is largely multimodal. When we search about an image, the image pops up with a corresponding text to describe it. Won't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.
+powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible forms of online content are largely multimodal. When we search for an image, the image pops up with a corresponding text to describe it. Wouldn't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.
 
 Vision Language Models (VLMs) are models that can understand and process both vision and text modalities. The joint understanding of both modalities lead VLMs to perform various tasks efficiently like Visual Question Answering, Text-to-image search etc. VLMs thus can serve as one of the best candidates for multimodal search. So overall, VLMs should find some way to map text and image pairs to a joint embedding space where each text-image pair is present as an embedding. We can perform various downstream tasks using these embeddings, which can also be used for search. The idea of such a joint space is that image and text embeddings that are similar in meaning will lie close together, enabling us to do searches for images based on text (text-to-image search) or vice-versa.
 

@@ -5,8 +5,6 @@ Of course, the real world of Computer Vision has a lot more to offer. Videos are
 
 Given their importance in our society and research, we also want to talk about them here in our course. In this introduction chapter, you will learn some very basic theory behind videos before going on to have a closer look at video processing.
 
-Let's go! 🤓
-
 ## What is a Video?
 
 An image is a binary, two-dimensional (2D) representation of visual data. A video is a multimedia format that sequentially displays these frames or images.
@@ -36,3 +34,44 @@ Codecs, short for “compressor-decompressor” are software or hardware compone
 There are two main types of codecs; "lossless codecs" and "lossy codecs". Lossless codecs are designed to compress data without any loss of quality, while lossy codecs are more designed to compress by removing some of the data resulting in a loss of quality.
 
 In summary, a video is a dynamic multimedia format that combines a series of individual frames, audio, and often additional metadata. It is used in a wide range of applications and can be tailored for different purposes, whether for entertainment, education, communication, or analysis.
+
+## What is Video Processing?
+
+In the research field of Computer Vision (CV) and Artificial Intelligence (AI), video processing involves automatically analyzing video data to understand and interpret both temporal and spatial features. Video data is simply a sequence of time-varying images, where the information is digitized both spatially and temporally. This allows us to perform detailed analysis and manipulation of the content within each frame of the video.
+
+
+
+Video processing has become increasingly important in today's technology-driven world, thanks to the rapid advancements in Deep Learning (DL) and AI. Traditionally, DL research has focused on images, speech, and text, but video data offers a unique and valuable opportunity for research due to its extensive size and complexity. With millions of videos uploaded daily on platforms like YouTube, video data has become a rich resource, driving AI research and enabling groundbreaking applications.
+
+
+### Applications of Video Processing
+
+- **Surveillance Systems:**
+Video processing plays a critical role in public safety, crime prevention, and traffic monitoring. It enables the automated detection of suspicious activities, helps identify individuals, and enhances the efficiency of surveillance systems.  
+
+- **Autonomous Driving:**
+In the realm of autonomous driving, video processing is essential for navigation, obstacle detection, and decision-making processes. It allows self-driving cars to understand their surroundings, recognize road signs, and react to changing environments, ensuring safe and efficient transportation. 
+
+- **Healthcare:**
+Video processing has significant applications in healthcare, including medical diagnostics, surgery, and patient monitoring. It helps analyze medical images, provides real-time feedback during surgical procedures, and continuously monitors patients to detect any abnormalities or emergencies.  
+
+### Challenges in Video Processing
+
+- **Computational Demands:**
+Real-time video analysis requires substantial processing power, which poses a significant challenge in developing and deploying efficient video processing systems. High-performance computing resources are essential to meet these demands.
+
+- **Storage Requirements:**
+High-resolution videos generate large volumes of data, leading to storage challenges. Efficient data compression and management techniques are necessary to handle the vast amounts of video data.
+
+- **Privacy and Ethical Concerns:**
+Video processing, especially in surveillance and healthcare, involves handling sensitive information. Ensuring privacy and addressing ethical concerns related to the misuse of video data are crucial considerations that must be carefully managed.
+
+## Conclusion
+
+Video processing is a dynamic and vital area within AI and CV, offering numerous applications and presenting unique challenges. Its importance in modern technology continues to grow, fueled by advancements in deep learning and the increasing availability of video data. In the following sections, we will dive deeper into deep learning for video processing. You'll explore state-of-the-art models including 3D CNNs and Transformers.  
+
+
+
+Additionally, we'll cover various tasks such as object tracking, action recognition, video stabilization, captioning, summarization, and background subtraction. These topics will provide you with a comprehensive understanding of how deep learning models are applied to different video processing challenges and applications.
+
+Let's go! 🤓
@@ -32,7 +32,7 @@ A trade-off exists between accuracy, performance, and resource usage when deploy
 2. Performance is the model's speed and efficiency (latency). This is important so the model can make predictions quickly, even in real time. However, optimizing performance will usually result in decreasing accuracy.
 3. Resource usage is the computational resources needed to perform inference on the model, such as CPU, memory, and storage. Efficient resource usage is crucial if we want to deploy models on devices with certain limitations, such as smartphones or IoT devices.
 
-Image below shows a common computer vision model in terms of model size, accuracy, and latency. Bigger model has high accuracy, but needs more time for inference and big size.
+The image below shows a common computer vision model in terms of model size, accuracy, and latency. A bigger model has high accuracy, but needs more time for inference and has a larger file size.
 
 ![Model Size VS Accuracy](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/model_size_vs_accuracy.png)
 

@@ -6,7 +6,7 @@
 
 The TensorFlow Model Optimization Toolkit is a suite of tools for optimizing machine learning models for deployment. 
 The TensorFlow Lite post-training quantization tool enable users to convert weights to 8 bit precision which reduces the trained model size by about 4 times. 
-The tools also include API for pruning and quantization during training is post-training quantization is insufficient.
+The tools also include API for pruning and quantization during training if post-training quantization is insufficient.
 These help user to reduce latency and inference cost, deploy models to edge devices with restricted resources and optimized execution for existing hardware or new special purpose accelerators.
 
 ### Setup guide