diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml index a9bd6ec08..685fd1247 100644 --- a/chapters/en/_toctree.yml +++ b/chapters/en/_toctree.yml @@ -57,7 +57,7 @@ - title: MobileViT v2 local: "unit3/vision-transformers/mobilevit" - title: FineTuning Vision Transformer for Object Detection - local: "unit3/vision-transformers/vision-transformer-for-objection-detection" + local: "unit3/vision-transformers/vision-transformer-for-object-detection" - title: DEtection TRansformer (DETR) local: "unit3/vision-transformers/detr" - title: Vision Transformers for Image Segmentation diff --git a/chapters/en/unit13/hyena.mdx b/chapters/en/unit13/hyena.mdx index 63535df3b..895d31ff2 100644 --- a/chapters/en/unit13/hyena.mdx +++ b/chapters/en/unit13/hyena.mdx @@ -12,7 +12,7 @@ Developed by Hazy Research, it features a subquadratic computational efficiency, Long convolutions are similar to standard convolutions except the kernel is the size of the input. It is equivalent to having a global receptive field instead of a local one. -Having an implicitly parametrized convultion means that the convolution filters values are not directly learnt, instead, learning a function that can recover thoses values is prefered. +Having an implicitly parametrized convolution means that the convolution filters values are not directly learned. Instead, learning a function that can recover thoses values is preferred. diff --git a/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx b/chapters/en/unit3/vision-transformers/vision-transformer-for-object-detection.mdx similarity index 98% rename from chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx rename to chapters/en/unit3/vision-transformers/vision-transformer-for-object-detection.mdx index 9b341fc9c..c1638d693 100644 --- a/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx +++ b/chapters/en/unit3/vision-transformers/vision-transformer-for-object-detection.mdx @@ -29,7 +29,7 @@ To deepen your understanding of the ins-and-outs of object detection, check out ### The Need to Fine-tune Models in Object Detection 🤔 -That is an awesome question. Training an object detection model from scratch means: +Should you build a new model, or alter an existing one? That is an awesome question. Training an object detection model from scratch means: - Doing already done research over and over again. - Writing repetitive model code, training them, and maintaining different repositories for different use cases. @@ -59,7 +59,7 @@ So, we are going to fine-tune a lightweight object detection model for doing jus ### Dataset -For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeaster University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with 🤗 `datasets`. +For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeastern University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with 🤗 `datasets`. ```python from datasets import load_dataset diff --git a/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx b/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx index 3b03a42cb..5516b363c 100644 --- a/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx +++ b/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx @@ -88,7 +88,7 @@ A detailed section on multimodal tasks and models with a focus on Vision and Tex ## An application of multimodality: Multimodal Search 🔎📲💻 Internet search was the one key advantage Google had, but with the introduction of ChatGPT by OpenAI, Microsoft started out with -powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible form of online content is largely multimodal. When we search about an image, the image pops up with a corresponding text to describe it. Won't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence. +powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible forms of online content are largely multimodal. When we search for an image, the image pops up with a corresponding text to describe it. Wouldn't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence. Vision Language Models (VLMs) are models that can understand and process both vision and text modalities. The joint understanding of both modalities lead VLMs to perform various tasks efficiently like Visual Question Answering, Text-to-image search etc. VLMs thus can serve as one of the best candidates for multimodal search. So overall, VLMs should find some way to map text and image pairs to a joint embedding space where each text-image pair is present as an embedding. We can perform various downstream tasks using these embeddings, which can also be used for search. The idea of such a joint space is that image and text embeddings that are similar in meaning will lie close together, enabling us to do searches for images based on text (text-to-image search) or vice-versa. diff --git a/chapters/en/unit9/intro_to_model_optimization.mdx b/chapters/en/unit9/intro_to_model_optimization.mdx index 2356de8d1..213419fa4 100644 --- a/chapters/en/unit9/intro_to_model_optimization.mdx +++ b/chapters/en/unit9/intro_to_model_optimization.mdx @@ -32,7 +32,7 @@ A trade-off exists between accuracy, performance, and resource usage when deploy 2. Performance is the model's speed and efficiency (latency). This is important so the model can make predictions quickly, even in real time. However, optimizing performance will usually result in decreasing accuracy. 3. Resource usage is the computational resources needed to perform inference on the model, such as CPU, memory, and storage. Efficient resource usage is crucial if we want to deploy models on devices with certain limitations, such as smartphones or IoT devices. -Image below shows a common computer vision model in terms of model size, accuracy, and latency. Bigger model has high accuracy, but needs more time for inference and big size. +The image below shows a common computer vision model in terms of model size, accuracy, and latency. A bigger model has high accuracy, but needs more time for inference and has a larger file size. ![Model Size VS Accuracy](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/model_size_vs_accuracy.png) diff --git a/chapters/en/unit9/tools_and_frameworks.mdx b/chapters/en/unit9/tools_and_frameworks.mdx index 21eeb5291..63d51a1d8 100644 --- a/chapters/en/unit9/tools_and_frameworks.mdx +++ b/chapters/en/unit9/tools_and_frameworks.mdx @@ -6,7 +6,7 @@ The TensorFlow Model Optimization Toolkit is a suite of tools for optimizing machine learning models for deployment. The TensorFlow Lite post-training quantization tool enable users to convert weights to 8 bit precision which reduces the trained model size by about 4 times. -The tools also include API for pruning and quantization during training is post-training quantization is insufficient. +The tools also include API for pruning and quantization during training if post-training quantization is insufficient. These help user to reduce latency and inference cost, deploy models to edge devices with restricted resources and optimized execution for existing hardware or new special purpose accelerators. ### Setup guide