Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy-Editing/Grammatical Fixes: Chapters 2-13 #326

Merged
merged 6 commits into from
Aug 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion chapters/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
- title: MobileViT v2
local: "unit3/vision-transformers/mobilevit"
- title: FineTuning Vision Transformer for Object Detection
local: "unit3/vision-transformers/vision-transformer-for-objection-detection"
local: "unit3/vision-transformers/vision-transformer-for-object-detection"
- title: DEtection TRansformer (DETR)
local: "unit3/vision-transformers/detr"
- title: Vision Transformers for Image Segmentation
Expand Down
2 changes: 1 addition & 1 deletion chapters/en/unit13/hyena.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Developed by Hazy Research, it features a subquadratic computational efficiency,

Long convolutions are similar to standard convolutions except the kernel is the size of the input.
It is equivalent to having a global receptive field instead of a local one.
Having an implicitly parametrized convultion means that the convolution filters values are not directly learnt, instead, learning a function that can recover thoses values is prefered.
Having an implicitly parametrized convolution means that the convolution filters values are not directly learned. Instead, learning a function that can recover thoses values is preferred.

</Tip>

Expand Down
ericoulster marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ To deepen your understanding of the ins-and-outs of object detection, check out

### The Need to Fine-tune Models in Object Detection 🤔

That is an awesome question. Training an object detection model from scratch means:
Should you build a new model, or alter an existing one? That is an awesome question. Training an object detection model from scratch means:

- Doing already done research over and over again.
- Writing repetitive model code, training them, and maintaining different repositories for different use cases.
Expand Down Expand Up @@ -59,7 +59,7 @@ So, we are going to fine-tune a lightweight object detection model for doing jus

### Dataset

For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeaster University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with 🤗 `datasets`.
For the above scenario, we will use the [hardhat](https://huggingface.co/datasets/hf-vision/hardhat) dataset provided by [Northeastern University China](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7CBGOS). We can download and load this dataset with 🤗 `datasets`.

```python
from datasets import load_dataset
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ A detailed section on multimodal tasks and models with a focus on Vision and Tex
## An application of multimodality: Multimodal Search 🔎📲💻

Internet search was the one key advantage Google had, but with the introduction of ChatGPT by OpenAI, Microsoft started out with
powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible form of online content is largely multimodal. When we search about an image, the image pops up with a corresponding text to describe it. Won't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.
powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible forms of online content are largely multimodal. When we search for an image, the image pops up with a corresponding text to describe it. Wouldn't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.

Vision Language Models (VLMs) are models that can understand and process both vision and text modalities. The joint understanding of both modalities lead VLMs to perform various tasks efficiently like Visual Question Answering, Text-to-image search etc. VLMs thus can serve as one of the best candidates for multimodal search. So overall, VLMs should find some way to map text and image pairs to a joint embedding space where each text-image pair is present as an embedding. We can perform various downstream tasks using these embeddings, which can also be used for search. The idea of such a joint space is that image and text embeddings that are similar in meaning will lie close together, enabling us to do searches for images based on text (text-to-image search) or vice-versa.

Expand Down
2 changes: 1 addition & 1 deletion chapters/en/unit9/intro_to_model_optimization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ A trade-off exists between accuracy, performance, and resource usage when deploy
2. Performance is the model's speed and efficiency (latency). This is important so the model can make predictions quickly, even in real time. However, optimizing performance will usually result in decreasing accuracy.
3. Resource usage is the computational resources needed to perform inference on the model, such as CPU, memory, and storage. Efficient resource usage is crucial if we want to deploy models on devices with certain limitations, such as smartphones or IoT devices.

Image below shows a common computer vision model in terms of model size, accuracy, and latency. Bigger model has high accuracy, but needs more time for inference and big size.
The image below shows a common computer vision model in terms of model size, accuracy, and latency. A bigger model has high accuracy, but needs more time for inference and has a larger file size.

![Model Size VS Accuracy](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/model_size_vs_accuracy.png)

Expand Down
2 changes: 1 addition & 1 deletion chapters/en/unit9/tools_and_frameworks.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

The TensorFlow Model Optimization Toolkit is a suite of tools for optimizing machine learning models for deployment.
The TensorFlow Lite post-training quantization tool enable users to convert weights to 8 bit precision which reduces the trained model size by about 4 times.
The tools also include API for pruning and quantization during training is post-training quantization is insufficient.
The tools also include API for pruning and quantization during training if post-training quantization is insufficient.
These help user to reduce latency and inference cost, deploy models to edge devices with restricted resources and optimized execution for existing hardware or new special purpose accelerators.

### Setup guide
Expand Down
Loading