diff --git a/chapters/en/unit0/welcome/welcome.mdx b/chapters/en/unit0/welcome/welcome.mdx index 21a73e5b2..6d6820b9a 100644 --- a/chapters/en/unit0/welcome/welcome.mdx +++ b/chapters/en/unit0/welcome/welcome.mdx @@ -12,7 +12,7 @@ On this page, you can find how to join the learners community, make a submission To obtain your certification for completing the course, complete the following assignments: -1. Training/fine-tuning a Model +1. Training/fine-tuning a model 2. Building an application and hosting it on Hugging Face Spaces ### Training/fine-tuning a Model @@ -21,7 +21,8 @@ There are notebooks under the Notebooks/Vision Transformers section. As of now, The model repository needs to have the following: -1. A properly filled model card, you can check out [here for more information](https://huggingface.co/docs/hub/en/model-cards) + +1. A properly filled model card, you can check out [here for more information](https://huggingface.co/docs/hub/en/model-cards). 2. If you trained a model with transformers and pushed it to Hub, the model card will be generated. In that case, edit the card and fill in more details. 3. Add the dataset’s ID to the model card to link the model repository to the dataset repository. @@ -34,7 +35,7 @@ In this assignment section, you'll be building a Gradio-based application for yo ## Certification 🥇 -Once you've finished the assignments — Training/fine-tuning a Model and Creating a Space — please complete the [form](https://forms.gle/isiVSw59oiiHP6pN9) with your name, email, and links to your model and Space repositories to receive your certificate +Once you've finished the assignments — Training/fine-tuning a Model and Creating a Space — please complete the [form](https://forms.gle/isiVSw59oiiHP6pN9) with your name, email, and links to your model and Space repositories to receive your certificate. ## Join the community! @@ -50,8 +51,8 @@ There are many channels focused on various topics on our Discord server. You wil As a computer vision course learner, you may find the following set of channels particularly relevant: -- `#computer-vision`: a catch-all channel for everything related to computer vision. -- `#cv-study-group`: a place to exchange ideas, ask questions about specific posts and start discussions. +- `#computer-vision`: a catch-all channel for everything related to computer vision +- `#cv-study-group`: a place to exchange ideas, ask questions about specific posts and start discussions - `#3d`: a channel to discuss aspects of computer vision specific to 3D computer vision If you are interested in generative AI, we also invite you to join all channels related to the Diffusion Models: #core-announcements, #discussions, #dev-discussions, and #diff-i-made-this. diff --git a/chapters/en/unit1/feature-extraction/feature-matching.mdx b/chapters/en/unit1/feature-extraction/feature-matching.mdx index ec527f02c..c412aa12b 100644 --- a/chapters/en/unit1/feature-extraction/feature-matching.mdx +++ b/chapters/en/unit1/feature-extraction/feature-matching.mdx @@ -8,7 +8,7 @@ Imagine you have a giant box of puzzle pieces, and you're trying to find a speci Now that we have an intuitive idea of how brute-force matches are found, let's dive into the algorithms. We are going to use the descriptors that we learned about in the previous chapter to find the matching features in two images. -First install and load libraries +First install and load libraries. ```bash !pip install opencv-python @@ -137,13 +137,13 @@ We also create a dictionary to specify the maximum leafs to visit as follows. search_params = dict(checks=50) ``` -Initiate SIFT detector +Initiate SIFT detector. ```python sift = cv2.SIFT_create() ``` -Find the keypoints and descriptors with SIFT +Find the keypoints and descriptors with SIFT. ```python kp1, des1 = sift.detectAndCompute(img1, None) @@ -259,7 +259,7 @@ Fm, inliers = cv2.findFundamentalMat(mkpts0, mkpts1, cv2.USAC_MAGSAC, 0.5, 0.999 inliers = inliers > 0 ``` -Finally, we can visualize the matches +Finally, we can visualize the matches. ```python draw_LAF_matches( diff --git a/chapters/en/unit1/image_and_imaging/examples-preprocess.mdx b/chapters/en/unit1/image_and_imaging/examples-preprocess.mdx index 21fe9222a..05f58658e 100644 --- a/chapters/en/unit1/image_and_imaging/examples-preprocess.mdx +++ b/chapters/en/unit1/image_and_imaging/examples-preprocess.mdx @@ -9,7 +9,7 @@ In digital image processing, operations on images are diverse and can be categor - Statistical - Geometrical - Mathematical -- Transform operations. +- Transform operations Each category encompasses different techniques, such as morphological operations under logical operations or fourier transforms and principal component analysis (PCA) under transforms. In this context, we refer to morphology as the group of operations that use structuring elements to generate images of the same size by looking into the values of the pixel neighborhood. Understanding the distinction between element-wise and matrix operations is important in image manipulation. Element-wise operations, such as raising an image to a power or dividing it by another image, involve processing each pixel individually. This pixel-based approach contrasts with matrix operations, which utilize matrix theory for image manipulation. Having said that, you can do whatever you want with images, as they are matrices containing numbers! diff --git a/chapters/en/unit1/image_and_imaging/imaging.mdx b/chapters/en/unit1/image_and_imaging/imaging.mdx index d718071ae..59f497341 100644 --- a/chapters/en/unit1/image_and_imaging/imaging.mdx +++ b/chapters/en/unit1/image_and_imaging/imaging.mdx @@ -16,7 +16,7 @@ The core of digital image formation is the function \\(f(x,y)\\), which is deter In transmission-based imaging, such as X-rays, transmissivity takes the place of reflectivity. The digital representation of an image is essentially a matrix or array of numerical values, each corresponding to a pixel. The process of transforming continuous image data into a digital format is twofold: -- Sampling, which digitizes the coordinate values +- Sampling, which digitizes the coordinate values. - Quantization, which converts amplitude values into discrete quantities. The resolution and quality of a digital image significantly depend on the following: diff --git a/chapters/en/unit10/blenderProc.mdx b/chapters/en/unit10/blenderProc.mdx index 8110ad70e..8ff6d2a8c 100644 --- a/chapters/en/unit10/blenderProc.mdx +++ b/chapters/en/unit10/blenderProc.mdx @@ -98,16 +98,20 @@ It is specifically created to help in the generation of realistic looking images You can install BlenderProc via pip: ```bash - pip install blenderProc +pip install blenderProc ``` Alternately, you can clone the official [BlenderProc repository](https://github.com/DLR-RM/BlenderProc) from GitHub using Git: -`git clone https://github.com/DLR-RM/BlenderProc` +```bash +git clone https://github.com/DLR-RM/BlenderProc +``` BlenderProc must be run inside the blender python environment (bpy), as this is the only way to access the Blender API. -`blenderproc run ` +```bash +blenderproc run +``` You can check out this notebook to try BlenderProc in Google Colab, demos the basic examples provided [here](https://github.com/DLR-RM/BlenderProc/tree/main/examples/basics). Here are some images rendered with the basic example: diff --git a/chapters/en/unit10/datagen-diffusion-models.mdx b/chapters/en/unit10/datagen-diffusion-models.mdx index fc8772f03..b9ad71c77 100644 --- a/chapters/en/unit10/datagen-diffusion-models.mdx +++ b/chapters/en/unit10/datagen-diffusion-models.mdx @@ -59,7 +59,7 @@ This means we have many tools under our belt to generate synthetic data! ## Approaches to Synthetic Data Generation -There are generally three cases for needing synthetic data, +There are generally three cases for needing synthetic data: **Extending an existing dataset:** diff --git a/chapters/en/unit10/point_clouds.mdx b/chapters/en/unit10/point_clouds.mdx index e5ec8007c..6a470c3f7 100644 --- a/chapters/en/unit10/point_clouds.mdx +++ b/chapters/en/unit10/point_clouds.mdx @@ -22,22 +22,22 @@ The 3D Point Data is mainly used in self-driving capabilities, but now other AI ## Generation and Data Representation -We will be using the python library [point-cloud-utils](https://github.com/fwilliams/point-cloud-utils), and [open-3d](https://github.com/isl-org/Open3D), which can be installed by +We will be using the python library [point-cloud-utils](https://github.com/fwilliams/point-cloud-utils), and [open-3d](https://github.com/isl-org/Open3D), which can be installed by: ```bash - pip install point-cloud-utils +pip install point-cloud-utils ``` -We will be also using the python library open-3d, which can be installed by +We will be also using the python library open-3d, which can be installed by: ```bash - pip install open3d +pip install open3d ``` -OR a Smaller CPU only version +OR a Smaller CPU only version: ```bash - pip install open3d-cpu +pip install open3d-cpu ``` Now, first we need to understand the formats in which these point clouds are stored in, and for that, we need to look at mesh cloud. @@ -53,13 +53,13 @@ The type of file is inferred from its file extension. Some of the extensions sup - A simple PLY object consists of a collection of elements for representation of the object. It consists of a list of (x,y,z) triplets of a vertex and a list of faces that are actually indices into the list of vertices. - Vertices and faces are two examples of elements and the majority of the PLY file consists of these two elements. -- New properties can also be created and attached to the elements of an object, but these should be added in such a way that old programs do not break when these new properties are encountered +- New properties can also be created and attached to the elements of an object, but these should be added in such a way that old programs do not break when these new properties are encountered. ** STL (Standard Tessellation Language) ** - This format approximates the surfaces of a solid model with triangles. - These triangles are also known as facets, where each facet is described by a perpendicular direction and three points representing the vertices of the triangle. -- However, these files have no description of Color and Texture +- However, these files have no description of Color and Texture. ** OFF (Object File Format) ** @@ -77,11 +77,11 @@ The type of file is inferred from its file extension. Some of the extensions sup - X3D is an XML based 3D graphics file format for presentation of 3D information. It is a modular standard and is defined through several ISO specifications. - The format supports vector and raster graphics, transparency, lighting effects, and animation settings including rotations, fades, and swings. -- X3D has the advantage of encoding color information (unlike STL) that is used during printing the model on a color 3D printer +- X3D has the advantage of encoding color information (unlike STL) that is used during printing the model on a color 3D printer. ** DAE (Digital Asset Exchange) ** - This is an XML schema which is an open standard XML schema, from which DAE files are built. -- This file format is based on the COLLADA (COLLAborative Design Activity) XML schema which is an open standard XML schema for the exchange of digital assets among graphics software applications +- This file format is based on the COLLADA (COLLAborative Design Activity) XML schema which is an open standard XML schema for the exchange of digital assets among graphics software applications. - The format's biggest selling point is its compatibility across multiple platforms. - COLLADA files aren't restricted to one program or manufacturer. Instead, they offer a standard way to store 3D assets. diff --git a/chapters/en/unit10/synthetic-lung-images.mdx b/chapters/en/unit10/synthetic-lung-images.mdx index 3a2cb2e90..1527306db 100644 --- a/chapters/en/unit10/synthetic-lung-images.mdx +++ b/chapters/en/unit10/synthetic-lung-images.mdx @@ -15,7 +15,7 @@ The generator has the following model architecture: - Conv2D layer - Batch Normalization layer - ReLU activation -- Conv2D layer with Tanh activation +- Conv2D layer with Tanh activation. The discriminator has the following model architecture: @@ -27,7 +27,7 @@ The discriminator has the following model architecture: - Conv2D layer - Batch Normalization layer - Leaky ReLU activation -- Conv2D layer with Sigmoid +- Conv2D layer with Sigmoid. **Data Collection** diff --git a/chapters/en/unit10/synthetic_datasets.mdx b/chapters/en/unit10/synthetic_datasets.mdx index cff284d2f..1d912689c 100644 --- a/chapters/en/unit10/synthetic_datasets.mdx +++ b/chapters/en/unit10/synthetic_datasets.mdx @@ -40,7 +40,7 @@ Semantic segmentation is vital for autonomous vehicles to interpret and navigate | Name | Year | Description | Paper | | Additional Links | |---------------------|--------------|-------------|----------------|---------------------|---------------------| | Virtual KITTI 2 | 2020 | Virtual Worlds as Proxy for Multi-Object Tracking Analysis | [Virtual KITTI 2](https://arxiv.org/pdf/2001.10773.pdf) | | [Website](https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds/) | -| ApolloScape | 2019 | Compared with existing public datasets from real scenes, e.g. KITTI [2] or Cityscapes [3], ApolloScape contains much large and richer labeling including holistic semantic dense point cloud for each site, stereo, per-pixel semantic labeling, lane-mark labeling, instance segmentation, 3D car instance, high accurate location for every frame in various driving videos from multiple sites, cities, and daytimes | [The ApolloScape Open Dataset for Autonomous Driving and its Application](https://arxiv.org/abs/1803.06184) | | [Website](https://apolloscape.auto/) | +| ApolloScape | 2019 | Compared with existing public datasets from real scenes, e.g. KITTI [2] or Cityscapes [3], ApolloScape contains much large and richer labeling including holistic semantic dense point cloud for each site, stereo, per-pixel semantic labeling, lane-mark labeling, instance segmentation, 3D car instance, high accurate location for every frame in various driving videos from multiple sites, cities, and daytimes. | [The ApolloScape Open Dataset for Autonomous Driving and its Application](https://arxiv.org/abs/1803.06184) | | [Website](https://apolloscape.auto/) | | Driving in the Matrix | 2017 | The core idea behind "Driving in the Matrix" is to use photo-realistic computer-generated images from a simulation engine to produce annotated data quickly. | [Driving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks?](https://arxiv.org/pdf/1610.01983.pdf) | | [GitHub](https://github.com/umautobots/driving-in-the-matrix) ![GitHub stars](https://img.shields.io/github/stars/umautobots/driving-in-the-matrix.svg?style=social&label=Star) | | CARLA | 2017 | **CARLA** (CAR Learning to Act) is an open simulator for urban driving, developed as an open-source layer over Unreal Engine 4. Technically, it operates similarly to, as an open source layer over Unreal Engine 4 that provides sensors in the form of RGB cameras (with customizable positions), ground truth depth maps, ground truth semantic segmentation maps with 12 semantic classes designed for driving (road, lane marking, traffic sign, sidewalk and so on), bounding boxes for dynamic objects in the environment, and measurements of the agent itself (vehicle location and orientation). | [CARLA: An Open Urban Driving Simulator](https://arxiv.org/pdf/1711.03938v1.pdf) | | [Website](https://carla.org/) | | Synthia | 2016 | A large collection of synthetic images for semantic segmentation of urban scenes. SYNTHIA consists of a collection of photo-realistic frames rendered from a virtual city and comes with precise pixel-level semantic annotations for 13 classes: misc, sky, building, road, sidewalk, fence, vegetation, pole, car, sign, pedestrian, cyclist, lane-marking. | [The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes](https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Ros_The_SYNTHIA_Dataset_CVPR_2016_paper.html) | | [Website](https://synthia-dataset.net/) | diff --git a/chapters/en/unit12/conclusion.mdx b/chapters/en/unit12/conclusion.mdx index 5e08e413a..6cd518d88 100644 --- a/chapters/en/unit12/conclusion.mdx +++ b/chapters/en/unit12/conclusion.mdx @@ -67,7 +67,7 @@ This is work that highlights and explores techniques for making machine learning ### 🧑‍🤝‍🧑 Inclusive These are projects which broaden the scope of who builds and benefits in the machine learning world. Some examples: -- Curating diverse datasets that increase the representation of underserved groups +- Curating diverse datasets that increase the representation of underserved groups. - Training language models on languages that aren't yet available on the Hugging Face Hub. - Creating no-code and low-code frameworks that allow non-technical folk to engage with AI. diff --git a/chapters/en/unit13/hyena.mdx b/chapters/en/unit13/hyena.mdx index 87812dd46..63535df3b 100644 --- a/chapters/en/unit13/hyena.mdx +++ b/chapters/en/unit13/hyena.mdx @@ -91,8 +91,8 @@ Some work has been conducted to speed up this computation like FastFFTConv based ![nd_hyena.png](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/outlook_hyena_images/nd_hyena.png) In essence, Hyena can be performed in two steps: -1. Compute a set of N+1 linear projections similarly of attention (it can be more than 3 projections) -2. Mixing up the projections: The matrix \\(H(u)\\) is defined by a combination of matrix multiplications +1. Compute a set of N+1 linear projections similarly of attention (it can be more than 3 projections). +2. Mixing up the projections: The matrix \\(H(u)\\) is defined by a combination of matrix multiplications. ## Why Hyena Matters @@ -113,7 +113,7 @@ Hyena has been applied to N-Dimensional data with the Hyena N-D layer and can be here is a noticeable enhancement in GPU memory efficiency with the increase in the number of image patches. Hyena Hierarchy facilitates the development of larger, more efficient convolution models for long sequences. -The potential for Hyena type models for computer vision would be a more efficient GPU memory consumption of patches, that would allow : +The potential for Hyena type models for computer vision would be a more efficient GPU memory consumption of patches, that would allow: - The processing of larger, higher-resolution images - The use of smaller patches, allowing a fine-graine feature representation diff --git a/chapters/en/unit2/cnns/convnext.mdx b/chapters/en/unit2/cnns/convnext.mdx index 35000590f..4d67f0cc9 100644 --- a/chapters/en/unit2/cnns/convnext.mdx +++ b/chapters/en/unit2/cnns/convnext.mdx @@ -9,12 +9,13 @@ ConvNext represents a significant improvement to pure convolution models by inco ## Key Improvements The author of the ConvNeXT paper starts building the model with a regular ResNet (ResNet-50), then modernizes and improves the architecture step-by-step to imitate the hierarchical structure of Vision Transformers. The key improvements are: -- Training Techniques -- Macro Design +- Training techniques +- Macro design - ResNeXt-ify -- Inverted Bottleneck -- Large Kernel Sizes -- Micro Design +- Inverted bottleneck +- Large kernel sizes +- Micro design + We will go through each of the key improvements. These designs are not novel in itself. However, you can learn how researchers adapt and modify designs systematically to improve existing models. To show the effectiveness of each improvement, we will compare the model's accuracy before and after the modification on ImageNet-1K. diff --git a/chapters/en/unit3/vision-transformers/cvt.mdx b/chapters/en/unit3/vision-transformers/cvt.mdx index 740bd952e..281754573 100644 --- a/chapters/en/unit3/vision-transformers/cvt.mdx +++ b/chapters/en/unit3/vision-transformers/cvt.mdx @@ -61,7 +61,7 @@ from einops import rearrange from einops.layers.torch import Rearrange ``` -2. Implementation of **Convolutional Projection**. +2. Implementation of **Convolutional Projection** ```python def _build_projection(self, dim_in, dim_out, kernel_size, padding, stride, method): diff --git a/chapters/en/unit3/vision-transformers/detr.mdx b/chapters/en/unit3/vision-transformers/detr.mdx index f0bfdf98e..75548178a 100644 --- a/chapters/en/unit3/vision-transformers/detr.mdx +++ b/chapters/en/unit3/vision-transformers/detr.mdx @@ -138,12 +138,12 @@ class DETR(nn.Module): ``` ### Going line by line in the `forward` function: **Backbone** -The input image is first put through a ResNet backbone and then a convolution layer, which reduces the dimension to the `hidden_dim` +The input image is first put through a ResNet backbone and then a convolution layer, which reduces the dimension to the `hidden_dim`. ```python x = self.backbone(inputs) h = self.conv(x) ``` -they are declared in the `__init__` function +They are declared in the `__init__` function. ```python self.backbone = nn.Sequential(*list(resnet50(pretrained=True).children())[:-2]) self.conv = nn.Conv2d(2048, hidden_dim, 1) @@ -171,7 +171,7 @@ self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2)) self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2)) ``` **Resize** -Before going into the transformer, the features with size `(batch size, hidden_dim, H, W)` are reshaped to `(hidden_dim, batch size, H*W)`. This makes them a sequential input for the transformer +Before going into the transformer, the features with size `(batch size, hidden_dim, H, W)` are reshaped to `(hidden_dim, batch size, H*W)`. This makes them a sequential input for the transformer. ```python h.flatten(2).permute(2, 0, 1) ``` @@ -185,7 +185,7 @@ In the end, the outputs, which is a tensor of size `(query_pos_dim, batch size, ```python return self.linear_class(h), self.linear_bbox(h).sigmoid() ``` -The first of which predicts the class. An additional class is added for the `No Object` class +The first of which predicts the class. An additional class is added for the `No Object` class. ```python self.linear_class = nn.Linear(hidden_dim, num_classes + 1) ``` diff --git a/chapters/en/unit3/vision-transformers/mobilevit.mdx b/chapters/en/unit3/vision-transformers/mobilevit.mdx index f3bee3942..f5492e5fa 100644 --- a/chapters/en/unit3/vision-transformers/mobilevit.mdx +++ b/chapters/en/unit3/vision-transformers/mobilevit.mdx @@ -23,7 +23,7 @@ A diagram of the MobileViT Block is shown below: Okay, that's a lot to take in. Let's break that down. - The block takes in an image with multiple channels. Let's say for an RGB image 3 channels, so the block takes in a three channeled image. -- It then performs a N by N convolution on the channels appending them to the existing channels +- It then performs a N by N convolution on the channels appending them to the existing channels. - The block then creates a linear combination of these channels and adds them to the existing stack of channels. - For each channel these images are unfolded into flattened patches. - Then these flattened patches are passed through a transformer to project them into new patches. diff --git a/chapters/en/unit3/vision-transformers/swin-transformer.mdx b/chapters/en/unit3/vision-transformers/swin-transformer.mdx index 3a9f748b4..6d52f8709 100644 --- a/chapters/en/unit3/vision-transformers/swin-transformer.mdx +++ b/chapters/en/unit3/vision-transformers/swin-transformer.mdx @@ -40,7 +40,7 @@ Key parts of the [implementation of Swin from the original paper](https://github 1. **Initialize Parameters**. Among various other dropout and normalization parameters, these parameters include: - `window_size`: Size of the windows for local self-attention. - `ape (bool)`: If True, add absolute position embedding to the patch embedding. - - `fused_window_process`: Optional hardware optimization + - `fused_window_process`: Optional hardware optimization. 2. **Apply Patch Embedding**: Similar to ViT, Images are split into non-overlapping patches and linearly embedded using `Conv2D`. @@ -52,7 +52,7 @@ Key parts of the [implementation of Swin from the original paper](https://github - The model is composed of multiple layers (`BasicLayer`) of `SwinTransformerBlock`s, each downsampling the feature map for hierarchical processing using `PatchMerging`. - The dimensionality of features and resolution of feature maps change across layers. -7. **Classification Head**: Similar to ViT, it uses an Multi-Layer Perceptron (MLP) head for classification tasks, as defined in `self.head`, as the last step +7. **Classification Head**: Similar to ViT, it uses an Multi-Layer Perceptron (MLP) head for classification tasks, as defined in `self.head`, as the last step. ```python class SwinTransformer(nn.Module): @@ -379,9 +379,9 @@ The feature map is partitioned into windows via `window_partition`. A **cyclic s Cyclic shift allows the model to capture relationships between adjacent windows, enhancing its ability to learn spatial contexts beyond the local scope of individual windows. -2. **Windowed attention**: Perform attention using window-based multi-head self attention (W-MSA) module +2. **Windowed attention**: Perform attention using window-based multi-head self attention (W-MSA) module. -3. **Merge Patches**: Patches are merged via `PatchMerging` +3. **Merge Patches**: Patches are merged via `PatchMerging`. 4. **Reverse cyclic shift**: After attention is done, the window partitioning is undone via `reverse_window`, and the cyclic shift operation is reversed, so that the feature map retains its original form. diff --git a/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx b/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx index 1d25c6037..9b341fc9c 100644 --- a/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx +++ b/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx @@ -15,17 +15,17 @@ This section will describe how object detection tasks are achieved using Vision Object detection is a computer vision task that involves identifying and localizing objects within an image or video. It consists of two main steps: -- First, recognizing the types of objects present (such as cars, people, or animals), +- First, recognizing the types of objects present (such as cars, people, or animals). - Second, determining their precise locations by drawing bounding boxes around them. These models typically receive images (static or frames from videos) as their inputs, with multiple objects present in each image. For example, consider an image containing several objects such as cars, people, bicycles, and so on. Upon processing the input, these models produce a set of numbers that convey the following information: -- Location of the object (XY coordinates of the bounding box) +- Location of the object (XY coordinates of the bounding box). - Class of the object. There are a lot of of applications around object detection. One of the most significant examples is in the field of autonomous driving, where object detection is used to detect different objects (like pedestrians, road signs, traffic lights, etc) around the car that become one of the inputs for taking decisions. -To deepen your understanding of the ins-and-outs of object detection, check out our [dedicated chapter](/chapters/en/Unit%206%20-%20Basic%20CV%20Tasks/object_detection.mdx) on Object Detection 🤗 +To deepen your understanding of the ins-and-outs of object detection, check out our [dedicated chapter](https://huggingface.co/learn/computer-vision-course/unit6/basic-cv-tasks/object_detection) on Object Detection 🤗. ### The Need to Fine-tune Models in Object Detection 🤔 diff --git a/chapters/en/unit3/vision-transformers/vision-transformers-for-image-segmentation.mdx b/chapters/en/unit3/vision-transformers/vision-transformers-for-image-segmentation.mdx index d4cd6c662..d807fb5ac 100644 --- a/chapters/en/unit3/vision-transformers/vision-transformers-for-image-segmentation.mdx +++ b/chapters/en/unit3/vision-transformers/vision-transformers-for-image-segmentation.mdx @@ -40,7 +40,7 @@ The architecture is composed of three components: **Segmentation Module**: Generates class probability predictions and mask embeddings for each segment using a linear classifier and a Multi-Layer Perceptron (MLP), respectively. The mask embeddings are used in combination with per-pixel embeddings to predict binary masks for each segment. -The model is trained with a binary mask loss, the same one as [DETR](https://github.com/johko/computer-vision-course/blob/9ad9b01f2383377ac9482dcbe02c91465b573b0b/chapters/en/Unit%203%20-%20Vision%20Transformers/Common%20Vision%20Transformers%20-%20DETR.mdx), and a cross-entropy classification loss per predicted segment. +The model is trained with a binary mask loss, the same one as [DETR](https://huggingface.co/learn/computer-vision-course/unit3/vision-transformers/detr), and a cross-entropy classification loss per predicted segment. ### Panoptic Segmentation Inference Example with Hugging Face Transformers diff --git a/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx b/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx index bb7ac3801..ebe7c18ec 100644 --- a/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx +++ b/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx @@ -38,7 +38,7 @@ For the machines around us to be more intelligent, better at communicating with - Vision + Text : Infographics, Memes, Articles, Blogs. - Vision + Audio: A Skype call with your friend, dyadic conversations. - Vision + Audio + Text: Watching YouTube videos or movies with captions, social media content in general is multimodal. -- Audio + Text: Voice notes, music files with lyrics +- Audio + Text: Voice notes, music files with lyrics. ## Multimodal Datasets @@ -64,7 +64,7 @@ Hugging Face supports a wide variety of multimodal tasks. Let us look into some - [Visual Question Answering or VQA](https://huggingface.co/tasks/visual-question-answering): Aiding visually impaired persons, efficient image retrieval, video search, Video Question Answering, Document VQA. - [Image to Text](https://huggingface.co/tasks/image-to-text): Image Captioning, Optical Character Recognition (OCR), Pix2Struct. -- [Text to Image](https://huggingface.co/tasks/text-to-image): Image Generation +- [Text to Image](https://huggingface.co/tasks/text-to-image): Image Generation. - [Text to Video](https://huggingface.co/tasks/text-to-video): Text-to-video editing, Text-to-video search, Video Translation, Text-driven Video Prediction. 2. Audio + Text: diff --git a/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx b/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx index 8186b5b9e..0eb51e3a3 100644 --- a/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx +++ b/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx @@ -56,7 +56,7 @@ probs = logits_per_image.softmax(dim=1) After executing this code, we got the following probabilities: - "a photo of a cat": 99.49% -- "a photo of a dog": 0.51% +- "a photo of a dog": 0.51% ## Limitations diff --git a/chapters/en/unit4/multimodal-models/tasks-models-part1.mdx b/chapters/en/unit4/multimodal-models/tasks-models-part1.mdx index 5b69c539f..4cae51578 100644 --- a/chapters/en/unit4/multimodal-models/tasks-models-part1.mdx +++ b/chapters/en/unit4/multimodal-models/tasks-models-part1.mdx @@ -187,7 +187,7 @@ Learn more about how to train and use DocVQA models in HuggingFace `transformers *Example of Input (Image) and Output (Text) for the Image Captioning Model. [[1]](#pretraining-paper)* - **Inputs:** - Image: Image in various formats (e.g., JPEG, PNG). - - Pre-trained image feature extractor (optional): A pre-trained neural network that can extract meaningful features from images, such as a convolutional neural network (CNN) + - Pre-trained image feature extractor (optional): A pre-trained neural network that can extract meaningful features from images, such as a convolutional neural network (CNN). - **Outputs:** Textual captions: Single Sentence or Paragraph that accurately describe the content of the input images, capturing objects, actions, relationships, and overall context. See the above example for the reference. - **Task:** To automatically generate natural language descriptions of images. This involves: (1) Understanding the visual content of the image (objects, actions, relationships). (2) Encoding this information into a meaningful representation. (3) Decoding this representation into a coherent, grammatically correct, and informative sentence or phrase. @@ -350,7 +350,7 @@ You can try out the Grounding DINO model in the Google Colab [here](https://cola Now, let's how can we use text-image generation models in HuggingFace. -Install `diffusers` library +Install `diffusers` library: ```bash pip install diffusers --upgrade ``` diff --git a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx index 15434c11d..f53b7a5be 100644 --- a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx +++ b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx @@ -3,16 +3,16 @@ What you will learn from this chapter: - What are diffusion models and how do they differ from GANs -- Major sub categories of Diffusion models -- Use cases of Diffusion models -- Drawback in Diffusion models +- Major sub categories of diffusion models +- Use cases of diffusion models +- Drawback in diffusion models ## Diffusion Models and their Difference from GANs Diffusion models are a new and exciting area in computer vision that has shown impressive results in creating images. These generative models work on two stages, a forward diffusion stage and a reverse diffusion stage: first, they slightly change the input data by adding some noise, and then they try to undo these changes to get back to the original data. This process of making changes and then undoing them helps generate realistic images. -These generative models raised the bar to a new level in the area of generative modeling, particularly referring to models such as [Imagen](https://imagen.research.google/) and [Latent Diffusion Models](https://arxiv.org/abs/2112.10752)(LDMs). For instance consider the below images generated via such models +These generative models raised the bar to a new level in the area of generative modeling, particularly referring to models such as [Imagen](https://imagen.research.google/) and [Latent Diffusion Models](https://arxiv.org/abs/2112.10752)(LDMs). For instance consider the below images generated via such models. ![Example images generated using diffusion models](https://huggingface.co/datasets/hwaseem04/Documentation-files/resolve/main/CV-Course/diffusion-eg.png) @@ -33,9 +33,9 @@ In diffusion models, Gaussian noise is added step-by-step to the training images ## Major Variants of Diffusion models -There are 3 major diffusion modelling frameworks +There are 3 major diffusion modelling frameworks: - Denoising diffusion probabilistic models (DDPMs): - - DDPMs are models that employ latent variables to estimate the probability distribution. From this point of view, DDPMs can be viewed as a special kind of variational auto-encoders (VAEs), where the forward diffusion stage corresponds to the encoding process inside VAE, while the reverse diffusion stage corresponds to the decoding process + - DDPMs are models that employ latent variables to estimate the probability distribution. From this point of view, DDPMs can be viewed as a special kind of variational auto-encoders (VAEs), where the forward diffusion stage corresponds to the encoding process inside VAE, while the reverse diffusion stage corresponds to the decoding process. - Noise conditioned score networks (NCSNs): - It is based on training a shared neural network via score matching to estimate the score function (defined as the gradient of the log density) of the perturbed data distribution at different noise levels. - Stochastic differential equations (SDEs): @@ -46,12 +46,12 @@ There are 3 major diffusion modelling frameworks ## Use Cases of Diffusion Models Diffusion is used in a variety of tasks including, but not limited to: -- Image generation - Generating images based on prompts -- Image super-resolution - Increasing resolution of images -- Image inpainting - Filling up a degraded portion of an image based on prompts +- Image generation - Generating images based on prompts. +- Image super-resolution - Increasing resolution of images. +- Image inpainting - Filling up a degraded portion of an image based on prompts. - Image editing - Editing specific/entire part of the image without losing its visual identity. -- Image-to-image translation - This includes changing background, attributes of the location etc -- Learned Latent representation from diffusion models can also be used for +- Image-to-image translation - This includes changing background, attributes of the location etc. +- Learned Latent representation from diffusion models can also be used for. - Image segmentation - Classification - Anomaly detection diff --git a/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx b/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx index 8dae27d25..f7c783070 100644 --- a/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx +++ b/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx @@ -4,7 +4,7 @@ This chapter introduces the building blocks of Stable Diffusion which is a gener What will you learn from this chapter? - Fundamental components of Stable Diffusion -- How to use `text-to-image`, `image2image`, inpainting pipelines +- How to use `text-to-image`, `image2image`, inpainting pipelines ## What Do We Need for Stable Diffusion to Work? To make this section interesting we will try to answer some questions to understand the basic components of the Stable Diffusion process. @@ -25,7 +25,7 @@ Latent diffusion models address the high computational demands of processing lar - How are we fusing texts with images since we are using prompts? We know that during inference time, we can feed in the description of an image we'd like to see and some pure noise as a starting point, and the model does its best to 'denoise' the random input into something that matches the caption. -SD leverages a pre-trained transformer model based on something called [CLIP](https://github.com/johko/computer-vision-course/blob/main/chapters/en/Unit%204%20-%20Mulitmodal%20Models/CLIP%20and%20relatives/clip.mdx). CLIP's text encoder was designed to process image captions into a form that could be used to compare images and text, so it is well suited to the task of creating useful representations from image descriptions. An input prompt is first tokenized (based on a large vocabulary where each word or sub-word is assigned a specific token) and then fed through the CLIP text encoder, producing a 768-dimensional (in the case of SD 1.X) or 1024-dimensional (SD 2.X) vector for each token. To keep things consistent prompts are always padded/truncated to be 77 tokens long, and so the final representation which we use as conditioning is a tensor of shape 77x1024 per prompt. +SD leverages a pre-trained transformer model based on something called [CLIP](https://huggingface.co/learn/computer-vision-course/unit4/multimodal-models/clip-and-relatives/clip). CLIP's text encoder was designed to process image captions into a form that could be used to compare images and text, so it is well suited to the task of creating useful representations from image descriptions. An input prompt is first tokenized (based on a large vocabulary where each word or sub-word is assigned a specific token) and then fed through the CLIP text encoder, producing a 768-dimensional (in the case of SD 1.X) or 1024-dimensional (SD 2.X) vector for each token. To keep things consistent prompts are always padded/truncated to be 77 tokens long, and so the final representation which we use as conditioning is a tensor of shape 77x1024 per prompt. - How can we add-in good inductive biases? diff --git a/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx b/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx index 9d2339025..0dfeda2f7 100644 --- a/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx +++ b/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx @@ -3,10 +3,10 @@ What you will learn in this chapter: - What is missing in Vanilla GAN -- StyleGAN1 components and benifits +- StyleGAN1 components and benefits - Drawback of StyleGAN1 and the need for StyleGAN2 - Drawback of StyleGAN2 and the need for StyleGAN3 -- Usecases of StyleGAN +- Use cases of StyleGAN ## What is missing in Vanilla GAN Generative Adversarial Networks(GANs) are a class of generative models that produce realistic images. But it is very evident that you don't have any control over how the images are generated. In Vanilla GANs, you have two networks (i) A Generator, and (ii) A Discriminator. A Discriminator takes an image as input and returns whether it is a real image or a synthetically generated image by the generator. A Generator takes in noise vector (generally sampled from a multivariate Gaussian) and tries to produce images that look similar but not exactly the same as the ones available in the training samples, initially, it will be a junk image but in a long run the aim of the Generator is to fool the Discriminator into believing that the images generated by the generator are real. @@ -22,7 +22,7 @@ TL DR; StyleGAN is a special modification made to the architectural style of the Let us just dive into the special components introduced in StyleGAN that give StyleGAN the power which we described above. Don't get intimidated by the figure above, it is one of the simplest yet powerful ideas which you can easily understand. As I already said, StyleGAN only modifies Generator and the Discriminator remains the same, hence it is not mentioned above. Diagram (a) corresponds to the structure of ProgessiveGAN. ProgessiveGAN is just a Vanilla GAN, but instead of generating images of a fixed resolution, it progressively generates images of higher resolution in aim of generating realistic high resolution images, i.e., block 1 of generator generates image of resolution 4 by 4, block 2 of generator generates image of resolution 8 by 8 and so on. -Diagram (b) is the proposed StyleGAN architecture. It has the following main components; +Diagram (b) is the proposed StyleGAN architecture. It has the following main components: 1. A mapping network 2. AdaIN (Adaptive Instance Normalisation) 3. Concatenation of Noise vector @@ -30,7 +30,7 @@ Diagram (b) is the proposed StyleGAN architecture. It has the following main com Let's break it down one by one. ### Mapping Network -Instead of passing the latent code (also known as the noise vector) z directly to the generator as done in traditional GANs, now it is mapped to w by a series of 8 MLP layers. The produced latent code w is not just passed as input to the first layer of the Generator, like in ProgessiveGAN, rather it is passed on to each block of the Generator Network (In StyleGAN terms, it is called a Synthesis Network). There are two major ideas here; +Instead of passing the latent code (also known as the noise vector) z directly to the generator as done in traditional GANs, now it is mapped to w by a series of 8 MLP layers. The produced latent code w is not just passed as input to the first layer of the Generator, like in ProgessiveGAN, rather it is passed on to each block of the Generator Network (In StyleGAN terms, it is called a Synthesis Network). There are two major ideas here: - Mapping the latent code from z to w disentangles the feature space. By disentanglement what we mean here is in a latent code of dimension 512, if you change just one of its feature values (say out of 512 values, you just increase or decrease the 4th value), then ideally in disentangled feature space, only one of the real world feature should change. If the 4th feature value corresponds to the real-world feature 'smile', then changing the 4th value of the 512-dimension latent code should generate images that are smiling/not smiling/something in between. - Passing latent code to each layer has a profound effect on the kind of the real features controlled. For instance, the effect of passing latent code w to lower blocks of the Synthetis network has control over high-level aspects such as pose, general hairstyle, face shape, and eyeglasses, and the effect of passing latent code w to blocks of the higher resolution of the synthetis network has control over smaller scale facial features, hairstyle, eyes open/closed etc. @@ -44,7 +44,7 @@ AdaIN modifies the instance Normalization by allowing the normalization paramete In StyleGAN, the latent code is not directly passed on to synthesis network rather affine transformer w, i.e y is passed to different blocks. y is called the 'style' representation. Here, \\(y_{s,i}\\) and \\(y_{b,i}\\) are the mean and standard deviation of the style representation y, and \\(mu(x_i)\\) and \\(sigma(x_i)\\) are the mean and standard deviation of the feature map x. -AdaIN enables the generator to modulate its behavior during the generation process dynamically. This is particularly useful in scenarios where different parts of the generated output may require different styles or characteristics +AdaIN enables the generator to modulate its behavior during the generation process dynamically. This is particularly useful in scenarios where different parts of the generated output may require different styles or characteristics. ### Concatenation of Noise vector @@ -67,7 +67,7 @@ You can see the blob structure in the above image, which the authors claim to ha ![Demodulation](https://huggingface.co/datasets/hwaseem04/Documentation-files/resolve/main/CV-Course/stylegan2_demod.png) -(ii) Fixing strong location preference artifact in Progessive GAN structure +(ii) Fixing strong location preference artifact in Progessive GAN structure. ![Phase Artifact](https://huggingface.co/datasets/hwaseem04/Documentation-files/resolve/main/CV-Course/progress.png) @@ -77,7 +77,7 @@ A skip generator and a residual discriminator was used to overcome the issue, wi There are also other changes introduced in StyleGAN2, but the above two are important to know at first hand. -## Drawbacks of StyleGAN2 and the need for StyleGAN3; +## Drawbacks of StyleGAN2 and the need for StyleGAN3 The same set of authors of StyleGAN2 figured out the dependence of the synthesis network on absolute pixel coordinates in an unhealthy manner. This leads to the phenomenon called the aliasing effect. ![Animation of aliasing](https://huggingface.co/datasets/hwaseem04/Documentation-files/resolve/main/CV-Course/MP4%20to%20GIF%20conversion.gif) @@ -99,7 +99,7 @@ StyleGAN's ability to generate photorealistic images has opened doors for divers **Creative explorations** -- Generating fashion designs: StyleGAN can be used to generate realistic and diverse fashion designs +- Generating fashion designs: StyleGAN can be used to generate realistic and diverse fashion designs. - Creating immersive experiences: StyleGAN can be used to create realistic virtual environments for gaming, education, and other applications. For instance, Stylenerf: A style-based. 3d aware generator for high-resolution image synthesis. These are just a non-exhaustive list. diff --git a/chapters/en/unit5/generative-models/gans.mdx b/chapters/en/unit5/generative-models/gans.mdx index c6a842636..b92472a40 100644 --- a/chapters/en/unit5/generative-models/gans.mdx +++ b/chapters/en/unit5/generative-models/gans.mdx @@ -2,7 +2,7 @@ ## Introduction Generative Adversarial Networks (GANs) are a class of deep learning models introduced by [Ian Goodfellow](https://scholar.google.ca/citations?user=iYN86KEAAAAJ&hl=en) and his colleagues in 2014. The core idea behind GANs is to train a generator network to produce data that is indistinguishable from real data, while simultaneously training a discriminator network to differentiate between real and generated data. -* **Architecture overview:** GANs consist of two main components: `the generator` and `the discriminator` +* **Architecture overview:** GANs consist of two main components: `the generator` and `the discriminator`. * **Generator:** The generator takes random noise \\(z\\) as input and generates synthetic data samples. Its goal is to create data that is realistic enough to deceive the discriminator. * **Discriminator:** The discriminator, akin to a detective, evaluates whether a given sample is real (from the actual dataset) or fake (generated by the generator). Its objective is to become increasingly accurate in distinguishing between real and generated samples. @@ -19,12 +19,12 @@ GANs and VAEs are both popular generative models in machine learning, but they h * **Example:** A GAN-generated image of a bedroom is likely to be indistinguishable from a real one, while a VAE-generated bedroom might appear blurry or have unrealistic lighting. ![Example of GAN-Generated bedrooms taken from Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, 2015](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/generative_models/bedroom.png) - **VAEs:** - * **Strengths:** Easier to train and more stable than GANs + * **Strengths:** Easier to train and more stable than GANs. * **Weaknesses:** May generate blurry, less detailed images with unrealistic features. * **Other Tasks:** - **GANs:** * **Strengths:** Can be used for tasks like super-resolution and image-to-image translation. - * **Weaknesses:** May not be the best choice for tasks that require a smooth transition between data points + * **Weaknesses:** May not be the best choice for tasks that require a smooth transition between data points. - **VAEs:** * **Strengths:** Widely used for tasks like image denoising and anomaly detection. * **Weaknesses:** May not be as effective as GANs for tasks that require high-quality image generation. diff --git a/chapters/en/unit5/generative-models/introduction/introduction.mdx b/chapters/en/unit5/generative-models/introduction/introduction.mdx index 5e9966523..ad5dbe5a4 100644 --- a/chapters/en/unit5/generative-models/introduction/introduction.mdx +++ b/chapters/en/unit5/generative-models/introduction/introduction.mdx @@ -33,7 +33,7 @@ Some other metrics you might come across are SSIM, PSNR, IS(Inception Score), an * PSNR (peak signal-to-noise ratio) can be interpreted almost as mean-squared-error. Generally, values from [25,34] are okay results while 34+ is very good. -* SSIM (Structural Similarity Index) is a metric in the range [0, 1] where 1 is a perfect match. The final index is calculated from 3 components: luminance, contrast, and structure. [this paper](https://arxiv.org/pdf/2006.13846.pdf) analyzes SSIM and its components if you're really interested +* SSIM (Structural Similarity Index) is a metric in the range [0, 1] where 1 is a perfect match. The final index is calculated from 3 components: luminance, contrast, and structure. [this paper](https://arxiv.org/pdf/2006.13846.pdf) analyzes SSIM and its components if you're really interested. * Inception score was introduced in [Improved Techniques for Training GANs](https://arxiv.org/pdf/1606.03498.pdf). It is calculated using the features on the inceptionv3 model. The higher the better. It is a mathematically very interesting metric, but has recently fallen out of favor. diff --git a/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx b/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx index acb4f0924..d2a976022 100644 --- a/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx +++ b/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx @@ -30,7 +30,7 @@ The future of AI-edited images will likely involve: - **Advanced detection and mitigation techniques:** Researchers will ideally develop more advanced techniques for detecting and mitigating the harms associated with AI-edited images. But is like a cat-and-mouse game where one group develops sophisticated realistic images generation algorithms, whereas another group develops methods to identify them. - **Public awareness and education:** Public awareness campaigns and educational initiatives will be crucial in promoting responsible use of AI-edited images and combating the spread of misinformation. -- **Protecting rights of image artist:** Companies like OpenAI, Google, StabiltyAI that trains large text-to-image models are facing slew of lawsuits because of scraping works of artists from internet without crediting them in anyway. Techniques like image poisoning is an emerging research problem where an artists' image is added with human-eye-invisible noise-like pixel changes before uploading on internet. This potentially corrupts the training data and hence model's image generation capability if scraped directly. You can read about this more from - [here](https://www.technologyreview.com/2023/10/23/1082189/data-poisoning-artists-fight-generative-ai/), and [here](https://arxiv.org/abs/2310.13828) +- **Protecting rights of image artist:** Companies like OpenAI, Google, StabiltyAI that trains large text-to-image models are facing slew of lawsuits because of scraping works of artists from internet without crediting them in anyway. Techniques like image poisoning is an emerging research problem where an artists' image is added with human-eye-invisible noise-like pixel changes before uploading on internet. This potentially corrupts the training data and hence model's image generation capability if scraped directly. You can read about this more from - [here](https://www.technologyreview.com/2023/10/23/1082189/data-poisoning-artists-fight-generative-ai/), and [here](https://arxiv.org/abs/2310.13828). This is a rapidly evolving field, and it is crucial to stay informed about the latest developments. diff --git a/chapters/en/unit5/generative-models/variational_autoencoders.mdx b/chapters/en/unit5/generative-models/variational_autoencoders.mdx index e1be4a308..6c045f94b 100644 --- a/chapters/en/unit5/generative-models/variational_autoencoders.mdx +++ b/chapters/en/unit5/generative-models/variational_autoencoders.mdx @@ -1,7 +1,7 @@ # Variational Autoencoders ## Introduction to Autoencoders -Autoencoders are a class of neural networks primarily used for unsupervised learning and dimensionality reduction. The fundamental idea behind autoencoders is to encode input data into a lower-dimensional representation and then decode it back to the original data, aiming to minimize the reconstruction error. The basic architecture of an autoencoder consists of two main components - `the encoder` and `the decoder` +Autoencoders are a class of neural networks primarily used for unsupervised learning and dimensionality reduction. The fundamental idea behind autoencoders is to encode input data into a lower-dimensional representation and then decode it back to the original data, aiming to minimize the reconstruction error. The basic architecture of an autoencoder consists of two main components - `the encoder` and `the decoder`. * **Encoder:** The encoder is responsible for transforming the input data into a compressed or latent representation. It typically consists of one or more layers of neurons that progressively reduce the dimensions of the input. * **Decoder:** The decoder, on the other hand, takes the compressed representation produced by the encoder and attempts to reconstruct the original input data. Like the encoder, it often consists of one or more layers, but in the reverse order, gradually increasing the dimensions. @@ -24,7 +24,7 @@ In the context of Vanilla Autoencoders (AE), the smile feature is encapsulated a ## Mathematics Behind VAEs Understanding the mathematical concepts behind VAEs involves grasping the principles of probabilistic modeling and variational inference. ![Variational Autoencoder - Lilian Weng Blog](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/generative_models/vae.png) -* **Probabilistic Modeling:** In VAEs, the latent space is modeled as a probability distribution, often assumed to be a multivariate Gaussian. This distribution is parameterized by the mean and standard deviation vectors, which are outputs of the probabilistic encoder \\( q_\phi(z|x) \\). This comprosises of our learned representation z which is further used to sample from the decoder as \\(p_\theta(x|z) \\) +* **Probabilistic Modeling:** In VAEs, the latent space is modeled as a probability distribution, often assumed to be a multivariate Gaussian. This distribution is parameterized by the mean and standard deviation vectors, which are outputs of the probabilistic encoder \\( q_\phi(z|x) \\). This comprosises of our learned representation z which is further used to sample from the decoder as \\(p_\theta(x|z) \\). * **Loss Function:** The loss function for VAEs comprises two components: the reconstruction loss (measuring how well the model reconstructs the input) similar to the vanilla autoencoders and the KL divergence (measuring how closely the learned distribution resembles a chosen prior distribution, usually gaussian). The combination of these components encourages the model to learn a latent representation that captures both the data distribution and the specified prior. * **Encouraging Meaningful Latent Representations:** By incorporating the KL divergence term into the loss function, VAEs are encouraged to learn a latent space where similar data points are closer, ensuring a meaningful and structured representation. The autoencoder's loss function aims to minimize both the reconstruction loss and the latent loss. A smaller latent loss implies a limited encoding of information that would otherwise enhance the reconstruction loss. Consequently, the Variational Autoencoder (VAE) finds itself in a delicate balance between the latent loss and the reconstruction loss. This equilibrium becomes pivotal, as a `smaller latent loss` tends to result in generated images closely resembling those present in the training set but lacking in visual quality. Conversely, a `smaller reconstruction loss` leads to well-reconstructed images during training but hampers the generation of novel images that deviate significantly from the training set. Striking a harmonious balance between these two aspects becomes imperative to achieve desirable outcomes in both image reconstruction and generation. diff --git a/chapters/en/unit8/3d-vision/nvs.mdx b/chapters/en/unit8/3d-vision/nvs.mdx index 7b06e890c..f3112f680 100644 --- a/chapters/en/unit8/3d-vision/nvs.mdx +++ b/chapters/en/unit8/3d-vision/nvs.mdx @@ -56,7 +56,7 @@ A model was trained separately on each class of object (e.g. planes, benches, ca

Image from: PixelNeRF

-The PixelNeRF code can be found on [GitHub](https://github.com/sxyu/pixel-nerf) +The PixelNeRF code can be found on [GitHub](https://github.com/sxyu/pixel-nerf). ### Related methods @@ -64,7 +64,7 @@ In the [ObjaverseXL](https://arxiv.org/pdf/2307.05663.pdf) paper, PixelNeRF was See also - [Generative Query Networks](https://deepmind.google/discover/blog/neural-scene-representation-and-rendering/), [Scene Representation Networks](https://www.vincentsitzmann.com/srns/), -[LRM](https://arxiv.org/pdf/2311.04400.pdf) +[LRM](https://arxiv.org/pdf/2311.04400.pdf). ## Zero123 (or Zero-1-to-3) @@ -78,7 +78,7 @@ Zero123 is built upon the [Stable Diffusion](https://arxiv.org/abs/2112.10752) a However, it adds a few new twists. The model actually starts with the weights from [Stable Diffusion Image Variations](https://huggingface.co/spaces/lambdalabs/stable-diffusion-image-variations), which uses the CLIP image embeddings (the final hidden state) of the input image to condition the diffusion U-Net, instead of a text prompt. However, here these CLIP image embeddings are concatenated with the relative viewpoint transformation between the input and novel views. -(This viewpoint change is represented in terms of spherical polar coordinates.) +(This viewpoint change is represented in terms of spherical polar coordinates).
Zero123 @@ -88,11 +88,12 @@ However, here these CLIP image embeddings are concatenated with the relative vie The rest of the architecture is the same as Stable Diffusion. However, the latent representation of the input image is concatenated channel-wise with the noisy latents before being input into the denoising U-Net. -To explore this model further, see the [Live Demo](https://huggingface.co/spaces/cvlab/zero123-live) +To explore this model further, see the [Live Demo](https://huggingface.co/spaces/cvlab/zero123-live). ### Related methods [3DiM](https://3d-diffusion.github.io/) - X-UNet architecture, with cross-attention between input and noisy frames. -[Zero123-XL](https://arxiv.org/pdf/2311.13617.pdf) - Trained on the larger objaverseXL dataset. See also [Stable Zero 123](https://huggingface.co/stabilityai/stable-zero123) -[Zero123++](https://arxiv.org/abs/2310.15110) - Generates 6 new fixed views, at fixed relative positions to the input view, with reference attention between input and generated images +[Zero123-XL](https://arxiv.org/pdf/2311.13617.pdf) - Trained on the larger objaverseXL dataset. See also [Stable Zero 123](https://huggingface.co/stabilityai/stable-zero123). + +[Zero123++](https://arxiv.org/abs/2310.15110) - Generates 6 new fixed views, at fixed relative positions to the input view, with reference attention between input and generated images. diff --git a/chapters/en/unit8/3d_measurements_stereo_vision.mdx b/chapters/en/unit8/3d_measurements_stereo_vision.mdx index e9dbdbfe1..75a40bc61 100644 --- a/chapters/en/unit8/3d_measurements_stereo_vision.mdx +++ b/chapters/en/unit8/3d_measurements_stereo_vision.mdx @@ -17,7 +17,7 @@ We aim to solve the problem of determining the 3D structure of objects. In our p Let's assume we are given the following information: 1. Single image of a scene point P -2. Pixel coordinates of point P in the image +2. Pixel coordinates of point P in the image 3. Position and orientation of the camera used to capture the image. For simplicity, we can also place an XYZ coordinate system at the location of the pinhole, with the z-axis perpendicular to the image place and the x-axis, and y-axis parallel to the image plane like in Figure 1. 4. Internal parameters of the camera, such as focal length and location of principal point. The principal point is where the optical axis intersects the image plane. Its location in the image plane is usually denoted as (Ox,Oy). @@ -32,9 +32,9 @@ With the information provided above, we can find a 3D line that originates from Given 2 lines in 3D, there are are three possibilities for their intersection: -1. Intersect at exactly 1 point -2. Intersect at infinite number of points -3. Do not intersect +1. Intersect at exactly 1 point +2. Intersect at infinite number of points +3. Do not intersect If both images (with original and new camera positions) contain point P, we can conclude that the 3D lines must intersect at least once and that the intersection point is point P. Furthermore, we can envision infinite points where both lines intersect only if the two lines are collinear. This is achievable if the pinhole at the new camera position lies somewhere on the original 3D line. For all other positions and orientations of the new camera location, the two 3D lines must intersect precisely at one point, where point P lies. @@ -54,10 +54,10 @@ Since there are many different positions and orientations for the camera locatio 4. We also have X and Y directions in a 2D image. X is the horizontal direction and Y is the vertical direction. We will refer to these directions in the image plane as u and v respectively. Therefore, pixel coordinates of a point are defined using (u,v) values. 5. X axis of the coordinate system is defined as the u direction / horizontal direction in the image plane. 6. Similarly Y axis of the coordinate system is defined as the v direction / vertical direction in the image plane. -7. Second camera (more precisely the pinhole of the second camera) is placed at a distance b called baseline in the positive x direction to the right of the first camera. Therefore, x,y,z coordinates of pinhole of second camera are (b,0,0) +7. Second camera (more precisely the pinhole of the second camera) is placed at a distance b called baseline in the positive x direction to the right of the first camera. Therefore, x,y,z coordinates of pinhole of second camera are (b,0,0). 5. Image plane of the second camera is oriented parallel to the image plane of the first camera. -6. u and v directions in the image plane of second/right camera are aligned with the u and v directions in the image plane of the first/left camera -7. Both left and right cameras are assumed to have the same intrinsic parameters like focal length and location of principal point +6. u and v directions in the image plane of second/right camera are aligned with the u and v directions in the image plane of the first/left camera. +7. Both left and right cameras are assumed to have the same intrinsic parameters like focal length and location of principal point. With the above configuration in place, we have the below equations which map a point in 3D to the image plane in 2D. @@ -70,12 +70,12 @@ With the above configuration in place, we have the below equations which map a p 2. \\(v\_right = f\_y * \frac{y}{z} + O\_y\\) Different symbols used in above equations are defined below: -* \\(u\_left\\), \\(v\_left\\) refer to pixel coordinates of point P in the left image -* \\(u\_right\\), \\(v\_right\\) refer to pixel coordinates of point P in the right image +* \\(u\_left\\), \\(v\_left\\) refer to pixel coordinates of point P in the left image. +* \\(u\_right\\), \\(v\_right\\) refer to pixel coordinates of point P in the right image. * \\(f\_x\\) refers to the focal length (in pixels) in x direction and \\(f\_y\\) refers to the focal length (in pixels) in y direction. Actually, there is only 1 focal length for a camera which is the distance between the pinhole (optical center of the lens) to the image plane. However, pixels may be rectangular and not perfect squares, resulting in different fx and fy values when we represent f in terms of pixels. -* x,y,z are 3D coordinates of the point P (any unit like cm, feet, etc can be used) -* \\(O\_x\\) and \\(O\_y\\) refer to pixel coordinates of the principal point -* b is called the baseline and refers to the distance between the left and right cameras. Same units are used for both b and x,y,z coordinates (any unit like cm, feet, etc can be used) +* x, y, z are 3D coordinates of the point P (any unit like cm, feet, etc can be used). +* \\(O\_x\\) and \\(O\_y\\) refer to pixel coordinates of the principal point. +* b is called the baseline and refers to the distance between the left and right cameras. Same units are used for both b and x,y,z coordinates (any unit like cm, feet, etc can be used). We have 4 equations above and 3 unknowns - x, y and z coordinates of a 3D point P. Intrinsic camera parameters - focal lengths and principal point are assumed to be known. Equations 1.2 and 2.2 indicate that the v coordinate value in the left and right images is the same. @@ -188,7 +188,7 @@ We can also compute 3D distances between different points using their (x,y,z) va | d5(9-10) | 16.9 | 16.7 | 1.2 | | d6(9-11) | 23.8 | 24 | 0.83 | -Calculated Dimension Results +Calculated Dimension Results ![Calculated Dimension Results](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/calculated_dim_results.png?download=true) ## Conclusion diff --git a/chapters/en/unit8/introduction/brief_history.mdx b/chapters/en/unit8/introduction/brief_history.mdx index f50f6ca04..6d9aa63ab 100644 --- a/chapters/en/unit8/introduction/brief_history.mdx +++ b/chapters/en/unit8/introduction/brief_history.mdx @@ -2,7 +2,7 @@ ## 1838: Stereoscopy -- **Inventor**: Sir Charles Wheatstone +- **Inventor**: Sir Charles Wheatstone. - **Technique**: Presenting offset images to each eye through a stereoscope, creating depth perception. ## 1853: Anaglyph 3D @@ -12,7 +12,7 @@ ## 1936: Polarized 3D -- **Developer**: Edwin H. Land +- **Developer**: Edwin H. Land. - **Approach**: Utilizing polarized light technology in 3D movies, with glasses that filter light in specific directions. ## 1960s: Virtual Reality diff --git a/chapters/en/unit8/nerf.mdx b/chapters/en/unit8/nerf.mdx index 86a770bf1..9a95d88f7 100644 --- a/chapters/en/unit8/nerf.mdx +++ b/chapters/en/unit8/nerf.mdx @@ -8,7 +8,7 @@ Furthermore, it allows us to store large scenes with a smaller memory footprint ## Short History 📖 The field of NeRFs is relatively young with the first publication by [Mildenhall et al.](https://www.matthewtancik.com/nerf) appearing in 2020. Since then, a vast number of papers have been published and fast advancements have been made. -Since 2020, more than 620 preprints and publications have been released, with more than 250 repositories on GitHub. *(as of Dec 2023, statistics from [paperswithcode.com](https://paperswithcode.com/method/nerf))* +Since 2020, more than 620 preprints and publications have been released, with more than 250 repositories on GitHub. *(as of Dec 2023, statistics from [paperswithcode.com](https://paperswithcode.com/method/nerf))*. Since the first formulation of NeRFs requires long training times (up to days on beefy GPUs), there have been a lot of advancements towards faster training and inference. An important leap was NVIDIA's [Instant-ngp](https://nvlabs.github.io/instant-ngp/), which was released in 2022. @@ -18,7 +18,7 @@ This novel approach was faster to train and query while performing on par qualit [Mipnerf-360](https://jonbarron.info/mipnerf360/), which was also released in 2022, is also worth mentioning. Again, the model architecture is the same as for most NeRFs, but the authors introduced a novel scene contraction that allows us to represent scenes that are unbounded in all directions, which is important for real-world applications. [Zip-NeRF](https://jonbarron.info/zipnerf/), released in 2023, combines recent advancements like the encoding from [Instant-ngp](https://nvlabs.github.io/instant-ngp/) and the scene contraction from [Mipnerf-360](https://jonbarron.info/mipnerf360/) to handle real-world situation whilst decreasing training times to under an hour. -*(this is still measured on beefy GPUs to be fair)* +*(this is still measured on beefy GPUs to be fair)*. Since the field of NeRFs is rapidly evolving, we added a section at the end where we will tease the latest research and the possible future direction of NeRFs. diff --git a/chapters/en/unit8/terminologies/camera-models.mdx b/chapters/en/unit8/terminologies/camera-models.mdx index 440ea7414..3dee184e0 100644 --- a/chapters/en/unit8/terminologies/camera-models.mdx +++ b/chapters/en/unit8/terminologies/camera-models.mdx @@ -15,11 +15,11 @@ There are a number of different conventions for the direction of the camera axes ### Pinhole camera coordinate transformation ![Pinhole transformation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Pinhole_transform.png) -Each point in 3D space maps to a single point on the 2D plane. To find the map between 3D and 2D coordinates, we first need to know the intrinsics of the camera, which for a pinhole camera are - - the focal lengths, \\(f_x\\) and \\(f_y\\) +Each point in 3D space maps to a single point on the 2D plane. To find the map between 3D and 2D coordinates, we first need to know the intrinsics of the camera, which for a pinhole camera are: + - the focal lengths, \\(f_x\\) and \\(f_y\\). - the coordinates of the principle point, \\(c_x\\)and \\(c_y\\), which is the optical centre of the image. This point is where the optical axis intersects the image plane. -Using these intrinsic parameters, we construct the camera matrix +Using these intrinsic parameters, we construct the camera matrix: $$ K = \begin{pmatrix} @@ -31,7 +31,7 @@ $$ In order to apply this to a point \\( p=[x,y,z]\\) to a point in 3D space, we multiply the point by the camera matrix \\( K @ p \\) to give a new 3x1 vector \\( [u,v,w]\\). This is a homogeneous vector in 2D, but where the last component isn't 1. To find the position of the point in the image plane we have to divide the first two coordinates by the last one, to give the point \\([u/w, v/w]\\). -Whilst this is the textbook definition of the camera matrix, if we use the Blender camera convention it will flip the image left to right and up-down (as points in front of the camera will have negative z-values). One potential way to fix this is to change the signs of some of the elements of the camera matrix +Whilst this is the textbook definition of the camera matrix, if we use the Blender camera convention it will flip the image left to right and up-down (as points in front of the camera will have negative z-values). One potential way to fix this is to change the signs of some of the elements of the camera matrix: $$ K = \begin{pmatrix} diff --git a/chapters/en/unit8/terminologies/linear-algebra.mdx b/chapters/en/unit8/terminologies/linear-algebra.mdx index 6f59970ed..47db71b66 100644 --- a/chapters/en/unit8/terminologies/linear-algebra.mdx +++ b/chapters/en/unit8/terminologies/linear-algebra.mdx @@ -139,7 +139,7 @@ The output should look something like this: Rotations around an axis are another commonly used transformation. There are a number of different ways of representing rotations, including Euler angles and quaternions, which can be very useful in some applications. Again, libraries such as Pytorch3d include a wide range of functionalities for performing rotations. However, as a simple example, we will just show how to construct rotations about each of the three axes. -- Rotation around the X-axis +- Rotation around the X-axis: $$ R_x(\alpha) = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos \alpha & -\sin \alpha & 0 \\ 0 & \sin \alpha & \cos \alpha & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$ @@ -176,7 +176,7 @@ The output should look something like this: output_rotation
-- Rotation around the Y-axis +- Rotation around the Y-axis: $$ R_y(\beta) = \begin{pmatrix} \cos \beta & 0 & \sin \beta & 0 \\ 0 & 1 & 0 & 0 \\ -\sin \beta & 0 & \cos \beta & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$ diff --git a/chapters/en/unit9/tools_and_frameworks.mdx b/chapters/en/unit9/tools_and_frameworks.mdx index 985ac7392..62012207d 100644 --- a/chapters/en/unit9/tools_and_frameworks.mdx +++ b/chapters/en/unit9/tools_and_frameworks.mdx @@ -7,7 +7,7 @@ The TensorFlow Model Optimization Toolkit is a suite of tools for optimizing machine learning models for deployment. The TensorFlow Lite post-training quantization tool enable users to convert weights to 8 bit precision which reduces the trained model size by about 4 times. The tools also include API for pruning and quantization during training is post-training quantization is insufficient. -These help user to reduce latency and inference cost, deploy models to edge devices with restricted resources and optimized execution for existing hardware or new special purpose accelerators +These help user to reduce latency and inference cost, deploy models to edge devices with restricted resources and optimized execution for existing hardware or new special purpose accelerators. ### Setup guide @@ -18,7 +18,7 @@ pip install -U tensorflow-model-optimization ### Hands-on guide -For a hands-on guide on how to use the Tensorflow Model Optimization Toolkit, refer this [notebook](https://colab.research.google.com/drive/1t1Tq6i0JZbOwloyhkSjg8uTTVX9iUkgj#scrollTo=D_MCHp6cwCFb) +For a hands-on guide on how to use the Tensorflow Model Optimization Toolkit, refer this [notebook](https://colab.research.google.com/drive/1t1Tq6i0JZbOwloyhkSjg8uTTVX9iUkgj#scrollTo=D_MCHp6cwCFb). ## Pytorch Quantization ### Overview @@ -40,7 +40,7 @@ import torch.quantization ``` ## Hands-on guide -For a hands-on guide on how to use the Pytorch Quantization, refer this [notebook](https://colab.research.google.com/drive/1toyS6IUsFvjuSK71oeLZZ51mm8hVnlZv +For a hands-on guide on how to use the Pytorch Quantization, refer this [notebook](https://colab.research.google.com/drive/1toyS6IUsFvjuSK71oeLZZ51mm8hVnlZv). ## ONNX Runtime @@ -49,12 +49,12 @@ For a hands-on guide on how to use the Pytorch Quantization, refer this [noteboo ONNX Runtime is a cross-platform machine-learning model accelerator, with a flexible interface to integrate hardware-specific libraries. ONNX Runtime can be used with models from PyTorch, Tensorflow/Keras, TFLite, scikit-learn, and other frameworks. The benefits of using ONNX Runtime for Inferencing are as follows: -- Improve inference performance for a wide variety of ML models -- Run on different hardware and operating systems -- Train in Python but deploy into a C#/C++/Java app -- Train and perform inference with models created in different frameworks +- Improve inference performance for a wide variety of ML models. +- Run on different hardware and operating systems. +- Train in Python but deploy into a C#/C++/Java app. +- Train and perform inference with models created in different frameworks. -For more details on ONNX Runtime, see [here](https://onnxruntime.ai/docs/) +For more details on ONNX Runtime, see [here](https://onnxruntime.ai/docs/). ### Setup guide @@ -72,7 +72,7 @@ pip install onnxruntime-gpu ### Hands-on guide -For a hands-on guide on how to use the ONNX Runtime, refer this [notebook](https://colab.research.google.com/drive/1A-qYPX52V2q-7fXHaLeNRJqPUk3a4Qkd) +For a hands-on guide on how to use the ONNX Runtime, refer this [notebook](https://colab.research.google.com/drive/1A-qYPX52V2q-7fXHaLeNRJqPUk3a4Qkd). ## TensorRT @@ -88,11 +88,11 @@ TensorRT is available as a pip package, `tensorrt`. To install the package, run ``` pip install tensorrt ``` -for other installation methods, see [here](https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html#install) +for other installation methods, see [here](https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html#install). ### Hands-on guide -For a hands-on guide on how to use the TensorRT, refer this [notebook](https://colab.research.google.com/drive/1b8ueEEwgRc9fGqky1f6ZPx5A2ak82FE1) +For a hands-on guide on how to use the TensorRT, refer this [notebook](https://colab.research.google.com/drive/1b8ueEEwgRc9fGqky1f6ZPx5A2ak82FE1). ## OpenVINO @@ -101,8 +101,8 @@ For a hands-on guide on how to use the TensorRT, refer this [notebook](https://c The OpenVINO™ toolkit enables user to optimize a deep learning model from almost any framework and deploy it with best-in-class performance on a range of Intel® processors and other hardware platforms. The benefits of using OpenVINO includes: - link directly with OpenVINO Runtime to run inference locally or use OpenVINO Model Server to serve model inference from a separate server or within Kubernetes environment -- Write an application once, deploy it anywhere on your preferred device, language and OS. -- has minimal external dependencies +- Write an application once, deploy it anywhere on your preferred device, language and OS +- has minimal external dependencies - Reduces first-inference latency by using the CPU for initial inference and then switching to another device once the model has been compiled and loaded to memory ### Setup guide @@ -112,11 +112,11 @@ Openvino is available as a pip package, `openvino`. To install the package, run pip install openvino ``` -For other installation methods, see [here](https://docs.openvino.ai/2023.2/openvino_docs_install_guides_overview.html?VERSION=v_2023_2_0&OP_SYSTEM=LINUX&DISTRIBUTION=ARCHIVE) +For other installation methods, see [here](https://docs.openvino.ai/2023.2/openvino_docs_install_guides_overview.html?VERSION=v_2023_2_0&OP_SYSTEM=LINUX&DISTRIBUTION=ARCHIVE). ### Hands-on guide -For a hands-on guide on how to use the OpenVINO, refer this [notebook](https://colab.research.google.com/drive/1FWD0CloFt6gIEd0WBSMBDDKzA7YUE8Wz) +For a hands-on guide on how to use the OpenVINO, refer this [notebook](https://colab.research.google.com/drive/1FWD0CloFt6gIEd0WBSMBDDKzA7YUE8Wz). ## Optimum @@ -142,11 +142,11 @@ Optimum is available as a pip package, `optimum`. To install the package, run th pip install optimum ``` -For installation of accelerator-specific features, see [here](https://huggingface.co/docs/optimum/installation) +For installation of accelerator-specific features, see [here](https://huggingface.co/docs/optimum/installation). ### Hands-on guide -For a hands-on guide on how to use Optimum for quantization, refer this [notebook](https://colab.research.google.com/drive/1tz4eHqSZzGlXXS3oBUc2NRbuRCn2HjdN) +For a hands-on guide on how to use Optimum for quantization, refer this [notebook](https://colab.research.google.com/drive/1tz4eHqSZzGlXXS3oBUc2NRbuRCn2HjdN). ## EdgeTPU @@ -156,10 +156,10 @@ Edge TPU is Google’s purpose-built ASIC designed to run AI at the edge. It del The benefits of using EdgeTPU includes: - Complements Cloud TPU and Google Cloud services to provide an end-to-end, cloud-to-edge, hardware + software infrastructure for AI-based solutions deployment - High performance in a small physical and power footprint -- Combined custom hardware, open software, and state-of-the-art AI algorithms to provide high-quality, easy to deploy AI solutions for the edge. +- Combined custom hardware, open software, and state-of-the-art AI algorithms to provide high-quality, easy to deploy AI solutions for the edge For more details on EdgeTPU, see [here](https://cloud.google.com/edge-tpu) -For guide on how to setup and use EdgeTPU, refer this [notebook](https://colab.research.google.com/drive/1aMEZE2sI9aMLLBVJNSS37ltMwmtEbMKl) +For guide on how to setup and use EdgeTPU, refer this [notebook](https://colab.research.google.com/drive/1aMEZE2sI9aMLLBVJNSS37ltMwmtEbMKl).