diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml index 0fdf195c5..a9bd6ec08 100644 --- a/chapters/en/_toctree.yml +++ b/chapters/en/_toctree.yml @@ -42,6 +42,8 @@ local: "unit2/cnns/intro-transfer-learning" - title: Lets Dive Further with MobileNet local: "unit2/cnns/mobilenetextra" + - title: Resnet + local: "unit2/cnns/resnet" - title: Unit 3 - Vision Transformers sections: - title: Vision Transformers for Image Classification diff --git a/chapters/en/unit1/chapter1/applications.mdx b/chapters/en/unit1/chapter1/applications.mdx index 27a0718c8..954b94cd8 100644 --- a/chapters/en/unit1/chapter1/applications.mdx +++ b/chapters/en/unit1/chapter1/applications.mdx @@ -38,7 +38,7 @@ Medical image analysis involves the application of computer vision and machine l - **Diagnostic Assistance**: Computer vision aids in diagnosing diseases and conditions by analyzing medical images. For instance, in radiology, algorithms can detect abnormalities such as tumors and fractures in X-rays or MRIs. These systems assist healthcare professionals by highlighting areas of concern or providing quantitative data that helps decision-making. -- **Segmentation and Detection:**: Medical image analysis involves segmenting and detecting specific structures or anomalies within the images. This process helps isolate organs, tissues, or pathologies for closer examination. For example, in cancer detection, computer vision algorithms can segment and analyze tumors from MRI or CT scans, assisting in treatment planning and monitoring. +- **Segmentation and Detection**: Medical image analysis involves segmenting and detecting specific structures or anomalies within the images. This process helps isolate organs, tissues, or pathologies for closer examination. For example, in cancer detection, computer vision algorithms can segment and analyze tumors from MRI or CT scans, assisting in treatment planning and monitoring. - **Treatment Planning and Monitoring**: Computer vision contributes to treatment planning by providing precise measurements, tracking changes over time, and assisting in surgical planning. It helps doctors understand the extent and progression of a disease, enabling them to plan and adjust treatment strategies accordingly. Doctors were already capable of doing most of these tasks, but they needed to do them by hand. CV systems can do it automatically, which frees us doctors to do other tasks. diff --git a/chapters/en/unit1/chapter1/definition.mdx b/chapters/en/unit1/chapter1/definition.mdx index 7b649fb8c..7309406c8 100644 --- a/chapters/en/unit1/chapter1/definition.mdx +++ b/chapters/en/unit1/chapter1/definition.mdx @@ -14,7 +14,7 @@ Computer vision is the science and technology of making machines see. It involve The evolution of computer vision has been marked by a series of incremental advancements in and across its interdisciplinary fields, where each step forward gave rise to breakthrough algorithms, hardware, and data, giving it more power and flexibility. One such leap was the jump to the widespread use of deep learning methods. -Initially, to extract and learn information in an image, you extract features through image-preprocessing techniques (chapter 3). Once you have a group of features describing your image, you use a classical machine learning algorithm on your dataset of features. It is a strategy that already simplifies things from the hard-coded rules, but it still relies on domain knowledge and exhaustive feature engineering. A more state-of-the-art approach arises when deep learning methods and large datasets meet. Deep learning (DL) allows machines to automatically learn complex features from the raw data. This paradigm shift allowed us to build more adaptive and sophisticated models, causing a renaissance in the field. +Initially, to extract and learn information in an image, you extract features through image-preprocessing techniques (Pre-processing for Computer Vision Tasks). Once you have a group of features describing your image, you use a classical machine learning algorithm on your dataset of features. It is a strategy that already simplifies things from the hard-coded rules, but it still relies on domain knowledge and exhaustive feature engineering. A more state-of-the-art approach arises when deep learning methods and large datasets meet. Deep learning (DL) allows machines to automatically learn complex features from the raw data. This paradigm shift allowed us to build more adaptive and sophisticated models, causing a renaissance in the field. The seeds of computer vision were sown long before the rise of deep learning models during 1960's, pioneers like David Marr and Hans Moravec wrestled with the fundamental question: Can we get machines to see? Early breakthroughs like edge detection algorithms, object recognition were achived with a mix of cleverness and brute-force which laid the ground work for this developing computer vision systems. Over time, as research and development advanced and hardware capabilities improved, the computer vision community expanded exponentially. This vibrant community is composed of researchers,engineers, data scientists, and passionate hobbyists across the globe coming from a vast arrayof disciplines. With open-source and community driven projects we are witnessing democratized access to cutting-edge tools and technologies helping to create a renaissance in this field. @@ -63,5 +63,5 @@ You will read more about the core tasks of computer vision in the Computer Visio The complexity of a given task in the realm of image analysis and computer vision is not solely determined by how noble or difficult a question or task may seem to an informed audience. Instead, it primarily hinges on the properties of the image or data being analyzed. Take, for example, the task of identifying a pedestrian in an image. To a human observer, this might appear straightforward and relatively simple, as we are adept at recognizing people. However, from a computational perspective, the complexity of this task can vary significantly based on factors such as lighting conditions, the presence of occlusions, the resolution of the image, and the quality of the camera. In low-light conditions or with pixelated images, even the seemingly basic task of pedestrian detection can become exceedingly complex for computer vision algorithms,requiring advanced image enhancement and machine learning techniques. Therefore, the challenge in image analysis and computer vision often lies not in the inherent nobility of a task, but in the intricacies of the visual data and the computational methods required to extract meaningful insights from it. ## Link to computer vision applications -As a field, computer vision has a growing importance in society. There are many ethical considerations regarding its applications. For example, a model that is deployed to detect cancer can have terrible consequences if it classifies a cancer sample as healthy. Surveillance technology, such as models that are capable of tracking people, also raises a lot of privacy concerns. This will be discussed in detail in Chapter 14- Applications of Computer Vision and real-world Use Cases, but we will give you a taste of some of its applications. +As a field, computer vision has a growing importance in society. There are many ethical considerations regarding its applications. For example, a model that is deployed to detect cancer can have terrible consequences if it classifies a cancer sample as healthy. Surveillance technology, such as models that are capable of tracking people, also raises a lot of privacy concerns. This will be discussed in detail in "Unit 12 - Ethics and Biases". We will give you a taste of some of its cool applications in "Applications of Computer Vision". diff --git a/chapters/en/unit1/chapter1/motivation.mdx b/chapters/en/unit1/chapter1/motivation.mdx index 5a5f00186..e8f76f01a 100644 --- a/chapters/en/unit1/chapter1/motivation.mdx +++ b/chapters/en/unit1/chapter1/motivation.mdx @@ -14,7 +14,7 @@ If you ever spontaneously kicked a ball, your brain performs a myriad of tasks u Shockingly, we don't need any formal education for this. We don't attend classes for most of the decisions we make daily. No mental math 101 can estimate the foot strength required for kicking a ball. We learned that from trial and error growing up. And some of us might never have learned at all. This is a striking contrast to the way we built programs. Programs are mostly rule-based. -Let’s try to replicate just the first task that our brain did: detecting that there is a ball. One way to do it is to define what a ball is and then exhaustively search for one in the image. Defining what a ball is is actually difficult. Balls can be as small as tennis balls but as big as Zorb balls, so size won’t help us much. We could try to describe its shape, but some balls, like rugby, are not always perfectly spherical. Not everything spherical is a ball either, otherwise ranges, bubbles, candies, and even our planet would all be considered balls. +Let’s try to replicate just the first task that our brain did: detecting that there is a ball. One way to do it is to define what a ball is and then exhaustively search for one in the image. Defining what a ball is is actually difficult. Balls can be as small as tennis balls but as big as Zorb balls, so size won’t help us much. We could try to describe its shape, but some balls, like rugby, are not always perfectly spherical. Not everything spherical is a ball either, otherwise bubbles, candies, and even our planet would all be considered balls.
Balls diff --git a/chapters/en/unit1/feature-extraction/feature-matching.mdx b/chapters/en/unit1/feature-extraction/feature-matching.mdx index 805919dbd..c412aa12b 100644 --- a/chapters/en/unit1/feature-extraction/feature-matching.mdx +++ b/chapters/en/unit1/feature-extraction/feature-matching.mdx @@ -24,7 +24,7 @@ import numpy as np Let's start by initializing SIFT detector. ```python -sift = cv.SIFT_create() +sift = cv2.SIFT_create() ``` Find the keypoints and descriptors with SIFT. @@ -37,7 +37,7 @@ kp2, des2 = sift.detectAndCompute(img2, None) Find matches using k nearest neighbors. ```python -bf = cv.BFMatcher() +bf = cv2.BFMatcher() matches = bf.knnMatch(des1, des2, k=2) ``` @@ -53,8 +53,8 @@ for m, n in matches: Draw the matches. ```python -img3 = cv.drawMatchesKnn( - img1, kp1, img2, kp2, good, None, flags=cv.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS +img3 = cv2.drawMatchesKnn( + img1, kp1, img2, kp2, good, None, flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS ) ``` @@ -67,7 +67,7 @@ img3 = cv.drawMatchesKnn( Initialize the ORB descriptor. ```python -orb = cv.ORB_create() +orb = cv2.ORB_create() ``` Find keypoints and descriptors. @@ -81,7 +81,7 @@ Because ORB is a binary descriptor, we find matches using [Hamming Distance](htt which is a measure of the difference between two strings of equal length. ```python -bf = cv.BFMatcher(cv.NORM_HAMMING, crossCheck=True) +bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True) ``` We will now find the matches. @@ -99,14 +99,14 @@ matches = sorted(matches, key=lambda x: x.distance) Draw first n matches. ```python -img3 = cv.drawMatches( +img3 = cv2.drawMatches( img1, kp1, img2, kp2, matches[:n], None, - flags=cv.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS, + flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS, ) ``` @@ -140,7 +140,7 @@ search_params = dict(checks=50) Initiate SIFT detector. ```python -sift = cv.SIFT_create() +sift = cv2.SIFT_create() ``` Find the keypoints and descriptors with SIFT. @@ -156,7 +156,7 @@ We will now define the FLANN parameters. Here, trees is the number of bins you w FLANN_INDEX_KDTREE = 1 index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=5) search_params = dict(checks=50) -flann = cv.FlannBasedMatcher(index_params, search_params) +flann = cv2.FlannBasedMatcher(index_params, search_params) matches = flann.knnMatch(des1, des2, k=2) ``` @@ -182,10 +182,10 @@ draw_params = dict( matchColor=(0, 255, 0), singlePointColor=(255, 0, 0), matchesMask=matchesMask, - flags=cv.DrawMatchesFlags_DEFAULT, + flags=cv2.DrawMatchesFlags_DEFAULT, ) -img3 = cv.drawMatchesKnn(img1, kp1, img2, kp2, matches, None, **draw_params) +img3 = cv2.drawMatchesKnn(img1, kp1, img2, kp2, matches, None, **draw_params) ``` ![FLANN](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/feature-extraction-feature-matching/FLANN.png) diff --git a/chapters/en/unit1/feature-extraction/feature_description.mdx b/chapters/en/unit1/feature-extraction/feature_description.mdx index 70f81af63..fe54b03a4 100644 --- a/chapters/en/unit1/feature-extraction/feature_description.mdx +++ b/chapters/en/unit1/feature-extraction/feature_description.mdx @@ -1,6 +1,6 @@ # Feature Description -Features are attributes of the instances learnt by the model to be later used to recognize new instances. +Features are attributes of the instances learned by the model to be later used to recognize new instances. ## How Can We Represent Features In Data Structures? diff --git a/chapters/en/unit1/image_and_imaging/extension-image.mdx b/chapters/en/unit1/image_and_imaging/extension-image.mdx index e0651500c..018fbde32 100644 --- a/chapters/en/unit1/image_and_imaging/extension-image.mdx +++ b/chapters/en/unit1/image_and_imaging/extension-image.mdx @@ -6,7 +6,7 @@ The litter of kittens is a simple story, but it reflects why it is so hard to im ![Cat kisses showing distortion based on the distance from the object](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/cat_kiss.gif) -It is tempting to think that if we just had a better camera, one that responds more rapidly with a high resolution all would be solved. We would get the adorable pictures we want. Moreover, we will use the knowledge in this course to do more than just capture all of the adorable kittens, we will want to build a model on a nanny cam that checks if the kittens are still together with their mommy so we know they are all safe and sound. Sounds perfect, right? +It is tempting to think that if we just had a better camera, one that responds more rapidly with a high resolution and then all would be solved. We would get the adorable pictures we want. Moreover, we will use the knowledge in this course to do more than just capture all of the adorable kittens, we will want to build a model on a nanny cam that checks if the kittens are still together with their mommy so we know they are all safe and sound. Sounds perfect, right? Before we go out to buy the newest flashiest new camera in the market thinking we will have better data. It will be super easy to train a model. We will have a super-accurate model. Out-of-this-world performance on the kitten tracking market. This paragraph is here to guide you in a more productive direction and possibly save you a lot of time and money. A higher resolution is not the answer to all your problems. For starters, a typical neural network model for dealing with images is a convolution neural network (CNN). CNNs expect an image of a given size. A large image needs a large model. Training will take a longer time. Chances are that your computers are also limited in RAM. A larger image size will mean fewer images to train on because the RAM will be limited for each iteration. @@ -32,11 +32,11 @@ To see more than what Mother Nature has given us, we need sensors capturing beyo We then directed our colossal lenses outwards toward the sky, using them to envision what was once unseen and unknown. We also pointed them out to the minuscule realm by building images of the DNA structure and individual atoms. Both of these instruments operate on the idea of manipulating light. We use different types of mirrors or lenses, bend and focus light in the specific ways we are interested in. -We are so obsessive about seeing things that scientists have even changed the DNA sequence of certain animals so they can tag proteins of interest with a special type of protein, called green fluorescence protein. As the name suggests, when a green wavelength of light illuminates the sample, the GFP emits a fluorescent signal back. Now, it is easier to know where the protein of interest is being expressed because scientists can image it. +We are so obsessive about seeing things that scientists have even changed the DNA sequence of certain animals so they can tag proteins of interest with a special type of protein (green fluorescence protein, GFP). As the name suggests, when a green wavelength of light illuminates the sample, the GFP emits a fluorescent signal back. Now, it is easier to know where the protein of interest is being expressed because scientists can image it. After that, it was a matter of improving this system to get more channels in place, in a longer timescale, in a better resolution. A great example of this is how microscopes now generate terabytes of data overnight. -A great example of this combined effort is the video below. In it, you see the time lapse of the projection of the 3D image of a developping embryo of a fished tagged a fluorescent protein. Each colored dot you see on the image represents an individual cell. +A great example of this combined effort is the video below. In it, you see the time lapse of the projection of the 3D image of a developping embryo of a fished tagged with a fluorescent protein. Each colored dot you see on the image represents an individual cell. ![Fisho Embryo Image adapted from https://www.biorxiv.org/content/10.1101/2023.03.06.531398v2.supplementary-material ](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/fish.gif) @@ -96,4 +96,4 @@ That is not the only scenario where the coordinates system comes into play. Anot Lastly, image acquisition comes with its own set of biases. We can loosely define bias here as an undesired characteristic of the dataset, either because it is noise or because it changes the model behavior. There are many sources of bias, but a relevant one in image acquisition is measurement bias. Measurement bias happens when the dataset used to train your model varies too much from the dataset that your model actually sees, like our previous example of a high-resolution kitten image and the nanny cam. There can be other sources of measurement bias, such as the measurement coming from the labelers themselves (i.e different groups and different people label images differently), or from the context of the image (i.e. in trying to classify dogs and cats, if all the pictures of cats are on the sofa, the model might learn to distinguish sofa from non-sofa instead of cats and dogs). -All of that is to say that recognizing and addressing the characteristics of images originating from different instruments is a good first step into building a computer vision model. Preprocessing techniques and strategies to address the problems we identify in this case can be used to mitigate its impact on the model. The next chapter will delve deeper into specific preprocessing methods used to enhance model performance. +All of that is to say that recognizing and addressing the characteristics of images originating from different instruments is a good first step into building a computer vision model. Preprocessing techniques and strategies to address the problems we identify in this case can be used to mitigate its impact on the model. The "Preprocessing for Computer Vision Tasks" chapter will address deeper into specific preprocessing methods used to enhance model performance. diff --git a/chapters/en/unit1/image_and_imaging/image.mdx b/chapters/en/unit1/image_and_imaging/image.mdx index e98267470..353fd0a13 100644 --- a/chapters/en/unit1/image_and_imaging/image.mdx +++ b/chapters/en/unit1/image_and_imaging/image.mdx @@ -35,11 +35,11 @@ If you've been tuned in, you may have caught on to the idea that videos are a v Images can naturally have a hidden component in time. They are, after all, taken at a specific point in time, and different images may be related in time, too. However, images and videos differ in how they sample this temporal information. An image is a static representation at a single point in time, while a video is a sequence of images played at a rate that creates an illusion of motion. This rate is what we can call frames per second. -This is so fundamental, that this course has a dedicated chapter to video. There, we will go over the adaptions required to deal with this added dimension. +This is so fundamental, that this course has a dedicated chapter to video. There, we will go over the adaptations required to deal with this added dimension. ### Images vs Tabular Data -In tabular data, dimensionality is usually defined by the number of features (columns) describing one data point. In visual data, dimensionality usually refers to the number of dimensions that describe your data. For a 2D image, we usually refer to numbers \\(x_i\\) and \\(Y_i\\) as the image size. +In tabular data, dimensionality is usually defined by the number of features (columns) describing one data point. In visual data, dimensionality usually refers to the number of dimensions that describe your data. For a 2D image, we usually refer to numbers \\(x_i\\) and \\(y_i\\) as the image size. Another aspect is the generation of features that describe visual data. They are generated by traditional preprocessing or learned through deep learning methods. We refer to this as feature extraction. It involves different algorithms discussed in more detail in the feature extraction chapter. It contrasts with the feature engineering for tabular data, where new features are built upon existing ones. diff --git a/chapters/en/unit2/cnns/googlenet.mdx b/chapters/en/unit2/cnns/googlenet.mdx index 9086921bc..4e3ae0d36 100644 --- a/chapters/en/unit2/cnns/googlenet.mdx +++ b/chapters/en/unit2/cnns/googlenet.mdx @@ -7,7 +7,6 @@ In this chapter we will go through a convolutional architecture called GoogleNet The Inception architecture, a convolutional neural network (CNN) designed for tasks in computer vision such as classification and detection, stands out due to its efficiency. It contains fewer than 7 million parameters and is significantly more compact than its predecessors, being 9 times smaller than AlexNet and 22 times smaller than VGG16. This architecture gained recognition in the ImageNet 2014 challenge, where Google's adaptation, named GoogLeNet (a tribute to LeNet), set new benchmarks in performance while utilizing fewer parameters compared to previous leading methods. - ### Architectural Innovations Before the advent of the Inception architecture, models like AlexNet and VGG demonstrated the benefits of deeper network structures. However, deeper networks typically entail more computational steps and can lead to issues such as overfitting and the vanishing gradient problem. The Inception architecture offers a solution, enabling the training of complex CNNs with a reduced count of floating-point parameters. @@ -16,18 +15,18 @@ Before the advent of the Inception architecture, models like AlexNet and VGG dem In prior networks, such as AlexNet or VGG, the fundamental block is the convolution layer itself. However, Lin et al. 2013, introduced the concept of Network In Network, arguing that a single convolution is not necessarily a correct fundamental building block. It ought to be more complex. So, inspired by that, the Inception model authors decided to have a more complex building block called the Inception Module, aptly named after the famous movie - "The Inception" (dream in dream). -The Inception Module insists on applying convolution filters of different kernel sizes for feature extraction at multiple scales. For any input feature map, it applies a $1 \times 1$ convolution, a 3x3 convolution, and a 5x5 convolution in parallel. In addition to convolution a max pooling operation is also applied. All four operations have padding and stride in such a way as to have the same spatial dimension. These features are concatenated and form the input to the next stage. See Figure 1. +The Inception Module insists on applying convolution filters of different kernel sizes for feature extraction at multiple scales. For any input feature map, it applies a \\(1 \times 1\\) convolution, a 3x3 convolution, and a 5x5 convolution in parallel. In addition to convolution a max pooling operation is also applied. All four operations have padding and stride in such a way as to have the same spatial dimension. These features are concatenated and form the input to the next stage. See Figure 1. ![inception_naive](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/inception_naive.png) Figure 1: Naive Inception Module -As we can see applying multiple convolutions at multiple scales with bigger kernel sizes, like 5x5, can increase the number of parameters drastically. This problem is pronounced as the input feature size (channel size) increases. So as we go deep in the network stacking these "Inception Modules", the computation will increase drastically. The simple solution is to reduce the number of features wherever computational requirements seem to increase. The major pain points of high computation are the convolution layers. The feature dimension is reduced by a computationally inexpensive $1 \times 1$ convolution just before the 3x3 and 5x5 convolution. Let's see it with an example. +As we can see applying multiple convolutions at multiple scales with bigger kernel sizes, like 5x5, can increase the number of parameters drastically. This problem is pronounced as the input feature size (channel size) increases. So as we go deep in the network stacking these "Inception Modules", the computation will increase drastically. The simple solution is to reduce the number of features wherever computational requirements seem to increase. The major pain points of high computation are the convolution layers. The feature dimension is reduced by a computationally inexpensive \\(1 \times 1\\) convolution just before the 3x3 and 5x5 convolution. Let's see it with an example. -We want to convert a feature map of (\\ S \times S \times 128 \\) to (\\ S \times S \times 256 \\) via a 5x5 convolution. The number of parameters (excluding biases) is 5*5*128*256 = 819,200. However, if we reduce the feature dimension first by a $1 \times 1$ convolution to 64, then the number of parameters(excluding biases) is (\\ 1\times 1\times 128\times 64 + 5\times 5\times 64\times 256 = 8,192 + 409,600 = 417,792 \\). That means the number of parameters was reduced by almost half! +We want to convert a feature map of \\( S \times S \times 128 \\) to \\( S \times S \times 256 \\) via a 5x5 convolution. The number of parameters (excluding biases) is 5\*5\*128\*256 = 819,200. However, if we reduce the feature dimension first by a \\(1 \times 1\\) convolution to 64, then the number of parameters(excluding biases) is \\( 1\times 1\times 128\times 64 + 5\times 5\times 64\times 256 = 8,192 + 409,600 = 417,792 \\). That means the number of parameters was reduced by almost half! -We would also want to reduce the output features of max pooling before concatenating with the output feature map. So, we add one more (\\ 1\times 1 \\) convolution after the max-pooling layer. We also add a ReLU activation after each (\\ 1\times 1 \\) convolution increasing non-linearity and complexity of the module. See Figure 2. +We would also want to reduce the output features of max pooling before concatenating with the output feature map. So, we add one more \\( 1\times 1 \\) convolution after the max-pooling layer. We also add a ReLU activation after each \\( 1\times 1 \\) convolution increasing non-linearity and complexity of the module. See Figure 2. ![inception_reduced](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/inception_reduced.png) @@ -38,11 +37,6 @@ Also, because of the parallel operations of convolutions at multiple scales, we #### Average Pooling In prior networks, like AlexNet or VGG, the final layers would be a few fully connected layers. These fully connected layers, due to their large number of units would contribute to most of the parameters in a network. For example, 89% of the parameters of VGG16 are in the final three fully connected layers. 95% of parameters in AlexNet are in the final fully connected layers. This need can be attributed to the premise that a Convolutional layer is not necessarily complex enough. -### Architectural Innovations - -Before the advent of the Inception architecture, models like AlexNet and VGG demonstrated the benefits of deeper network structures. However, deeper networks typically entail more computational steps and can lead to issues such as overfitting and the vanishing gradient problem. The Inception architecture offers a solution, enabling the training of complex CNNs with a reduced count of floating-point parameters. - -In a conventional CNN design, layers are typically categorized as either pooling or convolution layers, with specific sizes for convolution filters. Although layering different sizes of convolution filters is beneficial for various tasks, it can rapidly increase the total number of parameters. The Inception architecture takes a different approach by running the convolution filters of various sizes (1x1, 3x3, 5x5) in parallel. That means it is possible to get different lower-dimensional embeddings -and hence, more information- from the same higher-dimensional features using these parallel processes! These are then integrated with max pooling into a unified component known as the Inception module. The GoogLeNet architecture is composed of a series of 9 such Inception modules. This configuration allows the network to maintain flexibility and learn complex tasks without a substantial increase in depth. However, with an Inception block at our disposal, we do not need fully connected layers and a simple average pooling along the spatial dimensions should be enough. This was also derived from the Network in Network paper. However, GoogLeNet included one fully connected layer. They reported an increase of 0.6% in top-1 accuracy. @@ -103,11 +97,6 @@ class InceptionModule(nn.Module): self.b1 = BaseConv2d(in_channels, n1x1, kernel_size=1) - self.b1 = nn.Sequential( - nn.Conv2d(in_channels, n1x1, kernel_size=1), - nn.ReLU(True), - ) - self.b2 = nn.Sequential( BaseConv2d(in_channels, n3x3red, kernel_size=1), BaseConv2d(n3x3red, n3x3, kernel_size=3, padding=1), @@ -196,29 +185,6 @@ class GoogLeNet(nn.Module): self.dropout = nn.Dropout(0.4) self.fc = nn.Linear(1024, 1000) - self.pre_layers = nn.Sequential( - nn.Conv2d(3, 64, kernel_size=3, padding=1), - nn.ReLU(True), - ) - - self.inception_blocks = nn.Sequential( - InceptionModule(64, 16, 32, 32, 16, 8, 8), - InceptionModule(64, 24, 32, 48, 16, 12, 12), - nn.MaxPool2d(3, stride=2, padding=1), - InceptionModule(96, 24, 32, 48, 16, 12, 12), - InceptionModule(96, 16, 32, 48, 16, 16, 16), - InceptionModule(96, 16, 32, 48, 16, 16, 16), - InceptionModule(96, 16, 32, 48, 16, 16, 16), - InceptionModule(96, 32, 32, 48, 16, 24, 24), - nn.MaxPool2d(3, stride=2, padding=1), - InceptionModule(128, 32, 48, 64, 16, 16, 16), - InceptionModule(128, 32, 48, 64, 16, 16, 16), - ) - - self.output_net = nn.Sequential( - nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(), nn.Linear(128, 100) - ) - def forward(self, x): ## block 1 x = self.conv1(x) diff --git a/chapters/en/unit2/cnns/introduction.mdx b/chapters/en/unit2/cnns/introduction.mdx index e890389f2..8f0382797 100644 --- a/chapters/en/unit2/cnns/introduction.mdx +++ b/chapters/en/unit2/cnns/introduction.mdx @@ -8,7 +8,7 @@ In this unit, we will learn about Convolutional Neural Networks, an important st ## Convolution: Basic Ideas -Convolution is an operation used to extract features from data. The data can be 1D, 2D or 3D. We'll explain the operation with a solid example. All you need to know now is that the operation simply takes a matrix made of numbers, moves it through the data, and takes the sum of products between the data and that matrix. This matrix is called kernel or filter. You might say, "What does it have to do with the feature extraction, and how am I supposed to apply it? +Convolution is an operation used to extract features from data. The data can be 1D, 2D or 3D. We'll explain the operation with a solid example. All you need to know now is that the operation simply takes a matrix made of numbers, moves it through the data, and takes the sum of products between the data and that matrix. This matrix is called kernel or filter. You might say, "What does it have to do with the feature extraction, and how am I supposed to apply it?" Don’t panic! We’re getting to it. To illustrate the intuition, let's take a look at this example. We have this 1D data, and we visualize it. Visualization will help understand the effects of convolution operation. diff --git a/chapters/en/unit2/cnns/resnet.mdx b/chapters/en/unit2/cnns/resnet.mdx new file mode 100644 index 000000000..4b53d792f --- /dev/null +++ b/chapters/en/unit2/cnns/resnet.mdx @@ -0,0 +1,90 @@ +# ResNet (Residual Network) + + + + +Neural networks with more layers were assumed to be more effective because adding more layers improves the model performance. + +As the networks became deeper, the extracted features could be further enriched, such as seen with VGG16 and VGG19. + +A question arose: "Is learning networks as easy as stacking more layers"? +An obstacle to answering this question, the gradient vanishing problem, was addressed by normalized initializations and intermediate normalization layers. + +However, a new issue emerged: the degradation problem. As the neural networks became deeper, accuracy saturated and degraded rapidly. An experiment comparing shallow and deep plain networks revealed that deeper models exhibited higher training and test errors, suggesting a fundamental challenge in training deeper architectures effectively. This degradation was not because of overfitting but because the training error increased when the network became deeper. The added layers did not approximate the identity function. + + +ResNet’s residual connections unlocked the potential of the extreme depth, propelling the accuracy upwards compared to the previous architectures. + + + +## ResNet Architecture + +- A Residual Block. Source: ResNet Paper +![residual](https://huggingface.co/datasets/hf-vision/course-assets/blob/main/ResnetBlock.png) + +ResNet’s building blocks designed as identity functions, preserve input information while enabling learning. This approach ensures efficient weight optimization and prevents degradation as the network becomes deeper. + +The building block of ResNet can be shown in the picture, source ResNet paper. + +![resnet_building_block](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/ResnetBlock.png) + +Shortcut connections perform identity mapping and their output is added to the output of the stacked layers. Identity shortcut connections add neither extra parameters nor +computational complexity, these connections bypass layers, creating direct paths for information flow, and they enable neural networks to learn the residual function (F). + +We can summarize ResNet Network -> Plain Network + Shortcuts! + +For operations \(F(x) + x\), \(F(x)\) and \(x\) should have identical dimensions. +ResNet employs two techniques to achieve this: + +- Zero-padding shortcuts that add channels with zero values, maintaining dimensions without introducing extra parameters to be learned. +- Projection shortcuts that use 1x1 convolutions to adjust dimensions when necessary, involving some additional learnable parameters. + +In deeper ResNet architectures like ResNet 50, 101, and 152, a specialized "bottleneck building block" is employed to manage parameter complexity and maintain efficiency while enabling even deeper learning. + +## ResNet Code + + + +### Deep Residual Networks Pre-trained on ImageNet +Below you can see how to load pre-trained ResNet with an image classification head using transformers library. +```python +from transformers import ResNetForImageClassification + +model = ResNetForImageClassification.from_pretrained("microsoft/resnet-50") + +model.eval() +``` +All pre-trained models expect input images normalized similarly, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded into a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. + + +Here's a sample execution. This example is available in the [Hugging Face documentation](https://huggingface.co/docs/transformers/v4.18.0/en/model_doc/resnet). + +```python +from transformers import AutoFeatureExtractor, ResNetForImageClassification +import torch +from datasets import load_dataset + +dataset = load_dataset("huggingface/cats-image") +image = dataset["test"]["image"][0] + +feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50") +model = ResNetForImageClassification.from_pretrained("microsoft/resnet-50") + +inputs = feature_extractor(image, return_tensors="pt") + +with torch.no_grad(): + logits = model(**inputs).logits + +# model predicts one of the 1000 ImageNet classes +predicted_label = logits.argmax(-1).item() +print(model.config.id2label[predicted_label]) +``` + + +## References + +- [PyTorch docs](https://pytorch.org/hub/pytorch_vision_resnet/) +- [ResNet: Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) + +- [Resnet Architecture Source: ResNet Paper](https://arxiv.org/abs/1512.03385) +- [Hugging Face Documentation on ResNet](https://huggingface.co/docs/transformers/en/model_doc/resnet) \ No newline at end of file diff --git a/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx b/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx index b4f962ccf..ebe7c18ec 100644 --- a/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx +++ b/chapters/en/unit4/multimodal-models/a_multimodal_world.mdx @@ -26,7 +26,6 @@ _An infographic on multimodality and why it is important to capture the overall Many times communication between 2 people gets really awkward in textual mode, slightly improves when voices are involved but greatly improves when you are able to visualize body language and facial expressions as well. This has been studied in detail by the American Psychologist, Albert Mehrabian who stated this as the 7-38-55 rule of communication, the rule states: "In communication, 7% of the overall meaning is conveyed through verbal mode (spoken words), 38% through voice and tone and 55% through body language and facial expressions." -![Funny Image + Text Meme example](https://huggingface.co/datasets/hf-vision/course-assets/main/resolve/multimodal_fusion_text_vision/bigbang.jpg) To be more general, in the context of AI, 7% of the meaning conveyed is through textual modality, 38% through audio modality and 55% through vision modality. Within the context of deep learning, we would refer each modality as a way data arrives to a deep learning model for processing and predictions. The most commonly used modalities in deep learning are: vision, audio and text. Other modalities can also be considered for specific use cases like LIDAR, EEG Data, eye tracking data etc. diff --git a/chapters/en/unit4/multimodal-models/tasks-models-part1.mdx b/chapters/en/unit4/multimodal-models/tasks-models-part1.mdx index 7f652c134..4cae51578 100644 --- a/chapters/en/unit4/multimodal-models/tasks-models-part1.mdx +++ b/chapters/en/unit4/multimodal-models/tasks-models-part1.mdx @@ -5,7 +5,7 @@ In this section, we will briefly look at the different multimodal tasks involvin ## Examples of Tasks Before looking into specific models, it's crucial to understand the diverse range of tasks involving image and text. These tasks include but are not limited to: -- **Visual Question Anwering (VQA) and Visual Reasoning:** Imagine a machine that looks at a picture and understands your questions about it. Visual Question Answering (VQA) is just that! It trains computers to extract meaning from images and answer questions like "Who's driving the car?" while Visual Reasoning is the secret sauce, enabling the machine to go beyond simple recognition and infer relationships, compare objects, and understand scene context to give accurate answers. It's like asking a detective to read the clues in a picture, only much faster and better! +- **Visual Question Answering (VQA) and Visual Reasoning:** Imagine a machine that looks at a picture and understands your questions about it. Visual Question Answering (VQA) is just that! It trains computers to extract meaning from images and answer questions like "Who's driving the car?" while Visual Reasoning is the secret sauce, enabling the machine to go beyond simple recognition and infer relationships, compare objects, and understand scene context to give accurate answers. It's like asking a detective to read the clues in a picture, only much faster and better! - **Document Visual Question Answering (DocVQA):** Imagine a computer understanding both the text and layout of a document, like a map or contract, and then answering questions about it directly from the image. That's Document Visual Question Answering (DocVQA) in a nutshell. It combines computer vision for processing image elements and natural language processing to interpret text, allowing machines to "read" and answer questions about documents just like humans do. Think of it as supercharging document search with AI to unlock all the information trapped within those images. @@ -392,4 +392,4 @@ Congratulations! you made it till the end. Now on to the next section for more o ## References 1. [Vision-Language Pre-training: Basics, Recent Advances, and Future Trends](https://arxiv.org/abs/2210.09263) 2. [Document Collection Visual Question Answering](https://arxiv.org/abs/2104.14336) -3. [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499) \ No newline at end of file +3. [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499) diff --git a/chapters/en/unit8/3d-vision/nvs.mdx b/chapters/en/unit8/3d-vision/nvs.mdx index 4ff173dff..be9cbff8a 100644 --- a/chapters/en/unit8/3d-vision/nvs.mdx +++ b/chapters/en/unit8/3d-vision/nvs.mdx @@ -25,17 +25,18 @@ PixelNeRF is a method that directly generates the parameters of a NeRF from one In other words, it conditions the NeRF on the input images. Unlike the original NeRF, which trains a MLP which takes spatial points to a density and color, PixelNeRF uses spatial features generated from the input images. -![PixelNeRF diagram](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_pipeline.png) -image from https://alexyu.net/pixelnerf - +
+ PixelNeRF diagram +

Image from: PixelNeRF

+
The method first passes the input images through a convolutional neural network (ResNet34), bilinearly upsampling features from multiple layers to the same resolution as the input images. As in a standard NeRF, the new view is generated by volume rendering. However, the NeRF itself has a slightly unusual structure. -At each query point $x$ in the rendered volume, the corresponding point in the input image(s) is found (by projecting it using the input image camera transformation $\pi$). -The input image features at this point, $W(\pi x)$ are then found by bilinear interpolation. -Like in the original NeRF, the query point $x$ is positionally encoded and concatentated with the viewing direction $d$. -The NeRF network consists of a set of ResNet blocks; the input image features $W(\pi(x))$ pass through a linear layer, and are added to the features at the start of each of the first three residual blocks. +At each query point \\( x \\) in the rendered volume, the corresponding point in the input image(s) is found (by projecting it using the input image camera transformation \\( \pi \\) ). +The input image features at this point, \\( W(\pi x) \\) are then found by bilinear interpolation. +Like in the original NeRF, the query point \\( x \\) is positionally encoded and concatentated with the viewing direction \\( d\\). +The NeRF network consists of a set of ResNet blocks; the input image features \\( W(\pi(x)) \\) pass through a linear layer, and are added to the features at the start of each of the first three residual blocks. There are then two more residual blocks to further process these features, before an output layer reduces the number of channels to four (RGB+density). When multiple input views are supplied, these are processed independently for the first three residual blocks, and then the features are averaged before the last two blocks. @@ -46,11 +47,14 @@ A model was trained separately on each class of object (e.g. planes, benches, ca ### Results (from the PixelNeRF website) -![Input image of a chair](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_input.png) -![Rotating gif animation of rendered novel views](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_output.gif) - -Image from https://alexyu.net/pixelnerf. +
+ Input image of a chair +
+
+ Rotating gif animation of rendered novel views +

Image from: PixelNeRF

+
The PixelNeRF code can be found on [GitHub](https://github.com/sxyu/pixel-nerf). @@ -76,9 +80,10 @@ The model actually starts with the weights from [Stable Diffusion Image Variatio However, here these CLIP image embeddings are concatenated with the relative viewpoint transformation between the input and novel views. (This viewpoint change is represented in terms of spherical polar coordinates). -![Zero123](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Zero123.png) -image from https://zero123.cs.columbia.edu. - +
+ Zero123 +

Image from: https://zero123.cs.columbia.edu

+
The rest of the architecture is the same as Stable Diffusion. However, the latent representation of the input image is concatenated channel-wise with the noisy latents before being input into the denoising U-Net. diff --git a/chapters/en/unit8/3d_measurements_stereo_vision.mdx b/chapters/en/unit8/3d_measurements_stereo_vision.mdx index 4ad2112a6..584aae2e4 100644 --- a/chapters/en/unit8/3d_measurements_stereo_vision.mdx +++ b/chapters/en/unit8/3d_measurements_stereo_vision.mdx @@ -8,9 +8,10 @@ Now, let's say we are given this 2D image and the location of the pixel coordina We aim to solve the problem of determining the 3D structure of objects. In our problem statement, we can represent an object in 3D as a set of 3D points. Finding the 3D coordinates of each of these points helps us determine the 3D structure of the object. -![Figure 1: Image formation using single camera](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/image_formation_single_camera.png?download=true) - -Figure 1: Image formation using single camera +
+ Figure 1: Image formation using single camera +

Figure 1: Image formation using single camera

+
## Solution Let's assume we are given the following information: @@ -42,9 +43,10 @@ Therefore, using 2 images of the same scene point P, known positions and orienta ## Simplified Solution Since there are many different positions and orientations for the camera locations which can be selected, we can select a location that makes the math simpler, less complex, and reduces computational processing when running on a computer or an embedded device. One configuration that is popular and generally used is shown in Figure 2. We use 2 cameras in this configuration, which is equivalent to a single camera for capturing 2 images from 2 different locations. -![Figure 2: Image formation using 2 cameras](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/image_formation_simple_stereo.jpg?download=true) - -Figure 2: Image formation using 2 cameras. +
+ Figure 2: Image formation using 2 cameras +

Figure 2: Image formation using 2 cameras

+
1. Origin of the coordinate system is placed at the pinhole of the first camera which is usually the left camera. 2. Z axis of the coordinate system is defined perpendicular to the image plane. @@ -97,14 +99,20 @@ We'll work through an example, capture some images, and perform some calculation ### Raw Left and Right Images The left and right cameras in OAK-D Lite are oriented similarly to the geometry of the simplified solution detailed above. The baseline distance between the left and right cameras is 7.5cm. Left and right images of a scene captured using this device are shown below. The figure also shows these images stacked horizontally with a red line drawn at a constant height (i.e. at a constant v value ). We'll refer to the horizontal x-axis as u and the vertical y-axis as v. -Raw Left Image. -![Raw Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_left_frame.jpg?download=true) +Raw Left Image +
+ Raw Left Image +
-Raw Right Image. -![Raw Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_right_frame.jpg?download=true) +Raw Right Image +
+ Raw Right Image +
-![Raw Stacked Left and Right Images ](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_stacked_frames.jpg?download=true) Raw Stacked Left and Right Images +
+ Raw Stacked Left and Right Images +
Let's focus on a single point - the top left corner of the laptop. As per equation 3 above, \\(v\_left = v\_right\\) for the same point in the left and right images. However, notice that the red line, which is at a constant v value, touches the top-left corner of the laptop in the left image but misses this point by a few pixels in the right image. There are two main reasons for this discrepancy: @@ -115,27 +123,39 @@ Let's focus on a single point - the top left corner of the laptop. As per equati We can perform image rectification/post-processing to correct for differences in intrinsic parameters and orientations of the left and right cameras. This process involves performing 3x3 matrix transformations. In the OAK-D Lite API, a stereo node performs these calculations and outputs the rectified left and right images. Details and source code can be viewed [here](https://github.com/luxonis/depthai-experiments/blob/master/gen2-stereo-on-host/main.py). In this specific implementation, correction for intrinsic parameters is performed using intrinsic camera matrices, and correction for orientation is performed using rotation matrices(part of calibration parameters) for the left and right cameras. The rectified left image is transformed as if the left camera had the same intrinsic parameters as the right one. Therefore, in all our following calculations, we'll use the intrinsic parameters for the right camera i.e. focal length of 452.9 and principal point at (298.85, 245.52). In the rectified and stacked images below, notice that the red line at constant v touches the top-left corner of the laptop in both the left and right images. Rectified Left Image -![Rectified Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_left_frame.jpg?download=true) +
+ Rectified Left Image +
-Rectified Right Image -![Rectified Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_right_frame.jpg?download=true) +Rectified Right Image +
+ Rectified Right Image +
-Rectified and Stacked Left and Right Images -![Rectified and Stacked Left and Right Images](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_stacked_frames.jpg?download=true) +Rectified and Stacked Left and Right Images +
+ Rectified and Stacked Left and Right Images +
Let's also overlap the rectified left and right images to see the difference. We can see that the v values for different points remain mostly constant in the left and right images. However, the u values change, and this difference in the u values helps us find the depth information for different points in the scene, as shown in Equation 6 above. This difference in 'u' values \\(u\_left - u\_right\\) is called disparity, and we can notice that the disparity for points near the camera is greater compared to points further away. Depth z and disparity \\(u\_left - u\_right\\) are inversely proportional, as shown in equation 6. -Rectified and Overlapped Left and Right Images -![Rectified and Overlapped Left and Right Images](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_overlapping_frames.jpg?download=true) +Rectified and Overlapped Left and Right Images +
+ Rectified and Overlapped Left and Right Images +
### Annotated Left and Right Rectified Images Let's find the 3D coordinates for some points in the scene. A few points are selected and manually annotated with their (u,v) values, as shown in the figures below. Instead of manual annotations, we can also use template-based matching, feature detection algorithms like SIFT, etc for finding corresponding points in left and right images. -Annotated Left Image -![Annotated Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_left_img.jpg?download=true) +Annotated Left Image +
+ Annotated Left Image +
-Annotated Right Image -![Annotated Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_right_img.jpg?download=true) +Annotated Right Image +
+ Annotated Right Image +
### 3D Coordinate Calculations Twelve points are selected in the scene, and their (u,v) values in the left and right images are tabulated below. Using equations 4, 5, and 6, (x,y,z) coordinates for these points are also calculated and tabulated below. X and Y coordinates concerning the left camera, and the origin is at the left camera's pinhole (or optical center of the lens). Therefore, 3D points left and above the pinhole have negative X and Y values, respectively. diff --git a/chapters/en/unit8/nerf.mdx b/chapters/en/unit8/nerf.mdx index 6e9ef2492..9a95d88f7 100644 --- a/chapters/en/unit8/nerf.mdx +++ b/chapters/en/unit8/nerf.mdx @@ -20,19 +20,20 @@ Again, the model architecture is the same as for most NeRFs, but the authors int [Zip-NeRF](https://jonbarron.info/zipnerf/), released in 2023, combines recent advancements like the encoding from [Instant-ngp](https://nvlabs.github.io/instant-ngp/) and the scene contraction from [Mipnerf-360](https://jonbarron.info/mipnerf360/) to handle real-world situation whilst decreasing training times to under an hour. *(this is still measured on beefy GPUs to be fair)*. -Since the field of NeRFs is rapidly evolving, we added a section `sota` at the end where we will tease the latest research and the possible future direction of NeRFs. +Since the field of NeRFs is rapidly evolving, we added a section at the end where we will tease the latest research and the possible future direction of NeRFs. But now enough with the history, let's dive into the intrinsics of NeRFs! 🚀🚀 ## Underlying approach (Vanilla NeRF) 📘🔍 -The fundamental idea behind NeRFs is to represent a scene as a continuous function that maps a position, $\mathbf{x} \in \mathbb{R}^{3}$, and a viewing direction, $\boldsymbol{\theta} \in \mathbb{R}^{2}$, to a colour $\mathbf{c} \in \mathbb{R}^{3}$ and volume density $\sigma \in \mathbb{R}^{1}$. -As neural networks can serve as universal function approximators, we can approximate this continuous function that represents the scene with a simple Multi-Layer Perceptron (MLP) $F_{\mathrm{\Theta}} : (\mathbf{x}, \boldsymbol{\theta}) \to (\mathbf{c},\sigma)$. +The fundamental idea behind NeRFs is to represent a scene as a continuous function that maps a position, \\( \mathbf{x} \in \mathbb{R}^{3} \\), and a viewing direction, \\( \boldsymbol{\theta} \in \mathbb{R}^{2} \\), to a colour \\( \mathbf{c} \in \mathbb{R}^{3} \\) and volume density \\( \sigma \in \mathbb{R}^{1}\\). +As neural networks can serve as universal function approximators, we can approximate this continuous function that represents the scene with a simple Multi-Layer Perceptron (MLP) \\( F_{\mathrm{\Theta}} : (\mathbf{x}, \boldsymbol{\theta}) \to (\mathbf{c},\sigma) \\). A simple NeRF pipeline can be summarized with the following picture: -![nerf_pipeline](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/nerf_pipeline.png) - -Image from: [Mildenhall et al. (2020)](https://www.matthewtancik.com/nerf). +
+ nerf_pipeline +

Image from: Mildenhall et al. (2020)

+
**(a)** Sample points and viewing directions along camera rays and pass them through the network. @@ -51,9 +52,9 @@ What is important for the use case of NeRFs is that this step is **differentiabl $$\mathbf{C}(\mathbf{r}) = \int_{t_n}^{t_f}T(t)\sigma(\mathbf{r}(t))\mathbf{c}(\mathbf{r}(t),\mathbf{d})dt$$ -In the equation above, $\mathbf{C}(\mathbf{r})$ is the expected colour of a camera ray $\mathbf{r}(t)=\mathbf{o}+t\mathbf{d}$, where $\mathbf{o} \in \mathbb{R}^{3}$ is the origin of the camera, $\boldsymbol{d} \in \mathbb{R}^{3}$ is the viewing direction as a 3D unit vector and $t \in \mathbb{R}_+$ is the distance along the ray. -$t_n$ and $t_f$ stand for the near and far bounds of the ray, respectively. -$T(t)$ denotes the accumulated transmittance along ray $\mathbf{r}(t)$ from $t_n$ to $t$. +In the equation above, \\( \mathbf{C}(\mathbf{r}) \\) is the expected colour of a camera ray \\( \mathbf{r}(t)=\mathbf{o}+t\mathbf{d} \\), where \\( \mathbf{o} \in \mathbb{R}^{3} \\) is the origin of the camera, \\( \boldsymbol{d} \in \mathbb{R}^{3} \\) is the viewing direction as a 3D unit vector and \\( t \in \mathbb{R}_+ \\) is the distance along the ray. +\\( t_n \\) and \\( t_f \\) stand for the near and far bounds of the ray, respectively. +\\( T(t) \\) denotes the accumulated transmittance along ray \\( \mathbf{r}(t) \\) from \\( t_n \\) to \\( t \\). After discretization, the equation above can be computed as the following sum: @@ -70,7 +71,7 @@ Many NeRF approaches use a pixel-wise error term that can be written as follows: $$\mathcal{L}_{\rm recon}(\boldsymbol{\hat{C}},\boldsymbol{C^*}) = \left\|\boldsymbol{\hat{C}}-\boldsymbol{C^*}\right\|^2$$ -,where $\boldsymbol{\hat{C}}$ is the rendered pixel colour and $\boldsymbol{C}^*$ is the ground truth pixel colour. +,where \\( \boldsymbol{\hat{C}} \\) is the rendered pixel colour and \\( \boldsymbol{C}^* \\) is the ground truth pixel colour. **Additional remarks** @@ -155,7 +156,7 @@ If you are interested in the inner workings and optimisation of such a *proposal To get the full experience when training your first NeRF, I recommend taking a look at the awesome [Google Colab notebook from the nerfstudio team](https://colab.research.google.com/github/nerfstudio-project/nerfstudio/blob/main/colab/demo.ipynb). There, you can upload images of a scene of your choice and train a NeRF. You could for example fit a model to represent your living room. 🎉🎉 -## Current advancements in the field[[sota]] +## Current advancements in the field The field is rapidly evolving and the number of new publications is almost exploding. Concerning training and rendering speed, [VR-NeRF](https://vr-nerf.github.io) and [SMERF](https://smerf-3d.github.io) show very promising results. We believe that we will soon be able to stream a real-world scene in real-time on an edge device, and this is a huge leap towards a realistic *Metaverse*. diff --git a/chapters/en/unit8/terminologies/camera-models.mdx b/chapters/en/unit8/terminologies/camera-models.mdx index e6049019b..3dee184e0 100644 --- a/chapters/en/unit8/terminologies/camera-models.mdx +++ b/chapters/en/unit8/terminologies/camera-models.mdx @@ -1,8 +1,10 @@ # Camera models ## Pinhole Cameras +
+ Pinhole camera from https://commons.wikimedia.org/wiki/File:Pinhole-camera.svg +
-![Pinhole camera from https://commons.wikimedia.org/wiki/File:Pinhole-camera.svg](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Pinhole-camera.png) The simplest kind of camera - perhaps one that you have made yourself - consists of a lightproof box, with a small hole made in one side and a screen or a photographic film on the other. Light rays passing through the hole generate an inverted image on the rear wall of the box. This simple model for a camera is commonly used in 3D graphics applications. ### Camera axes conventions @@ -27,7 +29,7 @@ f_x & 0 & c_x \\ \end{pmatrix} $$ -In order to apply this to a point \\(p=[x,y,z]\\) to a point in 3D space, we multiply the point by the camera matrix $K @ p$ to give a new 3x1 vector \\([u,v,w]\\). This is a homogeneous vector in 2D, but where the last component isn't 1. To find the position of the point in the image plane we have to divide the first two coordinates by the last one, to give the point \\([u/w, v/w]\\). +In order to apply this to a point \\( p=[x,y,z]\\) to a point in 3D space, we multiply the point by the camera matrix \\( K @ p \\) to give a new 3x1 vector \\( [u,v,w]\\). This is a homogeneous vector in 2D, but where the last component isn't 1. To find the position of the point in the image plane we have to divide the first two coordinates by the last one, to give the point \\([u/w, v/w]\\). Whilst this is the textbook definition of the camera matrix, if we use the Blender camera convention it will flip the image left to right and up-down (as points in front of the camera will have negative z-values). One potential way to fix this is to change the signs of some of the elements of the camera matrix: diff --git a/chapters/en/unit8/terminologies/linear-algebra.mdx b/chapters/en/unit8/terminologies/linear-algebra.mdx index ac36f575c..47db71b66 100644 --- a/chapters/en/unit8/terminologies/linear-algebra.mdx +++ b/chapters/en/unit8/terminologies/linear-algebra.mdx @@ -99,7 +99,9 @@ plot_cube(ax, translated_cube, label="Translated", color="red") The output should look something like this: -![output_translation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/translation.png) +
+ output_translation +
### Scaling @@ -129,7 +131,9 @@ plot_cube(ax, scaled_cube, label="Scaled", color="green") The output should look something like this: -![output_scaling](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/scaling.png) +
+ output_scaling +
### Rotations @@ -168,7 +172,9 @@ plot_cube(ax, rotated_cube, label="Rotated", color="orange") The output should look something like this: -![output_rotation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/rotation.png) +
+ output_rotation +
- Rotation around the Y-axis: @@ -178,16 +184,16 @@ We are sure you can use the example snippet above and figure out how to implemen - Rotation around the Z-axis -$$ R_y(\beta) = \begin{pmatrix} \cos \beta & 0 & \sin \beta & 0 \\ 0 & 1 & 0 & 0 \\ -\sin \beta & 0 & \cos \beta & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$ +$$ R_z(\beta) = \begin{pmatrix} \cos \beta & -\sin \beta & 0 & 0 \\ \sin \beta & \cos \beta & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$ Again, can you use the last code snippet and implement a rotation around the Z-axis❓ Note that the standard convention is that a positive rotation angle corresponds to an anti-clockwise rotation when the axis of rotation is pointing toward the viewer. Also note that in most libraries the cosine function requires the angle to be in radians. To convert from -degrees to radians, multiply by \\(pi/180\\). +degrees to radians, multiply by \\( pi/180\\). ### Combining transformations -Multiple transformations can be combined by multiplying together their matrices. Note that the order that matrices are multiplied matters - with the matrices being applied right to left. To make a matrix that applies the transforms P, Q, and R, in that order, the composite transformation is given by \\(X = R @ Q @ P\\). +Multiple transformations can be combined by multiplying together their matrices. Note that the order that matrices are multiplied matters - with the matrices being applied right to left. To make a matrix that applies the transforms P, Q, and R, in that order, the composite transformation is given by \\( X = R @ Q @ P\\). If we want to do first the translation, then the rotation, and then the scaling that we did above in one operation, it looks as follows: @@ -207,4 +213,6 @@ plot_cube(ax, final_result, label="Combined", color="violet") The output should look something like the following. -![output_rotation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/combined.png) +
+ output_combined +
\ No newline at end of file diff --git a/chapters/en/unit8/terminologies/representations.mdx b/chapters/en/unit8/terminologies/representations.mdx index 237a21863..b7d4f8833 100644 --- a/chapters/en/unit8/terminologies/representations.mdx +++ b/chapters/en/unit8/terminologies/representations.mdx @@ -15,10 +15,10 @@ The Python `trimesh` package contains many useful functions for working with mes ## Volumetric Data -Volumetric data is commonly used to encode information about transparent objects, such as clouds and fire. Fundamentally, it takes the form of a function $f(x,y,z)$ mapping positions in space to a density, color, and possibly other attributes. One simple method of representing such data is as a volumetric grid, where the data at each point is found by trilinear interpolation from the eight corners of the voxel containing it. +Volumetric data is commonly used to encode information about transparent objects, such as clouds and fire. Fundamentally, it takes the form of a function \\( f(x,y,z) \\) mapping positions in space to a density, color, and possibly other attributes. One simple method of representing such data is as a volumetric grid, where the data at each point is found by trilinear interpolation from the eight corners of the voxel containing it. As will be seen later in the NeRF chapter, volumetric representations can also be effectively used to represent solid objects. More sophisticated representations can also be used, such as a small MLP, or complex hash-grids such as in InstantNGP. ## Implicit Surfaces -Sometimes the flexibility of a volumetric representation is desirable, but the surface of the object itself is of interest. Implicit surfaces are like volumetric data, but where the function $f(x,y,z)$ maps each point in space to a single number, and where the surface is the zero of this function. For computational efficiency, it can be useful to require that this function is actually a signed distance function (SDF), where the function $f(x,y,z)$ indicates the sortest distance to the surface, with positive values outside the object and negative values inside (this sign is a convention and may vary). Maintaining this constraint is more difficult, but it allows intersections between straight lines and the surface to be calculated more quickly, using an algorithm known as sphere tracing. \ No newline at end of file +Sometimes the flexibility of a volumetric representation is desirable, but the surface of the object itself is of interest. Implicit surfaces are like volumetric data, but where the function \\( f(x,y,z) \\) maps each point in space to a single number, and where the surface is the zero of this function. For computational efficiency, it can be useful to require that this function is actually a signed distance function (SDF), where the function \\( f(x,y,z) \\) indicates the sortest distance to the surface, with positive values outside the object and negative values inside (this sign is a convention and may vary). Maintaining this constraint is more difficult, but it allows intersections between straight lines and the surface to be calculated more quickly, using an algorithm known as sphere tracing. \ No newline at end of file diff --git a/chapters/en/unit9/intro_to_model_optimization.mdx b/chapters/en/unit9/intro_to_model_optimization.mdx index 5e21693b8..2356de8d1 100644 --- a/chapters/en/unit9/intro_to_model_optimization.mdx +++ b/chapters/en/unit9/intro_to_model_optimization.mdx @@ -28,7 +28,7 @@ There are several techniques in the model optimization, which will be explained ## Trade-offs between accuracy, performance, and resource usage A trade-off exists between accuracy, performance, and resource usage when deploying a model. That's when we have to decide which part to prioritize so that the model can be maximized in the case at hand. -1. Accuracy is the model's ability to predict correctly. High accuracy is needed in all applications, which also causes higher performance and resource usage. Complex models with high accuracy usually require a lot of memory, so there will be limitations if they are deployed on resource-constrained devices. Complex models with high accuracy usually require a lot of memory, so there will be limitations if they are deployed on resource-constrained devices. +1. Accuracy is the model's ability to predict correctly. High accuracy is needed in all applications, which also causes higher performance and resource usage. Complex models with high accuracy usually require a lot of memory, so there will be limitations if they are deployed on resource-constrained devices. 2. Performance is the model's speed and efficiency (latency). This is important so the model can make predictions quickly, even in real time. However, optimizing performance will usually result in decreasing accuracy. 3. Resource usage is the computational resources needed to perform inference on the model, such as CPU, memory, and storage. Efficient resource usage is crucial if we want to deploy models on devices with certain limitations, such as smartphones or IoT devices.