johko · merveenoyan · Apr 30, 2024 · Apr 28, 2024
@@ -32,10 +32,10 @@ image from https://alexyu.net/pixelnerf
 The method first passes the input images through a convolutional neural network (ResNet34), bilinearly upsampling features from multiple layers to the same resolution as the input images.  
 As in a standard NeRF, the new view is generated by volume rendering. 
 However, the NeRF itself has a slightly unusual structure. 
-At each query point $x$ in the rendered volume, the corresponding point in the input image(s) is found (by projecting it using the input image camera transformation $\pi$). 
-The input image features at this point, $W(\pi x)$ are then found by bilinear interpolation.
-Like in the original NeRF, the query point $x$ is positionally encoded and concatentated with the viewing direction $d$.
-The NeRF network consists of a set of ResNet blocks; the input image features $W(\pi(x))$ pass through a linear layer, and are added to the features at the start of each of the first three residual blocks.
+At each query point \\( x \\) in the rendered volume, the corresponding point in the input image(s) is found (by projecting it using the input image camera transformation \\( \pi \\) ). 
+The input image features at this point, \\( W(\pi x) \\) are then found by bilinear interpolation.
+Like in the original NeRF, the query point \\( x \\) is positionally encoded and concatentated with the viewing direction \\( d\\).
+The NeRF network consists of a set of ResNet blocks; the input image features \\( W(\pi(x)) \\) pass through a linear layer, and are added to the features at the start of each of the first three residual blocks.
 There are then two more residual blocks to further process these features, before an output layer reduces the number of channels to four (RGB+density).
 When multiple input views are supplied, these are processed independently for the first three residual blocks, and then the features are averaged before the last two blocks.
 

@@ -20,13 +20,13 @@ Again, the model architecture is the same as for most NeRFs, but the authors int
 [Zip-NeRF](https://jonbarron.info/zipnerf/), released in 2023, combines recent advancements like the encoding from [Instant-ngp](https://nvlabs.github.io/instant-ngp/) and the scene contraction from [Mipnerf-360](https://jonbarron.info/mipnerf360/) to handle real-world situation whilst decreasing training times to under an hour. 
 *(this is still measured on beefy GPUs to be fair)* 
 
-Since the field of NeRFs is rapidly evolving, we added a section `sota` at the end where we will tease the latest research and the possible future direction of NeRFs.
+Since the field of NeRFs is rapidly evolving, we added a section at the end where we will tease the latest research and the possible future direction of NeRFs.
 
 But now enough with the history, let's dive into the intrinsics of NeRFs! 🚀🚀
 
 ## Underlying approach (Vanilla NeRF) 📘🔍
-The fundamental idea behind NeRFs is to represent a scene as a continuous function that maps a position, $\mathbf{x} \in \mathbb{R}^{3}$, and a viewing direction, $\boldsymbol{\theta} \in \mathbb{R}^{2}$, to a colour $\mathbf{c} \in \mathbb{R}^{3}$ and volume density $\sigma \in \mathbb{R}^{1}$. 
-As neural networks can serve as universal function approximators, we can approximate this continuous function that represents the scene with a simple Multi-Layer Perceptron (MLP) $F_{\mathrm{\Theta}} : (\mathbf{x}, \boldsymbol{\theta}) \to (\mathbf{c},\sigma)$.
+The fundamental idea behind NeRFs is to represent a scene as a continuous function that maps a position, \\( \mathbf{x} \in \mathbb{R}^{3} \\), and a viewing direction,  \\( \boldsymbol{\theta} \in \mathbb{R}^{2} \\), to a colour \\( \mathbf{c} \in \mathbb{R}^{3} \\) and volume density \\( \sigma \in \mathbb{R}^{1}\\). 
+As neural networks can serve as universal function approximators, we can approximate this continuous function that represents the scene with a simple Multi-Layer Perceptron (MLP) \\( F_{\mathrm{\Theta}} : (\mathbf{x}, \boldsymbol{\theta}) \to (\mathbf{c},\sigma) \\).
 
 A simple NeRF pipeline can be summarized with the following picture:
 
@@ -51,9 +51,9 @@ What is important for the use case of NeRFs is that this step is **differentiabl
 
 $$\mathbf{C}(\mathbf{r}) = \int_{t_n}^{t_f}T(t)\sigma(\mathbf{r}(t))\mathbf{c}(\mathbf{r}(t),\mathbf{d})dt$$
 
-In the equation above, $\mathbf{C}(\mathbf{r})$ is the expected colour of a camera ray $\mathbf{r}(t)=\mathbf{o}+t\mathbf{d}$, where $\mathbf{o} \in \mathbb{R}^{3}$ is the origin of the camera, $\boldsymbol{d} \in \mathbb{R}^{3}$ is the viewing direction as a 3D unit vector and $t \in \mathbb{R}_+$ is the distance along the ray. 
-$t_n$ and $t_f$ stand for the near and far bounds of the ray, respectively. 
-$T(t)$ denotes the accumulated transmittance along ray $\mathbf{r}(t)$ from $t_n$ to $t$.
+In the equation above, \\( \mathbf{C}(\mathbf{r}) \\) is the expected colour of a camera ray \\( \mathbf{r}(t)=\mathbf{o}+t\mathbf{d} \\), where \\( \mathbf{o} \in \mathbb{R}^{3} \\) is the origin of the camera, \\( \boldsymbol{d} \in \mathbb{R}^{3} \\) is the viewing direction as a 3D unit vector and \\( t \in \mathbb{R}_+ \\) is the distance along the ray. 
+\\( t_n \\) and \\( t_f \\) stand for the near and far bounds of the ray, respectively. 
+\\( T(t) \\) denotes the accumulated transmittance along ray \\( \mathbf{r}(t) \\) from \\( t_n \\) to \\( t \\).
 
 After discretization, the equation above can be computed as the following sum:
 
@@ -70,7 +70,7 @@ Many NeRF approaches use a pixel-wise error term that can be written as follows:
 
 $$\mathcal{L}_{\rm recon}(\boldsymbol{\hat{C}},\boldsymbol{C^*}) = \left\|\boldsymbol{\hat{C}}-\boldsymbol{C^*}\right\|^2$$
 
-,where $\boldsymbol{\hat{C}}$ is the rendered pixel colour and $\boldsymbol{C}^*$ is the ground truth pixel colour.
+,where \\( \boldsymbol{\hat{C}} \\) is the rendered pixel colour and \\( \boldsymbol{C}^* \\) is the ground truth pixel colour.
 
 **Additional remarks**
 
@@ -144,7 +144,7 @@ visualize_grid(grid, encoded_grid, resolution)
 
 The output should look something like the image below:
 
-![encoding](https://huggingface.co/datasets/hf-vision/course-assets/blob/main/nerf_encodings.png)
+![encoding](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/nerf_encodings.png)
 
 The second trick worth mentioning is that most methods use smart approaches to sample points in space. 
 Essentially, we want to avoid sampling in regions where the scene is empty. 
@@ -155,7 +155,7 @@ If you are interested in the inner workings and optimisation of such a *proposal
 To get the full experience when training your first NeRF, I recommend taking a look at the awesome [Google Colab notebook from the nerfstudio team](https://colab.research.google.com/github/nerfstudio-project/nerfstudio/blob/main/colab/demo.ipynb). 
 There, you can upload images of a scene of your choice and train a NeRF. You could for example fit a model to represent your living room. 🎉🎉
 
-## Current advancements in the field[[sota]]
+## Current advancements in the field
 The field is rapidly evolving and the number of new publications is almost exploding. 
 Concerning training and rendering speed, [VR-NeRF](https://vr-nerf.github.io) and [SMERF](https://smerf-3d.github.io) show very promising results. 
 We believe that we will soon be able to stream a real-world scene in real-time on an edge device, and this is a huge leap towards a realistic *Metaverse*. 

@@ -27,7 +27,7 @@ f_x & 0 & c_x  \\
 \end{pmatrix}
 $$
 
-In order to apply this to a point \\(p=[x,y,z]\\) to a point in 3D space, we multiply the point by the camera matrix $K @ p$ to give a new 3x1 vector \\([u,v,w]\\). This is a homogeneous vector in 2D, but where the last component isn't 1. To find the position of the point in the image plane we have to divide  the first two coordinates by the last one, to give the point \\([u/w, v/w]\\).
+In order to apply this to a point \\( p=[x,y,z]\\) to a point in 3D space, we multiply the point by the camera matrix \\( K @ p \\) to give a new 3x1 vector \\( [u,v,w]\\). This is a homogeneous vector in 2D, but where the last component isn't 1. To find the position of the point in the image plane we have to divide  the first two coordinates by the last one, to give the point \\([u/w, v/w]\\).
 
 Whilst this is the textbook definition of the camera matrix, if we use the Blender camera convention it will flip the image left to right and up-down (as points in front of the camera will have negative z-values). One potential way to fix this is to change the signs of some of the elements of the camera matrix
 

@@ -178,16 +178,16 @@ We are sure you can use the example snippet above and figure out how to implemen
 
 - Rotation around the Z-axis
 
-$$ R_y(\beta) = \begin{pmatrix} \cos \beta & 0 & \sin \beta & 0 \\ 0 & 1 & 0 & 0 \\ -\sin \beta & 0 & \cos \beta & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$
+$$ R_z(\beta) = \begin{pmatrix} \cos \beta & -\sin \beta & 0 & 0 \\ \sin \beta & \cos \beta & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$
 
 Again, can you use the last code snippet and implement a rotation around the Z-axis❓
 
 Note that the standard convention is that a positive rotation angle corresponds to an anti-clockwise rotation when the axis of rotation is pointing toward the viewer. Also note that in most libraries the cosine function requires the angle to be in radians. To convert from
-degrees to radians, multiply by \\(pi/180\\).
+degrees to radians, multiply by \\( pi/180\\).
 
 ### Combining transformations
 
-Multiple transformations can be combined by multiplying together their matrices. Note that the order that matrices are multiplied matters - with the matrices being applied right to left. To make a matrix that applies the transforms P, Q, and R, in that order, the composite transformation is given by \\(X = R @ Q @ P\\).
+Multiple transformations can be combined by multiplying together their matrices. Note that the order that matrices are multiplied matters - with the matrices being applied right to left. To make a matrix that applies the transforms P, Q, and R, in that order, the composite transformation is given by \\( X = R @ Q @ P\\).
 
 If we want to do first the translation, then the rotation, and then the scaling that we did above in one operation, it looks as follows:
 

@@ -15,10 +15,10 @@ The Python `trimesh` package contains many useful functions for working with mes
 
 ## Volumetric Data
 
-Volumetric data is commonly used to encode information about transparent objects, such as clouds and fire. Fundamentally, it takes the form of a function $f(x,y,z)$ mapping positions in space to a density, color, and possibly other attributes. One simple method of representing such data is as a volumetric grid, where the data at each point is found by trilinear interpolation from the eight corners of the voxel containing it.
+Volumetric data is commonly used to encode information about transparent objects, such as clouds and fire. Fundamentally, it takes the form of a function \\( f(x,y,z) \\) mapping positions in space to a density, color, and possibly other attributes. One simple method of representing such data is as a volumetric grid, where the data at each point is found by trilinear interpolation from the eight corners of the voxel containing it.
 
 As will be seen later in the NeRF chapter, volumetric representations can also be effectively used to represent solid objects. More sophisticated representations can also be used, such as a small MLP, or complex hash-grids such as in InstantNGP.
 
 ## Implicit Surfaces
 
-Sometimes the flexibility of a volumetric representation is desirable, but the surface of the object itself is of interest. Implicit surfaces are like volumetric data, but where the function $f(x,y,z)$ maps each point in space to a single number, and where the surface is the zero of this function. For computational efficiency, it can be useful to require that this function is actually a signed distance function (SDF), where the function $f(x,y,z)$ indicates the sortest distance to the surface, with positive values outside the object and negative values inside (this sign is a convention and may vary). Maintaining this constraint is more difficult, but it allows intersections between straight lines and the surface to be calculated more quickly, using an algorithm known as sphere tracing.
+Sometimes the flexibility of a volumetric representation is desirable, but the surface of the object itself is of interest. Implicit surfaces are like volumetric data, but where the function \\( f(x,y,z) \\) maps each point in space to a single number, and where the surface is the zero of this function. For computational efficiency, it can be useful to require that this function is actually a signed distance function (SDF), where the function \\( f(x,y,z) \\) indicates the sortest distance to the surface, with positive values outside the object and negative values inside (this sign is a convention and may vary). Maintaining this constraint is more difficult, but it allows intersections between straight lines and the surface to be calculated more quickly, using an algorithm known as sphere tracing.