vision.tex

\part{Vision}

Notes on the usage of Transformers for vision tasks.

\section{Vision Transformers \label{sec_vit}}

The original application of the Transformers architecture \cite{dosovitskiy2021imageworth16x16words}
divides 2D images into patches of size $ P\times P $, e.g. flattening a three-channel $ i _{xyc} $
image to shape $ f _{ s d  } $ where $ d \in \left \{ 0, \ldots , P ^{ 2 }C-1 \right \} $ and the
effective sequence length runs over $ s \in \left \{ 0, L ^{ 2 }C /P ^{ 2 }-1 \right \} $, for an $
L\times L $ sized image\footnote{Example: for a $ 256\times 256 $, three-channel image with a $
16\times 16 $ patch size, the effective sequence length is 768.}. A linear projection converts the
effective hidden dimension here to match match the model's hidden dimension. These are known as
\textbf{Patch Embeddings}.

Since there is no notion of causality, no causal mask is needed. A special \pyinline{[CLS]} token is
prepended and used to generate the final representations $ z _{ bd } $ for a batch of images. This
can be used for classification, for instance, by adding a classification head.  The original
training objective was just that: standard classification tasks.


\section{CLIP \label{sec_clip}}

CLIP (Contrastive Language-Image Pre-Training) \cite{radford2021learningtransferablevisualmodels} is
a technique for generating semantically meaningful representations of images. The method is not
necessarily Transformers specific, but the typical implementations are based on this architecture.

The core of CLIP is its training objective. The dataset consists of image-caption pairs (which are
relatively easy to extract; a core motivation), the CLIP processes many such pairs and then tries to
predict which images match to which captions. This is thought to inject more semantic meaning into
the image embeddings as compared with, say, those generated from the standard classification task.

A typical implementation will use separate models for encoding the text and image inputs. The two
outputs are $ t _{ bd } $ and $ i _{ bd } $ shaped\footnote{There may also be another linear
projection from the actual model outputs to a common space, too. Obviously, this is also necessary
if the hidden dimensions of the two models differ.}, respectively, with batch and hidden dimensions,
and are canonically trained so that the similarity score between any two elements is a function of
their dot-product.

The original CLIP recipe:
\begin{enumerate}
    \item Process the text bracketed with \pyinline{[SOS]} and \pyinline{[EOS]} insertions, use a
        normal Transformer architecture\footnote{The original CLIP paper keeps the causal mask.},
        and extract the last output from the \pyinline{[EOS]} token as the text embedding: $ i _{ bd
        }= z _{ bsd }\big|_{ s=-1 } $.
    \item Process the image with a vision transformer network.
    \item Project to a common dimensionality space, if needed.
    \item Compute the logits through cosine similarity: $ \ell _{ b b' } = i _{ bd }t _{ b'd }/ |i _{ b }||t _{ b }| $. These are used to
        define both possible conditional probabilities\footnote{They differ by what is summed over
        the in the denominator, i.e., which dimension the \pyinline{Softmax} is over.}:
        \begin{align}
         P(i_b|t _{ b' }) =  \frac{ e ^{ \ell _{b b'} } }{ \sum _{ b  } e ^{ \ell _{b b'} } }  \ ,
         \quad P(t _{ b' }| i _{ b }) =  \frac{ e ^{ \ell _{b b'} } }{ \sum _{ b' }  e ^{ \ell _{b b'} } }
        \end{align}
    \item Compute the cross-entropy losses in both directions and average:
        \begin{align}
           \Lcal  &= \frac{ 1 }{ 2B }\sum _{ b } \left (\ln P \left ( i_b|t_b \right ) + \ln P \left ( t_b|i_b \right )\right ) \label{eq_clip_loss} \ .
        \end{align}
\end{enumerate}
They also add a temperature to the loss, which they also train.

Post-training, the CLIP models can be used in many ways:
\begin{enumerate}
    \item Using the vision model as a general purpose feature extractor. This is how many
        vision-language models work: the CLIP image embeddings form part of the VLM inputs.
    \item Classification works by comparing the logits for a given image across embedded sentences
        of the form \texttt{This is an image of a <CLASS HERE>.}
\end{enumerate}