What parameters are trained in linear probe (LP) exactly? #10

zhilif · 2023-05-23T22:27:01Z

Adopting your notations in figure 5(d), you initialize the final linear layer W with the text embedding V, and you also keep the visual projection W_v. When you do LP, do you train both V and W_v, or just W? From your code, it seems that you turned off require_grad for visual.proj, so I guess you only trained W.

In section 5.1, you reached the conclusion that "with the proposed language-init method, one can ensure that few-shot performance is always better than zero-shot". This is not precise because if you only train W, the optimization problem becomes convex and the initialization should not matter (of course this is up to the optimizer's inductive bias). So I suspect the reason why you have LP better than zero-shot in figure 6 is due to the inherent regularization from the optimizer (for example, early stopping ~ l2 regularization). For the CLIP paper, they used L-BFGS to solve LP so initialization really didn't matter to them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What parameters are trained in linear probe (LP) exactly? #10

What parameters are trained in linear probe (LP) exactly? #10

zhilif commented May 23, 2023

What parameters are trained in linear probe (LP) exactly? #10

What parameters are trained in linear probe (LP) exactly? #10

Comments

zhilif commented May 23, 2023