arXiv [TOC]
- 动机:检测未标注的object
- 方法:训练时同时整合语义属性和视觉特征
-
Architecture
-
feature extraction module
-
object localization module
-
semantic prediction module
- For each testing bounding box proposal we can obtain its semantic representation
获得语义表达
- As ZS-YOLO is trained end-to-end, the loss of this layer will back propagate so that the learned visual representations will also be influenced by similarities in the semantic domain.
视觉特征同时被语义相似性所影响
-
objectiveness confidence prediction module
- 利用语义信息+视觉信息+位置信息来分类
-
-
Loss
- Object Localization Loss
- Semantic Loss: 学习一个余弦相似度
$$ L_{attr}=\sum_k[\lambda_{obj}\mathbb I_k^{obj}(S(\hat y_k,y_k)-1)^2+\lambda_{noobj}\mathbb I_k^{noobj}(\max_{c\in C_{seen}} S(\hat y_k,y_c)-0)^2] $$
$S=\frac{\vec{a}\cdot \vec{b}}{||a||||b||}$ 余弦相似度(cosine similarity )$\lambda_{obj}=5,\lambda_{noobj}=1$ 用于平衡前景背景的不平衡$\mathbb I_k^{noobj}=1$ 当且仅当第$j$个anchor与gt box的overlap=0$\mathbb I_k^{obj}=1$ 当且仅当(if and only if)- 第$k$个box由第$j$个anchor预测
- gt box的center落入第$j$个anchor(falls into cell
$j$ ) - 在第$j$个anchor预测的5个box中,第$k$个box与gt box的overplap最大
-
Confidence Loss
重点在semantic prediction module,实际上就是学习一个余弦相似度,object的gt余弦相似度为1,背景为0