PR-325: Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
paper link: https://arxiv.org/abs/2004.00849
youtube link: https://youtu.be/Kgh88DLHHTo
3. 1. Research Background
Pre-training mechanism for cross-modality tasks
• Self-attention 기법과 self-supervision의 결합으로 labeling없이도 초대량의 데이터로부터 context를 학습할 수 있게
되었고, language와 vision 각각의 분야 뿐 아니라 modality domain (Visual Question Answering, image captioning,
image retrieval) 에서도 좋은 성능을 보이고 있다.
Visual Question Answering (VQA) text-to-image retrieval
Park, Gwangbeen, and Woobin Im. arXiv:1612.08354
3/27
4. 1. Research Background
Cross-modality learning in vison and language
• Visual representation으로서 Visual classification을 위한
pre-trained CNN feature 활용
Ben-Younes, Hedi, et al. "Mutan: Multimodal
tucker fusion for visual question answering."
ICCV. (2017)
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
4/27
5. 1. Research Background
Cross-modality learning in vison and language
https://visualgenome.org/
Krishna, Ranjay, et al. "Visual genome: Connecting
language and vision using crowdsourced dense image
annotations." IJCV. (2017)
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
5/27
6. 1. Research Background
Cross-modality learning in vison and language
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
* PR-012 (By JinWon lee)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn:
Towards real-time object detection with region
proposal networks. NIPS. (2015)
Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and
visual question answering." CVPR. (2018)
Faster R-CNN
6/27
7. 1. Research Background
Cross-modality learning in vison and language
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
Self-attention 개념을 통해 입력된 문장 안의 각 token
embedding을 다른 token들을 고려해 구할 수 있음
https://jalammar.github.io/illustrated-transformer/
Li, Xiujun, et al. "Oscar: Object-semantics aligned pre-training for vision-language
tasks." ECCV. (2020)
• Transformer의 self-attention과 feed-forward network을 통해 intra-domain 및
inter-domain의 dense connections 형성
Self-attention in Transformer
7/27
8. 1. Research Background
• Two-stream neural network approach
• Visual 정보와 language 정보를 각각 처리한 two-stream neural network를 transformer layer를 통해 합치게 됨
• Vilbert
• Single-stream neural network approach
• Sentence embedding feature와 bounding box feature를 합친 후 BERT를 통해 bi-directional joint distribution을 학습시킴
Lu, J. et al., Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NIPS (2019)
Get better joint representation for vision and language tasks
8/27
9. 1. Research Background
• Two-stream neural network approach
• Visual 정보와 language 정보를 각각 처리한 two-stream neural network를 transformer layer를 통해 합치게 됨
• Single-stream neural network approach
• Sentence embedding feature와 bounding box feature를 합친 후 BERT를 통해 bi-directional joint distribution을 학습시킴
• VL-BERT
Get better joint representation for vision and language tasks
Su, Weijie, et al. "Vl-bert: Pre-training of generic visual-linguistic representations." ICLR 2020.
9/27
10. 1. Research Background
Region-based visual feature extractors를 사용한 방법의 한계점
• Object detection이라는 특정 task를 수행하기 위해 추출된 visual feature이므로 language understanding를 하기에 충분한 image
feature를 추출할 수 없다는 주장
• Visual information의 소실 : object의 형태, object들이 겹치는 부분에서 나타나는 관계 정보, 배경이나 이미지의 분위기에서 얻을 수
있는 정보들
10/27
11. 1. Research Background
Objective & Approach
• We step out of the bounding box to make the full power of visual information in images for
vision and language learning.
• We propose Pixel-BERT that learns to align image pixels with text to build a more thorough
semantic embedding between visual and textual information.
word-level token embedding based
on BERT
CNN that takes image pixels as
input for visual embedding learning
Multi-modal transformers
for jointly learning
Task (VQA, retrieval, …)
11/27
14. 2. Methods
Architectures
① we learn from pixels to represent an image instead of using bounding boxes.
② Randomly sample feature pixels (100 features) during pre-training. (for robustness, computation cost)
③ Transformer의 self-attention과 feed-forward network을 통해 intra-domain 및 inter-domain의 dense connections 형성
③
Tokenize
①
100 features
ResNet-50, ResNeXt-152.
pre-trained model on ImageNet
②
14/27
15. • Masked Language Modeling (MLM)
- visual feature가 token을 예측하는 데에 활용됨으로서 두 modality 사이가
mapping되도록 유도
- randomly mask language tokens with a probability of 0.15
- UNITER (2020, ECCV), vl-BERT (2020, ICLR), LXMERT (2019, EMNLP)
에서 적용된 방법
2. Methods
pre-trained by MLM and ITM tasks
• MS-COCO captions
Chen, Xinlei, et al. arXiv:1504.00325 (2015).
• Visual Genome (VG)
15/27
16. • Image-Text Matching (ITM)
- Transformer로 입력된 문장이 같이 입력된 이미지를 잘 설명하고 있는지 맞추는 binary
classifiation task
- 같은 수의 negative (unmatched image-sentence pairs)-positive sample 사용
2. Methods
pre-trained by MLM and ITM tasks
• For CNN: SGD with learning rate 1e-2 and weight decay 5e-4
For Transformer: AdamW with learning rate 1e-4 and weight decay 1e-2
• Pre-training: 64 NVIDIA Tesla V100 GPUs with the batch size 4096 samples for
40 epochs.
• Pre-training setup
The man at bat readies to swing at the
pitch while the umpire looks on.
A man taking a picture behind the girl
True False
16/27
17. • Masked Language Modeling (MLM)
- visual feature가 token을 예측하는데에 활용됨으로서 두
modality 사이가 mapping되도록 유도
- randomly mask language tokens with a probability of 0.15
2. Methods
pre-trained by MLM and ITM tasks
• For CNN: SGD with learning rate 1e-2 and weight decay 5e-4 to optimize the CNN backbone
For Transformer: AdamW with learning rate 1e-4 and weight decay 1e-2
• Pre-training: 64 NVIDIA Tesla V100 GPUs with the batch size 4096 samples for 40 epochs.
• Image-Text Matching (ITM)
같은 수의 negative (unmatched image-sentence pairs)-
positive sample 사용
• MS-COCO captions
Chen, Xinlei, et al. arXiv:1504.00325 (2015).
• Visual Genome (VG)
• Pre-training setup
17/27
19. 3. Experimental Results
Downstream task - Visual Question Answering (VQA)
Question [SEP] Image
Pretrained Pixel-BERT
• 18 epochs on 16 NVIDIA Tesla V100 GPUs with batch size 256
• Learning rate decay : by 10 at 12th and 16th epoch.
ResNeXt-152
Faster-RCNN, transformer,
pretrained model
• Pixel-level에서의 image representation을 학습하는게 성능 향상에 도움이 됨
[CLS]
0 / 1
19/27
CNN feature, no transformer
Faster-RCNN, no transformer
ResNet-50
20. 3. Experimental Results
Downstream task - Natural Language for Visual Reasoning for Real (NLVR2)
𝑝1
𝑐𝑙𝑠
Q [SEP] Image1
Pixel-BERT
• 18 epochs on 16 NVIDIA Tesla V100 GPUs with batch size 128
• Learning rate decay : by 10 at 12th and 16th epoch.
https://lil.nlp.cornell.edu/nlvr/
cross-entropy loss
[CLS]
𝑝2
𝑐𝑙𝑠
Q [SEP] Image2
Pixel-BERT
[CLS]
0 / 1
Concat.
20/27
21. 3. Experimental Results
• 2개의 image를 입력받는 NLVR2 task에서도 좋은 성능을 보임
Downstream task - Natural Language for Visual Reasoning for Real (NLVR2)
21/27
22. 3. Experimental Results
• Query(text)와 Image를 Pixel-BERT에 입력하고 relevance score를 구하여 상위 검색 결과를 표시
• Unicoder-VL, UNITER와 비교했을 때, Pixel-BERT가 모든 subtask에서 의미 있는 성능 향상을 보임
• IR subtask는 이미지에 대한 global description을 이해하는 것이 필요한데, 이러한 점에서 pixel-BERT는 image pixel
과 language사이의 attention을 학습하도록 하는 장점이 있음
Downstream task - Image-Text Retrieval
1K testing results on Flickr30K 5-fold 1K testing and 5K testing results on MS-COCO
TR: Image-to-image retrieval. 이미지가 주어졌을 때 적합한 텍스트 검색
12-Layer Transformer
IR: Text-to-image retrieval. 텍스트가 주어졌을 때 적합한 이미지 검색
R@k: Recall at k. 모델이 검색결과로 k개를 냈을 때, 실제 relevant item이 얼마나 포함되어 있는지 나타냄.
22/27
23. 3. Experimental Results
• Pre-training task, Random pixel sampling 등의 아이디어가 subtask의 성능을 높이는 데에 역할을 했음을 확인
• 더 우수한 visual backbone모델을 활용하면 pixel-BERT의 성능을 높일 수 있을 것으로 기대됨
Ablation Study
1) ITM, MLM pre-training task
Random pixel sampling
2) Random pixel sampling method
3) 더 우수한 visual backbone을 활용
23/27
24. 3. Experimental Results
• Pixel-BERT can well learn the visual representation in region level with cross-modality learning
Visualization
Bounding box 모델에서는 표현하기
어려웠던 부분
24/27
26. 4. Conclusions
1. CNN-based Visual Encoder와 multimodal Transformer를 결합한 Pixel-BERT를 제안
2. Pixel-BERT기반의 pre-training model을 구축하고 down-stream task에서의 성능 확인
• Masked language model and image-text matching are two tasks designed for pre-training.
• 4가지의 downstream vision and language tasks를 수행하고 대부분의 task에서 최고의 성능을 보임
3. Robustness를 위해 “a random pixel sampling mechanism”을 제안
4. Future work
• Conceptual Caption Dataset에 대한 Pixel-BERT pre-training
• Self-supervised task를 Pixel-BERT에 적용 가능한지 연구
26/27
29. 4. Conclusions
Pixel Feature Embedding의 비효율성
Pixel-BERT Oscar
Randomly selected 100 features (2048 dim)
region feature is a P-dimensional
vector (i.e., P = 2048), region position
z a R-dimensional vector (i.e., R = 4
or 6)
Pixel-BERT
Oscar Single Tesla P100 (16GB)
29/27