SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
PR-325
주성훈, Samsung SDS
2021. 06. 13.
1. Research Background
1. Research Background
Pre-training mechanism for cross-modality tasks
• Self-attention 기법과 self-supervision의 결합으로 labeling없이도 초대량의 데이터로부터 context를 학습할 수 있게
되었고, language와 vision 각각의 분야 뿐 아니라 modality domain (Visual Question Answering, image captioning,
image retrieval) 에서도 좋은 성능을 보이고 있다.
Visual Question Answering (VQA) text-to-image retrieval
Park, Gwangbeen, and Woobin Im. arXiv:1612.08354
3/27
1. Research Background
Cross-modality learning in vison and language
• Visual representation으로서 Visual classification을 위한
pre-trained CNN feature 활용
Ben-Younes, Hedi, et al. "Mutan: Multimodal
tucker fusion for visual question answering."
ICCV. (2017)
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
4/27
1. Research Background
Cross-modality learning in vison and language
https://visualgenome.org/
Krishna, Ranjay, et al. "Visual genome: Connecting
language and vision using crowdsourced dense image
annotations." IJCV. (2017)
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
5/27
1. Research Background
Cross-modality learning in vison and language
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
* PR-012 (By JinWon lee)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn:
Towards real-time object detection with region
proposal networks. NIPS. (2015)
Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and
visual question answering." CVPR. (2018)
Faster R-CNN
6/27
1. Research Background
Cross-modality learning in vison and language
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
Self-attention 개념을 통해 입력된 문장 안의 각 token
embedding을 다른 token들을 고려해 구할 수 있음
https://jalammar.github.io/illustrated-transformer/
Li, Xiujun, et al. "Oscar: Object-semantics aligned pre-training for vision-language
tasks." ECCV. (2020)
• Transformer의 self-attention과 feed-forward network을 통해 intra-domain 및
inter-domain의 dense connections 형성
Self-attention in Transformer
7/27
1. Research Background
• Two-stream neural network approach
• Visual 정보와 language 정보를 각각 처리한 two-stream neural network를 transformer layer를 통해 합치게 됨
• Vilbert
• Single-stream neural network approach
• Sentence embedding feature와 bounding box feature를 합친 후 BERT를 통해 bi-directional joint distribution을 학습시킴
Lu, J. et al., Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NIPS (2019)
Get better joint representation for vision and language tasks
8/27
1. Research Background
• Two-stream neural network approach
• Visual 정보와 language 정보를 각각 처리한 two-stream neural network를 transformer layer를 통해 합치게 됨
• Single-stream neural network approach
• Sentence embedding feature와 bounding box feature를 합친 후 BERT를 통해 bi-directional joint distribution을 학습시킴
• VL-BERT
Get better joint representation for vision and language tasks
Su, Weijie, et al. "Vl-bert: Pre-training of generic visual-linguistic representations." ICLR 2020.
9/27
1. Research Background
Region-based visual feature extractors를 사용한 방법의 한계점
• Object detection이라는 특정 task를 수행하기 위해 추출된 visual feature이므로 language understanding를 하기에 충분한 image
feature를 추출할 수 없다는 주장
• Visual information의 소실 : object의 형태, object들이 겹치는 부분에서 나타나는 관계 정보, 배경이나 이미지의 분위기에서 얻을 수
있는 정보들
10/27
1. Research Background
Objective & Approach
• We step out of the bounding box to make the full power of visual information in images for
vision and language learning.
• We propose Pixel-BERT that learns to align image pixels with text to build a more thorough
semantic embedding between visual and textual information.
word-level token embedding based
on BERT
CNN that takes image pixels as
input for visual embedding learning
Multi-modal transformers
for jointly learning
Task (VQA, retrieval, …)
11/27
2. Methods
2. Methods
1) Pre-Training Pixel-BERT
2) Training for downstream tasks
Approach
13/27
2. Methods
Architectures
① we learn from pixels to represent an image instead of using bounding boxes.
② Randomly sample feature pixels (100 features) during pre-training. (for robustness, computation cost)
③ Transformer의 self-attention과 feed-forward network을 통해 intra-domain 및 inter-domain의 dense connections 형성
③
Tokenize
①
100 features
ResNet-50, ResNeXt-152.
pre-trained model on ImageNet
②
14/27
• Masked Language Modeling (MLM)
- visual feature가 token을 예측하는 데에 활용됨으로서 두 modality 사이가
mapping되도록 유도
- randomly mask language tokens with a probability of 0.15
- UNITER (2020, ECCV), vl-BERT (2020, ICLR), LXMERT (2019, EMNLP)
에서 적용된 방법
2. Methods
pre-trained by MLM and ITM tasks
• MS-COCO captions
Chen, Xinlei, et al. arXiv:1504.00325 (2015).
• Visual Genome (VG)
15/27
• Image-Text Matching (ITM)
- Transformer로 입력된 문장이 같이 입력된 이미지를 잘 설명하고 있는지 맞추는 binary
classifiation task
- 같은 수의 negative (unmatched image-sentence pairs)-positive sample 사용
2. Methods
pre-trained by MLM and ITM tasks
• For CNN: SGD with learning rate 1e-2 and weight decay 5e-4
For Transformer: AdamW with learning rate 1e-4 and weight decay 1e-2
• Pre-training: 64 NVIDIA Tesla V100 GPUs with the batch size 4096 samples for
40 epochs.
• Pre-training setup
The man at bat readies to swing at the
pitch while the umpire looks on.
A man taking a picture behind the girl
True False
16/27
• Masked Language Modeling (MLM)
- visual feature가 token을 예측하는데에 활용됨으로서 두
modality 사이가 mapping되도록 유도
- randomly mask language tokens with a probability of 0.15
2. Methods
pre-trained by MLM and ITM tasks
• For CNN: SGD with learning rate 1e-2 and weight decay 5e-4 to optimize the CNN backbone
For Transformer: AdamW with learning rate 1e-4 and weight decay 1e-2
• Pre-training: 64 NVIDIA Tesla V100 GPUs with the batch size 4096 samples for 40 epochs.
• Image-Text Matching (ITM)
같은 수의 negative (unmatched image-sentence pairs)-
positive sample 사용
• MS-COCO captions
Chen, Xinlei, et al. arXiv:1504.00325 (2015).
• Visual Genome (VG)
• Pre-training setup
17/27
3. Experimental Results
3. Experimental Results
Downstream task - Visual Question Answering (VQA)
Question [SEP] Image
Pretrained Pixel-BERT
• 18 epochs on 16 NVIDIA Tesla V100 GPUs with batch size 256
• Learning rate decay : by 10 at 12th and 16th epoch.
ResNeXt-152
Faster-RCNN, transformer,
pretrained model
• Pixel-level에서의 image representation을 학습하는게 성능 향상에 도움이 됨
[CLS]
0 / 1
19/27
CNN feature, no transformer
Faster-RCNN, no transformer
ResNet-50
3. Experimental Results
Downstream task - Natural Language for Visual Reasoning for Real (NLVR2)
𝑝1
𝑐𝑙𝑠
Q [SEP] Image1
Pixel-BERT
• 18 epochs on 16 NVIDIA Tesla V100 GPUs with batch size 128
• Learning rate decay : by 10 at 12th and 16th epoch.
https://lil.nlp.cornell.edu/nlvr/
cross-entropy loss
[CLS]
𝑝2
𝑐𝑙𝑠
Q [SEP] Image2
Pixel-BERT
[CLS]
0 / 1
Concat.
20/27
3. Experimental Results
• 2개의 image를 입력받는 NLVR2 task에서도 좋은 성능을 보임
Downstream task - Natural Language for Visual Reasoning for Real (NLVR2)
21/27
3. Experimental Results
• Query(text)와 Image를 Pixel-BERT에 입력하고 relevance score를 구하여 상위 검색 결과를 표시
• Unicoder-VL, UNITER와 비교했을 때, Pixel-BERT가 모든 subtask에서 의미 있는 성능 향상을 보임
• IR subtask는 이미지에 대한 global description을 이해하는 것이 필요한데, 이러한 점에서 pixel-BERT는 image pixel
과 language사이의 attention을 학습하도록 하는 장점이 있음
Downstream task - Image-Text Retrieval
1K testing results on Flickr30K 5-fold 1K testing and 5K testing results on MS-COCO
TR: Image-to-image retrieval. 이미지가 주어졌을 때 적합한 텍스트 검색
12-Layer Transformer
IR: Text-to-image retrieval. 텍스트가 주어졌을 때 적합한 이미지 검색
R@k: Recall at k. 모델이 검색결과로 k개를 냈을 때, 실제 relevant item이 얼마나 포함되어 있는지 나타냄.
22/27
3. Experimental Results
• Pre-training task, Random pixel sampling 등의 아이디어가 subtask의 성능을 높이는 데에 역할을 했음을 확인
• 더 우수한 visual backbone모델을 활용하면 pixel-BERT의 성능을 높일 수 있을 것으로 기대됨
Ablation Study
1) ITM, MLM pre-training task
Random pixel sampling
2) Random pixel sampling method
3) 더 우수한 visual backbone을 활용
23/27
3. Experimental Results
• Pixel-BERT can well learn the visual representation in region level with cross-modality learning
Visualization
Bounding box 모델에서는 표현하기
어려웠던 부분
24/27
4. Conclusion
4. Conclusions
1. CNN-based Visual Encoder와 multimodal Transformer를 결합한 Pixel-BERT를 제안
2. Pixel-BERT기반의 pre-training model을 구축하고 down-stream task에서의 성능 확인
• Masked language model and image-text matching are two tasks designed for pre-training.
• 4가지의 downstream vision and language tasks를 수행하고 대부분의 task에서 최고의 성능을 보임
3. Robustness를 위해 “a random pixel sampling mechanism”을 제안
4. Future work
• Conceptual Caption Dataset에 대한 Pixel-BERT pre-training
• Self-supervised task를 Pixel-BERT에 적용 가능한지 연구
26/27
Thank you
4. Conclusions
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
28/27
4. Conclusions
Pixel Feature Embedding의 비효율성
Pixel-BERT Oscar
Randomly selected 100 features (2048 dim)
region feature is a P-dimensional
vector (i.e., P = 2048), region position
z a R-dimensional vector (i.e., R = 4
or 6)
Pixel-BERT
Oscar Single Tesla P100 (16GB)
29/27

Más contenido relacionado

La actualidad más candente

BERTをブラウザで動かしたい! ―MobileBERTとTensorFlow.js―
BERTをブラウザで動かしたい!―MobileBERTとTensorFlow.js―BERTをブラウザで動かしたい!―MobileBERTとTensorFlow.js―
BERTをブラウザで動かしたい! ―MobileBERTとTensorFlow.js―Shion Honda
 
MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理Takeshi Yamamuro
 
【論文紹介】Deep Mimic: Example-Guided Deep Reinforcement Learning of Physics-Based...
【論文紹介】Deep Mimic: Example-Guided Deep Reinforcement Learning of Physics-Based...【論文紹介】Deep Mimic: Example-Guided Deep Reinforcement Learning of Physics-Based...
【論文紹介】Deep Mimic: Example-Guided Deep Reinforcement Learning of Physics-Based...Tomoyuki Hioki
 
【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fieldscvpaper. challenge
 
【DL輪読会】Transporters with Visual Foresight for Solving Unseen Rearrangement Tasks
【DL輪読会】Transporters with Visual Foresight for Solving Unseen Rearrangement Tasks【DL輪読会】Transporters with Visual Foresight for Solving Unseen Rearrangement Tasks
【DL輪読会】Transporters with Visual Foresight for Solving Unseen Rearrangement TasksDeep Learning JP
 
BERTに関して
BERTに関してBERTに関して
BERTに関してSaitama Uni
 
인공지능개론 (머신러닝 중심)
인공지능개론 (머신러닝 중심)인공지능개론 (머신러닝 중심)
인공지능개론 (머신러닝 중심)SK(주) C&C - 강병호
 
Graph Attention Network
Graph Attention NetworkGraph Attention Network
Graph Attention NetworkTakahiro Kubo
 
系列ラベリングの基礎
系列ラベリングの基礎系列ラベリングの基礎
系列ラベリングの基礎Takatomo Isikawa
 
「機械学習:技術的負債の高利子クレジットカード」のまとめ
「機械学習:技術的負債の高利子クレジットカード」のまとめ「機械学習:技術的負債の高利子クレジットカード」のまとめ
「機械学習:技術的負債の高利子クレジットカード」のまとめRecruit Technologies
 
【輪読会】Learning Continuous Image Representation with Local Implicit Image Funct...
【輪読会】Learning Continuous Image Representation with Local Implicit Image Funct...【輪読会】Learning Continuous Image Representation with Local Implicit Image Funct...
【輪読会】Learning Continuous Image Representation with Local Implicit Image Funct...Deep Learning JP
 
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​SSII
 
[DL輪読会]Learning to Generalize: Meta-Learning for Domain Generalization
[DL輪読会]Learning to Generalize: Meta-Learning for Domain Generalization[DL輪読会]Learning to Generalize: Meta-Learning for Domain Generalization
[DL輪読会]Learning to Generalize: Meta-Learning for Domain GeneralizationDeep Learning JP
 
SSII2022 [OS3-03] スケーラブルなロボット学習システムに向けて
SSII2022 [OS3-03] スケーラブルなロボット学習システムに向けてSSII2022 [OS3-03] スケーラブルなロボット学習システムに向けて
SSII2022 [OS3-03] スケーラブルなロボット学習システムに向けてSSII
 
MixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised LearningMixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised Learningharmonylab
 
Deep learning for_extreme_multi-label_text_classification
Deep learning for_extreme_multi-label_text_classificationDeep learning for_extreme_multi-label_text_classification
Deep learning for_extreme_multi-label_text_classificationJunya Kamura
 
【DL輪読会】Novel View Synthesis with Diffusion Models
【DL輪読会】Novel View Synthesis with Diffusion Models【DL輪読会】Novel View Synthesis with Diffusion Models
【DL輪読会】Novel View Synthesis with Diffusion ModelsDeep Learning JP
 
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...Takanori Nakai
 
文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey
文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey
文献紹介:Learning From Noisy Labels With Deep Neural Networks: A SurveyToru Tamaki
 

La actualidad más candente (20)

BERTをブラウザで動かしたい! ―MobileBERTとTensorFlow.js―
BERTをブラウザで動かしたい!―MobileBERTとTensorFlow.js―BERTをブラウザで動かしたい!―MobileBERTとTensorFlow.js―
BERTをブラウザで動かしたい! ―MobileBERTとTensorFlow.js―
 
MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理
 
【論文紹介】Deep Mimic: Example-Guided Deep Reinforcement Learning of Physics-Based...
【論文紹介】Deep Mimic: Example-Guided Deep Reinforcement Learning of Physics-Based...【論文紹介】Deep Mimic: Example-Guided Deep Reinforcement Learning of Physics-Based...
【論文紹介】Deep Mimic: Example-Guided Deep Reinforcement Learning of Physics-Based...
 
【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields
 
【DL輪読会】Transporters with Visual Foresight for Solving Unseen Rearrangement Tasks
【DL輪読会】Transporters with Visual Foresight for Solving Unseen Rearrangement Tasks【DL輪読会】Transporters with Visual Foresight for Solving Unseen Rearrangement Tasks
【DL輪読会】Transporters with Visual Foresight for Solving Unseen Rearrangement Tasks
 
一般向けのDeep Learning
一般向けのDeep Learning一般向けのDeep Learning
一般向けのDeep Learning
 
BERTに関して
BERTに関してBERTに関して
BERTに関して
 
인공지능개론 (머신러닝 중심)
인공지능개론 (머신러닝 중심)인공지능개론 (머신러닝 중심)
인공지능개론 (머신러닝 중심)
 
Graph Attention Network
Graph Attention NetworkGraph Attention Network
Graph Attention Network
 
系列ラベリングの基礎
系列ラベリングの基礎系列ラベリングの基礎
系列ラベリングの基礎
 
「機械学習:技術的負債の高利子クレジットカード」のまとめ
「機械学習:技術的負債の高利子クレジットカード」のまとめ「機械学習:技術的負債の高利子クレジットカード」のまとめ
「機械学習:技術的負債の高利子クレジットカード」のまとめ
 
【輪読会】Learning Continuous Image Representation with Local Implicit Image Funct...
【輪読会】Learning Continuous Image Representation with Local Implicit Image Funct...【輪読会】Learning Continuous Image Representation with Local Implicit Image Funct...
【輪読会】Learning Continuous Image Representation with Local Implicit Image Funct...
 
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
 
[DL輪読会]Learning to Generalize: Meta-Learning for Domain Generalization
[DL輪読会]Learning to Generalize: Meta-Learning for Domain Generalization[DL輪読会]Learning to Generalize: Meta-Learning for Domain Generalization
[DL輪読会]Learning to Generalize: Meta-Learning for Domain Generalization
 
SSII2022 [OS3-03] スケーラブルなロボット学習システムに向けて
SSII2022 [OS3-03] スケーラブルなロボット学習システムに向けてSSII2022 [OS3-03] スケーラブルなロボット学習システムに向けて
SSII2022 [OS3-03] スケーラブルなロボット学習システムに向けて
 
MixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised LearningMixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised Learning
 
Deep learning for_extreme_multi-label_text_classification
Deep learning for_extreme_multi-label_text_classificationDeep learning for_extreme_multi-label_text_classification
Deep learning for_extreme_multi-label_text_classification
 
【DL輪読会】Novel View Synthesis with Diffusion Models
【DL輪読会】Novel View Synthesis with Diffusion Models【DL輪読会】Novel View Synthesis with Diffusion Models
【DL輪読会】Novel View Synthesis with Diffusion Models
 
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
 
文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey
文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey
文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey
 

Similar a [PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

multi modal transformers representation generation .pptx
multi modal transformers representation generation .pptxmulti modal transformers representation generation .pptx
multi modal transformers representation generation .pptxsiddharth1729
 
2016 MediaEval - Interestingness Task Overview
2016 MediaEval - Interestingness Task Overview2016 MediaEval - Interestingness Task Overview
2016 MediaEval - Interestingness Task Overviewmultimediaeval
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]Dongmin Choi
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using KerasIRJET Journal
 
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...CSCJournals
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSitakanta Mishra
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용홍배 김
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...CSCJournals
 
Semantic Image Retrieval Using Relevance Feedback
Semantic Image Retrieval Using Relevance Feedback  Semantic Image Retrieval Using Relevance Feedback
Semantic Image Retrieval Using Relevance Feedback dannyijwest
 
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIRJET Journal
 
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learning
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic LearningOmni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learning
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learningsipij
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learningijtsrd
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level FeatureDongmin Choi
 
Large-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docxLarge-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docxcroysierkathey
 
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...IRJET Journal
 
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...dbpublications
 

Similar a [PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers (20)

multi modal transformers representation generation .pptx
multi modal transformers representation generation .pptxmulti modal transformers representation generation .pptx
multi modal transformers representation generation .pptx
 
2016 MediaEval - Interestingness Task Overview
2016 MediaEval - Interestingness Task Overview2016 MediaEval - Interestingness Task Overview
2016 MediaEval - Interestingness Task Overview
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using Keras
 
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
 
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
 
Semantic Image Retrieval Using Relevance Feedback
Semantic Image Retrieval Using Relevance Feedback  Semantic Image Retrieval Using Relevance Feedback
Semantic Image Retrieval Using Relevance Feedback
 
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
 
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learning
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic LearningOmni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learning
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learning
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 
Large-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docxLarge-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docx
 
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
 
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
 
kanimozhi2019.pdf
kanimozhi2019.pdfkanimozhi2019.pdf
kanimozhi2019.pdf
 
19
1919
19
 

Más de Sunghoon Joo

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterSunghoon Joo
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersSunghoon Joo
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfSunghoon Joo
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...Sunghoon Joo
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionSunghoon Joo
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...Sunghoon Joo
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.Sunghoon Joo
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningSunghoon Joo
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningSunghoon Joo
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...Sunghoon Joo
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingSunghoon Joo
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...Sunghoon Joo
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationSunghoon Joo
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesSunghoon Joo
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchSunghoon Joo
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesSunghoon Joo
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...Sunghoon Joo
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...Sunghoon Joo
 

Más de Sunghoon Joo (20)

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But Faster
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdf
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learning
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document reranking
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseases
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture Search
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
 

Último

AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...Health
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stageAbc194748
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 

Último (20)

AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 

[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

  • 3. 1. Research Background Pre-training mechanism for cross-modality tasks • Self-attention 기법과 self-supervision의 결합으로 labeling없이도 초대량의 데이터로부터 context를 학습할 수 있게 되었고, language와 vision 각각의 분야 뿐 아니라 modality domain (Visual Question Answering, image captioning, image retrieval) 에서도 좋은 성능을 보이고 있다. Visual Question Answering (VQA) text-to-image retrieval Park, Gwangbeen, and Woobin Im. arXiv:1612.08354 3/27
  • 4. 1. Research Background Cross-modality learning in vison and language • Visual representation으로서 Visual classification을 위한 pre-trained CNN feature 활용 Ben-Younes, Hedi, et al. "Mutan: Multimodal tucker fusion for visual question answering." ICCV. (2017) Visual Genome Dataset 발표 CNN feature 활용 Visual backbone으로 faster-RCNN 활용 Transformer를 이용한 두 modality의 dense connection 학습 4/27
  • 5. 1. Research Background Cross-modality learning in vison and language https://visualgenome.org/ Krishna, Ranjay, et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." IJCV. (2017) Visual Genome Dataset 발표 CNN feature 활용 Visual backbone으로 faster-RCNN 활용 Transformer를 이용한 두 modality의 dense connection 학습 5/27
  • 6. 1. Research Background Cross-modality learning in vison and language Visual Genome Dataset 발표 CNN feature 활용 Visual backbone으로 faster-RCNN 활용 Transformer를 이용한 두 modality의 dense connection 학습 * PR-012 (By JinWon lee) Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. (2015) Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering." CVPR. (2018) Faster R-CNN 6/27
  • 7. 1. Research Background Cross-modality learning in vison and language Visual Genome Dataset 발표 CNN feature 활용 Visual backbone으로 faster-RCNN 활용 Transformer를 이용한 두 modality의 dense connection 학습 Self-attention 개념을 통해 입력된 문장 안의 각 token embedding을 다른 token들을 고려해 구할 수 있음 https://jalammar.github.io/illustrated-transformer/ Li, Xiujun, et al. "Oscar: Object-semantics aligned pre-training for vision-language tasks." ECCV. (2020) • Transformer의 self-attention과 feed-forward network을 통해 intra-domain 및 inter-domain의 dense connections 형성 Self-attention in Transformer 7/27
  • 8. 1. Research Background • Two-stream neural network approach • Visual 정보와 language 정보를 각각 처리한 two-stream neural network를 transformer layer를 통해 합치게 됨 • Vilbert • Single-stream neural network approach • Sentence embedding feature와 bounding box feature를 합친 후 BERT를 통해 bi-directional joint distribution을 학습시킴 Lu, J. et al., Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NIPS (2019) Get better joint representation for vision and language tasks 8/27
  • 9. 1. Research Background • Two-stream neural network approach • Visual 정보와 language 정보를 각각 처리한 two-stream neural network를 transformer layer를 통해 합치게 됨 • Single-stream neural network approach • Sentence embedding feature와 bounding box feature를 합친 후 BERT를 통해 bi-directional joint distribution을 학습시킴 • VL-BERT Get better joint representation for vision and language tasks Su, Weijie, et al. "Vl-bert: Pre-training of generic visual-linguistic representations." ICLR 2020. 9/27
  • 10. 1. Research Background Region-based visual feature extractors를 사용한 방법의 한계점 • Object detection이라는 특정 task를 수행하기 위해 추출된 visual feature이므로 language understanding를 하기에 충분한 image feature를 추출할 수 없다는 주장 • Visual information의 소실 : object의 형태, object들이 겹치는 부분에서 나타나는 관계 정보, 배경이나 이미지의 분위기에서 얻을 수 있는 정보들 10/27
  • 11. 1. Research Background Objective & Approach • We step out of the bounding box to make the full power of visual information in images for vision and language learning. • We propose Pixel-BERT that learns to align image pixels with text to build a more thorough semantic embedding between visual and textual information. word-level token embedding based on BERT CNN that takes image pixels as input for visual embedding learning Multi-modal transformers for jointly learning Task (VQA, retrieval, …) 11/27
  • 13. 2. Methods 1) Pre-Training Pixel-BERT 2) Training for downstream tasks Approach 13/27
  • 14. 2. Methods Architectures ① we learn from pixels to represent an image instead of using bounding boxes. ② Randomly sample feature pixels (100 features) during pre-training. (for robustness, computation cost) ③ Transformer의 self-attention과 feed-forward network을 통해 intra-domain 및 inter-domain의 dense connections 형성 ③ Tokenize ① 100 features ResNet-50, ResNeXt-152. pre-trained model on ImageNet ② 14/27
  • 15. • Masked Language Modeling (MLM) - visual feature가 token을 예측하는 데에 활용됨으로서 두 modality 사이가 mapping되도록 유도 - randomly mask language tokens with a probability of 0.15 - UNITER (2020, ECCV), vl-BERT (2020, ICLR), LXMERT (2019, EMNLP) 에서 적용된 방법 2. Methods pre-trained by MLM and ITM tasks • MS-COCO captions Chen, Xinlei, et al. arXiv:1504.00325 (2015). • Visual Genome (VG) 15/27
  • 16. • Image-Text Matching (ITM) - Transformer로 입력된 문장이 같이 입력된 이미지를 잘 설명하고 있는지 맞추는 binary classifiation task - 같은 수의 negative (unmatched image-sentence pairs)-positive sample 사용 2. Methods pre-trained by MLM and ITM tasks • For CNN: SGD with learning rate 1e-2 and weight decay 5e-4 For Transformer: AdamW with learning rate 1e-4 and weight decay 1e-2 • Pre-training: 64 NVIDIA Tesla V100 GPUs with the batch size 4096 samples for 40 epochs. • Pre-training setup The man at bat readies to swing at the pitch while the umpire looks on. A man taking a picture behind the girl True False 16/27
  • 17. • Masked Language Modeling (MLM) - visual feature가 token을 예측하는데에 활용됨으로서 두 modality 사이가 mapping되도록 유도 - randomly mask language tokens with a probability of 0.15 2. Methods pre-trained by MLM and ITM tasks • For CNN: SGD with learning rate 1e-2 and weight decay 5e-4 to optimize the CNN backbone For Transformer: AdamW with learning rate 1e-4 and weight decay 1e-2 • Pre-training: 64 NVIDIA Tesla V100 GPUs with the batch size 4096 samples for 40 epochs. • Image-Text Matching (ITM) 같은 수의 negative (unmatched image-sentence pairs)- positive sample 사용 • MS-COCO captions Chen, Xinlei, et al. arXiv:1504.00325 (2015). • Visual Genome (VG) • Pre-training setup 17/27
  • 19. 3. Experimental Results Downstream task - Visual Question Answering (VQA) Question [SEP] Image Pretrained Pixel-BERT • 18 epochs on 16 NVIDIA Tesla V100 GPUs with batch size 256 • Learning rate decay : by 10 at 12th and 16th epoch. ResNeXt-152 Faster-RCNN, transformer, pretrained model • Pixel-level에서의 image representation을 학습하는게 성능 향상에 도움이 됨 [CLS] 0 / 1 19/27 CNN feature, no transformer Faster-RCNN, no transformer ResNet-50
  • 20. 3. Experimental Results Downstream task - Natural Language for Visual Reasoning for Real (NLVR2) 𝑝1 𝑐𝑙𝑠 Q [SEP] Image1 Pixel-BERT • 18 epochs on 16 NVIDIA Tesla V100 GPUs with batch size 128 • Learning rate decay : by 10 at 12th and 16th epoch. https://lil.nlp.cornell.edu/nlvr/ cross-entropy loss [CLS] 𝑝2 𝑐𝑙𝑠 Q [SEP] Image2 Pixel-BERT [CLS] 0 / 1 Concat. 20/27
  • 21. 3. Experimental Results • 2개의 image를 입력받는 NLVR2 task에서도 좋은 성능을 보임 Downstream task - Natural Language for Visual Reasoning for Real (NLVR2) 21/27
  • 22. 3. Experimental Results • Query(text)와 Image를 Pixel-BERT에 입력하고 relevance score를 구하여 상위 검색 결과를 표시 • Unicoder-VL, UNITER와 비교했을 때, Pixel-BERT가 모든 subtask에서 의미 있는 성능 향상을 보임 • IR subtask는 이미지에 대한 global description을 이해하는 것이 필요한데, 이러한 점에서 pixel-BERT는 image pixel 과 language사이의 attention을 학습하도록 하는 장점이 있음 Downstream task - Image-Text Retrieval 1K testing results on Flickr30K 5-fold 1K testing and 5K testing results on MS-COCO TR: Image-to-image retrieval. 이미지가 주어졌을 때 적합한 텍스트 검색 12-Layer Transformer IR: Text-to-image retrieval. 텍스트가 주어졌을 때 적합한 이미지 검색 R@k: Recall at k. 모델이 검색결과로 k개를 냈을 때, 실제 relevant item이 얼마나 포함되어 있는지 나타냄. 22/27
  • 23. 3. Experimental Results • Pre-training task, Random pixel sampling 등의 아이디어가 subtask의 성능을 높이는 데에 역할을 했음을 확인 • 더 우수한 visual backbone모델을 활용하면 pixel-BERT의 성능을 높일 수 있을 것으로 기대됨 Ablation Study 1) ITM, MLM pre-training task Random pixel sampling 2) Random pixel sampling method 3) 더 우수한 visual backbone을 활용 23/27
  • 24. 3. Experimental Results • Pixel-BERT can well learn the visual representation in region level with cross-modality learning Visualization Bounding box 모델에서는 표현하기 어려웠던 부분 24/27
  • 26. 4. Conclusions 1. CNN-based Visual Encoder와 multimodal Transformer를 결합한 Pixel-BERT를 제안 2. Pixel-BERT기반의 pre-training model을 구축하고 down-stream task에서의 성능 확인 • Masked language model and image-text matching are two tasks designed for pre-training. • 4가지의 downstream vision and language tasks를 수행하고 대부분의 task에서 최고의 성능을 보임 3. Robustness를 위해 “a random pixel sampling mechanism”을 제안 4. Future work • Conceptual Caption Dataset에 대한 Pixel-BERT pre-training • Self-supervised task를 Pixel-BERT에 적용 가능한지 연구 26/27
  • 28. 4. Conclusions Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks 28/27
  • 29. 4. Conclusions Pixel Feature Embedding의 비효율성 Pixel-BERT Oscar Randomly selected 100 features (2048 dim) region feature is a P-dimensional vector (i.e., P = 2048), region position z a R-dimensional vector (i.e., R = 4 or 6) Pixel-BERT Oscar Single Tesla P100 (16GB) 29/27