SlideShare una empresa de Scribd logo
1 de 23
MULTI MODALITY
FOR THE
UNINITIATED
– SIDDHARTH SHARMA, HARMFUL INFO PROBLEM, CI
FEATURE TYPES
https://www.internalfb.com/intern/wiki/Ads-ranking-feature/#type-of-
features
SIGNAL DIVERSITY
Deep Neural Networks for YouTube
Recommendations
Feed
Ranking
FUSION
A LONG TIME AGO IN A GALAXY FAR, FAR
AWAY....
Wide & Deep Learning for Recommender
Systems
HOW TO MERGE THESE FEATURES ?
Simplest Approach – Concatenate
https://www.internalfb.com/intern/wiki/Facebook_AI_Multimodal_(FAIM)/Model_Architectures/Non-
temporal_Models/ConcatMLP_Fusion_Model/
SPARSE – SPARSE INTERACTION
Demystify CTR_MBL_FEED_MODEL and learn modeling techniques
step by step
https://fb.quip.com/fwEFAoD4rDBs
DENSE-SPARSE INTERACTION
https://fb.quip.com/fwEFAoD4rDBs#YVJACAZa
0SO
TEXT IMAGE SPECIFIC TASKS
A
Hummingbird
MMF, a PyTorch powered MultiModal
Framework
https://www.youtube.com/watch?v=igAF-
48Pwnc
POPULAR DATA SETS
VISUALBERT
1.Architecture
1. The architecture of VisualBERT. Image regions and language are combined with a Transformer to allow the self-
attention to discover implicit alignments between language and vision.
2. Uses BERT weights for initialization, BERT word embeddings
3. Visual Token Embedding (sum of three representations):
1. A visual feature representation of the bounding regions (Faster-RCNN)
2. Segment Embedding
3. Position embedding
2. Dataset: The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by
human annotators.
VisualBERT: A Simple and Performant Baseline for Vision and Language
Entity Grounding
Attending to the corresponding bonding regions from entities in the sentence
For each entity in the sentence and for each attention head in VisualBERT,
look at the bounding region which receives the most attention weight.
For this evaluation, the head’s attention to other words was masked out.
SYNTACTIC GROUNDING
• Find whether model is learning syntactic relations between words (by analysing weights of attention heads)
• Parse all sentences in Flickr30k using AllenNLP’s dependency parser
• For each attention head in VisualBERT, given that two words have a particular dependency relationship, and one of them
has a ground-truth grounding in Flickr30K, compute how accurately the head attention weights predict the ground-truth
grounding.
ISSUES
• Issues with just concatenating features of linguistic and visual
modality
• VisualBert treats inputs from both modalities identically
• they would need different pre-processing and they are at different level of abstraction
• Forcing pretrained BERT weights to accommodate the large set of additional
visual tokens may damage the learned BERT language model
VILBERT: PRE-TRAINING TASK AGNOSTIC VISIOLINGUISTIC
REPRESENTATIONS FOR VISION AND LANGUAGE TASKS
• Key Contribution:
• Two parallel streams for visual and linguistic processing that interact through
novel co-attentional transformer layers.
• Dataset:
• Conceptual Captions ~ 3.3. Million images
• Proxy Tasks:
• Predicting masked words and image regions
• Predicting whether an image and text segment corresponds
VILBERT
Method:
• Develop two stream architecture modelling each modality separately and then
fusing them through a small set of attention based interactions.
• Approach allows for variable network depth for each modality and enables cross-
modal connections at different depths.
VILBERT : INPUT REPRESENTATIONS
• Image features
• Generated by extracting bounding boxes and their visual features from a pre-trained object
detection network. ( Faster R-CNN (with Resnet-101) backbone
• Spatial info is encoded in a 5-d vector from region position (normalized top-left, bottom-right
coordinates and fraction of image area covered).
• This is then projected to match dimensions of the visual features and they are
summed.
• Word embedding initialized with BERT base pretrained on
BookCorpus and Wikipedia
NOVELTY
• Co-TRM : Co-attentional transformer layers to enable information
exchange between modalities.
• The key and values from each modality are passed as input to the other modality’s
multi headed attention block.
• The exchange between the two streams is restricted to be between specific
layers
• Text stream has significantly more processing before interacting
with visual features
• visual features are already fairly high level and require limited context aggregation
compared to words in a sentence
TRAINING TASKS
• Alignment Task
• Model is presented with an image and text pair
• {IMG, v1, …vt, CLS, w1, …, wt, SEP} predicts whether the image and text
are aligned.
• The outputs IMG and CLS are holistic representation of image and
text inputs.
• Overall representation is computed as an element-wise dot product
between IMG and CLS representations and a linear layer sits on top to
make binary prediction.
• Masked Modeling Task
I
M
G
C
L
S
+
M
M
ISSUES WITH VILBERT
• Cannot incorporate pre trained unimodal representations
• Cannot work for any sequence of dense vectors
FB: SUPERVISED MULTIMODAL
BITRANSFORMERS FOR CLASSIFYING IMAGES
AND TEXT
• Jointly finetunes unimodally pretrained text and image encoders by
projecting image embeddings to text token space
• Easier to incorporate pre trained unimodal modals in this architecture
MBIT: IMAGE ENCODER
• Get feature maps from ResNet-152
• Use ResNet-152 with average pooling over K x M grids in the image, yielding N = KM output vectors
of 2048 dimensions
• Learn weights to project each of the N image embeddings to D-dimensional token input
embedding space
• In a way we are mapping image embeddings to BERT’s token space using a set
of randomly initialized mappings
EVALUATION
• Surprisingly competitive to VILBERT
• Create hard test sets
• Construct hard test sets by taking the examples where BERT and IMG classifier
predictions are most different from the ground truth classes in the test set
• Compare with
• Text-only Bert
• Image only model
• Concat BOW + Image
• Late fusion
• Concat BERT + Img
• Concatenate output of bert and image baselines (2048 + 768) and apply linear classifier
on top

Más contenido relacionado

Similar a multi modal transformers representation generation .pptx

leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfleewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfrobertsamuel23
 
Presentation_Conversion of Sign language to text.pptx
Presentation_Conversion of Sign language to text.pptxPresentation_Conversion of Sign language to text.pptx
Presentation_Conversion of Sign language to text.pptxsandeep506550
 
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...IRJET Journal
 
AaSeminar_Template.pptx
AaSeminar_Template.pptxAaSeminar_Template.pptx
AaSeminar_Template.pptxManojGowdaKb
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyNUPUR YADAV
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdfFEG
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMeetupDataScienceRoma
 
Automated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureAutomated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureIRJET Journal
 
Deep Learning Project.pptx
Deep Learning Project.pptxDeep Learning Project.pptx
Deep Learning Project.pptxTasnimRahman54
 
Lec16 - Autoencoders.pptx
Lec16 - Autoencoders.pptxLec16 - Autoencoders.pptx
Lec16 - Autoencoders.pptxSameer Gulshan
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanismSwatiNarkhede1
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...SBGC
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingYu Huang
 
How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?Benjaminlapid1
 
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...WSO2
 

Similar a multi modal transformers representation generation .pptx (20)

leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfleewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
 
Presentation_Conversion of Sign language to text.pptx
Presentation_Conversion of Sign language to text.pptxPresentation_Conversion of Sign language to text.pptx
Presentation_Conversion of Sign language to text.pptx
 
chaitra_resume
chaitra_resumechaitra_resume
chaitra_resume
 
Fashion AI
Fashion AIFashion AI
Fashion AI
 
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...
 
AaSeminar_Template.pptx
AaSeminar_Template.pptxAaSeminar_Template.pptx
AaSeminar_Template.pptx
 
Dl
DlDl
Dl
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdf
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image Processing
 
Automated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureAutomated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU Architecture
 
Deep Learning Project.pptx
Deep Learning Project.pptxDeep Learning Project.pptx
Deep Learning Project.pptx
 
Lec16 - Autoencoders.pptx
Lec16 - Autoencoders.pptxLec16 - Autoencoders.pptx
Lec16 - Autoencoders.pptx
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
 
lec6a.ppt
lec6a.pptlec6a.ppt
lec6a.ppt
 
How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?
 
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...
 

Último

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 

Último (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 

multi modal transformers representation generation .pptx

  • 1. MULTI MODALITY FOR THE UNINITIATED – SIDDHARTH SHARMA, HARMFUL INFO PROBLEM, CI
  • 3. SIGNAL DIVERSITY Deep Neural Networks for YouTube Recommendations Feed Ranking
  • 5. A LONG TIME AGO IN A GALAXY FAR, FAR AWAY.... Wide & Deep Learning for Recommender Systems
  • 6. HOW TO MERGE THESE FEATURES ? Simplest Approach – Concatenate https://www.internalfb.com/intern/wiki/Facebook_AI_Multimodal_(FAIM)/Model_Architectures/Non- temporal_Models/ConcatMLP_Fusion_Model/
  • 7. SPARSE – SPARSE INTERACTION Demystify CTR_MBL_FEED_MODEL and learn modeling techniques step by step https://fb.quip.com/fwEFAoD4rDBs
  • 9. TEXT IMAGE SPECIFIC TASKS A Hummingbird MMF, a PyTorch powered MultiModal Framework https://www.youtube.com/watch?v=igAF- 48Pwnc
  • 11. VISUALBERT 1.Architecture 1. The architecture of VisualBERT. Image regions and language are combined with a Transformer to allow the self- attention to discover implicit alignments between language and vision. 2. Uses BERT weights for initialization, BERT word embeddings 3. Visual Token Embedding (sum of three representations): 1. A visual feature representation of the bounding regions (Faster-RCNN) 2. Segment Embedding 3. Position embedding 2. Dataset: The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. VisualBERT: A Simple and Performant Baseline for Vision and Language
  • 12. Entity Grounding Attending to the corresponding bonding regions from entities in the sentence For each entity in the sentence and for each attention head in VisualBERT, look at the bounding region which receives the most attention weight. For this evaluation, the head’s attention to other words was masked out.
  • 13. SYNTACTIC GROUNDING • Find whether model is learning syntactic relations between words (by analysing weights of attention heads) • Parse all sentences in Flickr30k using AllenNLP’s dependency parser • For each attention head in VisualBERT, given that two words have a particular dependency relationship, and one of them has a ground-truth grounding in Flickr30K, compute how accurately the head attention weights predict the ground-truth grounding.
  • 14. ISSUES • Issues with just concatenating features of linguistic and visual modality • VisualBert treats inputs from both modalities identically • they would need different pre-processing and they are at different level of abstraction • Forcing pretrained BERT weights to accommodate the large set of additional visual tokens may damage the learned BERT language model
  • 15. VILBERT: PRE-TRAINING TASK AGNOSTIC VISIOLINGUISTIC REPRESENTATIONS FOR VISION AND LANGUAGE TASKS • Key Contribution: • Two parallel streams for visual and linguistic processing that interact through novel co-attentional transformer layers. • Dataset: • Conceptual Captions ~ 3.3. Million images • Proxy Tasks: • Predicting masked words and image regions • Predicting whether an image and text segment corresponds
  • 16. VILBERT Method: • Develop two stream architecture modelling each modality separately and then fusing them through a small set of attention based interactions. • Approach allows for variable network depth for each modality and enables cross- modal connections at different depths.
  • 17. VILBERT : INPUT REPRESENTATIONS • Image features • Generated by extracting bounding boxes and their visual features from a pre-trained object detection network. ( Faster R-CNN (with Resnet-101) backbone • Spatial info is encoded in a 5-d vector from region position (normalized top-left, bottom-right coordinates and fraction of image area covered). • This is then projected to match dimensions of the visual features and they are summed. • Word embedding initialized with BERT base pretrained on BookCorpus and Wikipedia
  • 18. NOVELTY • Co-TRM : Co-attentional transformer layers to enable information exchange between modalities. • The key and values from each modality are passed as input to the other modality’s multi headed attention block. • The exchange between the two streams is restricted to be between specific layers • Text stream has significantly more processing before interacting with visual features • visual features are already fairly high level and require limited context aggregation compared to words in a sentence
  • 19. TRAINING TASKS • Alignment Task • Model is presented with an image and text pair • {IMG, v1, …vt, CLS, w1, …, wt, SEP} predicts whether the image and text are aligned. • The outputs IMG and CLS are holistic representation of image and text inputs. • Overall representation is computed as an element-wise dot product between IMG and CLS representations and a linear layer sits on top to make binary prediction. • Masked Modeling Task I M G C L S + M M
  • 20. ISSUES WITH VILBERT • Cannot incorporate pre trained unimodal representations • Cannot work for any sequence of dense vectors
  • 21. FB: SUPERVISED MULTIMODAL BITRANSFORMERS FOR CLASSIFYING IMAGES AND TEXT • Jointly finetunes unimodally pretrained text and image encoders by projecting image embeddings to text token space • Easier to incorporate pre trained unimodal modals in this architecture
  • 22. MBIT: IMAGE ENCODER • Get feature maps from ResNet-152 • Use ResNet-152 with average pooling over K x M grids in the image, yielding N = KM output vectors of 2048 dimensions • Learn weights to project each of the N image embeddings to D-dimensional token input embedding space • In a way we are mapping image embeddings to BERT’s token space using a set of randomly initialized mappings
  • 23. EVALUATION • Surprisingly competitive to VILBERT • Create hard test sets • Construct hard test sets by taking the examples where BERT and IMG classifier predictions are most different from the ground truth classes in the test set • Compare with • Text-only Bert • Image only model • Concat BOW + Image • Late fusion • Concat BERT + Img • Concatenate output of bert and image baselines (2048 + 768) and apply linear classifier on top