Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Â
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
1. Screen2Vec: Semantic Embedding of GUI
Screens and GUI Components
Toby Li, Lindsay Popowski, Tom M Mitchell, Brad A. Myers
2021 CHI Conference on Human Factors in Computing Systems
2. Background
⢠Existing approaches of representing GUI screens are limited
ď§ Capturing only text on the screen
⢠Missing information encoded in the layout and design pattern
ď§ Focusing on the visual design patterns and GUI layouts
⢠Not capturing the content in the GUI
⢠Prior approaches use supervised learning with large datasets for specific task
objectives
ď§ Requiring labeling efforts
ď§ Inapplicable in different downstream tasks
1
Semantic representations of GUI screens and components
3. Contribution
⢠Presenting a self-supervised technique, not requiring human-labeled data
⢠Generating more comprehensive semantic embeddings of GUI screens and
components using
ď§ Textual content
ď§ Visual design
ď§ Layout patterns
ď§ App meta-data
⢠Training an open-sourced GUI embedding model using Screen2Vec with RICO
dataset
⢠Providing sample downstream tasks such as
ď§ Nearest neighbor retrieval
ď§ Composability-based retrieval
ď§ Representing mobile tasks
2
5. Architecture of Screen2Vec
⢠Input
ď§ 768-dimensional embedding vector of the text label of the GUI component
⢠Encoded using a pre-trained Sentence-BERT
ď§ 6-dimensional class embedding vector
⢠Representing the class type of the GUI component
⢠Optimizing weights in the class embeddings and weights in the linear layer (text + class)
⢠Output
ď§ 768-dimensional embedding vector
4
GUI Component Level
6. Architecture of Screen2Vec
1) Collection of GUI component embedding vector
ď§ Combined into a 768-dimensional vector using RNN
2) 64-dimensional layout embedding vector
ď§ Encoding the screenâs visual layout
3) 768-dimensional embedding vector of the textual App Store description
ď§ Encoded with a pre-trained Sentence-BERT model
⢠GUI(1) and layout(2) vectors are combined using a linear layer ď 768-dimensional
embedding vector
⢠After training, description(3) vector is concatenated ď 1536-dimensional embedding
vector
⢠Weights of RNN and weights of the linear layer trained on a Continuous Bag of Word
prediction 5
GUI Screen Level
7. Dataset
⢠RICO Dataset
ď§ Containing interaction traces on 66,261 unique GUI screens
ď§ From 9,384 free Android apps
⢠Specifics
ď§ Each dataset with a screenshot image
ď§ Screenâs âview hierarchyâ (e.g., DOM tree in HTML) in a JSON file
⢠Each node including
⢠Class type
⢠Textual content
⢠Location as the bounding box on the screen
⢠Properties such as whether it is clickable, focused, or scrollable
ď§ Each interaction trace represented as a sequence of GUI screens
⢠Which location is clicked or swiped
6
8. Implementation Details
⢠Encoding 26 class categories into a vector space
⢠Mapping each of the categories into a continuous 6-dimensional vector
⢠Optimizing embedding vector value by training GUI component prediction task
ď§ Categories semantically similar, close in the vector space
7
GUI Class Type Embeddings
9. Implementation Details
⢠Defining the context of a component as its 16 nearest components
⢠Measures of screen distance for determining the context
ď§ Euclidean : straight-line minimal distance on the screen
⢠In pixel
ď§ Hierarchical : distance between 2 GUI components on the hierarchical view tree
⢠Parent and children : 1
8
GUI Component Context
10. Implementation Details
⢠Combining multiple vectors into a lower-dimension vector
⢠GUI component level
ď§ Concatenating 768-dimension with 6-dimension
ď§ Shrinking down to 768-dimension
ď§ Creating 774 x 768 weights
⢠GUI screen level
ď§ Combining 768-dimension and 64-dimension
ď§ Producing 768-dimension for screen content and layout
9
Linear Layer
11. Implementation Details
⢠Use a pre-trained Sentence-BERT language model
ď§ Using SNLI and Multi-Genre NLI datasets with mean-pooling
⢠Encoding the text label of description to 768-dimensional vectors
⢠Deriving semantically similar sentences and phrases
10
Text Embeddings
12. Implementation Details
⢠Extracting the layout from a screenshot
⢠Differentiating between text and non-text GUI components
⢠Using autoencoder to encode each image into 64-dimensional embedding vector
⢠Encoderâs input dimension : 11,200
⢠Two hidden layers of 2,048 and 256
⢠Applying RLU to eliminate negative input
⢠Loss determined by MSE
11
Layout Embeddings
13. Implementation Details
⢠Combining embedding vectors of multiple GUI components
⢠GUI components embeddings fed into the RNN
ď§ In the pre-order traversal order of hierarchy tree
⢠Starting with hidden state of zero, fed into a linear layer with đ â 1 đĄâ output
12
GUI Embedding Combining Layer
14. Training Configuration
⢠Training: 90% of the data; validation: 10%
⢠Cross entropy loss function with Adam optimizer
⢠Learning rate: 0.001; batch size: 256
⢠GUI component model: 120 epochs; GUI screen model: 80-120 epochs
⢠Total loss
ď§ Component
⢠Total Loss = Loss(text prediction) + Loss(class type prediction)
ď§ Screen
⢠Negative sampling
⢠Prediction compared to the correct screen and a sample of negative data
⢠Random sampling of other screens with size 128 on the same app
⢠To differentiate different screens on the same app
13
15. Baselines
⢠Text Embedding Only (similar textual context)
ď§ Screen embedding method used in SOVITE
ď§ Computed by averaging the text embedding vectors for all the text in the screen
⢠Layout Embedding Only (similar layout)
ď§ Screen embedding method used in the original RICO paper
ď§ Computed by the layout autoencoder to represent the screen
⢠Visual Embedding Only (similar visual)
ď§ Direct screen shot of image instead of layout
ď§ Inspired by VASTA, Sikuli, and HILC
14
16. Results
⢠Predicting each GUI screen in all the GUI interaction traces in the RICO dataset using its
context
ď§ 3 versions to compare
⢠EUCLIDEAN with locations of GUI components and the screen layouts
⢠HIERARCHICAL with above spatial info
⢠EUCLIDEAN without spatial info
15
17. Sample Downstream Tasks
⢠The main purpose is to produce distributed vector representations that encode useful
semantic, layout, design properties
⢠Compare similarity between the nearest neighbor results by different models
Methods
⢠Select 50 screens from apps and app domains
⢠Retrieve top-5 most similar screens using each of 3 models
⢠79 Mechanical Turk workers participated
⢠Each worker saw top-5 most similar screens of 5 source screens produced by 3 models
⢠Questionnaires include followings
ď§ (1) App similarity (2) Screen type similarity (3) Content similarity
16
Nearest Neighbors
18. Sample Downstream Tasks
Results
⢠The differences between the mean ratings of the Screen2Vec model and both TextOnly
and LayoutOnly model are significant (non-parametric Mann-Whitney U test)
⢠Retrieve top-5 most similar screens using each of 3 models
17
Nearest Neighbors
19. Sample Downstream Tasks
Observation
⢠Screen2Vec generate more comprehensive
representations
ď§ âRequest rideâ in Lyft
⢠âGet directionâ in Uber Driver
⢠âSelect navigation typeâ in Waze app
⢠âRequest rideâ in Free Now
ď§ MapView taking majority
ď§ All feature a menu/information card at the
bottom 1/3 â 1/4
⢠TextOnly generated results are semantically
similar to âpaymentâ
⢠LayoutOnly generated results show lower score
in the content and app-context similarity
18
Nearest Neighbors
20. Sample Downstream Tasks
Word2Vec
⢠âMan is to woman as brother is to sisterâ
⢠(brother - man + woman) results in an
embedding vector representing sister
Screen2Vec
⢠Marriott app âs âhotel bookingâ screen +
(Cheapoair appâs âsearch resultâ screen
â Cheapoair appâs âhotel bookingâ
screen))
⢠The top result is the âsearch resultâ
screen in the Marriott app
19
Embedding Composability
21. Sample Downstream Tasks
⢠Preliminary evaluation on the effectiveness of embedding mobile tasks as
sequences of Screen2Vec screen embedding vectors
⢠Recording scripts of completing 10 common smartphone tasks
⢠Representing each task as the average of Screen2Vec vectors
⢠Querying for the nearest neighbor within 20 task variations and get 18/20
accuracy
ď§ TextOnly : 14/20 accuracy
20
Screen Embedding Sequences for Representing Mobile Tasks
22. Potential Application
⢠Designers query for example designs that display similar content or screens in
apps of a similar domain
⢠Composability helps to find a specific page for the app
ď§ Suppose a designer searches for checkout page for app A
ď§ Aâs order page + (App Bâs checkout page â App Bâs order page
⢠LayoutGAN can generate realistic GUI layouts based on user-specified
constraints
ď§ Applying Screen2Vec to incorporate the semantics of GUIs and the context
of user interaction
21
23. Limitation
⢠Only trained and test on Android app GUIs
⢠RICO dataset
ď§ contains interaction traces within single apps ď need to generalize multiple app
ď§ Does not contain paid apps
⢠Screen2Vec does not encode the semantics of graphic icons that have no textual
information
22
Editor's Notes
Correct GUI component is among the top 0.01% in the prediction result
Aggregating textual information is useful for representing topic of a screen ď¨ good top 0.1% and 1% / NRMSE
Textual content, visual design, layout pattern, and app context
Add, substract, and average to form meaningful new one
Add, substract, and average to form meaningful new one
Add, substract, and average to form meaningful new one