4. Graph Representation Learning
Node Embedding Graph Embedding
Representation learning is learning representations of input data typically by transforming it or
extracting features from it(by some means), that makes it easier to perform a task like classification
or prediction. [Yoshua Bengio 2014]
Embedding is ALL you need:
word2vec, doc2vec, node2vec, item2vec, struc2vec…
4
5. Tasks on Graph
Node Classification
- Predict a type of given node.
Edge Classification/Link Prediction
- Predict whether two nodes are linked
or the type of the link.
Graph Classification
- Identify densely linked clusters of nodes
Network Similarity
- How similar are two (sub)network
5
6. Goal: Encode (f) nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in
the original network
1. Define an Encoder
2. Define a similarity function
3. Optimization
Node Embedding
6
7. Shallow Encoding —— an embedding-lookup table
ENC(v)=Zv
Methods: DeepWalk[Perozzi et al. 2014 KDD], Node2vec[Grover et al. 2016 KDD], etc.
7
8. Shallow Methods Framework
generate ‘sentences’
unbiased walk:
DeepWalk
biased walk:
Node2vec
(different walk strategies)
Idea: Optimize the node embeddings so that nodes have similar embeddings if they tend to co-occur on
short random walks over the graph.
8
9. DeepWalk
1. Run Short fixed-length random walks starting from each node on the graph using some strategy R
2. For each node u collection N(u), the multiset of nodes visited on random walks starting from u
3. Optimize embeddings according to : Given nodes u, predict its neighbors N(u).
9
10. DeepWalk Optimization
The loss function is kind of slow because…
1. nested sum give o|V^2| complexity
2. normalization term from the softmax function
Solution: Negative sampling
• Use k negative nodes proportional to
degree instead of all nodes..!
• k should be a balance between
predictive accuracy and computational
efficiency.
10
11. Node2Vec —— Let’s generate biased walks
Idea: Flexible notion of network neighborhood of node leads to rich node embeddings.
11
12. Two parameters:
• return parameter p:
• return back to the previous node
• ‘walk away’ parameter q:
• moving outwards(DFS) vs. inwards (BFS)
• intuitively, q is the ratio of BFS vs DFS
Node2Vec —— Explore neighborhoods in a BFS as well as DFS fashion.
The walker just traversed edge (s1,w) and is now at w.
Neighbors of w can only be:
- s2: same distance to s1.
- s1: back to s1
- s3/s4: farther from s1
12
13. Limitations of Shallow Encoders
• o(|V|) parameters are needed:
• Each node has a unique embedding.
• No sharing of parameters between nodes.
• Inherently “transductive”:
• Either not possible or very time consuming to generate
embeddings for nodes not seen during training.
• Does not incorporate node features
• many graphs have features that we can and should leverage
13
14. Graph Convolutional Networks
Idea:
Node’s neighborhood defines a computation graph.
To obtain node representations, use a NN to aggregate information from neighbors recursively by limited BFS.
14
15. Graph Convolutional Networks
• Each layer is one level of depth in the BFS
• Nodes have embeddings at each layer.
• Layer-0 embedding of node u is its input
feature.
• Layer-K embedding gets information from
NN - final embeddings.
So we need…
1. AGG: Aggregator for collecting information
from node’s neighborhood.
2. NNs: Neural network for neighborhood
representation(eg. NN W1) and node’s self
embedding (eg. NN B1)
3. Loss Function for optimization
15
17. Supervised Training vs Unsupervised Training
For the shallow methods, we train the models in an unsupervised manner:
• use only the graph structure
• similar nodes have similar embeddings
• feed the ‘sentences’ into skipgram model.
For GCN, we directly train the model for a supervised task, like node classification.
We can feed the embeddings into any loss function and run stochastic gradient descent to train the parameters.
17
18. Inductive capability
1. In many real applications new nodes are often added to the graph.
Needed to generate embeddings for new nodes without retraining.
Hard to do with shallow methods.
2. The same aggregation parameters are shared for all nodes. The number of model parameters is sublinear in |V|
and generalize to unseen nodes
18
19. GraphSAGE —— Graph SAmple and aggreGatE
GCN just aggregated the neighbor messages by taking the weighted average. How to do better?
Idea: Generalize the aggregation methods to its neighbors and concatenate the features with itself.
19
20. Neighborhood Aggregator
Mean: Take a weighted average of its neighbors
Pooling: element-wise mean or max pooling.
LSTM: Apply LSTM to reshuffled neighbors
20
21. Recap for GCN, GraphSAGE
Key Idea: Generate node embeddings based on local neighborhoods using neural networks
Graph Convolutional Network:
• Average neighborhood information and stack neural network
GraphSAGE:
• Generalized neighborhood aggregation (AVG, POOLING, LSTM, etc.)
21
22. Graph Attention Network —— Learnable Aggregator for GCN
Idea: Borrow the idea of attention mechanisms and learn to assign different weights to different
neighbors in the aggregation process.
Attention Is All You Need [A Vaswani, 2017 NIPS]
22
23. Graph Attention Network —— Learnable Aggregator for GCN
a is the attention mechanism function
euv indicates the importance of node u’s message to node v
!uv is the normalized coefficients using softmax function
Compute embedding of each node in the graph following an attention strategy.
• Nodes attend over their neighborhoods’ messages
• Implicitly specifgying different weights to different nodes in a neighborhood
23
24. Attention Mechanism
Attention mechanism a:
The approach is agnostic to the choice of a
• The original paper use a simple single-layer neural network
• Multi-head attention can stabilize the learning process of attention mechanism
• a can have parameters, which needs to be estimated
Parameters of a are trained jointly:
• learn the parameters together with weight matrices in an end-to-end fashion
Benefits:
• Computationally efficient:
computation of attentional coefficients can be parallelized across all edges of the graph
aggregation may be parallelized across all nodes
• Storage efficient:
sparse matrix operations do not require more than O(V+E) entries to be stored
Fixed number of parameters, irrespective of graph size
• Trivially localized:
only attends over local network neighborhoods (masked model).
• Inductive capability:
it is a shared edge-wise mechanism
it does not depend on the global graph structure.
24
25. Applications ——Pinsage
Challenge for Pinterest:
Scaling up GCN-based node embedding in training and inference is difficult:
300M+ users, 4+B pins and 2+B boards.
Innovations:
• Importance-based neighborhoods sampling strategy by simulating random walks and selecting neighbors
with highest visit counts. (importance pooling)
• selecting a fixed number of nodes to aggregate from allows to control the memory footprint of the
algorithm during training.
25