Parsing Natural Scenes and Natural Language with Recursive Neural Networks
1. Parsing Natural Scenes and
Natural Language with
Recursive Neural Networks
Richard Socher, Cliff Chiung-Yu Lin, Andrew Y. Ng, Christopher D. Manning
ICML’ 2011
Jie Cao
2. Outline
• Context
• Recursive Neural Network Definition
• Input Representation
• Output
• Greedy Structure Predicting RNNs
• Loss Function
• Max-Margin Framework
• Back propagation Through Structure
• L-BFGS
• Experiment and Improved RNN
7. Word Embedding Matrix
dense vector
co-occurrence statistic
Collobert, R. and Weston, J. A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML, 2008
8. Input Representation for
Scene Image
the features
each segment i = 1,...,Nsegs in an image
the matrix of parameters
we want to learn
bias
applied element-wise,
can be any sigmoid-like function,original one
“semantic”
n-dimensional space
78 segments per image
119 features for every segement
Gould, S., Fulton, R., and Koller, D. Decomposing a Scene into Geometric and Semantically Consistent Regions. In ICCV, 2009
9. f: X→Y (Output Y )
• For Visual Parser:
• A visual tree is correct if all adjacent segments that belong to the
same class(all segments labeled) are merged into one super segment
before merges occur with super segments of different classes.
• how object parts are internally merged or how complete, neighboring
objects are merged into the full scene image
• A set of correct trees
• For Language Parser:
• only has one element, the annotated ground truth tree: Y (x) = {y}
How to evaluate to error between Y_true and Y’?
(Loss Function)
10. Recursive NN Definition
new presentation of parent(i,j)
new score of parent(i,j)
C recursively adding
new merged parent,
and update the adjacent matrix
Potential Adjacent Pairs
15. Category Classify in RNN
Each node of the tree built by the RNN has associated
with it a distributed feature representation
We can leverage this representation by adding to
each RNN parent node (after removing the scoring layer)
a simple softmax layer to predict class labels
16.
17. Loss Function for Language
For Constituency Parser:(Phrase Structure Parser)
A constituent(non-terminal) is correct only if :
1. it dominates exactly the correct span of words
2. it is the correct type of constituent
(S[1:7]
(NP[1:1] Jim)
(VP[2:2] ate)
(NP[3:4] the cookies)
(PP[5:7] in
(NP[6:7] the bowl)
)
)
(S[1:7]
(NP[1:1] Jim)
(VP[2:7] ate
(NP[3:7] the cookies
(PP[5:7] in
(NP[6:7] the bowl)
)
)
)
)
Hamming Distance
18. Loss Function for Image
For Visual Parser: A set of correct trees
for proposing a parse yˆ for input x with labels l
19. RNN for Structure Prediction
Given the training set, we search for a function f
with small expected loss on unseen inputs.
T(x) is the set of possibly correct trees.
Assuming this problem can be described in terms of a
computationally tractable max over a score function s
How to define the margin?
20. Max Margin
Hard-Margin:
Soft-Margin:
Adding a slack to handle not separable data
We need to minimize as the hinge loss
max for true Y is because not only one true tree for image
Max
27. Experiment in ICML’2011
The final unlabeled bracketing F-measure of our language
parser is 90.29%, compared to 91.63% for the widely
used Berkeley parser (Petrov et al., 2006) (development
F1 is virtually identical with 92.06% for the RNN and
92.08% for the Berkeley parser).
Unlike most previous systems, our parser does not provide
a parent with information about the syntactic categories of
its children. This shows that our learned, continuous
representations capture enough syntactic information to
make good parsing decisions.