Learning to Grow Structured Visual Summaries for Document Collections

Learning to Grow Structured Visual Summaries
for Document Collections
Daniil Mirylenka Andrea Passerini
University of Trento, Italy
Machine learning seminar, Waikato University, 2013

Problem: informative representation of documents
Application: academic search
Input: document collection Output: topic map
⇒

Our approach:
Building and summarizing the topic graph
⇒ ⇒

Building the topic graph:
Overview
1. Map documents to Wikipedia articles
2. Retrieve the parent categories
3. Link categories to each other
4. Merge similar topics
5. Break cycles in the graph

Mapping the document to Wikipedia articles
“..we propose a method of summarizing collections
of documents with concise topic hierarchies, and
show how it can be applied to visualization and
browsing of academic search results.”
⇓
“..we propose a method summarizing collections of
documents with concise [[Topic (linguistics) |topic]]
[[Hierarchy |hierarchies]], and show how it can be
applied to [[Visualization (computer graphics)
|visualization]] and [[Web browser |browsing]] of
[[List of academic databases and search engines
|academic search]] results.”

Retrieving the parent categories
⇓

Linking the categories
⇓

Merging similar topics
⇓

Breaking the cycles
⇓

Example of an actual topic graph built from 100 abstracts

Summarizing the topic graph
Reﬂection
⇒
What is a summary?
- a set of nodes (topics).

Reﬂection
⇒
What is a summary?
What is a good summary?
- ???

Reﬂection
⇒
What is a summary?
What is a good summary?
- ???
Let’s learn from examples!
- subjective

The ﬁrst attempt
Structured prediction
ˆGT = arg max
GT
F(G, GT )

The ﬁrst attempt
ˆGT = arg max
GT
F(G, GT )
Problem: evaluation on |G|
T
subgraphs
- Example:
300-node topic graph
10-node summary

The ﬁrst attempt
ˆGT = arg max
GT
F(G, GT )
Problem: evaluation on |G|
T
subgraphs
- Example:
300-node topic graph
10-node summary
1 398 320 233 241 701 770 possible subgraphs
(1 million graphs per second ⇒ 44 311 years)

Key idea
Restriction: summaries should be nested
∅ = G0 ⊂ G1 ⊂ · · · ⊂ GT

Key idea
∅ = G0 ⊂ G1 ⊂ · · · ⊂ GT
Now we can build summaries sequentially
Gt = Gt−1 ∪ {vt}

Key idea
∅ = G0 ⊂ G1 ⊂ · · · ⊂ GT
Now we can build summaries sequentially
Gt = Gt−1 ∪ {vt}
Still a supervised learning problem
- training data: summary sequences (G, G1, G2, · · · , GT )
- or topic sequences: (G, v1, v2, · · · , vT )

Learning to grow summaries
as imitation learning
Imitation learning (racing analogy)
destination: ﬁnish
sequence of states
driver’s actions (steering, etc.)
goal: copy the behaviour
Supervised Trai
Expert Trajectories
Learned Policy: aˆsup 
(borrowed from the presentation of Stephane Ross)

as imitation learning
Imitation learning (racing analogy)
destination: ﬁnish
sequence of states
driver’s actions (steering, etc.)
Supervised Trai
Expert Trajectories
Learned Policy: aˆsup 
Our problem
destination: summary GT
states: intermediate summaries G0, G1, · · · , GT−1
actions: topics v1, v2, · · · , vT added to the summaries

How can we do that?
Straightforward approach
Choose a classiﬁer π : (G, Gt−1) → vt
Train on the ‘ground truth’ examples ((G, Gt−1), vt)
Sequentially apply on the new graphs
∅ = ˆG0
π(G,.)
→ ˆG1
π(G,.)
→ · · ·
π(G,.)
→ ˆGT

How can we do that?
∅ = ˆG0
π(G,.)
→ ˆG1
π(G,.)
→ · · ·
π(G,.)
→ ˆGT
Will it work?

How can we do that?
∅ = ˆG0
π(G,.)
→ ˆG1
π(G,.)
→ · · ·
π(G,.)
→ ˆGT
Will it work?
No.
(unable to recover from mistakes)

DAgger (dataset aggregation)
S. Ross, G. J. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret
online learning. Journal of Machine Learning Research - Proceedings Track, 15:627635, 2011.
Idea:
train on the states we are going to encounter
(our own-generated states)

Idea:
How can we do that?
We haven’t trained the classiﬁer yet!

Idea:
How can we do that?
We haven’t trained the classiﬁer yet!
We will do it iteratively (for i = 0, 1,)
train the classiﬁer πi on the dataset Di
generate the trajectories using πi
add new states to the dataset Di+1

Collecting the actions
iterating, we collect states
but we also need actions

Collecting the actions
iterating, we collect states
but we also need actions
“Let the expert steer”
Q: What action is optimal?
A: One that brings us closest to
the optimal trajectory.
DAgger: Dataset Aggregation
• Collect new trajectories with 1
1
14
Steering from
expert

Recap of the algorithm
The algorithm
‘ground truth’ dataset: points
(state, action)
train π on the ‘ground truth’
dataset
apply π to the initial states
- generate the trajectories
generate expert’s actions
add new state-action pairs to
the dataset
repeat
DAgger: Dataset Aggregation
• Collect new trajectories with 1
1
14
Steering from
expert

Training the classiﬁer
Classiﬁer
π : (G, Gt−1) → vt
Scoring function
F(G, Gt−1, vt) = w, Ψ (G, Gt−1, vt)
Prediction
vt = arg maxv F(G, Gt−1, v)
Learning: SVMstruct
- ensures that optimal topics score best

Providing the expert’s actions
Expert’s action
brings us closest to the optimal trajectory
Technically
by minimizing the loss function
vt = arg min
v
G (Gt−1 ∪ {v}, Gopt
t )
Loss functions
graphs as topic sets ⇒ redundancy
key: consider similarity between the topics

Learning grow summaries
Graph features
Some of the features:
document coverage
transitive document coverage
average and max. overlap between topics
average and max. parent-child overlap
the height of the graph
the number of connected components
...

Initial experiments
Evaluation
Microsoft Academic Search
10 manually annotated queries
leave-one-out cross-validation
greedy coverage baseline
spectral clustering-based method
based on U. Scaiella, P. Ferragina, A. Marino, and M.
Ciaramita. Topical clustering of search results. WSDM 2012.
Notes
small number of points
unique task ⇒
no established datasets
no appropriate competitor
approaches
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
1 2 3 4 5 6 7 8
0.20.30.40.50.60.70.8
Number n of predicted topics
match@n
q
q
q
q
q
GreedyCov
LSG
our method: 1st iteration
our method, iterations 2−9
our method, 10th iteration

Thank You
Thank You!
Questions?
Daniil Mirylenka
dmirylenka@disi.unitn.it

Learning to Grow Structured Visual Summaries for Document Collections

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (13)

Similar a Learning to Grow Structured Visual Summaries for Document Collections

Similar a Learning to Grow Structured Visual Summaries for Document Collections (20)

Último

Último (20)

Learning to Grow Structured Visual Summaries for Document Collections