The document analyzes the explainability of GraphSum, an abstractive multi-document summarization model, by examining its attention weights. It finds that GraphSum's attention weights from later decoding layers correlate more strongly with the relevance of input text segments, improving explainability. It also finds that GraphSum performs better when using paragraphs rather than sentences as input for the news domain, as paragraphs aid structure rather than topic separation for news articles. The document concludes that attention weights and expert annotations may provide better insight into abstractive summarization than ROUGE scores alone.
Analysis of GraphSum's Attention Weights to Improve the Explainability of Multi-Document Summarization
1. Analysis of GraphSum’s Attention
Weights to Improve the
Explainability of Multi-Document
Summarization
06.04.2022
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp 1
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner,
J. Töllich and A. Scherp
2. Extractive vs. Abstractive MDS
06.04.2022
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp 2
Input Documents
Model
Summary
Model
Extractive:
Abstractive:
4. Research Questions
06.04.2022
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp 4
Model1
Sentences Paragraphs
Model2
Quality?
Documents
Model
Summary
Explainability?
5. GraphSum
06.04.2022
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp 5
Source: Li et al. “Leveraging Graph to Improve Abstractive Multi-Document Summarization” (2020)
8. Pre-Processing
06.04.2022
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp 8
EXTRACTION
TRUNCATION
/
PADDING
TF-IDF
GRAPH
Build TF-IDF
Graph
Train
GraphSum
Model
Evaluate
Performance
9. GraphSum Training Procedure
06.04.2022
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp 9
Build TF-IDF
Graph
Train
GraphSum
Model
Evaluate
Performance
Architecture and hyper-parameters as suggested by
Li et. al “Leveraging Graph to Improve Abstractive Multi-Document Summarization” (2020)
Use similarity graph generated by pre-processing
Use multiple batch-sizes
Same number of input tokens
Train / validation / test split
10. ROUGE Score
ROUGE-2: Overlapping bi-grams
06.04.2022
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp 10
ROUGE-L: Longest common subsequence
Final score based on F-score as proposed by
Chin-Yew Lin, “ROUGE: A Package for Automatic Evaluation of Summaries” (2004)
Reference Reference
Candidate Candidate
Build TF-IDF
Graph
Train
GraphSum
Model
Evaluate
Performance
12. Approach for Explainability Improvement
06.04.2022
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp 12
13. Data Sets
MultiNews WikiSum
Sentence vs Paragraphs x
Explainability Analysis x x
06.04.2022
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp 13
MultiNews:
Human written news summaries from professionals (60.000 Documents)
WikiSum:
Wikipedia articles and their references as MDS task (2.3 Million Arcticles)
14. Results: Textual Unit Comparison
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp
19. Correlation between Attention Weights and Reference Metric
06.04.2022
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp 19
MultiNews
Layer 6 (High Correlation)
Reference Metric
Attention
Weights
Reference Metric
Attention
Weights
Layer 3 (Low Correlation)
21. Conclusion
Paragraphs perform better than sentences for news domain
Paragraphs are used as structural aid, not for topic separation
Other domains may show different behaviour
Attention weights improve explainability of MDS
Attention weights provide source origin information
Latter decoding layers more suitable
ROUGE score might not be fully applicable as metric for abstractive MDS
ROUGE score not suitable for e.g., paraphrased sentences
Expert annotated source information could provide better insights
06.04.2022
M.L. Hickmann, F. Wurzberger, M. Hoxhalli, A. Lochner, J. Töllich and A. Scherp 21
Code available on GitHub: https://github.com/arnelochner/GBTBMDS
Notas del editor
Paragraphs:
- Leveraging inter-paragraph relations can provide the model additional information for detecting contextual relations between topics.
Sentences:
- Our rationale is that with sentences as textual units, the graph structure represents inter-sentence relations, which may provide more detailed information within topics and thus may improve the results.
Batch Sizes
GraphSum Model Hyperparamter as proposed by Li et al
Use tokenzier for extraction
Same number of tokens
Wir haben ROUGE Scores als referenz verwendet
Pearson Correlation
WikiSum nicht für Snetence vs Paragraphs aus resource limitations
Averaged Runs
Multi News Example
Basierend auf diesen Erkentnissen haben wir die Attention weights der Multi-heads aggregiert im weiteren Vorgehen