[poster] Detecting Incongruity Between News Headline and Body Text via a Deep Hierarchical Encoder
1. Performance comparison
Performance over long text input
The proposed models showed robustness in handling long articles
Evaluation in the real world supports the validity of the dataset / approach
- Manual validation - Surveys on AMT workers
Detecting Incongruity Between News Headline and Body Text
via a Deep Hierarchical Encoder
Seunghyun Yoon*, Kunwoo Park*, Joongbo Shin, Hongjun Lim, Seungpil Won, Meeyoung Cha and Kyomin Jung
Seoul National University, KAIST
Full paper at emerging track on artificial intelligence for social impact (oral presentation at 29 Jan 14:00-15:30)
Research Problem
Detecting incongruity between news headline and body text
(i.e., when a news headline does not correctly represent the story as in
advertisements, clickbaits, fake news, hijacked stories, etc.)
Why is it important?
- People are less likely to read the whole contents but just read news headlines
- News headlines play an important role in making first impressions to readers,
which persist even after reading the whole article content
Most news are shared based on headlines in online platforms
Previous efforts
- Public competition aiming to facilitate research on developing algorithms
(e.g., Fake News Challenge in 2017)
- Feature based approaches in detecting ambiguous headlines (Wei et al. 2017)
A sophisticated model could not be proposed or fully-evaluated
due to lack of available corpus for research
Proposed approaches
- Attentive Hierarchical Dual Encoder (AHDE): encodes the full news article
from a word-level to a paragraph-level with attention mechanism
- Independent Paragraph (IP) Method: splits paragraphs in the body text and
learns the relationship between each paragraph and headline
Experimental Results
Dataset Generation
Data source
- Main dataset: Crawl a near-complete set of news article published in South
Korea (period: January of 2017 – October of 2017)
- Supplementary dataset: 136K news articles in English (Horne et al. 2018)
Two types of dataset
- Whole-corpus: Inject paragraphs that are negatively sampled from other
articles into an original article to make the headline become incongruent
- Paragraph-corpus: Transform a pair of headline and body text into multiple
sub-pairs of headline and paragraph
Methodology
Baseline approaches
- XGBoost: the winning model in Fake News Challenge (FNC-1)
- Recurrent Dual Encoder (RDE): independently learn sequential patterns of
headline and body text via dual RNNs
- Convolutional Dual Encoder (CDE): independently learn textual patterns of
headline and body text via dual CNNs
Passage encoding
, , , ,
,,
Attention
tanh
∑
,
∑
Without IP With IP
AHDE was the best AHDE was (still) the bestEnjoy more advantages
compared to hierarchical models
Without IP With IP
Summary of contributions
RELEASE a million-scale dataset for research on misinformation
PROPOSE two neural networks that efficiently learn the textual relationship
between headline and body text
SHOW that the models trained on the released corpus decent performances
on the real-world dataset
The code, data, and paper are available at http://david-yoon.github.io