Building a Microblog Corpus for Search Result Diversification
1. Building a Microblog Corpus
for Search Result Diversification
AIRS 2013, Singapore, December 10
Ke Tao, Claudia Hauff, Geert-Jan Houben
Web Information Systems, TU Delft, the Netherlands
Delft
University of
Technology
2. Research Challenges
1. Diversification needed: Users are likely to use shorter
queries, which tend to be underspecified, to search on
microblog
2. Lack of Corpus for Diversification Study: How can
one build a microblog corpus for evaluating further study
on diversification?
Search
Result
tweets
query
Diversified
Result
diversification
strategy
diversity
judgment
Building a Microblog Corpus for Search Result Diversification
2
3. Methodology
Overview
1. Data Source
• How can we find a good representative Twitter dataset?
2. Topic Selection
• How do we select the search topics?
3. Tweets Pooling
• Which tweets are we going to annotate?
4. Diversity Annotation
• How do we annotate the tweets with diversity characteristics?
Building a Microblog Corpus for Search Result Diversification
3
4. Methodology – Data source
• From where?
• Twitter sampling API around 1% of whole Twitter streams
• Duration
• From February 1st to March 31st 2013
• Coincide with TREC 2013 Microblog Track
• Tools
• Twitter Public Stream Sampling Tools by @lintool
• Amazon EC2 in EU
TREC 2013 Microblog Guideline: https://github.com/lintool/twitter-tools/wiki/ TREC-2013-Track-Guidelines
Twitter Public Stream Sampling Tool: https://github.com/lintool/twitter-tools/wiki/Sampling-the-public-Twitter-stream
Building a Microblog Corpus for Search Result Diversification
4
5. Methodology – Topic Selection
How do we select the search topics?
• Candidates in Wikipedia Current Events Portal
• Enough importance
• More than local interests
• Temporal Characteristics
• Evenly distributed during the period of 2-month
• Enables further analysis on temporal characteristics
• Selected
• 50 topics on trending news events
Wikipedia Current Events Portal: http://en.wikipedia.org/wiki/Portal: Current_events
Building a Microblog Corpus for Search Result Diversification
5
6. Methodology – Tweets Pooling – 1/2
Maximize coverage & Minimize effort
• Challenge for adopting existing solution
• Lack of access to multiple retrieval systems
• Topic Expansion
• Manually created query for each topic
• Aim at maximum coverage of tweets that are relevant to the topic
• Duplicate Filtering
• Filter out the duplicate tweets (cosine similarity > 0.9)
Building a Microblog Corpus for Search Result Diversification
6
7. Methodology – Tweets Pooling – 2/2
Topic Expansion Example
Hillary Clinton steps
down as United States
Secretary of State
Possible variety
of expressions
Building a Microblog Corpus for Search Result Diversification
7
8. Methodology – Diversity Annotation
Annotation Efforts
• 500 tweets for each topic
• No identification of subtopics beforehand
• Tweets about general topic (=no added value) are judged non-relevant
• No further check on URL links may be not available as time goes
• 50 topics split between 2 annotators
• Subjective process
• Later comparative results
• 3 topics dropped – e.g. not enough diversity / relevant documents
Building a Microblog Corpus for Search Result Diversification
8
9. Topic Analysis
The Topics and Subtopics 1/2
All topics
Avg. #subtopics
Std. dev. #subtopics
Min. #subtopics
Max. #subtopics
9.27
3.88
2
21
Topics annotated by
Annotator 1 Annotator 2
8.59
9.88
5.11
2.14
2
6
21
13
On average, we found 9 subtopics per each topic.
The subjectivity of annotation is confirmed based on
the differences in the standard deviation of number
of subtopics per each topic between two annotators.
Building a Microblog Corpus for Search Result Diversification
9
10. Topic Analysis
The Topics and Subtopics 2/2
The annotators on average spent 6.6 seconds to
annotate a tweet. Most of the tweets are assigned
with exactly one subtopic.
Building a Microblog Corpus for Search Result Diversification
10
11. Topic Analysis
The relevance judgment 1/2
• Different diversity in topics
• 25 topics have less than 100 tweets with subtopics
• 6 topics have more than 350 tweets with subtopics
• Difference between 2 annotators
• On average, 96 tweets v.s. 181 tweets with subtopic assignment
Number of documents
500
400
300
RELEVANT
200
NONRELEVANT
100
0
Topics
Building a Microblog Corpus for Search Result Diversification
11
12. Topic Analysis
The relevance judgment 2/2
• Temporal persistence
• Some topics are active during the entire timespan
• Northern Mali conflicts
• Syrian civil war
• Low to 24 hours for some topics
• BBC Twitter account hacked
• Eiffel Tower, evacuated due to bomb threat
Difference in days
60
50
40
30
20
10
0
Topics
Building a Microblog Corpus for Search Result Diversification
12
13. Topic Analysis
Diversity Difficulty
• The difficulty to diversify the search results
• Ambiguity or Under-specification of topics
• Diverse content available in the corpus
• Golbus et al. proposed diversity difficulty measure dd
• dd > 0.9 = arbitrary ranked list is likely to cover all subtopics
• dd < 0.5 means hard to discover subtopics by an untuned retrieval system
All topics
Avg. diversity difficulty
Std. Dev. diversity difficulty
0.71
0.07
Topics annotated by
Annotator 1
Annotator 2
0.72
0.70
0.06
0.07
Golbus et al.: Increasing evaluation sensitivity to diversity. Information Retrieval (2013) 16
Building a Microblog Corpus for Search Result Diversification
13
14. Topic Analysis
Diversity Difficulty
• The difficulty to diversify the search results
• Ambiguity or Under-specification of topics
• Diverse content available in the corpus
• Golbus et al. proposed diversity difficulty measure dd
• dd > 0.9 indicates a diverse query
• dd < 0.5 means hard to discover subtopics by an untuned retrieval system
• Difference between long-/short-term topics
• The topics with longer timespan (>50 days) are easier in diversity difficulty
(0.73 > 0.70)
Golbus et al.: Increasing evaluation sensitivity to diversity. Information Retrieval (2013) 16
Building a Microblog Corpus for Search Result Diversification
14
15. Diversification by De-Duplicating – 1/6
Lower redudancy, but higher diversity?
• In previous work, we were motivated by the fact that
• 20% of search results are duplicate information in different extent
• Therefore, we proposed to remove the duplicates in order to
achieve lower redundancy in top-k results
• Implemented with a machine learning framework
• Make use of syntactical, semantic, and contextual features
• Eliminate the identified duplicates with lower rank in the search result
Whether it can achieve in higher diversity?
Tao et al.: Groundhog Day: Near-duplicate Detection on Twitter. In Proceedings of 22nd
International World Wide Web Conference.
Building a Microblog Corpus for Search Result Diversification
15
16. Diversification by De-Duplicating – 2/6
Measures
• We adopts following measures:
• alpha-(n)DCG
• Precision-IA
• Subtopic-Recall
• Redundancy
Clarke et al.: Novelty and Diversity in Information Retrieval Evaluation. In Proceedings of
SIGIR, 2008.
Agrawal et al.: Diversifying Search Results. In Proceedings of WSDM, 2009.
Zhai et al.: Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic
Retrieval. In Proceedings of SIGIR, 2003.
Building a Microblog Corpus for Search Result Diversification
16
17. Diversification by De-Duplicating – 3/6
Baseline and De-Duplicate Strategies
• Baseline Strategies
• Automatic run: using standard queries (no more than 3 terms)
• Filtered Auto: filter the duplicates out w.r.t. cosine similarity
• Manual Run: manually created complex queries with automatic filtering
• De-duplicate Strategies
• Sy = Syntactical, Se= Semantic, Co = Contextual
• Four strategies: Sy, SyCo, SySe, SySeCo
Building a Microblog Corpus for Search Result Diversification
17
18. Diversification by De-Duplicating – 4/6
Overall comparison
Overall, the de-duplicate strategies did achieve in
lower redundancy. However, they didn’t achieve
in terms of higher diversity.
Building a Microblog Corpus for Search Result Diversification
18
19. Diversification by De-Duplicating – 5/6
Influence of Annotator Subjectivity
Building a Microblog Corpus for Search Result Diversification
19
20. Diversification by De-Duplicating – 5/6
Influence of Annotator Subjectivity
The same general trends for both annotators.
alpha-nDCG scores are higher for Annotator 2
can be explained by on average more
documents judged as relevant by Annotator 2.
Building a Microblog Corpus for Search Result Diversification
20
21. Diversification by De-Duplicating – 6/6
Influence of Temporal Persistence
Building a Microblog Corpus for Search Result Diversification
21
22. Diversification by De-Duplicating – 6/6
Influence of Temporal Persistence
De-duplicate strategies can help for long-term
topics, because the vocabulary was richer
while only a small set of terms were used for
short-term topics.
Building a Microblog Corpus for Search Result Diversification
22
23. Conclusions
• We have done:
• Created a microblog-based corpus for search result diversification
• Conducted comprehensive analysis and showed its suitability
• Confirmed considerable subjectivity among annotators, although the trends
w.r.t. the different evaluation measures were largely independent of
annotators
• We have made the corpus available via:
• http://wis.ewi.tudelft.nl/airs2013/
• What we will do:
• Apply the diversification approaches that have been shown to perform well
in the Web search setting.
• Propose the diversification approaches specifically designed for search on
microblogging platforms.
Building a Microblog Corpus for Search Result Diversification
23