Talk at the ACM SIGKDD - Austin Chapter Meeting, March 21, 2012. Paper by Hohyon Ryu, Matthew Lease, and Nicholas Woodward, at the 23rd ACM Conference on Hypertext and Social Media, 2012.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Discovering Memes in Social Media
1. Discovering Memes in Social Media
Matt Lease
School of Information
University of Texas at Austin
ml@ischool.utexas.edu
@mattlease
Joint Work with
Hohyon Ryu & Nicholas Woodward
Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
2. Memes
• Short, similar phrases found in
many different sources
– Re-use, shared temporal context
• Evolutionary mutation &
propagation as they transmit
from source-to-source
• Reveals implicit connections
between sources, individuals
and communities involved
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 2
4. Google/NYT Living Stories
livingstories.googlelabs.com
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 4
5. Related Work
• Jure Leskovec et al. (KDD’09): blogs
– quotations only: http://memetracker.org
• Steven Skiena, Stony Brook NY: blogs
– Named-entities only: http://www.textmap.com
• O. Kolak and B. Schilit (HT’08): scanned books
– Mine “popular passages” from complete texts
– MapReduce “shingling” approach
– Popular passages found are local, not global
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 5
6. MapReduce @ UT
• UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10
• New harddisks @ TACC Longhorn installed Dec.’10
– 48 Dell R610 nodes
• 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz
• 48GB RAM with ~1.5TB disk per node
• With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers
– 16 Dell R710 (same CPU configuration)
• 144GB RAM with ~0.8TB disk per node
– Setup Hadoop, testing, benchmarking, etc.
• Baldridge & Lease teach MapReduce class Fall’11
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 6
7. Datasets
• TREC Blogs08 Collection
– http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html
– 28M permalinks (January 2008 – January 2009)
– 250G compressed
• ICWSM 2009 Spinn3r Blog Dataset
– http://www.icwsm.org/data/
– 44 million blog posts (August - September, 2008)
– 27 GB compressed
• ICWSM 2011 Spinn3r Blog Dataset
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 7
8. Processing Architecture
Blogs08 Test Collection
28M posts, 1.4TB
Preprocessing (Pseudo-MapReduce)
Decruft & Language Identification
HTML Strip & Near-Duplicate Detection 16M posts, 960GB
Common Phrase Extraction
15K posts, 43GB
3 MapReduce Stages
Common Phrase Ranking
Daily Top 200 Phrases 6.2M phrases, 2GB
1 MapReduce Process
Common Phrase Clustering
75K phrases, 2.6MB
1 MapReduce Process
Meme Browser
68K memes
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 8
9. Creating the Shingle Table
• e.g. trigram shingles for: what do you think of
– what do you
– do you think
– you think of
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 9
11. Common Phrase (CP) Detection
• Mapper:
Merge adjacent
shingles into memes
(ignoring small gaps)
• Reducer:
Find set of
documents in which
each meme occurs
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 11
18. Thank You!
• Joint Work with Matt Lease
– Hohyon (Will) Ryu ml@ischool.utexas.edu
• InfoChimps (Summer’11) www.ischool.utexas.edu/~ml
• Indeed.com (Summer’12) @mattlease
– Nicholas Woodward (TACC)
• Latin American Network
Information Center (LANIC) Support
• FCT of Portugal / UT CoLab
• Amazon Web Services
• UT Austin LIFT Award
• John P. Commons Fellowship