SlideShare una empresa de Scribd logo
1 de 18
Discovering Memes in Social Media

                              Matt Lease
                        School of Information
                      University of Texas at Austin
                        ml@ischool.utexas.edu
                              @mattlease

                             Joint Work with
                     Hohyon Ryu & Nicholas Woodward


Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
Memes
• Short, similar phrases found in
  many different sources
  – Re-use, shared temporal context
• Evolutionary mutation &
  propagation as they transmit
  from source-to-source
• Reveals implicit connections
  between sources, individuals
  and communities involved
  March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   2
MemeBrowser & Critical Literacy




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   3
Google/NYT Living Stories




                 livingstories.googlelabs.com
March 21, 2012         ACM SIGKDD - Austin Chapter Meeting   4
Related Work
• Jure Leskovec et al. (KDD’09): blogs
     – quotations only: http://memetracker.org
• Steven Skiena, Stony Brook NY: blogs
     – Named-entities only: http://www.textmap.com
• O. Kolak and B. Schilit (HT’08): scanned books
     – Mine “popular passages” from complete texts
     – MapReduce “shingling” approach
     – Popular passages found are local, not global

March 21, 2012      ACM SIGKDD - Austin Chapter Meeting   5
MapReduce @ UT
• UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10
• New harddisks @ TACC Longhorn installed Dec.’10
   – 48 Dell R610 nodes
         • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz
         • 48GB RAM with ~1.5TB disk per node
         • With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers
   – 16 Dell R710 (same CPU configuration)
         • 144GB RAM with ~0.8TB disk per node
   – Setup Hadoop, testing, benchmarking, etc.
• Baldridge & Lease teach MapReduce class Fall’11
 March 21, 2012         ACM SIGKDD - Austin Chapter Meeting        6
Datasets
• TREC Blogs08 Collection
     – http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html
     – 28M permalinks (January 2008 – January 2009)
     – 250G compressed
• ICWSM 2009 Spinn3r Blog Dataset
     – http://www.icwsm.org/data/
     – 44 million blog posts (August - September, 2008)
     – 27 GB compressed
• ICWSM 2011 Spinn3r Blog Dataset

March 21, 2012        ACM SIGKDD - Austin Chapter Meeting      7
Processing Architecture
                                                               Blogs08 Test Collection
                                                                  28M posts, 1.4TB
       Preprocessing (Pseudo-MapReduce)
       Decruft & Language Identification
       HTML Strip & Near-Duplicate Detection                       16M posts, 960GB



       Common Phrase Extraction
                                                                    15K posts, 43GB
       3 MapReduce Stages

       Common Phrase Ranking
       Daily Top 200 Phrases                                       6.2M phrases, 2GB
       1 MapReduce Process

       Common Phrase Clustering
                                                                   75K phrases, 2.6MB
       1 MapReduce Process

       Meme Browser
                                                                      68K memes


March 21, 2012               ACM SIGKDD - Austin Chapter Meeting                         8
Creating the Shingle Table
• e.g. trigram shingles for: what do you think of

  – what do you
  – do you think
  – you think of




 March 21, 2012     ACM SIGKDD - Austin Chapter Meeting   9
Grouping Shingles by Document
• Mapper: trivial grouping; Reducer: Identity




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   10
Common Phrase (CP) Detection
• Mapper:
  Merge adjacent
  shingles into memes
  (ignoring small gaps)

• Reducer:
  Find set of
  documents in which
  each meme occurs
  March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   11
Ranking Memes




 March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   12
Clustering Memes
• Mapper:
  Single-link
  hierarchical
  clustering with
  cosine similarity
• Reducer:
  create/merge
  clusters


  March 21, 2012      ACM SIGKDD - Austin Chapter Meeting   13
Efficiency: Meme Clustering



• From WEKA ARFF format to sparse representation
   – From ~96 hours  11 hours
• Indexed vs. un-indexed
   – From 11 hours  16 minutes (single core)
   – From 34 minutes  3 minutes (136 cores)
• Distributed vs. single core
   – From 11 hours  34 minutes (un-indexed)
   – From 16 minutes  3 minutes (indexed)
  March 21, 2012     ACM SIGKDD - Austin Chapter Meeting   14
Meme Browser: Original Interface




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   15
Meme Browser: Current Interface




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   16
Meme Evolution (Leskovec et al.’09)




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   17
Thank You!
• Joint Work with                   Matt Lease
  – Hohyon (Will) Ryu               ml@ischool.utexas.edu
     • InfoChimps (Summer’11)       www.ischool.utexas.edu/~ml
     • Indeed.com (Summer’12)         @mattlease
  – Nicholas Woodward (TACC)
     • Latin American Network
       Information Center (LANIC)   Support
                                    • FCT of Portugal / UT CoLab
                                    • Amazon Web Services
                                    • UT Austin LIFT Award
                                    • John P. Commons Fellowship

Más contenido relacionado

Destacado

Making Memes Latinitas
Making Memes LatinitasMaking Memes Latinitas
Making Memes Latinitas
Andrea Zarate
 

Destacado (11)

Gdc reports2013 4_13
Gdc reports2013 4_13Gdc reports2013 4_13
Gdc reports2013 4_13
 
Making Memes Latinitas
Making Memes LatinitasMaking Memes Latinitas
Making Memes Latinitas
 
WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.
 
Memes
MemesMemes
Memes
 
Meme Powerpoint
Meme PowerpointMeme Powerpoint
Meme Powerpoint
 
mems ppt
mems pptmems ppt
mems ppt
 
Memes, Memes Everywhere
Memes, Memes EverywhereMemes, Memes Everywhere
Memes, Memes Everywhere
 
Fantastic memes and how to use them
Fantastic memes and how to use themFantastic memes and how to use them
Fantastic memes and how to use them
 
Social networking PPT
Social networking PPTSocial networking PPT
Social networking PPT
 
A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)
 
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
 

Similar a Discovering Memes in Social Media

MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
Salil Navgire
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
plan4all
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
Dr.-Ing. Thomas Hartmann
 

Similar a Discovering Memes in Social Media (20)

Discovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaDiscovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social Media
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
 
Startup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZoneStartup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZone
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
RDF: Resource Description Failures?
RDF: Resource Description Failures?RDF: Resource Description Failures?
RDF: Resource Description Failures?
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don Day
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
 

Más de Matthew Lease

The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
Matthew Lease
 

Más de Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Discovering Memes in Social Media

  • 1. Discovering Memes in Social Media Matt Lease School of Information University of Texas at Austin ml@ischool.utexas.edu @mattlease Joint Work with Hohyon Ryu & Nicholas Woodward Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
  • 2. Memes • Short, similar phrases found in many different sources – Re-use, shared temporal context • Evolutionary mutation & propagation as they transmit from source-to-source • Reveals implicit connections between sources, individuals and communities involved March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 2
  • 3. MemeBrowser & Critical Literacy March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 3
  • 4. Google/NYT Living Stories livingstories.googlelabs.com March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 4
  • 5. Related Work • Jure Leskovec et al. (KDD’09): blogs – quotations only: http://memetracker.org • Steven Skiena, Stony Brook NY: blogs – Named-entities only: http://www.textmap.com • O. Kolak and B. Schilit (HT’08): scanned books – Mine “popular passages” from complete texts – MapReduce “shingling” approach – Popular passages found are local, not global March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 5
  • 6. MapReduce @ UT • UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10 • New harddisks @ TACC Longhorn installed Dec.’10 – 48 Dell R610 nodes • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz • 48GB RAM with ~1.5TB disk per node • With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers – 16 Dell R710 (same CPU configuration) • 144GB RAM with ~0.8TB disk per node – Setup Hadoop, testing, benchmarking, etc. • Baldridge & Lease teach MapReduce class Fall’11 March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 6
  • 7. Datasets • TREC Blogs08 Collection – http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html – 28M permalinks (January 2008 – January 2009) – 250G compressed • ICWSM 2009 Spinn3r Blog Dataset – http://www.icwsm.org/data/ – 44 million blog posts (August - September, 2008) – 27 GB compressed • ICWSM 2011 Spinn3r Blog Dataset March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 7
  • 8. Processing Architecture Blogs08 Test Collection 28M posts, 1.4TB Preprocessing (Pseudo-MapReduce) Decruft & Language Identification HTML Strip & Near-Duplicate Detection 16M posts, 960GB Common Phrase Extraction 15K posts, 43GB 3 MapReduce Stages Common Phrase Ranking Daily Top 200 Phrases 6.2M phrases, 2GB 1 MapReduce Process Common Phrase Clustering 75K phrases, 2.6MB 1 MapReduce Process Meme Browser 68K memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 8
  • 9. Creating the Shingle Table • e.g. trigram shingles for: what do you think of – what do you – do you think – you think of March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 9
  • 10. Grouping Shingles by Document • Mapper: trivial grouping; Reducer: Identity March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 10
  • 11. Common Phrase (CP) Detection • Mapper: Merge adjacent shingles into memes (ignoring small gaps) • Reducer: Find set of documents in which each meme occurs March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 11
  • 12. Ranking Memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 12
  • 13. Clustering Memes • Mapper: Single-link hierarchical clustering with cosine similarity • Reducer: create/merge clusters March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 13
  • 14. Efficiency: Meme Clustering • From WEKA ARFF format to sparse representation – From ~96 hours  11 hours • Indexed vs. un-indexed – From 11 hours  16 minutes (single core) – From 34 minutes  3 minutes (136 cores) • Distributed vs. single core – From 11 hours  34 minutes (un-indexed) – From 16 minutes  3 minutes (indexed) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 14
  • 15. Meme Browser: Original Interface March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 15
  • 16. Meme Browser: Current Interface March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 16
  • 17. Meme Evolution (Leskovec et al.’09) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 17
  • 18. Thank You! • Joint Work with Matt Lease – Hohyon (Will) Ryu ml@ischool.utexas.edu • InfoChimps (Summer’11) www.ischool.utexas.edu/~ml • Indeed.com (Summer’12) @mattlease – Nicholas Woodward (TACC) • Latin American Network Information Center (LANIC) Support • FCT of Portugal / UT CoLab • Amazon Web Services • UT Austin LIFT Award • John P. Commons Fellowship