SlideShare una empresa de Scribd logo
1 de 34
Elastic MapReduce
   Wikipedia
http://ohkura.com

• 2008                  1
•
•              blog
•              2007
Python




     Wikipedia       (   120   )
MapReduce
• Hadoop
  o

• Hadoop Streaming
  o Mapper Reducer


  o                  OK   Python
  o            IO
• Amazon AWS (S3, EC2)
Elastic MapReduce

• Amazon          Cloud Computing
• MapReduce                    Hadoop


• Master                       Worker EC2
• S3

• http://aws.amazon.com/elasticmapreduce/
Step0:

• AWS
• Elastic MapReduce                                    1
• S3
  o   Ruby                     s3sync
      http://s3sync.net/wiki
• elastic-mapreduce
   o Amazon                       Ruby
  o   http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264
Step1:

• Wikipedia
    o wget "http://download.wikimedia.org/jawiki/latest/jawiki-
      latest-pages-articles.xml.bz2"
    o bunzip2 jawiki-latest-pages-articles.xml.bz2
•
    o   <page>      20000
    o   Hadoop Streaming        worker
• S3
    o   ohkura-wikipedia:jawiki/articles/part-00000, 00001, ...
    o   EC2
Step2:
Step2:

Mapper
 link_pat = re.compile(r"[[([^]|#]*?)[]|#]")

 for line in sys.stdin:
    for link in link_pat.findall(line):
        if ":" not in link:
            print "LongValueSum:%st1" % link

Reducer
  aggregate (Hadoop                   Reducer)
2007     92008
2006     88376
2008     82821
2005     77964
       76111
2000     68078
2004     64921
                 63660
        58081
2001     57419
2003     57130
Step3:
Step3:

Mapper
 timestamp_pat = re.compile("<timestamp>(.+?)</timestamp>")
 articles = ArticleExtractor(sys.stdin)
 for article in articles:
   for line in article:
      m = timestamp_pat.search(line)
      if m:
         dt = m.groups(0)[0]
         # eg. 2009-10-08T05:55:49Z
         t = datetime.datetime.strptime(dt, "%Y-%m-%dT%H:%M:%SZ")
         print "LongValueSum:%s t1" % t.year


Reducer
  aggregate (Hadoop                    Reducer)
JSON                   Wizard




$ elastic-mapreduce --create --num-instances 4
                  --instance-type m1.small
                  --json count-year-jobflow.json
2002: 1
2003: 4107
2004: 19630
2005: 44766
2006: 103018
2007: 151382
2008: 217252
2009: 683079
Step4: PageRank
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
PageRank                 MapReduce

• Step1
    o          Wikipedia
    o   M:


    o   R: Identity
• Step2
  o M:          /
    o   R:
•         Step2     10
    o                    HDFS
1803.63759701
1568.19638967
1029.67219551 2006
991.646816399 2007
930.652982148 2005
885.892964893
866.358526418 2008
798.668799871 2004
779.443042817
.
.
1803.63759701
1568.19638967
885.892964893
779.443042817
755.488775376
728.882441149
682.257070166
623.000478660
580.347125978
569.411885196
...
779.443042817
728.882441149
682.257070166
580.347125978
522.618667481
495.986145911
452.646283200
444.036370473
443.043952427
441.486349135
392.427995635
=100

0.00682557409174                785       ...
0.00682555111099 JR   700
0.00682544488688
0.00682540998664
0.00682540375114
0.00682528989653      (     )
0.00682524117061
0.00682521978481      (               )
0.00682521236658
0.00682517459662
0.00682512260620
• Wikipedia (JA)
  o 1,900,000 articles
  o 4.2GB
  o 20
  o   ~30
• Blog          from   blogeye.jp
  o   200,000,000 articles
  o   800GB
  o   80
  o   70
•
    o
    o                 Master
•
    o
    o
    o
    o
    o   1   1   0.1   1    100   1000
Q&A

Más contenido relacionado

Similar a Quick Wikipedia Mining using Elastic Map Reduce

Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)
Muhannad Aulama
 
marko_go_in_badoo
marko_go_in_badoomarko_go_in_badoo
marko_go_in_badoo
Marko Kevac
 
Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...
pycontw
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
MongoDB
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
bartzon
 
クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術
Ryousei Takano
 
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
olberger
 

Similar a Quick Wikipedia Mining using Elastic Map Reduce (20)

Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)
 
Ruby Outside Rails 2 (southfest)
Ruby Outside Rails 2 (southfest)Ruby Outside Rails 2 (southfest)
Ruby Outside Rails 2 (southfest)
 
marko_go_in_badoo
marko_go_in_badoomarko_go_in_badoo
marko_go_in_badoo
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby
 
Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...
 
mastodon API
mastodon APImastodon API
mastodon API
 
Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-template
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingMongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and Merging
 
Log files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO OpportunitiesLog files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO Opportunities
 
Django REST Framework における API 実装プラクティス | PyCon JP 2018
Django REST Framework における API 実装プラクティス | PyCon JP 2018Django REST Framework における API 実装プラクティス | PyCon JP 2018
Django REST Framework における API 実装プラクティス | PyCon JP 2018
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術
 
The Seven Wastes of Software Development
The Seven Wastes of Software DevelopmentThe Seven Wastes of Software Development
The Seven Wastes of Software Development
 
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
 
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Quick Wikipedia Mining using Elastic Map Reduce