SlideShare una empresa de Scribd logo
1 de 34
Elastic MapReduce
   Wikipedia
http://ohkura.com

• 2008                  1
•
•              blog
•              2007
Python




     Wikipedia       (   120   )
MapReduce
• Hadoop
  o

• Hadoop Streaming
  o Mapper Reducer


  o                  OK   Python
  o            IO
• Amazon AWS (S3, EC2)
Elastic MapReduce

• Amazon          Cloud Computing
• MapReduce                    Hadoop


• Master                       Worker EC2
• S3

• http://aws.amazon.com/elasticmapreduce/
Step0:

• AWS
• Elastic MapReduce                                    1
• S3
  o   Ruby                     s3sync
      http://s3sync.net/wiki
• elastic-mapreduce
   o Amazon                       Ruby
  o   http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264
Step1:

• Wikipedia
    o wget "http://download.wikimedia.org/jawiki/latest/jawiki-
      latest-pages-articles.xml.bz2"
    o bunzip2 jawiki-latest-pages-articles.xml.bz2
•
    o   <page>      20000
    o   Hadoop Streaming        worker
• S3
    o   ohkura-wikipedia:jawiki/articles/part-00000, 00001, ...
    o   EC2
Step2:
Step2:

Mapper
 link_pat = re.compile(r"[[([^]|#]*?)[]|#]")

 for line in sys.stdin:
    for link in link_pat.findall(line):
        if ":" not in link:
            print "LongValueSum:%st1" % link

Reducer
  aggregate (Hadoop                   Reducer)
2007     92008
2006     88376
2008     82821
2005     77964
       76111
2000     68078
2004     64921
                 63660
        58081
2001     57419
2003     57130
Step3:
Step3:

Mapper
 timestamp_pat = re.compile("<timestamp>(.+?)</timestamp>")
 articles = ArticleExtractor(sys.stdin)
 for article in articles:
   for line in article:
      m = timestamp_pat.search(line)
      if m:
         dt = m.groups(0)[0]
         # eg. 2009-10-08T05:55:49Z
         t = datetime.datetime.strptime(dt, "%Y-%m-%dT%H:%M:%SZ")
         print "LongValueSum:%s t1" % t.year


Reducer
  aggregate (Hadoop                    Reducer)
JSON                   Wizard




$ elastic-mapreduce --create --num-instances 4
                  --instance-type m1.small
                  --json count-year-jobflow.json
2002: 1
2003: 4107
2004: 19630
2005: 44766
2006: 103018
2007: 151382
2008: 217252
2009: 683079
Step4: PageRank
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
PageRank                 MapReduce

• Step1
    o          Wikipedia
    o   M:


    o   R: Identity
• Step2
  o M:          /
    o   R:
•         Step2     10
    o                    HDFS
1803.63759701
1568.19638967
1029.67219551 2006
991.646816399 2007
930.652982148 2005
885.892964893
866.358526418 2008
798.668799871 2004
779.443042817
.
.
1803.63759701
1568.19638967
885.892964893
779.443042817
755.488775376
728.882441149
682.257070166
623.000478660
580.347125978
569.411885196
...
779.443042817
728.882441149
682.257070166
580.347125978
522.618667481
495.986145911
452.646283200
444.036370473
443.043952427
441.486349135
392.427995635
=100

0.00682557409174                785       ...
0.00682555111099 JR   700
0.00682544488688
0.00682540998664
0.00682540375114
0.00682528989653      (     )
0.00682524117061
0.00682521978481      (               )
0.00682521236658
0.00682517459662
0.00682512260620
• Wikipedia (JA)
  o 1,900,000 articles
  o 4.2GB
  o 20
  o   ~30
• Blog          from   blogeye.jp
  o   200,000,000 articles
  o   800GB
  o   80
  o   70
•
    o
    o                 Master
•
    o
    o
    o
    o
    o   1   1   0.1   1    100   1000
Q&A

Más contenido relacionado

Similar a Quick Wikipedia Mining using Elastic Map Reduce

Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)Muhannad Aulama
 
Ruby Outside Rails 2 (southfest)
Ruby Outside Rails 2 (southfest)Ruby Outside Rails 2 (southfest)
Ruby Outside Rails 2 (southfest)Victor Petrenko
 
marko_go_in_badoo
marko_go_in_badoomarko_go_in_badoo
marko_go_in_badooMarko Kevac
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in RubyAmazon Web Services
 
Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...pycontw
 
mastodon API
mastodon APImastodon API
mastodon APItreby
 
Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices Daniel Berman
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-templateshintaro mizuno
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
 
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance FuckupsNETFest
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingMongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingJason Terpko
 
Log files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO OpportunitiesLog files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO OpportunitiesRobin Rozhon
 
Django REST Framework における API 実装プラクティス | PyCon JP 2018
Django REST Framework における API 実装プラクティス | PyCon JP 2018Django REST Framework における API 実装プラクティス | PyCon JP 2018
Django REST Framework における API 実装プラクティス | PyCon JP 2018Masashi Shibata
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuningMongoDB
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nltieleman
 
クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術Ryousei Takano
 
The Seven Wastes of Software Development
The Seven Wastes of Software DevelopmentThe Seven Wastes of Software Development
The Seven Wastes of Software DevelopmentMatt Stine
 
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017OWASP Kyiv
 
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...olberger
 

Similar a Quick Wikipedia Mining using Elastic Map Reduce (20)

Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)
 
Ruby Outside Rails 2 (southfest)
Ruby Outside Rails 2 (southfest)Ruby Outside Rails 2 (southfest)
Ruby Outside Rails 2 (southfest)
 
marko_go_in_badoo
marko_go_in_badoomarko_go_in_badoo
marko_go_in_badoo
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby
 
Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...
 
mastodon API
mastodon APImastodon API
mastodon API
 
Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-template
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingMongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and Merging
 
Log files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO OpportunitiesLog files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO Opportunities
 
Django REST Framework における API 実装プラクティス | PyCon JP 2018
Django REST Framework における API 実装プラクティス | PyCon JP 2018Django REST Framework における API 実装プラクティス | PyCon JP 2018
Django REST Framework における API 実装プラクティス | PyCon JP 2018
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術
 
The Seven Wastes of Software Development
The Seven Wastes of Software DevelopmentThe Seven Wastes of Software Development
The Seven Wastes of Software Development
 
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
 
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
 

Último

ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimaginedpanagenda
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 

Último (20)

ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 

Quick Wikipedia Mining using Elastic Map Reduce