SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Aaronlin
KKBOX	 如何使用	 mrjob	 連結
Python,	 hadoop,	 aws
About KKBOX!
About KKBOX!
 
透過網路與技術的創新,提供歌手藝人與他們的
音樂更多宣傳平台、管道

為音樂愛好者創造最全面性的音樂體驗
 
•  Aaron Lin
–  研究中心頭子
–  aaronlin@kkbox.com
–  http://about.me/aaron.yclin
•  KKBOX 研究中心過去成果
About me!
為什麼今天會有這場演講?
一切就來自於
科科科技面對到的科科難題
MORE THAN
10 MILLION
USERS
MORE THAN
10 MILLION
SONGS
•  Need to use map-reduce to perform experiments
–  map-reduce: map à sort à reduce
兩團巨量資料交會之下!
•  What is mrjob
–  Open source project founded by Yelp
•  https://github.com/Yelp/mrjob
•  Docs: https://pythonhosted.org/mrjob/
–  A python library for writing map-reduce job
–  Can cooperate with hadoop cluster and AWS very
easily
為什麼要使用 mrjob?!
•  Why python?
–  Because of we love python
•  Why AWS Elastic MapReduce (EMR)?
–  if hadoop cluster has no resources left, use EMR
–  If hadoop cluster cannot finish the job in time, use
EMR
–  mrjob can audit your expense and effectiveness of
each job
為什麼要使用 mrjob?!
•  Three steps
–  Define your question into map-reduce
–  Write your mapper(s)
–  Write your reducer(s)
•  That’s it!
First mrjob program!
First mrjob program!
•  mrjob can run in three ways
–  Locally
–  Hadoop
–  AWS EMR
First mrjob program!
•  Either way works
–  python wordcount news.txt	
–  cat news.txt | python wordcount.py	
–  cat news.txt | python wordcount.py --mapper | sort | 

python wordcount.py --reducer	
Run mrjob locally!
•  Easy to test since mapper/reducer can be run
individually
–  cat news.txt | python wordcount.py --mapper	
–  cat news.txt | python wordcount.py --mapper | sort | 

python wordcount.py --reducer	
•  Good for Development
Run mrjob locally!
•  Write .mrjob.conf in HOME folder
Run mrjob in EMR!
Instance type of each group!
task	
c3.2xlarge	
c3.2xlarge	
m1.small
•  Use -r to specify the runner
–  python wordcount.py -r emr news.txt	
–  python wordcount.py -r emr s3://xxxx/news.txt	
Run mrjob in EMR!
Run mrjob in EMR!
•  How to audit emr usage
–  mrjob audit-emr-usage	
•  If you have ValueError due to mismatched datetime
format
–  Fix it in mrjob folder/audit_usage.py	
Run mrjob in EMR!
但使用上總還是有些問題得先解決
•  Write a cool program to compute it
•  But we don’t know which AWS instance type is
the best
悲劇!
•  http://docs.aws.amazon.com/ElasticMapReduce/latest
/DeveloperGuide/emr-plan-instances.html
If you check the official document!
I like brute force…!
Memory
optimized	
Compute
optimized	
General
purpose
•  For instances with Similar Cost and same number of
vCPU, Current generation instance is better
Focus on compute optimized instance!
•  For instances with Similar Cost and same number of
vCPU, Current generation instance is better
Focus on compute optimized instance!
•  Configuration of number of mapper/reducer is
different
Focus on compute optimized instance!
•  Configuration of number of mapper/reducer is
different
Focus on compute optimized instance!
•  Evaluation is specific to this task
•  Brute force search is too lazy……
•  Cost about 1500 NTD per run……
•  Hadoop/AWS is a buzz word
–  The money you spend is real
–  Buying some low-cost computers 

is always an option
Conclusion!
•  Mrjob
–  https://github.com/Yelp/mrjob
–  Docs: https://pythonhosted.org/mrjob/
•  Hardware spec of each instance type
–  http://aws.amazon.com/ec2/instance-types/
–  http://aws.amazon.com/ec2/previous-generation/
•  Number of mapper/reducer of instance type
–  http://docs.aws.amazon.com/ElasticMapReduce/latest
/DeveloperGuide/TaskConfiguration_H1.0.3.html
Reference!
•  Slides and script
–  https://github.com/KKBOX/coscup.tw.2014
Reference!
z	
  
We	
  are	
  hiring!	
  
h,p://www.kkbox.com/jobs/	
  
	
  

Más contenido relacionado

Destacado

Diverging six factors circular flow arrows diagram software power point slides
Diverging six factors circular flow arrows diagram software power point slidesDiverging six factors circular flow arrows diagram software power point slides
Diverging six factors circular flow arrows diagram software power point slides
SlideTeam.net
 
Four leading reason for cause cycle process diagram power point slides
Four leading reason for cause cycle process diagram power point slidesFour leading reason for cause cycle process diagram power point slides
Four leading reason for cause cycle process diagram power point slides
SlideTeam.net
 
Powerpoint presentations process management solution cycle flow network templ...
Powerpoint presentations process management solution cycle flow network templ...Powerpoint presentations process management solution cycle flow network templ...
Powerpoint presentations process management solution cycle flow network templ...
SlideTeam.net
 
Group of nine coverging arrows circular layout process power point slides
Group of nine coverging arrows circular layout process power point slidesGroup of nine coverging arrows circular layout process power point slides
Group of nine coverging arrows circular layout process power point slides
SlideTeam.net
 
Business powerpoint presentations process diagram six decisions cycle flow ch...
Business powerpoint presentations process diagram six decisions cycle flow ch...Business powerpoint presentations process diagram six decisions cycle flow ch...
Business powerpoint presentations process diagram six decisions cycle flow ch...
SlideTeam.net
 

Destacado (7)

Diverging six factors circular flow arrows diagram software power point slides
Diverging six factors circular flow arrows diagram software power point slidesDiverging six factors circular flow arrows diagram software power point slides
Diverging six factors circular flow arrows diagram software power point slides
 
Audience Profiling
Audience ProfilingAudience Profiling
Audience Profiling
 
Extended Audience Profile
Extended Audience ProfileExtended Audience Profile
Extended Audience Profile
 
Four leading reason for cause cycle process diagram power point slides
Four leading reason for cause cycle process diagram power point slidesFour leading reason for cause cycle process diagram power point slides
Four leading reason for cause cycle process diagram power point slides
 
Powerpoint presentations process management solution cycle flow network templ...
Powerpoint presentations process management solution cycle flow network templ...Powerpoint presentations process management solution cycle flow network templ...
Powerpoint presentations process management solution cycle flow network templ...
 
Group of nine coverging arrows circular layout process power point slides
Group of nine coverging arrows circular layout process power point slidesGroup of nine coverging arrows circular layout process power point slides
Group of nine coverging arrows circular layout process power point slides
 
Business powerpoint presentations process diagram six decisions cycle flow ch...
Business powerpoint presentations process diagram six decisions cycle flow ch...Business powerpoint presentations process diagram six decisions cycle flow ch...
Business powerpoint presentations process diagram six decisions cycle flow ch...
 

Similar a How KKBOX use mrjob to link python, hadoop, aws

Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Paul Brebner
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012
Tomas Doran
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner VogelsAWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels
Amazon Web Services
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Ontico
 

Similar a How KKBOX use mrjob to link python, hadoop, aws (20)

Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
Serial-War
Serial-WarSerial-War
Serial-War
 
Japanese CloudSearch Use-Cases and Tech Deep Dive
Japanese CloudSearch Use-Cases and Tech Deep DiveJapanese CloudSearch Use-Cases and Tech Deep Dive
Japanese CloudSearch Use-Cases and Tech Deep Dive
 
AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Hunting for anglerfish in datalakes
Hunting for anglerfish in datalakesHunting for anglerfish in datalakes
Hunting for anglerfish in datalakes
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Cdn cs6740
Cdn cs6740Cdn cs6740
Cdn cs6740
 
Clojure Conj 2014 - Paradigms of core.async - Julian Gamble
Clojure Conj 2014 - Paradigms of core.async - Julian GambleClojure Conj 2014 - Paradigms of core.async - Julian Gamble
Clojure Conj 2014 - Paradigms of core.async - Julian Gamble
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampion
 
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner VogelsAWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels
AWS Enterprise Day | Closing Keynote - Data Without Limits, Dr Werner Vogels
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Log Analytics with Amazon Elasticsearch Service & Kibana
Log Analytics with Amazon Elasticsearch Service & KibanaLog Analytics with Amazon Elasticsearch Service & Kibana
Log Analytics with Amazon Elasticsearch Service & Kibana
 
Accelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAccelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of Genomics
 
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global A...
 

Último

"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Último (20)

A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 

How KKBOX use mrjob to link python, hadoop, aws