SlideShare una empresa de Scribd logo
1 de 30
Hadoop at Last.fm June 2010
About us Last.fm you say?
Last.fm is a music discovery website powered by scrobbling that provides personalized radio
Music discovery website Each month we get: over 40M unique visitors over 500M pageviews Each pageview leads to at least one log line Clicks and other interactions sometimes lead to log lines too
Powered by scrobbling scrobble: skrob·bul (ˈskrɒbəll) [verb] To automatically add the tracks you play to your Last.fm profile with a piece of software called a Scrobbler Stats: Up to 800 scrobbles per second More than 40 million scrobbles per day Over 40 billion scrobbles so far Each scrobble leads to a log line
Personalized radio Via flash player, Xbox, desktop and mobile apps Stats: Over 10 million streaming hours per month Over 400 thousand unique stations per day Each stream leads to at least one log line
And it’s not just logs… So we gather a lot of logs, but also: Tags Shouts Journals Wikis Friend connections Fingerprints … Hadoop is the infrastructure we use for storing and processing our flood of data
OUR SETUP How many nodes?
Our herd of elephants Current specs of our production cluster: 44 nodes 8 cores per node 16 GB memory per node 4 disks of 1 TB spinning at 7200 RPM per node Unpatched CDH2 using: Fair scheduler with preemption Slightly patched hadoop-lzo RecordIO, Avro Hive, Dumbo, Pig
We often avoid Java with Dumbo def mapper(key, value):     for word in value.split():         yield word, 1 def reducer(key, values):     yield key, sum(values) if __name__ == "__main__":     import dumbo dumbo.run(mapper, reducer, combiner=reducer)
Or go even more high-level with Hive hive> CREATE TABLE name_counts (gender STRING, name STRING, occurrences INT); hive> INSERT OVERWRITE TABLE name_counts SELECT lower(sex), lower(split(realname, ‘ ‘)[0]), count(1) FROM meta_user_info WHERE lower(sex) <> ‘n’ GROUP BY lower(sex), lower(split(realname, ‘ ‘)[0]); hive> CREATE TABLE gender_likelihoods (name STRING, gender STRING, likelihood FLOAT); hive> INSERT OVERWRITE TABLE gender_likelihoods SELECT b.name, b.gender, b.occurrences / a.occurrences FROM (SELECT name, sum(occurrences) as occurrences FROM name_counts GROUP BY name) a JOIN name_countsb ON (a.name = b.name); hive> SELECT * FROM gender_likelihoods WHERE (name = ‘klaas’) OR (name = ‘sam’); klaasm	 0.99038464   klaasf	 0.009615385 samm	 0.7578873 samf	 0.24211268
Mixed usage of tools is common def starter(prog):     month = prog.delopt(“month”)  # is expected to be YYYY/MM hql = “INSERT OVERWRITE DIRECTORY ‘cool/stuff_hive’ ...”.format(month)     if os.system(‘hive –e “{0}”’.format(hql)) != 0:         raise dumbo.Error("hive query failed") prog.addopt(“input”, “cool/stuff_hive”)  # will be text delimited by ‘01’ prog.addopt(“output”, “cool/stuff/” + month) … if __name__ == “__main__”: dumbo.main(runner, starter)
Running out of DFS space is common too Possible solutions: Bigger and/or more disks HDFS RAID
Running out of DFS space is common too Data deleted  New nodes More compression and nodes Not finalized yet after upgrade Possible solutions: Bigger and/or more disks HDFS RAID
Hitting I/O and CPU limits is less common Our cluster can be pretty busy at times But DFS space is our main worry
Hitting I/O and CPU limits is less common Upgraded to 0.20 Our cluster can be pretty busy at times But DFS space is our main worry
USE CASES What do you use it for?
Things we do with Hadoop Site stats and metrics Charts Reporting Metadata corrections Neighbours Recommendations Indexing for search Evaluations Data insights
And also scaring our ops…
Example: Website traffic stats We compute a lot of site metrics, mostly from apache logs
Example: Website traffic stats Google Chrome is gaining ground We compute a lot of site metrics, mostly from apache logs
Example: Overall charts Charts for a single user can be shown in real time and are computed on the fly But computing overall charts is a pretty big job and is done on Hadoop
Example: World charts This “world chart” for the Belgian band “Hooverphonic” also required Hadoop because it’s based on data from many different users
Example: Overall wave graphs   Overall visualizations also typically require Hadoop for getting the data they visualize The “wave graphs” in our “Best of 2009” newspaper were good examples
Example: Overall wave graphs
Example: Death and scrobbles graphs −scrobbles − listeners
Example: Radio stats Graphs for several metrics that can be broken down by various attributes Used extensively for A/B testing
Example: Radio stats Graphs for several metrics that can be broken down by various attributes Used extensively for A/B testing Significant differences
Example: Radio stats Graphs for several metrics that can be broken down by various attributes Used extensively for A/B testing Overheated data centre Significant differences DB maintenance that went bad
Thanks! klaas@last.fm 		@klbosteemarc@last.fm 		@lanttims@last.fm		@roserpens

Más contenido relacionado

Destacado

practica de manzanas
practica de manzanaspractica de manzanas
practica de manzanaskathiuskisita
 
Presentaciones Efectivas
Presentaciones EfectivasPresentaciones Efectivas
Presentaciones EfectivasSandra Esposito
 
Eastern Illinois University - B.A. General Studies
Eastern Illinois University -  B.A. General Studies Eastern Illinois University -  B.A. General Studies
Eastern Illinois University - B.A. General Studies EIU BGS
 
Estrategia empresarial (articulo) “NER: Todos nuestros proyectos están en f...
Estrategia empresarial (articulo)   “NER: Todos nuestros proyectos están en f...Estrategia empresarial (articulo)   “NER: Todos nuestros proyectos están en f...
Estrategia empresarial (articulo) “NER: Todos nuestros proyectos están en f...Synergica Forwarding S.L.
 
Almanaque ambiental 2013
Almanaque ambiental 2013Almanaque ambiental 2013
Almanaque ambiental 2013Enith Arrieta
 
Getting More People To Open Your Nonprofit eNewsletter
Getting More People To Open Your Nonprofit eNewsletterGetting More People To Open Your Nonprofit eNewsletter
Getting More People To Open Your Nonprofit eNewsletterBloomerang
 
Alex. bd higher education across borders a select bibliography french-w
Alex. bd higher education across borders  a select bibliography french-wAlex. bd higher education across borders  a select bibliography french-w
Alex. bd higher education across borders a select bibliography french-wIAU_Past_Conferences
 
XPages Extension Library slides
XPages Extension Library   slidesXPages Extension Library   slides
XPages Extension Library slidesNiklas Heidloff
 
La Derivada y el Costo de Produccion
La Derivada y el Costo de ProduccionLa Derivada y el Costo de Produccion
La Derivada y el Costo de ProduccionJudith Medina Vela
 
PBM. Preoperative Anemia Management, Dr García Erce. Roma 2015
PBM. Preoperative Anemia Management, Dr García Erce. Roma 2015PBM. Preoperative Anemia Management, Dr García Erce. Roma 2015
PBM. Preoperative Anemia Management, Dr García Erce. Roma 2015José Antonio García Erce
 
"Fairer Handel" als Thema im Unterricht der weiterführenden Schulen
"Fairer Handel" als Thema im Unterricht der weiterführenden Schulen"Fairer Handel" als Thema im Unterricht der weiterführenden Schulen
"Fairer Handel" als Thema im Unterricht der weiterführenden Schulenland2nile
 
Pre and post surgery final /certified fixed orthodontic courses by Indian den...
Pre and post surgery final /certified fixed orthodontic courses by Indian den...Pre and post surgery final /certified fixed orthodontic courses by Indian den...
Pre and post surgery final /certified fixed orthodontic courses by Indian den...Indian dental academy
 
Elfriede anker (alemán eoi getxo heo)
Elfriede anker (alemán eoi getxo heo)Elfriede anker (alemán eoi getxo heo)
Elfriede anker (alemán eoi getxo heo)auxiliaresconversacion
 

Destacado (20)

practica de manzanas
practica de manzanaspractica de manzanas
practica de manzanas
 
Backus presentación
Backus presentaciónBackus presentación
Backus presentación
 
108 pensamientos budistas
108 pensamientos budistas108 pensamientos budistas
108 pensamientos budistas
 
Det Gode Partnerskab
Det Gode PartnerskabDet Gode Partnerskab
Det Gode Partnerskab
 
Presentaciones Efectivas
Presentaciones EfectivasPresentaciones Efectivas
Presentaciones Efectivas
 
Arbeitstag 20120309
Arbeitstag 20120309Arbeitstag 20120309
Arbeitstag 20120309
 
Eastern Illinois University - B.A. General Studies
Eastern Illinois University -  B.A. General Studies Eastern Illinois University -  B.A. General Studies
Eastern Illinois University - B.A. General Studies
 
Estrategia empresarial (articulo) “NER: Todos nuestros proyectos están en f...
Estrategia empresarial (articulo)   “NER: Todos nuestros proyectos están en f...Estrategia empresarial (articulo)   “NER: Todos nuestros proyectos están en f...
Estrategia empresarial (articulo) “NER: Todos nuestros proyectos están en f...
 
Almanaque ambiental 2013
Almanaque ambiental 2013Almanaque ambiental 2013
Almanaque ambiental 2013
 
¿Qué es un microscopio de fuerza atómica?
¿Qué es un microscopio de fuerza atómica?¿Qué es un microscopio de fuerza atómica?
¿Qué es un microscopio de fuerza atómica?
 
Getting More People To Open Your Nonprofit eNewsletter
Getting More People To Open Your Nonprofit eNewsletterGetting More People To Open Your Nonprofit eNewsletter
Getting More People To Open Your Nonprofit eNewsletter
 
Alex. bd higher education across borders a select bibliography french-w
Alex. bd higher education across borders  a select bibliography french-wAlex. bd higher education across borders  a select bibliography french-w
Alex. bd higher education across borders a select bibliography french-w
 
XPages Extension Library slides
XPages Extension Library   slidesXPages Extension Library   slides
XPages Extension Library slides
 
La Derivada y el Costo de Produccion
La Derivada y el Costo de ProduccionLa Derivada y el Costo de Produccion
La Derivada y el Costo de Produccion
 
Paul McCartney presentacion examen
Paul McCartney presentacion examenPaul McCartney presentacion examen
Paul McCartney presentacion examen
 
Pollinator Presentation
Pollinator Presentation Pollinator Presentation
Pollinator Presentation
 
PBM. Preoperative Anemia Management, Dr García Erce. Roma 2015
PBM. Preoperative Anemia Management, Dr García Erce. Roma 2015PBM. Preoperative Anemia Management, Dr García Erce. Roma 2015
PBM. Preoperative Anemia Management, Dr García Erce. Roma 2015
 
"Fairer Handel" als Thema im Unterricht der weiterführenden Schulen
"Fairer Handel" als Thema im Unterricht der weiterführenden Schulen"Fairer Handel" als Thema im Unterricht der weiterführenden Schulen
"Fairer Handel" als Thema im Unterricht der weiterführenden Schulen
 
Pre and post surgery final /certified fixed orthodontic courses by Indian den...
Pre and post surgery final /certified fixed orthodontic courses by Indian den...Pre and post surgery final /certified fixed orthodontic courses by Indian den...
Pre and post surgery final /certified fixed orthodontic courses by Indian den...
 
Elfriede anker (alemán eoi getxo heo)
Elfriede anker (alemán eoi getxo heo)Elfriede anker (alemán eoi getxo heo)
Elfriede anker (alemán eoi getxo heo)
 

Similar a Hadoop at Last.fm

Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseNaveen Kumar
 
Mashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsMashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsJohn Herren
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...Big Data Spain
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt
 
Realtime Data Visualization
Realtime Data VisualizationRealtime Data Visualization
Realtime Data Visualizationphil_renaud
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Intro To Mashups
Intro To MashupsIntro To Mashups
Intro To Mashupstristan.woo
 
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...Bill Graham
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Mashups in the Information Technology Classroom
Mashups in the Information Technology ClassroomMashups in the Information Technology Classroom
Mashups in the Information Technology ClassroomMark Frydenberg
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talkrtelmore
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashupsgiurca
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Christopher Curtin
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsJoshua Shinavier
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 

Similar a Hadoop at Last.fm (20)

Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / Database
 
Mashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsMashup University 4: Intro To Mashups
Mashup University 4: Intro To Mashups
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Hive at Last.fm
Hive at Last.fmHive at Last.fm
Hive at Last.fm
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
Realtime Data Visualization
Realtime Data VisualizationRealtime Data Visualization
Realtime Data Visualization
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Intro To Mashups
Intro To MashupsIntro To Mashups
Intro To Mashups
 
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...
Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users a...
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Mashups in the Information Technology Classroom
Mashups in the Information Technology ClassroomMashups in the Information Technology Classroom
Mashups in the Information Technology Classroom
 
Mashups
MashupsMashups
Mashups
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashups
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBs
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 

Último

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Hadoop at Last.fm

  • 1. Hadoop at Last.fm June 2010
  • 2. About us Last.fm you say?
  • 3. Last.fm is a music discovery website powered by scrobbling that provides personalized radio
  • 4. Music discovery website Each month we get: over 40M unique visitors over 500M pageviews Each pageview leads to at least one log line Clicks and other interactions sometimes lead to log lines too
  • 5. Powered by scrobbling scrobble: skrob·bul (ˈskrɒbəll) [verb] To automatically add the tracks you play to your Last.fm profile with a piece of software called a Scrobbler Stats: Up to 800 scrobbles per second More than 40 million scrobbles per day Over 40 billion scrobbles so far Each scrobble leads to a log line
  • 6. Personalized radio Via flash player, Xbox, desktop and mobile apps Stats: Over 10 million streaming hours per month Over 400 thousand unique stations per day Each stream leads to at least one log line
  • 7. And it’s not just logs… So we gather a lot of logs, but also: Tags Shouts Journals Wikis Friend connections Fingerprints … Hadoop is the infrastructure we use for storing and processing our flood of data
  • 8. OUR SETUP How many nodes?
  • 9. Our herd of elephants Current specs of our production cluster: 44 nodes 8 cores per node 16 GB memory per node 4 disks of 1 TB spinning at 7200 RPM per node Unpatched CDH2 using: Fair scheduler with preemption Slightly patched hadoop-lzo RecordIO, Avro Hive, Dumbo, Pig
  • 10. We often avoid Java with Dumbo def mapper(key, value): for word in value.split(): yield word, 1 def reducer(key, values): yield key, sum(values) if __name__ == "__main__": import dumbo dumbo.run(mapper, reducer, combiner=reducer)
  • 11. Or go even more high-level with Hive hive> CREATE TABLE name_counts (gender STRING, name STRING, occurrences INT); hive> INSERT OVERWRITE TABLE name_counts SELECT lower(sex), lower(split(realname, ‘ ‘)[0]), count(1) FROM meta_user_info WHERE lower(sex) <> ‘n’ GROUP BY lower(sex), lower(split(realname, ‘ ‘)[0]); hive> CREATE TABLE gender_likelihoods (name STRING, gender STRING, likelihood FLOAT); hive> INSERT OVERWRITE TABLE gender_likelihoods SELECT b.name, b.gender, b.occurrences / a.occurrences FROM (SELECT name, sum(occurrences) as occurrences FROM name_counts GROUP BY name) a JOIN name_countsb ON (a.name = b.name); hive> SELECT * FROM gender_likelihoods WHERE (name = ‘klaas’) OR (name = ‘sam’); klaasm 0.99038464 klaasf 0.009615385 samm 0.7578873 samf 0.24211268
  • 12. Mixed usage of tools is common def starter(prog): month = prog.delopt(“month”) # is expected to be YYYY/MM hql = “INSERT OVERWRITE DIRECTORY ‘cool/stuff_hive’ ...”.format(month) if os.system(‘hive –e “{0}”’.format(hql)) != 0: raise dumbo.Error("hive query failed") prog.addopt(“input”, “cool/stuff_hive”) # will be text delimited by ‘01’ prog.addopt(“output”, “cool/stuff/” + month) … if __name__ == “__main__”: dumbo.main(runner, starter)
  • 13. Running out of DFS space is common too Possible solutions: Bigger and/or more disks HDFS RAID
  • 14. Running out of DFS space is common too Data deleted  New nodes More compression and nodes Not finalized yet after upgrade Possible solutions: Bigger and/or more disks HDFS RAID
  • 15. Hitting I/O and CPU limits is less common Our cluster can be pretty busy at times But DFS space is our main worry
  • 16. Hitting I/O and CPU limits is less common Upgraded to 0.20 Our cluster can be pretty busy at times But DFS space is our main worry
  • 17. USE CASES What do you use it for?
  • 18. Things we do with Hadoop Site stats and metrics Charts Reporting Metadata corrections Neighbours Recommendations Indexing for search Evaluations Data insights
  • 19. And also scaring our ops…
  • 20. Example: Website traffic stats We compute a lot of site metrics, mostly from apache logs
  • 21. Example: Website traffic stats Google Chrome is gaining ground We compute a lot of site metrics, mostly from apache logs
  • 22. Example: Overall charts Charts for a single user can be shown in real time and are computed on the fly But computing overall charts is a pretty big job and is done on Hadoop
  • 23. Example: World charts This “world chart” for the Belgian band “Hooverphonic” also required Hadoop because it’s based on data from many different users
  • 24. Example: Overall wave graphs Overall visualizations also typically require Hadoop for getting the data they visualize The “wave graphs” in our “Best of 2009” newspaper were good examples
  • 26. Example: Death and scrobbles graphs −scrobbles − listeners
  • 27. Example: Radio stats Graphs for several metrics that can be broken down by various attributes Used extensively for A/B testing
  • 28. Example: Radio stats Graphs for several metrics that can be broken down by various attributes Used extensively for A/B testing Significant differences
  • 29. Example: Radio stats Graphs for several metrics that can be broken down by various attributes Used extensively for A/B testing Overheated data centre Significant differences DB maintenance that went bad
  • 30. Thanks! klaas@last.fm @klbosteemarc@last.fm @lanttims@last.fm @roserpens