SlideShare una empresa de Scribd logo
1 de 17
Hadoop at Musicmetric

     Dr Jameel Syed
         April 2012
Music has moved online
• The world has changed
  –   Do you buy vinyl/tapes/CDs of music?
  –   Do you buy music downloads?
  –   Do you download illegal content from BitTorrent?
  –   Do you listen to music on YouTube?
  –   Do you “like” bands on Facebook?
  –   Do you subscribe to Spotify?
  –   Do you listen on the radio to the weekly charts on a
      Sunday afternoon?
• What’s happening online?
How popular am I?
Who are my fans?
Where are my fans?
What is the press saying?
Who is popular?
Data Science in the Music Industry
• Raw Data
    – Social media/networks (Facebook, YouTube,
      Twitter, Last.fm...)
    – BitTorrent
    – Online reviews
• Raw Data -> Derived Data -> Insight
    – Who is popular right now/in the immediate
      future?
    – What was the effect of appearing at a festival?
    – Which artists are (becoming) popular with
      listeners with certain demographics (in a
      region)?
• Data processing, machine learning &
  statistical methods
    –   Sentiment analysis
    –   Named Entity Recognition
    –   Ranking
    –   Segmentation
Data Pipeline - Overview

                  Data Processing
              Anomaly                    Key-Value           Web
   Raw Data                Aggregation               API
              Detection                    Store           Application




• Engineering approach
  – KISS
  – Decoupled components
Data Pipeline - Input

                  Data Processing
              Anomaly                    Key-Value           Web
   Raw Data                Aggregation               API
              Detection                    Store           Application




• Input
  – Distributed data collection from public internet
    sources
      • Real-time system constraints: 24/7 hourly data
      • Changing format, scope
  – Customers providing private data feeds
      • e.g. sales and streaming data
Data Pipeline - Output

                   Data Processing
               Anomaly                    Key-Value           Web
   Raw Data                 Aggregation               API
               Detection                    Store           Application




• Output
  – Sparse data requests about hundreds of thousands of artists
  – Timeliness
  – Lots of combinations (by country/city, by release/track,
    diff/cumulative, hourly/daily/weekly, charts…)
  – Need to reprocess over EVERYTHING (new metadata, re-
    delivery of data, anomaly detection)
Why Hadoop?
• Outgrew initial solution for data processing
  over existing data
  – How long should daily processing take?
  – I/O (disk seeks)
• Additional data
  – BitTorrent scale-up
  – iTunes sales
  – Spotify plays
Hadoop Cluster
•    12 physical servers + 2 KVM virtual machines
•    Cloudera CDH3/Ubuntu 10.04 LTS
•    2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm)
•    24GB RAM, 4x 2TB WD
•    Gb Ethernet (no link aggregation yet)
•    ~2.5KW (max 4KW)

       mm-addax                 mm-rhino-01                mm-rhino-02

    Edge Server              Primary Name Node          Secondary Name Node
                                 Job Tracker
      mm-impala                  Zoo Keeper

     NFS Server                                   mm-rhino-03

       DHCP/PXE/DNS                   Data Node 01
                                                  mm-rhino-10
      mm-gazelle
                                      Data Node 02
                                              …
                                                  mm-rhino-11
    Private Hadoop
    network                           Data Node 09
Data Storage & Processing
                             Hadoop
      Private Data           Raw data       Processed        Time series


                                                                                    Voldemort


      Public Data
                              Push To     Preprocess    Generate      HDFS to KVS
                              Hadoop                    timeseries


                             RabbitMQ
                              To Hadoop   Preprocess    Timeseries     To_KVS


•   E.g. BitTorrent input data: per 1TB
•   Pre-processed: 200GB
•   Raw time series: 37GB
•   Filtered/artist data: 2.5GB
•   KVS: 1.9GB
Opportunities
• Hive/Pig/HBase
• Mahout
• Nutch
Open Questions & Challenges
• Organizational readiness
    – Planning
    – Access
    – Experience
• Cluster maintenance
    – Unlikely to replicate production setup
    – 24/7 (ish)
    – What can be switched off when (and is it handled automatically)?
• Resource scheduling
• Workflow
• Amazon EMR vs own hardware?
    – Predictable workload/cost?
    – In for a penny, in for a pound
    – Hotel California
• DBA equivalent on Hadoop? HDA
We are hiring

jobs@musicmetric.com
      @tilapia

Más contenido relacionado

La actualidad más candente

Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data ScienceTJ Stalcup
 
Data science a practitioner's perspective
Data science  a practitioner's perspectiveData science  a practitioner's perspective
Data science a practitioner's perspectiveAmir Ziai
 
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...Dr. Haxel Consult
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
 
Course Information for March 25th Batch
Course Information for March 25th BatchCourse Information for March 25th Batch
Course Information for March 25th BatchUpXAcademy
 
Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]Jane Frazier
 
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...Dr. Haxel Consult
 

La actualidad más candente (10)

Intro big data analytics
Intro big data analyticsIntro big data analytics
Intro big data analytics
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Data science a practitioner's perspective
Data science  a practitioner's perspectiveData science  a practitioner's perspective
Data science a practitioner's perspective
 
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Course Information for March 25th Batch
Course Information for March 25th BatchCourse Information for March 25th Batch
Course Information for March 25th Batch
 
Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
 

Destacado

Formularz konsultacji społecznych
Formularz konsultacji społecznychFormularz konsultacji społecznych
Formularz konsultacji społecznychFundacja "Merkury"
 
Wireless Systems
Wireless SystemsWireless Systems
Wireless SystemsSaqib Ahmed
 
Neet株式会社(仮)の組織形態についてのご提案
Neet株式会社(仮)の組織形態についてのご提案Neet株式会社(仮)の組織形態についてのご提案
Neet株式会社(仮)の組織形態についてのご提案kakkun005
 
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...IJSRD
 
manoj_kumar_resume
manoj_kumar_resumemanoj_kumar_resume
manoj_kumar_resumeManoj Kumar
 
Wist-je-datjes over UiTPASregio's
Wist-je-datjes over UiTPASregio'sWist-je-datjes over UiTPASregio's
Wist-je-datjes over UiTPASregio'sTom Van de Velde
 
Tim Keefe - DRI Training Series Day UCC: Digitising Your Collection
Tim Keefe - DRI Training Series Day UCC: Digitising Your CollectionTim Keefe - DRI Training Series Day UCC: Digitising Your Collection
Tim Keefe - DRI Training Series Day UCC: Digitising Your Collectiondri_ireland
 
هرم الغذائي فاطمة المحيشي
هرم الغذائي فاطمة المحيشي هرم الغذائي فاطمة المحيشي
هرم الغذائي فاطمة المحيشي r12347890
 
Cover Proposal Pembangunan Masjid
Cover Proposal Pembangunan MasjidCover Proposal Pembangunan Masjid
Cover Proposal Pembangunan MasjidMa'shum Arif
 
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)dri_ireland
 
GO Menstrual , de Miranda Gray
GO Menstrual , de Miranda GrayGO Menstrual , de Miranda Gray
GO Menstrual , de Miranda GrayPaola Pozzi
 

Destacado (19)

Tic 1
Tic 1Tic 1
Tic 1
 
Rada Seniorów
Rada SeniorówRada Seniorów
Rada Seniorów
 
Formularz konsultacji społecznych
Formularz konsultacji społecznychFormularz konsultacji społecznych
Formularz konsultacji społecznych
 
Wireless Systems
Wireless SystemsWireless Systems
Wireless Systems
 
Tic 2
Tic 2Tic 2
Tic 2
 
Neet株式会社(仮)の組織形態についてのご提案
Neet株式会社(仮)の組織形態についてのご提案Neet株式会社(仮)の組織形態についてのご提案
Neet株式会社(仮)の組織形態についてのご提案
 
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
 
manoj_kumar_resume
manoj_kumar_resumemanoj_kumar_resume
manoj_kumar_resume
 
Pirates 7
Pirates 7Pirates 7
Pirates 7
 
selection
selectionselection
selection
 
Lakshya_Concept
Lakshya_ConceptLakshya_Concept
Lakshya_Concept
 
1314053
13140531314053
1314053
 
Wist-je-datjes over UiTPASregio's
Wist-je-datjes over UiTPASregio'sWist-je-datjes over UiTPASregio's
Wist-je-datjes over UiTPASregio's
 
B&D Eolas - Catalogue des formations webmarketing - 2015
B&D Eolas - Catalogue des formations webmarketing - 2015B&D Eolas - Catalogue des formations webmarketing - 2015
B&D Eolas - Catalogue des formations webmarketing - 2015
 
Tim Keefe - DRI Training Series Day UCC: Digitising Your Collection
Tim Keefe - DRI Training Series Day UCC: Digitising Your CollectionTim Keefe - DRI Training Series Day UCC: Digitising Your Collection
Tim Keefe - DRI Training Series Day UCC: Digitising Your Collection
 
هرم الغذائي فاطمة المحيشي
هرم الغذائي فاطمة المحيشي هرم الغذائي فاطمة المحيشي
هرم الغذائي فاطمة المحيشي
 
Cover Proposal Pembangunan Masjid
Cover Proposal Pembangunan MasjidCover Proposal Pembangunan Masjid
Cover Proposal Pembangunan Masjid
 
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
 
GO Menstrual , de Miranda Gray
GO Menstrual , de Miranda GrayGO Menstrual , de Miranda Gray
GO Menstrual , de Miranda Gray
 

Similar a Hadoop at Musicmetric

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 
The Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyThe Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyJosh Baer
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesKinetica
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratchVinayak Hegde
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesIsuru Suriarachchi
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 

Similar a Hadoop at Musicmetric (20)

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
The Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyThe Evolution of Big Data at Spotify
The Evolution of Big Data at Spotify
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial Services
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratch
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 

Último

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 

Último (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Hadoop at Musicmetric

  • 1. Hadoop at Musicmetric Dr Jameel Syed April 2012
  • 2. Music has moved online • The world has changed – Do you buy vinyl/tapes/CDs of music? – Do you buy music downloads? – Do you download illegal content from BitTorrent? – Do you listen to music on YouTube? – Do you “like” bands on Facebook? – Do you subscribe to Spotify? – Do you listen on the radio to the weekly charts on a Sunday afternoon? • What’s happening online?
  • 4. Who are my fans?
  • 5. Where are my fans?
  • 6. What is the press saying?
  • 8. Data Science in the Music Industry • Raw Data – Social media/networks (Facebook, YouTube, Twitter, Last.fm...) – BitTorrent – Online reviews • Raw Data -> Derived Data -> Insight – Who is popular right now/in the immediate future? – What was the effect of appearing at a festival? – Which artists are (becoming) popular with listeners with certain demographics (in a region)? • Data processing, machine learning & statistical methods – Sentiment analysis – Named Entity Recognition – Ranking – Segmentation
  • 9. Data Pipeline - Overview Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Engineering approach – KISS – Decoupled components
  • 10. Data Pipeline - Input Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Input – Distributed data collection from public internet sources • Real-time system constraints: 24/7 hourly data • Changing format, scope – Customers providing private data feeds • e.g. sales and streaming data
  • 11. Data Pipeline - Output Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Output – Sparse data requests about hundreds of thousands of artists – Timeliness – Lots of combinations (by country/city, by release/track, diff/cumulative, hourly/daily/weekly, charts…) – Need to reprocess over EVERYTHING (new metadata, re- delivery of data, anomaly detection)
  • 12. Why Hadoop? • Outgrew initial solution for data processing over existing data – How long should daily processing take? – I/O (disk seeks) • Additional data – BitTorrent scale-up – iTunes sales – Spotify plays
  • 13. Hadoop Cluster • 12 physical servers + 2 KVM virtual machines • Cloudera CDH3/Ubuntu 10.04 LTS • 2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm) • 24GB RAM, 4x 2TB WD • Gb Ethernet (no link aggregation yet) • ~2.5KW (max 4KW) mm-addax mm-rhino-01 mm-rhino-02 Edge Server Primary Name Node Secondary Name Node Job Tracker mm-impala Zoo Keeper NFS Server mm-rhino-03 DHCP/PXE/DNS Data Node 01 mm-rhino-10 mm-gazelle Data Node 02 … mm-rhino-11 Private Hadoop network Data Node 09
  • 14. Data Storage & Processing Hadoop Private Data Raw data Processed Time series Voldemort Public Data Push To Preprocess Generate HDFS to KVS Hadoop timeseries RabbitMQ To Hadoop Preprocess Timeseries To_KVS • E.g. BitTorrent input data: per 1TB • Pre-processed: 200GB • Raw time series: 37GB • Filtered/artist data: 2.5GB • KVS: 1.9GB
  • 16. Open Questions & Challenges • Organizational readiness – Planning – Access – Experience • Cluster maintenance – Unlikely to replicate production setup – 24/7 (ish) – What can be switched off when (and is it handled automatically)? • Resource scheduling • Workflow • Amazon EMR vs own hardware? – Predictable workload/cost? – In for a penny, in for a pound – Hotel California • DBA equivalent on Hadoop? HDA