SlideShare una empresa de Scribd logo
1 de 29
Hourglass: a Library for Incremental Processing on
Hadoop
IEEE BigData 2013
October 9th
Matthew Hayes
©2013 LinkedIn Corporation. All Rights Reserved.
Matthew Hayes
Staff Software Engineer
www.linkedin.com/in/matthewterencehayes/
©2013 LinkedIn Corporation. All Rights Reserved.
• 3+ Years on Applied Data Team at LinkedIn
• Skills
• Endorsements
• DataFu
• White Elephant
Agenda
 Motivation
 Design
 Experiments
 Q&A
©2013 LinkedIn Corporation. All Rights Reserved. 3
Motivation
©2013 LinkedIn Corporation. All Rights Reserved. 4
Event Collection in an Online System
 Typically online websites have
instrumented services that collect
events
 Events stored in an offline system
(such as Hadoop) for later analysis
 Using events, can build dashboards
with metrics such as:
– # of page views over last month
– # of active users over last month
 Metrics derived from events can also
be useful in recommendation pipelines
– e.g. impression discounting
©2013 LinkedIn Corporation. All Rights Reserved. 5
Event Storage
 Events can be categorized into topics, for example:
– page view
– user login
– ad impression/click
 Store events by topic and by day:
– /data/page_view/daily/2013/10/08
– /data/page_view/daily/2013/10/09
– ...
– /data/ad_click/daily/2013/10/08
 Now can perform computation over specific time windows
©2013 LinkedIn Corporation. All Rights Reserved. 6
Computation Over Time Windows
 In practice, many of our computations over time windows use
either:
©2013 LinkedIn Corporation. All Rights Reserved. 7
Recognizing Inefficiencies
 But, typically jobs compute these daily
 From one day to next, input changes little
 Fixed-start window includes one new day:
©2013 LinkedIn Corporation. All Rights Reserved. 8
Recognizing Inefficiencies
 Fixed-length window includes one new day, minus oldest day
©2013 LinkedIn Corporation. All Rights Reserved. 9
Recognizing Inefficiencies
 Repeatedly processing same input data
 This wastes cluster resources
 Better to process new data only
 How can we do better?
©2013 LinkedIn Corporation. All Rights Reserved. 10
Hourglass Design
©2013 LinkedIn Corporation. All Rights Reserved. 11
Design Goals
 Address use cases:
– Fixed-start and fixed-length window computations
– Daily partitioned data
 Reduce resource usage
 Reduce wall clock time
 Run on standard Hadoop
©2013 LinkedIn Corporation. All Rights Reserved. 12
Improving Fixed-Start Computations
 Suppose we must compute page view counts per member
 The job consumes all days of available input, producing one output.
 We call this a partition-collapsing job.
 But, if the job runs tomorrow it has to reprocess the same data.
©2013 LinkedIn Corporation. All Rights Reserved. 13
Improving Fixed-Start Computations
 Solution: Merge new data with previous output
 We can do this because this is an arithmetic operation
 Hourglass provides a partition-collapsing job that supports output
reuse.
©2013 LinkedIn Corporation. All Rights Reserved. 14
Partition-Collapsing Job Architecture (Fixed-Start)
 When applied to a fixed-start window computation:
©2013 LinkedIn Corporation. All Rights Reserved. 15
Improving Fixed-Length Computations
 For a fixed-length job, can reuse output using a similar trick:
– Add new day to previous output
– Subtract old day from result
 We can subtract the old day since this is arithmetic
©2013 LinkedIn Corporation. All Rights Reserved. 16
Partition-Collapsing Job Architecture (Fixed-Length)
 When applied to a fixed-length window computation:
©2013 LinkedIn Corporation. All Rights Reserved. 17
Improving Fixed-Length Computations
 But, for some operations, cannot subtract old data
– example: max(), min()
 Cannot reuse previous output, so how do we reduce computation?
 Solution: partition-preserving job
 Partitioned input data, partitioned output data
 Essentially: aggregate the data in advance
 Aggregating in advance can be useful even when you can reuse
output
©2013 LinkedIn Corporation. All Rights Reserved. 18
Partition-Preserving Job Architecture
©2013 LinkedIn Corporation. All Rights Reserved. 19
MapReduce in Hourglass
 MapReduce is a fairly general programming model
 Hourglass requires:
– reduce() must output (key,value) pair
– reduce() must produce at most one value
– reduce() implemented by an accumulator
©2013 LinkedIn Corporation. All Rights Reserved. 20
Building Blocks
 Two types of jobs:
– Partition-preserving: consume partitioned input data, produce
partitioned output data
– Partition-collapsing: consume partitioned input data, produce single
output
 Must provide to jobs:
– Inputs and output paths
– Desired time range
 Must implement:
– map()
– accumulate()
 May implement if necessary:
– merge()
– unmerge()
©2013 LinkedIn Corporation. All Rights Reserved. 21
Experiments
©2013 LinkedIn Corporation. All Rights Reserved. 22
Metrics for Evaluation
 Wall clock time
– Amount of time that elapses until job completes
 Total task time
– Sum of execution times for all tasks
– Represents usage of cluster resources
 Compare each against baseline non-incremental job
©2013 LinkedIn Corporation. All Rights Reserved. 23
Experiment: Page Views per Member
 Goal: Count page views per member over last n days
 Chain partition-preserving and partition-collapsing
 Can reuse previous output:
©2013 LinkedIn Corporation. All Rights Reserved. 24
Experiment: Page Views per Member
©2013 LinkedIn Corporation. All Rights Reserved. 25
Member Count Estimation
 Goal: Estimate number of members visiting site over past n days
 Use HyperLogLog cardinality estimation (space vs. accuracy)
 Can't reuse output, but with partition-preserving can save state:
©2013 LinkedIn Corporation. All Rights Reserved. 26
Member Count Estimation: Results
©2013 LinkedIn Corporation. All Rights Reserved. 27
Conclusion
 Computations over sliding windows are quite common
 Implementations are typically inefficient
 Incrementalizing Hadoop jobs can in some cases yield:
– 95-98% reductions in total task time
– 20-40% reductions in wall clock time
©2013 LinkedIn Corporation. All Rights Reserved. 28
datafu.org
Learning More
©2013 LinkedIn Corporation. All Rights Reserved. 29

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copy
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
Hadoop Everywhere
Hadoop EverywhereHadoop Everywhere
Hadoop Everywhere
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
Mixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache SparkMixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache Spark
 
Greenplum-Spark November 2018
Greenplum-Spark November 2018Greenplum-Spark November 2018
Greenplum-Spark November 2018
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
The Past, Present and Future of Big Data @LinkedIn
The Past, Present and Future of Big Data @LinkedInThe Past, Present and Future of Big Data @LinkedIn
The Past, Present and Future of Big Data @LinkedIn
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Destacado

Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz
 
Verden lige nu
Verden lige nuVerden lige nu
Verden lige nu
persloth
 

Destacado (20)

Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
 
React.js: Beyond the Browser
React.js: Beyond the BrowserReact.js: Beyond the Browser
React.js: Beyond the Browser
 
Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)
 
Keynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! ScaleKeynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! Scale
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
 
Danger Of Free
Danger Of FreeDanger Of Free
Danger Of Free
 
Enforcing Your Code of Conduct: effective incident response
Enforcing Your Code of Conduct: effective incident responseEnforcing Your Code of Conduct: effective incident response
Enforcing Your Code of Conduct: effective incident response
 
An Abusive Relationship with AngularJS
An Abusive Relationship with AngularJSAn Abusive Relationship with AngularJS
An Abusive Relationship with AngularJS
 
What the F**K is Social Media: One Year Later
What the F**K is Social Media: One Year LaterWhat the F**K is Social Media: One Year Later
What the F**K is Social Media: One Year Later
 
Adobe Digital Insights: Mobile Landscape A Moving Target
Adobe Digital Insights: Mobile Landscape A Moving TargetAdobe Digital Insights: Mobile Landscape A Moving Target
Adobe Digital Insights: Mobile Landscape A Moving Target
 
Paginas ampliadas
Paginas ampliadasPaginas ampliadas
Paginas ampliadas
 
Tarea ambiente (1)
Tarea ambiente (1)Tarea ambiente (1)
Tarea ambiente (1)
 
Tecnologia eduativa
Tecnologia eduativaTecnologia eduativa
Tecnologia eduativa
 
Opendataday
OpendatadayOpendataday
Opendataday
 
Verden lige nu
Verden lige nuVerden lige nu
Verden lige nu
 
Valtek MK1 Rebuild
Valtek MK1 RebuildValtek MK1 Rebuild
Valtek MK1 Rebuild
 
Las 48 leyes del poder
Las 48 leyes del poderLas 48 leyes del poder
Las 48 leyes del poder
 
Decimales: Valor Posicional
Decimales: Valor PosicionalDecimales: Valor Posicional
Decimales: Valor Posicional
 

Similar a Hourglass: a Library for Incremental Processing on Hadoop

Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanBuilding a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
DataWorks Summit
 

Similar a Hourglass: a Library for Incremental Processing on Hadoop (20)

Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedIn
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
 
Hive at LinkedIn
Hive at LinkedIn Hive at LinkedIn
Hive at LinkedIn
 
Building a data-driven application
Building a data-driven applicationBuilding a data-driven application
Building a data-driven application
 
Apigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven ActionsApigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven Actions
 
Building MuleSoft Applications with Google BigQuery Meetup 4
Building MuleSoft Applications with Google BigQuery Meetup 4Building MuleSoft Applications with Google BigQuery Meetup 4
Building MuleSoft Applications with Google BigQuery Meetup 4
 
The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...
The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...
The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...
 
Decrease build time and application size
Decrease build time and application sizeDecrease build time and application size
Decrease build time and application size
 
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
 
Fast track RTC Innovate India 2013
Fast track  RTC Innovate India 2013Fast track  RTC Innovate India 2013
Fast track RTC Innovate India 2013
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
 
Journey to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container EngineJourney to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container Engine
 
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013
 
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
 
VisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyVisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case study
 
Software Engineering for Startups (University of St Andrews, 2013)
Software Engineering for Startups (University of St Andrews, 2013)Software Engineering for Startups (University of St Andrews, 2013)
Software Engineering for Startups (University of St Andrews, 2013)
 
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanBuilding a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Hourglass: a Library for Incremental Processing on Hadoop

  • 1. Hourglass: a Library for Incremental Processing on Hadoop IEEE BigData 2013 October 9th Matthew Hayes ©2013 LinkedIn Corporation. All Rights Reserved.
  • 2. Matthew Hayes Staff Software Engineer www.linkedin.com/in/matthewterencehayes/ ©2013 LinkedIn Corporation. All Rights Reserved. • 3+ Years on Applied Data Team at LinkedIn • Skills • Endorsements • DataFu • White Elephant
  • 3. Agenda  Motivation  Design  Experiments  Q&A ©2013 LinkedIn Corporation. All Rights Reserved. 3
  • 5. Event Collection in an Online System  Typically online websites have instrumented services that collect events  Events stored in an offline system (such as Hadoop) for later analysis  Using events, can build dashboards with metrics such as: – # of page views over last month – # of active users over last month  Metrics derived from events can also be useful in recommendation pipelines – e.g. impression discounting ©2013 LinkedIn Corporation. All Rights Reserved. 5
  • 6. Event Storage  Events can be categorized into topics, for example: – page view – user login – ad impression/click  Store events by topic and by day: – /data/page_view/daily/2013/10/08 – /data/page_view/daily/2013/10/09 – ... – /data/ad_click/daily/2013/10/08  Now can perform computation over specific time windows ©2013 LinkedIn Corporation. All Rights Reserved. 6
  • 7. Computation Over Time Windows  In practice, many of our computations over time windows use either: ©2013 LinkedIn Corporation. All Rights Reserved. 7
  • 8. Recognizing Inefficiencies  But, typically jobs compute these daily  From one day to next, input changes little  Fixed-start window includes one new day: ©2013 LinkedIn Corporation. All Rights Reserved. 8
  • 9. Recognizing Inefficiencies  Fixed-length window includes one new day, minus oldest day ©2013 LinkedIn Corporation. All Rights Reserved. 9
  • 10. Recognizing Inefficiencies  Repeatedly processing same input data  This wastes cluster resources  Better to process new data only  How can we do better? ©2013 LinkedIn Corporation. All Rights Reserved. 10
  • 11. Hourglass Design ©2013 LinkedIn Corporation. All Rights Reserved. 11
  • 12. Design Goals  Address use cases: – Fixed-start and fixed-length window computations – Daily partitioned data  Reduce resource usage  Reduce wall clock time  Run on standard Hadoop ©2013 LinkedIn Corporation. All Rights Reserved. 12
  • 13. Improving Fixed-Start Computations  Suppose we must compute page view counts per member  The job consumes all days of available input, producing one output.  We call this a partition-collapsing job.  But, if the job runs tomorrow it has to reprocess the same data. ©2013 LinkedIn Corporation. All Rights Reserved. 13
  • 14. Improving Fixed-Start Computations  Solution: Merge new data with previous output  We can do this because this is an arithmetic operation  Hourglass provides a partition-collapsing job that supports output reuse. ©2013 LinkedIn Corporation. All Rights Reserved. 14
  • 15. Partition-Collapsing Job Architecture (Fixed-Start)  When applied to a fixed-start window computation: ©2013 LinkedIn Corporation. All Rights Reserved. 15
  • 16. Improving Fixed-Length Computations  For a fixed-length job, can reuse output using a similar trick: – Add new day to previous output – Subtract old day from result  We can subtract the old day since this is arithmetic ©2013 LinkedIn Corporation. All Rights Reserved. 16
  • 17. Partition-Collapsing Job Architecture (Fixed-Length)  When applied to a fixed-length window computation: ©2013 LinkedIn Corporation. All Rights Reserved. 17
  • 18. Improving Fixed-Length Computations  But, for some operations, cannot subtract old data – example: max(), min()  Cannot reuse previous output, so how do we reduce computation?  Solution: partition-preserving job  Partitioned input data, partitioned output data  Essentially: aggregate the data in advance  Aggregating in advance can be useful even when you can reuse output ©2013 LinkedIn Corporation. All Rights Reserved. 18
  • 19. Partition-Preserving Job Architecture ©2013 LinkedIn Corporation. All Rights Reserved. 19
  • 20. MapReduce in Hourglass  MapReduce is a fairly general programming model  Hourglass requires: – reduce() must output (key,value) pair – reduce() must produce at most one value – reduce() implemented by an accumulator ©2013 LinkedIn Corporation. All Rights Reserved. 20
  • 21. Building Blocks  Two types of jobs: – Partition-preserving: consume partitioned input data, produce partitioned output data – Partition-collapsing: consume partitioned input data, produce single output  Must provide to jobs: – Inputs and output paths – Desired time range  Must implement: – map() – accumulate()  May implement if necessary: – merge() – unmerge() ©2013 LinkedIn Corporation. All Rights Reserved. 21
  • 22. Experiments ©2013 LinkedIn Corporation. All Rights Reserved. 22
  • 23. Metrics for Evaluation  Wall clock time – Amount of time that elapses until job completes  Total task time – Sum of execution times for all tasks – Represents usage of cluster resources  Compare each against baseline non-incremental job ©2013 LinkedIn Corporation. All Rights Reserved. 23
  • 24. Experiment: Page Views per Member  Goal: Count page views per member over last n days  Chain partition-preserving and partition-collapsing  Can reuse previous output: ©2013 LinkedIn Corporation. All Rights Reserved. 24
  • 25. Experiment: Page Views per Member ©2013 LinkedIn Corporation. All Rights Reserved. 25
  • 26. Member Count Estimation  Goal: Estimate number of members visiting site over past n days  Use HyperLogLog cardinality estimation (space vs. accuracy)  Can't reuse output, but with partition-preserving can save state: ©2013 LinkedIn Corporation. All Rights Reserved. 26
  • 27. Member Count Estimation: Results ©2013 LinkedIn Corporation. All Rights Reserved. 27
  • 28. Conclusion  Computations over sliding windows are quite common  Implementations are typically inefficient  Incrementalizing Hadoop jobs can in some cases yield: – 95-98% reductions in total task time – 20-40% reductions in wall clock time ©2013 LinkedIn Corporation. All Rights Reserved. 28
  • 29. datafu.org Learning More ©2013 LinkedIn Corporation. All Rights Reserved. 29