SlideShare a Scribd company logo
1 of 29
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
@lintool
Talk Outline Graph algorithms Graph algorithms in MapReduce Making it efficient Experimental results Punch line: per-iteration running time -69% on 1.4b link webgraph!
What’s a graph? G = (V, E), where V represents the set of vertices (nodes) E represents the set of edges (links) Both vertices and edges may contain additional information Graphs are everywhere: E.g., hyperlink structure of the web, interstate highway system, social networks, etc. Graph problems are everywhere: E.g., random walks, shortest paths, MST, max flow, bipartite matching, clustering, etc.
Source: Wikipedia (Königsberg)
Graph Representation G = (V, E) Typically represented as adjacency lists: Each node is associated with its neighbors (via outgoing edges) 2 1: 2, 4 2: 1, 3, 4 3: 1 4: 1, 3 1 3 4
“Message Passing” Graph Algorithms Large class of iterative algorithms on sparse, directed graphs At each iteration: Computations at each vertex Partial results (“messages”) passed (usually) along directed edges Computations at each vertex: messages aggregate to alter state Iterate until convergence
A Few Examples… Parallel breadth-first search (SSSP) Messages are distances from source Each node emits current distance + 1 Aggregation = MIN PageRank Messages are partial PageRank mass Each node evenly distributes mass to neighbors Aggregation = SUM DNA Sequence assembly Michael Schatz’s dissertation Boring! Still boring!
PageRank in a nutshell…. Random surfer model: User starts at a random Web page User randomly clicks on links, surfing from page to page With some probability, user randomly jumps around PageRank… Characterizes the amount of time spent on any given page Mathematically, a probability distribution over pages
Given page x with inlinkst1…tn, where C(t) is the out-degree of t  is probability of random jump N is the total number of nodes in the graph PageRank: Defined t1 X t2 … tn
Sample PageRank Iteration (1) Iteration 1 n2 (0.2) n2 (0.166) 0.1 n1 (0.2) 0.1 0.1 n1 (0.066) 0.1 0.066 0.066 0.066 n5 (0.2) n5 (0.3) n3 (0.2) n3 (0.166) 0.2 0.2 n4 (0.2) n4 (0.3)
Sample PageRank Iteration (2) Iteration 2 n2 (0.166) n2 (0.133) 0.033 0.083 n1 (0.066) 0.083 n1 (0.1) 0.033 0.1 0.1 0.1 n5 (0.3) n5 (0.383) n3 (0.166) n3 (0.183) 0.3 0.166 n4 (0.3) n4 (0.2)
PageRank in MapReduce Map n2 n4 n3 n5 n1 n2 n3 n4 n5 n2 n4 n3 n5 n1 n2 n3 n4 n5 Reduce
PageRank Pseudo-Code
Why don’t distributed algorithms scale?
Source: http://www.flickr.com/photos/fusedforces/4324320625/
Three Design Patterns In-mapper combining: efficient local aggregation Smarter partitioning: create more opportunities Schimmy: avoid shuffling the graph
In-Mapper Combining Use combiners Perform local aggregation on map output Downside: intermediate data is still materialized Better: in-mapper combining Preserve state across multiple map calls, aggregate messages in buffer, emit buffer contents at end Downside: requires memory management buffer configure map close
Better Partitioning Default: hash partitioning Randomly assign nodes to partitions Observation: many graphs exhibit local structure E.g., communities in social networks Better partitioning creates more opportunities for local aggregation Unfortunately… partitioning is hard! Sometimes, chick-and-egg But in some domains (e.g., webgraphs) take advantage of cheap heuristics For webgraphs: range partition on domain-sorted URLs
Schimmy Design Pattern Basic implementation contains two dataflows: Messages (actual computations) Graph structure (“bookkeeping”) Schimmy: separate the two data flows, shuffle only the messages Basic idea: merge join between graph structure and messages both relations sorted by join key both relations consistently partitioned and sorted by join key S T S1 T1 S2 T2 S3 T3
Do the Schimmy! Schimmy = reduce side parallel merge join between graph structure and messages Consistent partitioning between input and intermediate data Mappers emit only messages (actual computation) Reducers read graph structure directly from HDFS intermediate data (messages) intermediate data (messages) intermediate data (messages) from HDFS (graph structure) from HDFS (graph structure) from HDFS (graph structure) S1 T1 S2 T2 S3 T3 Reducer Reducer Reducer
Experiments Cluster setup: 10 workers, each 2 cores (3.2 GHz Xeon), 4GB RAM, 367 GB disk Hadoop 0.20.0 on RHELS 5.3 Dataset: First English segment of ClueWeb09 collection 50.2m web pages (1.53 TB uncompressed, 247 GB compressed) Extracted webgraph: 1.4 billion links, 7.0 GB Dataset arranged in crawl order Setup: Measured per-iteration running time (5 iterations) 100 partitions
Results “Best Practices”
Results +18% 1.4b 674m
Results +18% 1.4b 674m -15%
Results +18% 1.4b 674m -15% -60% 86m
Results +18% 1.4b 674m -15% -60% -69% 86m
Take-Away Messages Lots of interesting graph problems! Social network analysis Bioinformatics Reducing intermediate data is key Local aggregation Better partitioning Less bookkeeping
Complete details in Jimmy Lin and Michael Schatz. Design Patterns for Efficient Graph Algorithms in MapReduce.Proceedings of the 2010 Workshop on Mining and Learning with Graphs Workshop (MLG-2010), July 2010, Washington, D.C.  http://mapreduce.me/ Source code available in Cloud9 http://cloud9lib.org/ @lintool

More Related Content

What's hot

Minicourse on Network Science
Minicourse on Network ScienceMinicourse on Network Science
Minicourse on Network SciencePavel Loskot
 
IRJET- Bidirectional Graph Search Techniques for Finding Shortest Path in Ima...
IRJET- Bidirectional Graph Search Techniques for Finding Shortest Path in Ima...IRJET- Bidirectional Graph Search Techniques for Finding Shortest Path in Ima...
IRJET- Bidirectional Graph Search Techniques for Finding Shortest Path in Ima...IRJET Journal
 
Bidirectional graph search techniques for finding shortest path in image base...
Bidirectional graph search techniques for finding shortest path in image base...Bidirectional graph search techniques for finding shortest path in image base...
Bidirectional graph search techniques for finding shortest path in image base...Navin Kumar
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shangBBKuhn
 
Social network-analysis-in-python
Social network-analysis-in-pythonSocial network-analysis-in-python
Social network-analysis-in-pythonJoe OntheRocks
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashingDmitriy Selivanov
 
A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)Lynn Cherny
 
A Novel Approach of Caching Direct Mapping using Cubic Approach
A Novel Approach of Caching Direct Mapping using Cubic ApproachA Novel Approach of Caching Direct Mapping using Cubic Approach
A Novel Approach of Caching Direct Mapping using Cubic ApproachKartik Asati
 
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSEVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSijcsit
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...Menlo Systems GmbH
 
Transmission efficient control protocol for WSN
Transmission efficient control protocol for WSNTransmission efficient control protocol for WSN
Transmission efficient control protocol for WSNAvinash Chourasia
 
Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And MiningSrinath Srinivasa
 
Scaling PageRank to 100 Billion Pages
Scaling PageRank to 100 Billion PagesScaling PageRank to 100 Billion Pages
Scaling PageRank to 100 Billion PagesSubhajit Sahu
 
A New Chaos Based Image Encryption and Decryption using a Hash Function
A New Chaos Based Image Encryption and Decryption using a Hash FunctionA New Chaos Based Image Encryption and Decryption using a Hash Function
A New Chaos Based Image Encryption and Decryption using a Hash FunctionIRJET Journal
 

What's hot (20)

Minicourse on Network Science
Minicourse on Network ScienceMinicourse on Network Science
Minicourse on Network Science
 
Python networkx library quick start guide
Python networkx library quick start guidePython networkx library quick start guide
Python networkx library quick start guide
 
Full Search Technique
Full Search TechniqueFull Search Technique
Full Search Technique
 
IRJET- Bidirectional Graph Search Techniques for Finding Shortest Path in Ima...
IRJET- Bidirectional Graph Search Techniques for Finding Shortest Path in Ima...IRJET- Bidirectional Graph Search Techniques for Finding Shortest Path in Ima...
IRJET- Bidirectional Graph Search Techniques for Finding Shortest Path in Ima...
 
Bidirectional graph search techniques for finding shortest path in image base...
Bidirectional graph search techniques for finding shortest path in image base...Bidirectional graph search techniques for finding shortest path in image base...
Bidirectional graph search techniques for finding shortest path in image base...
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shang
 
Word2 vec epam
Word2 vec epamWord2 vec epam
Word2 vec epam
 
Social network-analysis-in-python
Social network-analysis-in-pythonSocial network-analysis-in-python
Social network-analysis-in-python
 
Networkx tutorial
Networkx tutorialNetworkx tutorial
Networkx tutorial
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)
 
A Novel Approach of Caching Direct Mapping using Cubic Approach
A Novel Approach of Caching Direct Mapping using Cubic ApproachA Novel Approach of Caching Direct Mapping using Cubic Approach
A Novel Approach of Caching Direct Mapping using Cubic Approach
 
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSEVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
 
Transmission efficient control protocol for WSN
Transmission efficient control protocol for WSNTransmission efficient control protocol for WSN
Transmission efficient control protocol for WSN
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And Mining
 
Scaling PageRank to 100 Billion Pages
Scaling PageRank to 100 Billion PagesScaling PageRank to 100 Billion Pages
Scaling PageRank to 100 Billion Pages
 
A New Chaos Based Image Encryption and Decryption using a Hash Function
A New Chaos Based Image Encryption and Decryption using a Hash FunctionA New Chaos Based Image Encryption and Decryption using a Hash Function
A New Chaos Based Image Encryption and Decryption using a Hash Function
 

Similar to Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010

Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXBenjamin Bengfort
 
Complex Networks Analysis @ Universita Roma Tre
Complex Networks Analysis @ Universita Roma TreComplex Networks Analysis @ Universita Roma Tre
Complex Networks Analysis @ Universita Roma TreMatteo Moci
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.pptCheeWeiTan10
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
cis97003
cis97003cis97003
cis97003perfj
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Computing with Directed Labeled Graphs
Computing with Directed Labeled GraphsComputing with Directed Labeled Graphs
Computing with Directed Labeled GraphsMarko Rodriguez
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache sparkEmiliano Martinez Sanchez
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...AIST
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Derryck Lamptey, MPhil, CISSP
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Spanning Tree in data structure and .pptx
Spanning Tree in data structure and .pptxSpanning Tree in data structure and .pptx
Spanning Tree in data structure and .pptxasimshahzad8611
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab Arshit Rai
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab Arshit Rai
 

Similar to Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010 (20)

F14 lec12graphs
F14 lec12graphsF14 lec12graphs
F14 lec12graphs
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Complex Networks Analysis @ Universita Roma Tre
Complex Networks Analysis @ Universita Roma TreComplex Networks Analysis @ Universita Roma Tre
Complex Networks Analysis @ Universita Roma Tre
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
Pregel
PregelPregel
Pregel
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
cis97003
cis97003cis97003
cis97003
 
Informatics systems
Informatics systemsInformatics systems
Informatics systems
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Distributed Streams
Distributed StreamsDistributed Streams
Distributed Streams
 
Computing with Directed Labeled Graphs
Computing with Directed Labeled GraphsComputing with Directed Labeled Graphs
Computing with Directed Labeled Graphs
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache spark
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Spanning Tree in data structure and .pptx
Spanning Tree in data structure and .pptxSpanning Tree in data structure and .pptx
Spanning Tree in data structure and .pptx
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab
 
Angular and Deep Learning
Angular and Deep LearningAngular and Deep Learning
Angular and Deep Learning
 

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Recently uploaded

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 

Recently uploaded (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010

  • 1. Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
  • 3. Talk Outline Graph algorithms Graph algorithms in MapReduce Making it efficient Experimental results Punch line: per-iteration running time -69% on 1.4b link webgraph!
  • 4. What’s a graph? G = (V, E), where V represents the set of vertices (nodes) E represents the set of edges (links) Both vertices and edges may contain additional information Graphs are everywhere: E.g., hyperlink structure of the web, interstate highway system, social networks, etc. Graph problems are everywhere: E.g., random walks, shortest paths, MST, max flow, bipartite matching, clustering, etc.
  • 6. Graph Representation G = (V, E) Typically represented as adjacency lists: Each node is associated with its neighbors (via outgoing edges) 2 1: 2, 4 2: 1, 3, 4 3: 1 4: 1, 3 1 3 4
  • 7. “Message Passing” Graph Algorithms Large class of iterative algorithms on sparse, directed graphs At each iteration: Computations at each vertex Partial results (“messages”) passed (usually) along directed edges Computations at each vertex: messages aggregate to alter state Iterate until convergence
  • 8. A Few Examples… Parallel breadth-first search (SSSP) Messages are distances from source Each node emits current distance + 1 Aggregation = MIN PageRank Messages are partial PageRank mass Each node evenly distributes mass to neighbors Aggregation = SUM DNA Sequence assembly Michael Schatz’s dissertation Boring! Still boring!
  • 9. PageRank in a nutshell…. Random surfer model: User starts at a random Web page User randomly clicks on links, surfing from page to page With some probability, user randomly jumps around PageRank… Characterizes the amount of time spent on any given page Mathematically, a probability distribution over pages
  • 10. Given page x with inlinkst1…tn, where C(t) is the out-degree of t  is probability of random jump N is the total number of nodes in the graph PageRank: Defined t1 X t2 … tn
  • 11. Sample PageRank Iteration (1) Iteration 1 n2 (0.2) n2 (0.166) 0.1 n1 (0.2) 0.1 0.1 n1 (0.066) 0.1 0.066 0.066 0.066 n5 (0.2) n5 (0.3) n3 (0.2) n3 (0.166) 0.2 0.2 n4 (0.2) n4 (0.3)
  • 12. Sample PageRank Iteration (2) Iteration 2 n2 (0.166) n2 (0.133) 0.033 0.083 n1 (0.066) 0.083 n1 (0.1) 0.033 0.1 0.1 0.1 n5 (0.3) n5 (0.383) n3 (0.166) n3 (0.183) 0.3 0.166 n4 (0.3) n4 (0.2)
  • 13. PageRank in MapReduce Map n2 n4 n3 n5 n1 n2 n3 n4 n5 n2 n4 n3 n5 n1 n2 n3 n4 n5 Reduce
  • 15. Why don’t distributed algorithms scale?
  • 17. Three Design Patterns In-mapper combining: efficient local aggregation Smarter partitioning: create more opportunities Schimmy: avoid shuffling the graph
  • 18. In-Mapper Combining Use combiners Perform local aggregation on map output Downside: intermediate data is still materialized Better: in-mapper combining Preserve state across multiple map calls, aggregate messages in buffer, emit buffer contents at end Downside: requires memory management buffer configure map close
  • 19. Better Partitioning Default: hash partitioning Randomly assign nodes to partitions Observation: many graphs exhibit local structure E.g., communities in social networks Better partitioning creates more opportunities for local aggregation Unfortunately… partitioning is hard! Sometimes, chick-and-egg But in some domains (e.g., webgraphs) take advantage of cheap heuristics For webgraphs: range partition on domain-sorted URLs
  • 20. Schimmy Design Pattern Basic implementation contains two dataflows: Messages (actual computations) Graph structure (“bookkeeping”) Schimmy: separate the two data flows, shuffle only the messages Basic idea: merge join between graph structure and messages both relations sorted by join key both relations consistently partitioned and sorted by join key S T S1 T1 S2 T2 S3 T3
  • 21. Do the Schimmy! Schimmy = reduce side parallel merge join between graph structure and messages Consistent partitioning between input and intermediate data Mappers emit only messages (actual computation) Reducers read graph structure directly from HDFS intermediate data (messages) intermediate data (messages) intermediate data (messages) from HDFS (graph structure) from HDFS (graph structure) from HDFS (graph structure) S1 T1 S2 T2 S3 T3 Reducer Reducer Reducer
  • 22. Experiments Cluster setup: 10 workers, each 2 cores (3.2 GHz Xeon), 4GB RAM, 367 GB disk Hadoop 0.20.0 on RHELS 5.3 Dataset: First English segment of ClueWeb09 collection 50.2m web pages (1.53 TB uncompressed, 247 GB compressed) Extracted webgraph: 1.4 billion links, 7.0 GB Dataset arranged in crawl order Setup: Measured per-iteration running time (5 iterations) 100 partitions
  • 25. Results +18% 1.4b 674m -15%
  • 26. Results +18% 1.4b 674m -15% -60% 86m
  • 27. Results +18% 1.4b 674m -15% -60% -69% 86m
  • 28. Take-Away Messages Lots of interesting graph problems! Social network analysis Bioinformatics Reducing intermediate data is key Local aggregation Better partitioning Less bookkeeping
  • 29. Complete details in Jimmy Lin and Michael Schatz. Design Patterns for Efficient Graph Algorithms in MapReduce.Proceedings of the 2010 Workshop on Mining and Learning with Graphs Workshop (MLG-2010), July 2010, Washington, D.C. http://mapreduce.me/ Source code available in Cloud9 http://cloud9lib.org/ @lintool