SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
RAMP: A System for Capturing and Tracing
Provenance in MapReduce Workflows
Hyunjung Park, Robert Ikeda, Jennifer Widom
Stanford University
Hyunjung Park
MapReduce Workflow
 Directed acyclic graph composed of MapReduce jobs
 Popular for large-scale data processing
 Hadoop
 Higher-level platforms: Pig, Hive, Jaql, …
 Debugging can be difficult
2
Hyunjung Park
Provenance
 Where data came from, How it was processed, …
 Uses: drill-down, verification, debugging, …
 RAMP: fine-grained provenance of data elements
 Backward tracing
 Find the input subsets that contributed to a given output
element
 Forward tracing
 Determine which output elements were derived from a
particular input element
3
Hyunjung Park
Outline of Talk
 MapReduce Provenance: definition and examples
 RAMP System: overview and some details
 RAMP Experimental Evaluation
 Provenance-enabled Pig using RAMP
 Conclusion
4
Hyunjung Park
MapReduce Provenance
 Mapper
 M(I) = Ui I M({i})
 Provenance of an output element oM(I) is
the input element i I that produced o, i.e., oM({i})
 Reducer (and Combiner)
 R(I) = U1≤ j ≤ n R(Ij) I1,…,In partition of I on reduce key
 Provenance of an output element oR(I) is
the group Ik  I that produced o, i.e., oR(Ik)
5
Hyunjung Park
MapReduce Provenance
 MapReduce workflows
 Intuitive recursive definition
 “Replay” property
 Replay the entire workflow with the provenance of an
output element o
 Does the result include the element o?
6
Usually YES, but not always
RMI* I
E
ORM o  O
Hyunjung Park
MapReduce Provenance: Wordcount
7
Apache Hadoop
Hadoop MapReduce
Apache Pig
Hadoop Summit
MapReduce Tutorial
(Apache, 1)
(Hadoop, 1)
(Hadoop, 1)
(MapReduce, 1)
(Apache, 1)
(Pig, 1)
(Apache, 2)
(Hadoop, 2)
(MapReduce, 1)
(Pig, 1)
(Apache, 2)
(Hadoop, 3)
(MapReduce, 2)
(Pig, 1)
(Summit, 1)
(Tutorial, 1)
(Hadoop, 1)
(Summit, 1)
(MapReduce, 1)
(Tutorial, 1)
(Hadoop, 1)
(Summit, 1)
(MapReduce, 1)
(Tutorial, 1)
Mapper Combiner Reducer
IntSumIntSumTokenizer
Hyunjung Park
Replay Property: Wordcount
8
Apache Hadoop
Hadoop MapReduce
Apache Pig
Hadoop Summit
MapReduce Tutorial
(Apache, 1)
(Hadoop, 1)
(Hadoop, 1)
(MapReduce, 1)
(Apache, 1)
(Pig, 1)
(Apache, 1)
(Hadoop, 2)
(MapReduce, 1)
(Pig, 1)
(Apache, 1)
(Hadoop, 3)
(MapReduce, 1)
(Pig, 1)
(Summit, 1)
(Tutorial, 1)
(Hadoop, 1)
(Summit, 1)
(MapReduce, 1)
(Tutorial, 1)
(Hadoop, 1)
(Summit, 1)
(MapReduce, 1)
(Tutorial, 1)
Mapper Combiner Reducer
IntSum IntSumTokenizer
Hyunjung Park
System Overview
 Built as an extension (library) to Hadoop 0.21
 Some changes in the core part as well
 Generic wrapper for provenance capture
 Capture provenance for one MapReduce job at a time
 Transparent
• Provenance stored separately from the input/output data
• Retains Hadoop’s parallel execution and fault tolerance
 Wrapped components
• RecordReader
• Mapper, Combiner, Reducer
• RecordWriter
9
Hyunjung Park
System Overview
 Pluggable schemes for provenance capture
 Element ID scheme
 Provenance storage scheme (OutputFormat)
 Applicable to any MapReduce workflow
 Provenance tracing program
 Stand-alone
 Depends on the pluggable schemes
10
Hyunjung Park
Provenance Capture
11
RecordReader
Mapper
(ki, vi)
(km, vm)
Input
Map Output
Wrapper
Wrapper
RecordReader
(ki, vi)
Mapper
(ki, 〈vi, p〉)
(ki, vi)
(km, vm)
(km, 〈vm, p〉)
Input
Map Output
p
p
Hyunjung Park
Provenance Capture
12
Reducer
RecordWriter
(ko, vo)
Map Output
Output
(km, [vm
1,…,vm
n])
Wrapper
Wrapper
Reducer
(ko, vo)
RecordWriter
(ko, 〈vo, km
ID〉)
(ko, vo)
(km, [vm
1,…,vm
n])
(km, [〈vm
1, p1〉,…, 〈vm
n, pn〉])
Map Output
Output
(km
ID, pj)
(q, km
ID)
Provenance
q
Hyunjung Park
Default Scheme for File Input/Output
 Element ID
 (filename, offset)
 Element ID increases as elements are appended
• Reduce provenance stored in ascending key order
• Efficient backward tracing without special indexes
 Provenance storage
 Reduce provenance: offset  reduce group ID
 Map provenance: reduce group ID  (filename, offset)
 Compaction
• Filename: replaced by file ID
• Integer ID, offset: variable-length encoded
13
Hyunjung Park
Experimental Evaluation
 51 large EC2 instances (Thank you, Amazon!)
 Two MapReduce “workflows”
 Wordcount
• Many-one with large fan-in
• Input sizes: 100, 300, 500 GB
 Terasort
• One-one
• Input sizes: 93, 279, 466 GB
14
Hyunjung Park
Experimental Results: Wordcount
15
0
1200
2400
3600
4800
6000
100G
300G
500G
100G
300G
500G
100G
300G
500G
Execution time Map finish
time
Avg. reduce
task time
Time(seconds)
Time (no provenance)
0
400
800
1200
1600
2000
100G
300G
500G
100G
300G
500G
100G
300G
500G
Map output
data size
Intermediate
data size
Output data
size
Size(GB)
Space (no provenance)
0
1200
2400
3600
4800
6000
100G
300G
500G
100G
300G
500G
100G
300G
500G
Execution time Map finish
time
Avg. reduce
task time
Time(seconds)
Time (no provenance) Time Overhead
0
400
800
1200
1600
2000
100G
300G
500G
100G
300G
500G
100G
300G
500G
Map output
data size
Intermediate
data size
Output data
size
Size(GB)
Space (no provenance) Space Overhead
Hyunjung Park
Experimental Results: Terasort
16
0
400
800
1200
1600
2000
93G
279G
466G
93G
279G
466G
93G
279G
466G
Execution time Map finish
time
Avg. reduce
task time
Time(seconds)
Time (no provenance)
0
200
400
600
800
1000
93G
279G
466G
93G
279G
466G
93G
279G
466G
Map output
data size
Intermediate
data size
Output data
size
Size(GB)
Space (no provenance)
0
400
800
1200
1600
2000
93G
279G
466G
93G
279G
466G
93G
279G
466G
Execution time Map finish
time
Avg. reduce
task time
Time(seconds)
Time (no provenance) Time Overhead
0
200
400
600
800
1000
93G
279G
466G
93G
279G
466G
93G
279G
466G
Map output
data size
Intermediate
data size
Output data
size
Size(GB)
Space (no provenance) Space Overhead
Hyunjung Park
Experimental Results: Summary
 Provenance capture
 Wordcount
• 76% time overhead, space overhead depends directly on
fan-in
 Terasort
• 20% time overhead, 21% space overhead
 Backward tracing
 Wordcount
• 1, 3, 5 minutes (for 100, 300, 500 GB input sizes)
 Terasort
• 1.5 seconds
17
Hyunjung Park
Instrumenting Pig for Provenance
 Can we run real-world MapReduce workflows on top
of RAMP/Hadoop?
 Pig 0.8
 Added (file, offset) based element ID scheme: ~100 LOC
 Default provenance storage scheme
 Default provenance tracing program
18
Hyunjung Park
Pig Script: Sentiment Analysis
 Input data sets
 Tweets collected in 2009
 478 highest-grossing movie titles from IMDb
 For each tweet:
 Infer a 1-5 overall sentiment rating
 Generate all n-grams and join them with movie titles
 Output data set
 (Movie title, Rating, #Tweets in November, #Tweets in
December)
19
Hyunjung Park
Pig Script: Sentiment Analysis
raw_movie = LOAD ’movies.txt’ USING PigStorage(’t’) AS (title: chararray, year: int);
movie = FOREACH raw_movie GENERATE LOWER(title) as title;
raw_tweet = LOAD ’tweets.txt’ USING PigStorage(’t’) AS (datetime: chararray, url, tweet: chararray);
tweet = FOREACH raw_tweet GENERATE datetime, url, LOWER(tweet) as tweet;
rated = FOREACH tweet GENERATE datetime, url, tweet, InferRating(tweet) as rating;
ngramed = FOREACH rated GENERATE datetime, url, flatten(GenerateNGram(tweet)) as ngram, rating;
ngram_rating = DISTINCT ngramed;
title_rating = JOIN ngram_rating BY ngram, movie BY title USING ’replicated’;
title_rating_month = FOREACH title_rating GENERATE title, rating, SUBSTRING(datetime, 5, 7) as month;
grouped = GROUP title_rating_month BY (title, rating, month);
title_rating_month_count = FOREACH grouped GENERATE flatten($0), COUNT($1);
november_count = FILTER title_rating_month_count BY month eq ’11’;
december_count = FILTER title_rating_month_count BY month eq ’12’;
outer_joined = JOIN november_count BY (title, rating) FULL OUTER, december_count BY (title, rating);
result = FOREACH outer_joined GENERATE (($0 is null) ? $4 : $0) as title, (($1 is null) ? $5 : $1) as
rating, (($3 is null) ? 0 : $3) as november, (($7 is null) ? 0 : $7) as december;
STORE result INTO ’/sentiment-analysis-result’ USING PigStorage();
20
Hyunjung Park
Backward Tracing: Sentiment Analysis
21
Hyunjung Park
Limitations
 Definition of MapReduce provenance
 Reducer treated as a black-box
 Many-many reducers
• Provenance may contain more input elements than
desired
• E.g., identity reducer for sorting records with duplicates
 Wrapper-based approach for provenance capture
 “Pure” mapper, combiner, and reducer
 “Standard” input/output channels
22
Hyunjung Park
Conclusion
 RAMP transparently captures provenance with
reasonable time and space overhead
 RAMP provides convenient and efficient means of
drilling-down and verifying output elements
23

Más contenido relacionado

La actualidad más candente

20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multiKohei KaiGai
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Piotr Wikiel
 
R belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamerR belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamerJean-Baptiste Poullet
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
PyHEP 2019: Python Histogramming Packages
PyHEP 2019: Python Histogramming PackagesPyHEP 2019: Python Histogramming Packages
PyHEP 2019: Python Histogramming PackagesHenry Schreiner
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)Hansol Kang
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonWes McKinney
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with HadoopFerran Galí Reniu
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHortonworks
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech ProjectsJody Garnett
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop Intro to Big Data using Hadoop
Intro to Big Data using Hadoop Sergejus Barinovas
 
HFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn ProtocolHFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn ProtocolMatteo Dell'Amico
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGaiKohei KaiGai
 
Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Koichi Shirahata
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike
 

La actualidad más candente (20)

20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
 
R belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamerR belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamer
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
PyHEP 2019: Python Histogramming Packages
PyHEP 2019: Python Histogramming PackagesPyHEP 2019: Python Histogramming Packages
PyHEP 2019: Python Histogramming Packages
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with Hadoop
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
 
HFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn ProtocolHFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn Protocol
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019
 

Similar a RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows

Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Jianfeng Zhang
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & StorageIlayaraja P
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Scaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachScaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachJinal Jhaveri
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 

Similar a RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows (20)

Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Scaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachScaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approach
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Pig latin
Pig latinPig latin
Pig latin
 

Último

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Último (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows

  • 1. RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows Hyunjung Park, Robert Ikeda, Jennifer Widom Stanford University
  • 2. Hyunjung Park MapReduce Workflow  Directed acyclic graph composed of MapReduce jobs  Popular for large-scale data processing  Hadoop  Higher-level platforms: Pig, Hive, Jaql, …  Debugging can be difficult 2
  • 3. Hyunjung Park Provenance  Where data came from, How it was processed, …  Uses: drill-down, verification, debugging, …  RAMP: fine-grained provenance of data elements  Backward tracing  Find the input subsets that contributed to a given output element  Forward tracing  Determine which output elements were derived from a particular input element 3
  • 4. Hyunjung Park Outline of Talk  MapReduce Provenance: definition and examples  RAMP System: overview and some details  RAMP Experimental Evaluation  Provenance-enabled Pig using RAMP  Conclusion 4
  • 5. Hyunjung Park MapReduce Provenance  Mapper  M(I) = Ui I M({i})  Provenance of an output element oM(I) is the input element i I that produced o, i.e., oM({i})  Reducer (and Combiner)  R(I) = U1≤ j ≤ n R(Ij) I1,…,In partition of I on reduce key  Provenance of an output element oR(I) is the group Ik  I that produced o, i.e., oR(Ik) 5
  • 6. Hyunjung Park MapReduce Provenance  MapReduce workflows  Intuitive recursive definition  “Replay” property  Replay the entire workflow with the provenance of an output element o  Does the result include the element o? 6 Usually YES, but not always RMI* I E ORM o  O
  • 7. Hyunjung Park MapReduce Provenance: Wordcount 7 Apache Hadoop Hadoop MapReduce Apache Pig Hadoop Summit MapReduce Tutorial (Apache, 1) (Hadoop, 1) (Hadoop, 1) (MapReduce, 1) (Apache, 1) (Pig, 1) (Apache, 2) (Hadoop, 2) (MapReduce, 1) (Pig, 1) (Apache, 2) (Hadoop, 3) (MapReduce, 2) (Pig, 1) (Summit, 1) (Tutorial, 1) (Hadoop, 1) (Summit, 1) (MapReduce, 1) (Tutorial, 1) (Hadoop, 1) (Summit, 1) (MapReduce, 1) (Tutorial, 1) Mapper Combiner Reducer IntSumIntSumTokenizer
  • 8. Hyunjung Park Replay Property: Wordcount 8 Apache Hadoop Hadoop MapReduce Apache Pig Hadoop Summit MapReduce Tutorial (Apache, 1) (Hadoop, 1) (Hadoop, 1) (MapReduce, 1) (Apache, 1) (Pig, 1) (Apache, 1) (Hadoop, 2) (MapReduce, 1) (Pig, 1) (Apache, 1) (Hadoop, 3) (MapReduce, 1) (Pig, 1) (Summit, 1) (Tutorial, 1) (Hadoop, 1) (Summit, 1) (MapReduce, 1) (Tutorial, 1) (Hadoop, 1) (Summit, 1) (MapReduce, 1) (Tutorial, 1) Mapper Combiner Reducer IntSum IntSumTokenizer
  • 9. Hyunjung Park System Overview  Built as an extension (library) to Hadoop 0.21  Some changes in the core part as well  Generic wrapper for provenance capture  Capture provenance for one MapReduce job at a time  Transparent • Provenance stored separately from the input/output data • Retains Hadoop’s parallel execution and fault tolerance  Wrapped components • RecordReader • Mapper, Combiner, Reducer • RecordWriter 9
  • 10. Hyunjung Park System Overview  Pluggable schemes for provenance capture  Element ID scheme  Provenance storage scheme (OutputFormat)  Applicable to any MapReduce workflow  Provenance tracing program  Stand-alone  Depends on the pluggable schemes 10
  • 11. Hyunjung Park Provenance Capture 11 RecordReader Mapper (ki, vi) (km, vm) Input Map Output Wrapper Wrapper RecordReader (ki, vi) Mapper (ki, 〈vi, p〉) (ki, vi) (km, vm) (km, 〈vm, p〉) Input Map Output p p
  • 12. Hyunjung Park Provenance Capture 12 Reducer RecordWriter (ko, vo) Map Output Output (km, [vm 1,…,vm n]) Wrapper Wrapper Reducer (ko, vo) RecordWriter (ko, 〈vo, km ID〉) (ko, vo) (km, [vm 1,…,vm n]) (km, [〈vm 1, p1〉,…, 〈vm n, pn〉]) Map Output Output (km ID, pj) (q, km ID) Provenance q
  • 13. Hyunjung Park Default Scheme for File Input/Output  Element ID  (filename, offset)  Element ID increases as elements are appended • Reduce provenance stored in ascending key order • Efficient backward tracing without special indexes  Provenance storage  Reduce provenance: offset  reduce group ID  Map provenance: reduce group ID  (filename, offset)  Compaction • Filename: replaced by file ID • Integer ID, offset: variable-length encoded 13
  • 14. Hyunjung Park Experimental Evaluation  51 large EC2 instances (Thank you, Amazon!)  Two MapReduce “workflows”  Wordcount • Many-one with large fan-in • Input sizes: 100, 300, 500 GB  Terasort • One-one • Input sizes: 93, 279, 466 GB 14
  • 15. Hyunjung Park Experimental Results: Wordcount 15 0 1200 2400 3600 4800 6000 100G 300G 500G 100G 300G 500G 100G 300G 500G Execution time Map finish time Avg. reduce task time Time(seconds) Time (no provenance) 0 400 800 1200 1600 2000 100G 300G 500G 100G 300G 500G 100G 300G 500G Map output data size Intermediate data size Output data size Size(GB) Space (no provenance) 0 1200 2400 3600 4800 6000 100G 300G 500G 100G 300G 500G 100G 300G 500G Execution time Map finish time Avg. reduce task time Time(seconds) Time (no provenance) Time Overhead 0 400 800 1200 1600 2000 100G 300G 500G 100G 300G 500G 100G 300G 500G Map output data size Intermediate data size Output data size Size(GB) Space (no provenance) Space Overhead
  • 16. Hyunjung Park Experimental Results: Terasort 16 0 400 800 1200 1600 2000 93G 279G 466G 93G 279G 466G 93G 279G 466G Execution time Map finish time Avg. reduce task time Time(seconds) Time (no provenance) 0 200 400 600 800 1000 93G 279G 466G 93G 279G 466G 93G 279G 466G Map output data size Intermediate data size Output data size Size(GB) Space (no provenance) 0 400 800 1200 1600 2000 93G 279G 466G 93G 279G 466G 93G 279G 466G Execution time Map finish time Avg. reduce task time Time(seconds) Time (no provenance) Time Overhead 0 200 400 600 800 1000 93G 279G 466G 93G 279G 466G 93G 279G 466G Map output data size Intermediate data size Output data size Size(GB) Space (no provenance) Space Overhead
  • 17. Hyunjung Park Experimental Results: Summary  Provenance capture  Wordcount • 76% time overhead, space overhead depends directly on fan-in  Terasort • 20% time overhead, 21% space overhead  Backward tracing  Wordcount • 1, 3, 5 minutes (for 100, 300, 500 GB input sizes)  Terasort • 1.5 seconds 17
  • 18. Hyunjung Park Instrumenting Pig for Provenance  Can we run real-world MapReduce workflows on top of RAMP/Hadoop?  Pig 0.8  Added (file, offset) based element ID scheme: ~100 LOC  Default provenance storage scheme  Default provenance tracing program 18
  • 19. Hyunjung Park Pig Script: Sentiment Analysis  Input data sets  Tweets collected in 2009  478 highest-grossing movie titles from IMDb  For each tweet:  Infer a 1-5 overall sentiment rating  Generate all n-grams and join them with movie titles  Output data set  (Movie title, Rating, #Tweets in November, #Tweets in December) 19
  • 20. Hyunjung Park Pig Script: Sentiment Analysis raw_movie = LOAD ’movies.txt’ USING PigStorage(’t’) AS (title: chararray, year: int); movie = FOREACH raw_movie GENERATE LOWER(title) as title; raw_tweet = LOAD ’tweets.txt’ USING PigStorage(’t’) AS (datetime: chararray, url, tweet: chararray); tweet = FOREACH raw_tweet GENERATE datetime, url, LOWER(tweet) as tweet; rated = FOREACH tweet GENERATE datetime, url, tweet, InferRating(tweet) as rating; ngramed = FOREACH rated GENERATE datetime, url, flatten(GenerateNGram(tweet)) as ngram, rating; ngram_rating = DISTINCT ngramed; title_rating = JOIN ngram_rating BY ngram, movie BY title USING ’replicated’; title_rating_month = FOREACH title_rating GENERATE title, rating, SUBSTRING(datetime, 5, 7) as month; grouped = GROUP title_rating_month BY (title, rating, month); title_rating_month_count = FOREACH grouped GENERATE flatten($0), COUNT($1); november_count = FILTER title_rating_month_count BY month eq ’11’; december_count = FILTER title_rating_month_count BY month eq ’12’; outer_joined = JOIN november_count BY (title, rating) FULL OUTER, december_count BY (title, rating); result = FOREACH outer_joined GENERATE (($0 is null) ? $4 : $0) as title, (($1 is null) ? $5 : $1) as rating, (($3 is null) ? 0 : $3) as november, (($7 is null) ? 0 : $7) as december; STORE result INTO ’/sentiment-analysis-result’ USING PigStorage(); 20
  • 21. Hyunjung Park Backward Tracing: Sentiment Analysis 21
  • 22. Hyunjung Park Limitations  Definition of MapReduce provenance  Reducer treated as a black-box  Many-many reducers • Provenance may contain more input elements than desired • E.g., identity reducer for sorting records with duplicates  Wrapper-based approach for provenance capture  “Pure” mapper, combiner, and reducer  “Standard” input/output channels 22
  • 23. Hyunjung Park Conclusion  RAMP transparently captures provenance with reasonable time and space overhead  RAMP provides convenient and efficient means of drilling-down and verifying output elements 23