Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. To aid this effort, we built Titian, a library that enables data provenance tracking data through transformations in Apache Spark.
4. ๏ Debugging data processing logic in Data-Intensive Scalable Computing
(DISC) system is difficult
๏ Analysis tools are still in their “infancy”
๏ Today’s large-scale jobs are black boxes:
• Job submitted to a cluster
• Results come back minutes to hours later
• No visibility into running algorithm
Big Data Debugging
10. ๏ Easy to use GDB-like debugger [ICSE 16] (not covered in this talk)
๏ Visibility of data into running workflow
• E.g., what (input) data led to this (outlier) result?
๏ Selectively replaying a portion of the data processing steps on subsets
of intermediate data leading to outliers results
๏ Interactive program analysis
Big Data Debugging - Desiderata
11. ๏ Visibility of data -> Tracking the dependencies between the
individual inputs and outputs records
๏ Selective replay -> Storage of intermediate results:
• Dataset shared among running job and analysis tool
๏ Interactivity -> Implementation Constraints:
• Latency constraint - In memory computation
• Programming interface constraint - Integration with Spark DSL
Big Data Debugging - Challenges
12. ๏ Well known technique in databases
๏ Two granularities of provenance
• Transformation (coarse-grained) provenance
– Records the complete workflow of the derivation of a dataset
– Spark RDD lineage is an example of this form of provenance
• Data (fine-grained) provenance
– Records data dependencies between input and output records
– The type of provenance Titian focuses on
Data Provenance (Lineage)
13. Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT AVG(temp),time
FROM sensors
GROUP BY time
Sensors
Result-ID Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Data Provenance - Example
14. Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT AVG(temp),time
FROM sensors
GROUP BY time
Sensors
Result-ID Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Outlier
Outlier
Why
ID-2 and ID-3
have those high
Data Provenance - Example
15. Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT AVG(temp),time
FROM sensors
GROUP BY time
Sensors
Result-ID Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Outlier
Outlier
Why
ID-2 and ID-3
have those high
Data Provenance - Example
16. ๏ They use external storage systems (HDFS in
RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to
retain lineage data
๏ Data provenance queries are supported in a
separate programming interface
Previous Data Provenance DISC Systems
17. ๏ They use external storage systems (HDFS in
RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to
retain lineage data
๏ Data provenance queries are supported in a
separate programming interface
High overhead
Previous Data Provenance DISC Systems
18. ๏ They use external storage systems (HDFS in
RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to
retain lineage data
๏ Data provenance queries are supported in a
separate programming interface
High overhead
Low interactivity
Previous Data Provenance DISC Systems
19. ๏ Word Count job
๏ RAMP is up to 4X Spark
๏ Newt up to 86X
Experience with Newt and RAMP
100
1000
1 10 100
Time(s)
Dataset Size (GB)
Spark
Newt
RAMP
21. Loads error messages from a log, counts the
number of errors occurrences and returns a report
containing the description of each error
lc = new LineageContext(sc)
lines = lc.textFile(“hdfs://...”)
errors = lines.filter(_.startswith(“error”))
codes = errors.map(_.split(“t”)(1))
pairs = codes.map(word =>(word, 1))
counts = pairs.reduceByKey(word =>(_ + _))
reports = counts.map(kv => (dscr(kv._1), kv._2))
reports.collect.foreach(println)
Example: Log Analysis
22. Given the result of the previous example, select the
most frequent error and trace back to the input
lines containing them
Example: Backward Tracing
23. Given the result of the previous example, select the
most frequent error and trace back to the input
lines containing them
frequentPair = reports.sortBy(_._2, false).take(1)
frequent = reports.filter(_ == frequentPair)
lineage = frequent.getLineage()
input = lineage.goBackAll()
input.collect().foreach(println)
Example: Backward Tracing
24. Return the error codes generated from the network
sub-system (indicated in the log by a “NETWORK” tag)
Example: Forward Tracing
25. Return the error codes generated from the network
sub-system (indicated in the log by a “NETWORK” tag)
network = errors.filter(_.contains(“NETWORK”))
lineage = network.getLineage()
output = lineage.goNextAll()
output.collect().foreach(println)
Example: Forward Tracing
26. Return the error distribution without the ones cause by
the Guest user
Example: Selective Replay
27. Return the error distribution without the ones cause by
the Guest user
lineage = reports.getLineage()
inputLines = lineage.goBackAll()
noGuest = inputLines.filter(!_.contains(“Guest”) && _.startswith(“error”))
newCodes = noGuest.map(_.split(“t”)(1))
newPairs = newCodes.map(word =>(word, 1))
newCounts = newPairs.reduceByKey(word =>(_ + _))
newRep = newCounts.map(kv => (dscr(kv._1), kv._2))
newRep.collect
Example: Selective Replay
29. ๏ LineageContext wrap SparkContext
• Providing visibility into the submitted job
๏ Instrument LineageRDD at stage boundaries
• Wrap native RDDs
• Specific LineageRDD implementation based on instrument transformation
๏ Provenance data is buffered inside LineageRDDs
• Saved into Spark BlockManager for querying
Provenance Capturing
33. Lineage Capture Runtime Overheads
100
1000
1 10 100
Time(s)
Dataset Size (GB)
Spark
Titian
Newt
RAMP
๏ Same Word Count job
๏ Titian is in average 1.3X slower than Spark
34. Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Captured Data Lineage
35. Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Trace Back
36. Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Trace Back
Stage.Input IDReducer.Output ID
37. Reducer.Output IDCombiner.Output ID
Example: Trace Back
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
38. Example: Trace Back
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Combiner.Input IDHadoop.Output ID
Now let’s do it for real!
39. Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Example: Trace Back
40. Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
42. Example: Trace Back
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Stage.Input IDReducer.Output ID
43. Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
44. Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Targeted Shuffle
45. Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
400 id1
4 id2
Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
46. Example: Trace Back
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Combiner.Output ID Reducer.Output ID
Combiner.Output ID Reducer.Output ID
47. Example: Trace Back
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
48. Example: Trace Back
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Combiner.Input IDHadoop.Output ID
Combiner.Input IDHadoop.Output ID
49. Tracing Performance
๏ Word Count job
๏ Tracing one record backward in < 1 sec for
dataset < 100GB
๏ 18 sec for 500GB dataset
50. Vega: Optimizations
for Selective Replay
Matteo Interlandi, Sai Deep Tetali, Muhammad Ali Gulzar, Joseph Noor
Miryung Kim, Todd Millstein, Tyson Condie
Under Submission
51. Debugging workflow
๏ Run program
๏ Understand the cause for bugs / outliers:
• Lineage
• Breakpoints/watchpoints
• Crash culprit
๏ Fix bug
• Fast selective replay
}
} Titian [VLDB 2016]
BigDebug [ICSE 2016]
53. Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
countspairslines
Stage 1 Stage 2
shuffle
input .map(x=>(x,1)) .reduceByKey(_+_)
54. Incremental Plan
Inject a filter in the workflow
countspairslines
Stage 1 Stage 2
shufflefilter
input .filter(x=>x!=‘c’).map(x=>(x,1)) .reduceByKey(_+_)
55. Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(aa, 1)
Shuffle
(aa, [1, 1])
(b, 1)
Reduce
(aa, 2)
(b, 1)
Filter
aa
b
aa
countspairslines
Stage 1 Stage 2
shufflefilter
input .filter(x=>x!=‘c’).map(x=>(x,1)) .reduceByKey(_+_)
Incremental Plan
57. Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
Incremental Plan
59. Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
∆Map
—(c, 1)
—(c, 1)
∆Shuffle
c, [—1, —1])
60. Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
∆Map
—(c, 1)
—(c, 1)
∆Shuffle
c, [—1, —1])
∆Reduce
—(c, 2)
62. Performance
๏ Good up to a certain point
๏ Two factors dominate:
• Space utilization
• Time to shuffle deltas
๏ Insight:
• The more downstream the filter is placed, the better the incremental
performance
• Especially beneficial if we can place it past the shuffle
65. Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
filter(x=>x!=‘c’)
But the input to the filter is (word, 1)
We cannot use the filter anymore
66. Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
filter(x=>x!=‘c’)
Observe that the map is invertible
We can use the old filter by using the inverse of the map
70. Why does it scale so well?
๏ Runtime in the order of output
๏ Output depends on the number of unique words
๏ Unique words << total words
71. Combining Strategies
๏ Push the changed transform past as many shuffles
as possible with rewrites
• The new transform can be placed only after materialization
points
• By default we materialize shuffle output
• Efficient because Spark already save shuffle output for fault
tolerance
๏ Use delta computation for the remaining workflow
72. Vega
๏ Built on Spark and Spark SQL (only filter rewrite)
๏ Spark SQL API is unchanged
๏ Spark API includes:
• Functions with inverses (for maps)
• Inverse values (for incremental reduce)
๏ Automatically rewrites workflows using commutativity
and incremental evaluation
73. ๏ Titian provides to Spark users the ability of tracing through program execution
๏ Features:
• Intermediate results are shared in memory
• Tight integration with the Spark API (LineageRDD)
• Low job overhead
• Efficient lineage query
๏ Vega provides 1–3 orders magnitude performance gains over rerunning the
computation from scratch
๏ Both provide results in a few seconds for many workflows allowing interactive
usage
} Transformation provenance
Conclusions
76. Configuration
๏ Two set of experiments:
• Unstructured - grep and word count
• Structured - PigMix queries
๏ Datasets:
• Unstructured: from 500MB to 500GB files contains words generated using a
Zipf distribution from a dictionary of 8000 words
• Structured: we used the PigMix generator to create dataset of sizes ranging
from 1GB to 1TB
๏ Configuration:
• 16 4 cores (2 hyper threads per core) machines, 32GB of RAM, 1TB disk
• Spark 1.2.1
79. ๏ Titian provides to Spark users the ability of tracing through
program execution at interactive speed
๏ Features:
• Intermediate results are shared in memory
• Tight integration with the Spark API (LineageRDD)
• Low job overhead
• Efficient lineage query
๏ We believe Titian will open the door to program logic debugging,
iterative data (and program) cleaning, and exploratory analysis
}Transformation provenance
Titian: Data Provenance in Spark
107. Combiner Probe Phase
Input records Output records
Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
108. Combiner Probe Phase
Input records Output records
Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)
109. Combiner Probe Phase
Input records Output records
Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)
Get output Id
110. Combiner Probe Phase
Input records Output records
Input ID Output
ID
{id1, id
3}
400
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)
Get output Id
111. Combiner Probe Phase
Input records Output records
Input ID Output
ID
{id1, id
3}
400
{ id2 } 4
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(4, 1)
124. Get output Id
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400 id1
TaskCont
ext
400
(Bad request, 7)