Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate Views

A MapReduce-based
Programming Model for
Self-maintainable Aggregate Views
2012-08-31

Johannes Schildgen
TU Kaiserslautern
schildgen@cs.uni-kl.de

11 26 14 23 37
39 41 26
19 8
25
19 22 15 18 10 16
27
8 9 12 14 15

Kinderriegel: 26
Balisto: 39
Hanuta: 14
Snickers: 19
Ritter Sport: 41
Pickup: 12
…

Kinderriegel: 26
Balisto: 39
Hanuta: 14
Snickers: 19
Ritter Sport: 41
Δ
Balisto: -1
Pickup: 12
…

Kinderriegel: 26
Balisto: 39
Hanuta: 14
Snickers: 19
Ritter Sport: 41
Δ
Balisto: -8
Pickup: 12 Snickers: +24
… Ritter Sport: -7

Increment Installation

Kinderriegel: 26
Balisto: 31
Hanuta: 14
Snickers: 43
Ritter Sport: 34
Δ
Balisto: -8
Pickup: 12 Snickers: +24
… Ritter Sport: -7

Overwrite Installation

Kinderriegel: 26
Balisto: 39
Hanuta: 14
Snickers: 19
Δ
Balisto: -8
Ritter Sport: 41Kinderriegel: 26
Pickup: 12 Balisto: 31 Snickers: +24
Hanuta: 14 Ritter Sport: -7
… Snickers: 43
Ritter Sport: 34
Pickup: 12

Fundamentals &
The Marimba Framework Evaluation
Related Work

public class WordCount extends Configured implements Tool {

public static class WordCountMapper extends
Mapper<LongWritable, Text, ImmutableBytesWritable,
LongWritable> {

private Text word = new Text();

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
this.word.set(tokenizer.nextToken());

A A E
B E F
C F B
D C D

> create 'person', 'default'
> put 'person', 'p27', 'default:forename', 'Anton'
> put 'person', 'p27', 'default:surname', 'Schmidt'

> get 'person', 'p27'
COLUMN CELL
default:forname timestamp=1338991497408, value=Schmidt
default:surname timestamp=1338991436688, value=Anton
2 row(s) in 0.0640 seconds

banane
die
iss
nimm
3
4
1
1
Δ
kaufe 1
schale 1
die 0
schäle 1
banane 1
schmeiß 1
schmeiß -1
weg 1
schale -1
weg -1

banane
die
iss
kaufe
nimm
4
4
1
1
1
Δ
kaufe 1
increment()
schale 0 die 0
schäle 1 banane 1
schmeiß 0 schmeiß -1
weg 0 schale -1
weg -1

banane
die
iss
nimm
3
4
1
1
Δ
kaufe 1
schale 1 overwrite() die 0
schäle 1
banane 1
schmeiß 1
schmeiß -1
weg 1
schale -1
weg -1

banane
die
iss
kaufe
nimm
4
4
1
1
1
Δ
kaufe 1
schale 0 overwrite() die 0
schäle 1 banane 1
schmeiß 0 schmeiß -1
weg 0 schale -1
weg -1

void map(key, value) {
if(value is inserted) {
for(word : value.split(" ")) {
write(word, 1);
}
else if(value is deleted) {
write(word, -1);
}
}

}

void map(key, value) {
if(value is inserted) {
write(word, 1);
}
else if(value is deleted) {
write(word, -1);
}
}
else { // old result
write(key, value);
}
}

Overwrite Installation

void reduce(key, values) {
sum = 0;
for(value : value) {
sum += value;
}
put = new Put(key);
put.add("fam", "col", sum);
context.write(key, put);
}

Increment Installation

void reduce(key, values) {
sum = 0;
for(value : value) {
sum += value;
}
inc = new Increment(key);
inc.add("fam", "col", sum);
context.write(key, inc);
}

Core functionality:
Distributed computations
with MapReduce

I care about:
IncDec, Overwrite,
reading old results,
producing of Increments,…

I tell you how to
read input data,
Core functionality: aggregate, invert and
write the output
Incremental computations

public class WordTranslator extends
Translator<LongWritable, Text> {
public void translate(…) {
…
}

IncJob job = new IncJobOverwrite(conf);
job.setTranslatorClass(
WordTranslator.class);
job.setAbelianClass(WordAbelian.class);

public class WordAbelian implements
Abelian<WordAbelian> {
WordAbelian invert() { … }
WordAbelian aggregate(WordAbelian
other) { … }
WordAbelian neutral() { … }
boolean isNeutral() { … }
Writable extractKey() { … }
void write(…) { … }
void readFields(…) { … }
}

public class WordSerializer
implements Serializer<WordAbelian> {

Writable serialize(Writable key,
WordAbelian v) {
…
}
WordAbelian deserializeHBase(
byte[] rowId, byte[] colFamily,
byte[] qualifier, byte[] value) {
…
}
}

How To Write A Marimba-Job

1. Abelian-Class
2. Translator-Class
3. Serializer-Class
4. Write a Hadoop-Job and use the
class IncJob

Implementation

setInputTable(…)

IncJob setOutputTable(…)

IncJobFull
IncJobIncDec IncJobOverwrite
Recomputation

setResultInputTable(…)

NeutralOutputStrategy
(for IncJobOverwrite)

public interface Abelian<T extends
Abelian<?>> extends
WritableComparable<BinaryComparable>{

T invert();
T aggregate(T other);
T neutral();
boolean isNeutral();
Writable extractKey();
void write(…);
void readFields(…);
}

public interface Serialzer<T extends
Abelian<?>> {

Writable serialize(T obj);
T deserializeHBase(
byte[] rowId, byte[] colFamily,
byte[] qualifier, byte[] value);
}

public abstract class Translator
<KEYIN, VALUEIN> {

public abstract void translate
(KEYIN key, VALUEIN value,
Context context);

this.mapContext.write(
abelianValue.extractKey(),
this.invertValue ?
abelianValue.invert() :
abelianValue);

GenericMapper

From InputFormat: Value

OverwriteResult InsertedValue DeletedValue PreservedValue

deserialize translate set invertValue=true; ignore
translate

GenericReducer

1. aggregate
2. serialize
3. write

IncDec: Overwrite:
PUT → write
putToIncrement(…) IGNORE → don‘t write
DELETE → putToDelete(...)

GenericCombiner

„Write A Combiner“
-- 7 Tips for Improving MapReduce Performance, (Tipp 4)

1. aggregate

Example:
1. WordCount
void translate(key, value) {
WordAbelian invert() {
return new WordAbelian(
for(word : value.split(" ")) { this.word,
write( -1 * this.count);
new WordAbelian(word, 1)); }
}
} WordAbelian aggregate(
WordAbelian other) {
return new WordAbelian(
Writable serialize( this.word,
WordAbelian w) { this.count
Put p = new Put( + other.count);
w.getWord()); }
p.add(…);
return p; boolean neutral() {
} return new WordAbelian(
this.word, 0);
}

boolean isNeutral() {

Translator }
return (this.count == 0);

Serializer WordAbelian

Example:
2. Friends Of Friends
FRIENDS

A
D
B FRIENDS OF FRIENDS

C
E

Example:
2. Friends Of Friends
translate(person, friends):

aggregate(…):
Merge friends-of-friends-
sets

Example:
3. Reverse WebLink-Graph

REVERSE WEB LINK GRAPH
(Row-ID -> Columns)

Google -> {eBay, Wikipedia}

aggregate(…): -> {Google, Wikipedia}
eBay

Merge link-sets Mensa-KL -> {Google}
Facebook -> {Google, Mensa-
KL, Uni-KL}

Wikipedia -> {Google}

Uni-KL -> {Google, Wikipedia}

Example:
4. Bigrams
Hi, kannst du mich ___?___ am
Bahnhof abholen? So in etwa
10 ___?___. Viele liebe ___?__.
P.S. Ich habe viel ___?___.

Idea:
Analize large amount of
text data

Example:
4. Bigrams
extractKey()
a invert():
count*=-1
b NGramAbelian
count
aggregate(… other):
write(…) count+=other.count
neutral():
isNeutral(): count=0
count==0

NGramStep2Abelian

Beispielanwendungen:
4. Bigrams
extractKey()
a invert():
count*=-1
b NGramAbelian
count
aggregate(… other):
write(…) count+=other.count
neutral():
isNeutral(): count=0
count==0

NGramStep2Abelian

bitte
Hi, kannst du mich ___ ___ am
nicht
Bahnhof abholen? So in etwa
<num>Minuten
<num>
Jahre
10 ___ ___. Grüße
dich
Viele liebe ___ __.
zu
Spaß
P.S. Ich habe viel ___ ___.

WordCount

01:10

01:00

00:50

00:40
Zeit [hh:mm]

00:30 FULL
INCDEC
OVERWRITE
00:20

00:10

00:00
0% 10% 20% 30% 40% 50% 60% 70% 80%
Änderungen

Reverse Weblink-Graph

02:51

02:41

02:31

02:21

02:11

02:00

01:50

01:40
Zeit [hh:mm]

01:30
FULL
01:20
INCDEC
01:10
OVERWRITE
01:00

00:50

00:40

00:30

00:20

00:10

00:00
0% 10% 20% 30% 40% 50% 60% 70% 80%
Änderungen

Conclusion
Full Recomputation
IncDec / Overwrite

Images
Folie 5-9: Folie 37-44:
Flammen und Smilie: Microsoft Office 2010 Puzzle: http://www.flickr.com/photos/dps/136565237/

Folie 10: Folie 46 - 48:
Google: http://www.google.de Junge: Microsoft Office 2010

Folie 11: Folie 49:
Amazon: http://www.amazon.de Google: http://www.google.de
eBay: http://www.ebay.de
Folie 12: Mensa-KL: http://www.mensa-kl.de
Hadoop: http://hadoop.apache.org facebook: http://www.facebook.de
Casio Wristwatch: Wikipedia: http://de.wikipedia.org
http://www.flickr.com/photos/andresrueda/3448240252 TU Kaiserslautern: http://www.uni-kl.de

Folie 16: Folie 50-51:
Hadoop: http://hadoop.apache.org Handy: Microsoft Office 2010

Folie 17: Folie 56:
Hadoop: http://hadoop.apache.org Wikipedia: http://de.wikipedia.org
Notebook: Microsoft Office 2010 Twitter: http://www.twitter.com

Folie 18: Folie 57:
HBase: http://hbase.apache.org Handy: Microsoft Office 2010

Folie 31: Folie 58:
Hadoop: http://hadoop.apache.org Hadoop: http://hadoop.apache.org
Casio Wristwatch: Casio Wristwatch: http://www.flickr.com/photos/andresrueda/3448240252
http://www.flickr.com/photos/andresrueda/3448240252

Folie 32:
Gerüst: http://www.flickr.com/photos/michale/94538528/
Hadoop: http://hadoop.apache.org
Junge: Microsoft Office 2010

Bibliography (1/2)
[0] Johannes Schildgen. Ein MapReduce-basiertes Programmiermodell für selbstwartbare Aggregatsichten.
Masterarbeit, TU Kaiserslautern, August 2012

[1] Apache Hadoop project. http://hadoop.apache.org/.
[2] Virga: Incremental Recomputations in MapReduce. http://wwwlgis.informatik.uni-kl.de/cms/?id=526.
[3] Philippe Adjiman. Hadoop Tutorial Series, Issue #4: To Use Or Not To Use A Combiner,2010.
http://www.philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/.
[4] Kai Biermann. Big Data: Twitter wird zum Fieberthermometer der Gesellschaft, April 2012.
http://www.zeit.de/digital/internet/2012-04/twitter-krankheiten-nowcast.
[5] Julie Bort. 8 Crazy Things IBM Scientists Have Learned Studying Twitter, January 2012.
http://www.businessinsider.com/8-crazy-things-ibm-scientists-have-learned-studying-twitter-2012-1.
[6] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clus-ters. OSDI, pages 137–150, 2004.
[7] Lars George. HBase: The Definitive Guide. O’Reilly Media, 1 edition, 2011.
[8] Brown University Data Management Group. A Comparison of Approaches to Large-Scale Data Analysis.
http://database.cs.brown.edu/projects/mapreduce-vs-dbms/.
[9] Ricky Ho. Map/Reduce to recommend people connection, August 2010.
http://horicky.blogspot.de/2010/08/mapreduce-to-recommend-people.html.
[10] Yong Hu. Efficiently Extracting Change Data from HBase. April 2012.
[11] Thomas Jörg, Roya Parvizi, Hu Yong, and Stefan Dessloch. Can mapreduce learnform materialized views?
In LADIS 2011, pages 1 – 5, 9 2011.
[12] Thomas Jörg, Roya Parvizi, Hu Yong, and Stefan Dessloch. Incremental recomputations in mapreduce. In CloudDB 2011, 10 2011.
[13] Steve Krenzel. MapReduce: Finding Friends, 2010. http://stevekrenzel.com/finding-friends-with-mapreduce.
[14] Todd Lipcon. 7 Tips for Improving MapReduce Performance, 2009.http://www.cloudera.com/blog/2009/12/7-tips-for-improving-
mapreduce-performance/.

Bibliography (2/2)
[15] TouchType Ltd. SwiftKey X - Android Apps auf Google Play, February 2012.
http://play.google.com/store/apps/details?id=com.touchtype.swiftkey.
[16] Karl H. Marbaise. Hadoop - Think Large!, 2011. http://www.soebes.de/files/RuhrJUGEssenHadoop-20110217.pdf.
[17] Arnab Nandi, Cong Yu, Philip Bohannon, and Raghu Ramakrishnan. Distributed Cube Materia-lization on Holistic Measures.
ICDE, pages 183–194, 2011.
[18] Alexander Neumann. Studie: Hadoop wird ähnlich erfolgreich wie Linux, Mai 2012.
http://heise.de/-1569837.
[19] Owen O’Malley, Jack Hebert, Lohit Vijayarenu, and Amar Kamat. Partitioning your job into maps and reduces, September 2009.
http://wiki.apache.org/hadoop/HowManyMapsAndReduces?action=recall&rev=7.
[20] Roya Parvizi. Inkrementelle Neuberechnungen mit MapReduce. Bachelorarbeit, TU Kaiserslautern,
Juni 2011.
[21] Arnd Poetzsch-Heffter. Konzepte objektorientierter Programmierung. eXamen.press.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
[22] Dave Rosenberg. Hadoop, the elephant in the enterprise, June 2012.
http://news.cnet.com/8301-1001 3-57452061-92/hadoop-the-elephant-in-the-enterprise/.
[23] Marc Schäfer. Inkrementelle Wartung von Data Cubes. Bachelorarbeit, TU Kaiserslautern, Januar 2012.
[24] Sanjay Sharma. Advanced Hadoop Tuning and Optimizations, 2009.
http://www.slideshare.net/ImpetusInfo/ppt-on-advanced-hadoop-tuning-n-optimisation.
[25] Jason Venner. Pro Hadoop. Apress, Berkeley, CA, 2009.
[26] DickWeisinger. Big Data: Think of NoSQL As Complementary to Traditional RDBMS, Juni 2012.
http://www.formtek.com/blog/?p=3032.
[27] Tom White. 10 MapReduce Tips, May 2009. http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/.60

Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate Views

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate Views

Similar to Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate Views (20)

More from Johannes Schildgen

More from Johannes Schildgen (6)

Recently uploaded

Recently uploaded (20)

Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate Views