Big Data Architecture for Efficient MapReduce Pipelines

Big Data otimizado: Arquiteturas
eficientes para construção de
Pipelines MapReduce
Fabiane Bizinella Nardon
@fabianenardon

COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
Todos os seus dados não cabem
em uma só máquina
byFernandoStankuns

GRANDES MESMO:
Você está falando mais em Terabytes
do que em Gigabytes

GRANDES MESMO:
A quantidade de dados que você processa cresce
constantemente. E deve dobrar no ano que vem.
bySauloCruz

PARA TODO O RESTO:
KEEP IT SIMPLE!

Hadoop
HBase
Hive
Crunch
HDFS
Cascading
Pig Mahout
Redis
MongoDB
MySQL
Cassandra

Pipeline (Exemplo)
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
http://www.tailtarget.com/about/ - Tecnologia
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/

MapReduce Pipelines
- Ferramentas -
Orquestrar
Encadear
Otimizar
Hadoop
HBase
Hive
Crunch
HDFS
Cascading
Pig
Mahout
Redis
MongoDB
MySQL
Cassandra

Apache Crunch
Biblioteca para construção de MapReduce pipelines
sobre Hadoop
Intercala e orquestra diferentes funções de
MapReduce
De quebra, otimiza e facilita a implementação de
MapReduce
FlumeJava: Easy, Efficient Data-Parallel
Pipelines (Google, 2010)

Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* Ptable*
• PCollection, PTable ouPGroupedTable
Write
Data
Source
HDFS HDFS
Data
Target

DoFN 1 DoFN 2PCollection* Ptable* Write
Data
Source
parallelDo()
HDFS HDFS
Data
Target

DoFN 1 DoFN 1PCollection* Ptable* Write
Data
Source
Data
Target
parallelDo()
HDFS HDFS
DoFN 1 DoFN 2

Data
Target
DoFN 1 DoFN 2PCollection* PTable* Write
Data
Source
Data
Target
parallelDo()
DoFN 1 DoFN 1
HDFS HDFS

tailtarget.com – 2
cnn.com - 1

Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();

public class NaiveCountVisitors extends DoFn<String, Pair<String, Integer>> {
public NaiveCountVisitors() {
}
public void process(String line, Emitter<Pair<String, Integer>> emitter) {
String[] parts = line.split(" ");
URL url = new URL(parts[2]);
emitter.emit(Pair.of(url.getHost(), 1));
}
}

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text page = new Text();
public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
page.set(new URL(parts[2]).getHost());
context.write(page, one);
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable counter = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context){
int count = 0;
for (IntWritable value : values) {
count = count + value.get();
}
counter.set(count);
context.write(key, counter);
}
}
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());

HDFS
Chunk 1
Chunk 2
Record
Reader
Record
Reader
Map
Map
Combine
Combine Local
Storage
Map
Local
Storage
Copy Sort Reduce
Reduce

Counter Map Reduce Total Map Reduce Total
SLOTS_MILLIS_MAPS 0 0 1,635,906 0 0 1,434,544
SLOTS_MILLIS_REDUCES 0 0 870,082 0 0 384,755
FILE_BYTES_WRITTEN 1,907,284,471 956,106,354 2,863,390,825 3,776,871 681,575 4,458,446
Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720
Map output records 33,661,880 0 33,661,880 33,661,880 0 33,661,880
Combine input records 0 0 0 33,714,223 0 33,714,223
Combine output records 0 0 0 74,295 0 74,295
Reduce input records 0 33,661,880 33,661,880 0 21,952 21,952
Reduce output records 0 343 343 0 343 343
Map output bytes 888,738,480 0 888,738,480 888,738,480 0 888,738,480
Map output materialized bytes 956,063,008 0 956,063,008 657,536 0 657,536
Reduce shuffle bytes 0 940,985,238 940,985,238 0 657,536 657,536
Physical memory (bytes) snapshot 11,734,376,448 527,491,072 12,261,867,520 12,008,472,576 86,654,976 12,095,127,552
Spilled Records 67,103,496 33,661,880 100,765,376 74,295 21,952 96,247
Total committed heap usage
(bytes)
10,188,226,560 396,902,400 10,585,128,960 10,188,226,560 59,441,152 10,247,667,712
CPU time spent (ms) 456,03 79,84 535,87 450,26 6,23 456,49
MapReduce Com Crunch

private Map<String, Integer> items = null;
public void initialize() {
items = new HashMap<String, Integer>();
}
public void process(String line, Emitter<Pair<String, Integer>> emitter) {
Integer value = items.get(new URL(parts[2]).getHost());
if (value == null) {
items.put(url.getHost(), 1);
} else {
items.put(url.getHost(), value + 1);
}
}
public void cleanup(Emitter<Pair<String, Integer>> emitter) {
for (Entry<String, Integer> item : items.entrySet()) {
emitter.emit(Pair.of(item.getKey(), item.getValue()));
}
}

Counter Map Reduce Total Map Reduce Total
SLOTS_MILLIS_MAPS 0 0 1,434,544 0 0 1,151,037
SLOTS_MILLIS_REDUCES 0 0 384,755 0 0 614,053
FILE_BYTES_WRITTEN 3,776,871 681,575 4,458,446 2,225,432 706,037 2,931,469
Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720
Map output records 33,661,880 0 33,661,880 21,952 0 21,952
Combine input records 33,714,223 0 33,714,223 21,952 0 21,952
Combine output records 74,295 0 74,295 21,952 0 21,952
Reduce input records 0 21,952 21,952 0 21,952 21,952
Reduce output records 0 343 343 0 343 343
Map output bytes 888,738,480 0 888,738,480 613,248 0 613,248
Map output materialized bytes 657,536 0 657,536 657,92 0 657,92
Reduce shuffle bytes 0 657,536 657,536 0 657,92 657,92
Physical memory (bytes) snapshot 12,008,472,576 86,654,976 12,095,127,552
12,065,873,9
20
171,278,336
12,237,152,2
56
Spilled Records 74,295 21,952 96,247 21,952 21,952 43,904
Total committed heap usage (bytes) 10,188,226,560 59,441,152 10,247,667,712
10,188,226,5
60
118,882,304
10,307,108,8
64
CPU time spent (ms) 450,26 6,23 456,49 295,79 9,86 305,65
Com Crunch
Com
Pré-Combinação

Os gargalos geralmente são
causados pela quantidade de
dados que é trafegada na rede

Arquitetando o Pipeline
http://cnn.com/news

http://cnn.com/news
Merge
1
2 3
4 5
6

{http://www.tailtarget.com/home/, [u=0C010003 ], Tecnologia}
{http://cnn.com/news, [u=12070002], Notícias}
{http://www.tailtarget.com/about/, [u=00AD0e12 ], Tecnologia}
{http://www.tailtarget.com/home/, [u=0C010003 ]}
{http://cnn.com/news, [u=12070002]}
{http://www.tailtarget.com/about/, [u=00AD0e12 ]}
1
2
3
4

Faça o máximo que puder com
os dados que tem na mão

Merge
http://cnn.com/news
Redis
4
2 4
3
56
1
2

http://cnn.com/news
Redis
1
2 3
4 6
Pipeline A:
Input: 1
Output: 2, 4
Pipeline B:
Input: 1
Output: 3, 6

String dotFileContents = pipeline.getConfiguration()
.get(PlanningParameters.PIPELINE_PLAN_DOTFILE);
FileUtils.writeStringToFile(
new File("/tmp/logpipelinegraph.dot"), dotFileContents);

Nada tem mais impacto
na performance da sua aplicação
do que a otimização do seu
próprio código

*É big data, lembra?
Hadoop/HDFS não funcionam bem
com arquivos pequenos.

E se a fonte não for um arquivo texto?
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
DataBaseSource<Vistors> dbsrc = new DataBaseSource.Builder<Vistors>(Vistors.class)
.setDriverClass(org.h2.Driver.class)
.setUrl(”jdbc://…").setUsername(”root").setPassword("")
.selectSQLQuery("SELECT URL, UID FROM TEST")
.countSQLQuery("select count(*) from Test").build();
PCollection<Vistors> visitors= pipeline.read(dbsrc);
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
…

Ou crie o seu datasource…
public class RedisSource<T extends Writable> implements Source<T> {
@Override
public void configureSource(org.apache.hadoop.mapreduce.Job job, int inputId)
throws IOException {
Configuration configuration = job.getConfiguration();
RedisConfiguration.configureDB(configuration,
redisMasters, dbNumber, dataStructure);
job.setInputFormatClass(RedisInputFormat.class);
RedisInputFormat.setInput(job, inputClass, redisMasters,
dbNumber, sliceExpression, maxRecordsToReturn,
dataStructure, null, null);
}
}
- Read
- getSplits
- Write

Tenha certeza que seu Big Data é
Big mesmo
Otimize o seu código. Ele será
repetido MUITAS vezes
Projete o seu pipeline para
maximizar o paralelismo

Na nuvem você tem recursos
virtualmente ilimitados.
Mas o custo também.

Big Data Architecture for Efficient MapReduce Pipelines

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Big Data Architecture for Efficient MapReduce Pipelines

Similar a Big Data Architecture for Efficient MapReduce Pipelines (20)

Último

Último (20)

Big Data Architecture for Efficient MapReduce Pipelines

Notas del editor