SlideShare una empresa de Scribd logo
1 de 56
Big Data otimizado: Arquiteturas
eficientes para construção de
Pipelines MapReduce
Fabiane Bizinella Nardon
@fabianenardon
Eu e Big Data
GRANDE?
o quão
grande é
COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
Todos os seus dados não cabem
em uma só máquina
byFernandoStankuns
COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
Você está falando mais em Terabytes
do que em Gigabytes
COMO SABER SE VOCÊ TEM DADOS
GRANDES MESMO:
A quantidade de dados que você processa cresce
constantemente. E deve dobrar no ano que vem.
bySauloCruz
PARA TODO O RESTO:
KEEP IT SIMPLE!
Hadoop
HBase
Hive
Crunch
HDFS
Cascading
Pig Mahout
Redis
MongoDB
MySQL
Cassandra
Dados
Map
Reduce
Novos Dados
Dados
Map
Reduce
Pipeline (Exemplo)
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
http://www.tailtarget.com/about/ - Tecnologia
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
MapReduce Pipelines
- Ferramentas -
Orquestrar
Encadear
Otimizar
Hadoop
HBase
Hive
Crunch
HDFS
Cascading
Pig
Mahout
Redis
MongoDB
MySQL
Cassandra
Apache Crunch
Biblioteca para construção de MapReduce pipelines
sobre Hadoop
Intercala e orquestra diferentes funções de
MapReduce
De quebra, otimiza e facilita a implementação de
MapReduce
FlumeJava: Easy, Efficient Data-Parallel
Pipelines (Google, 2010)
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* Ptable*
• PCollection, PTable ouPGroupedTable
Write
Data
Source
HDFS HDFS
Data
Target
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* Ptable* Write
Data
Source
parallelDo()
HDFS HDFS
• PCollection, PTable ouPGroupedTable
Data
Target
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 1PCollection* Ptable* Write
Data
Source
Data
Target
parallelDo()
HDFS HDFS
• PCollection, PTable ouPGroupedTable
DoFN 1 DoFN 2
Data
Target
Crunch – Anatomia de um Pipeline
Hadoop Node 1 Hadoop Node 2
DoFN 1 DoFN 2PCollection* PTable* Write
Data
Source
Data
Target
parallelDo()
DoFN 1 DoFN 1
HDFS HDFS
• PCollection, PTable ouPGroupedTable
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
tailtarget.com – 2
cnn.com - 1
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
PCollection<String> lines = pipeline.readTextFile("my/file");
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
pipeline.writeTextFile(counts, "my/output/file");
PipelineResult pipelineResult = pipeline.done();
public class NaiveCountVisitors extends DoFn<String, Pair<String, Integer>> {
public NaiveCountVisitors() {
}
public void process(String line, Emitter<Pair<String, Integer>> emitter) {
String[] parts = line.split(" ");
URL url = new URL(parts[2]);
emitter.emit(Pair.of(url.getHost(), 1));
}
}
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text page = new Text();
public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
String[] parts = line.split(" ");
page.set(new URL(parts[2]).getHost());
context.write(page, one);
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable counter = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context){
int count = 0;
for (IntWritable value : values) {
count = count + value.get();
}
counter.set(count);
context.write(key, counter);
}
}
PGroupedTable<String, Integer> grouped = visitors.groupByKey();
PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
MapReduce (Sem Crunch)
MapReduce (Com Crunch)
HDFS
Chunk 1
Chunk 2
Record
Reader
Record
Reader
Map
Map
Combine
Combine Local
Storage
Map
Local
Storage
Copy Sort Reduce
Reduce
Counter Map Reduce Total Map Reduce Total
SLOTS_MILLIS_MAPS 0 0 1,635,906 0 0 1,434,544
SLOTS_MILLIS_REDUCES 0 0 870,082 0 0 384,755
FILE_BYTES_WRITTEN 1,907,284,471 956,106,354 2,863,390,825 3,776,871 681,575 4,458,446
Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720
Map output records 33,661,880 0 33,661,880 33,661,880 0 33,661,880
Combine input records 0 0 0 33,714,223 0 33,714,223
Combine output records 0 0 0 74,295 0 74,295
Reduce input records 0 33,661,880 33,661,880 0 21,952 21,952
Reduce output records 0 343 343 0 343 343
Map output bytes 888,738,480 0 888,738,480 888,738,480 0 888,738,480
Map output materialized bytes 956,063,008 0 956,063,008 657,536 0 657,536
Reduce shuffle bytes 0 940,985,238 940,985,238 0 657,536 657,536
Physical memory (bytes) snapshot 11,734,376,448 527,491,072 12,261,867,520 12,008,472,576 86,654,976 12,095,127,552
Spilled Records 67,103,496 33,661,880 100,765,376 74,295 21,952 96,247
Total committed heap usage
(bytes)
10,188,226,560 396,902,400 10,585,128,960 10,188,226,560 59,441,152 10,247,667,712
CPU time spent (ms) 456,03 79,84 535,87 450,26 6,23 456,49
MapReduce Com Crunch
private Map<String, Integer> items = null;
public void initialize() {
items = new HashMap<String, Integer>();
}
public void process(String line, Emitter<Pair<String, Integer>> emitter) {
String[] parts = line.split(" ");
Integer value = items.get(new URL(parts[2]).getHost());
if (value == null) {
items.put(url.getHost(), 1);
} else {
items.put(url.getHost(), value + 1);
}
}
public void cleanup(Emitter<Pair<String, Integer>> emitter) {
for (Entry<String, Integer> item : items.entrySet()) {
emitter.emit(Pair.of(item.getKey(), item.getValue()));
}
}
Crunch com pré-combinação
Counter Map Reduce Total Map Reduce Total
SLOTS_MILLIS_MAPS 0 0 1,434,544 0 0 1,151,037
SLOTS_MILLIS_REDUCES 0 0 384,755 0 0 614,053
FILE_BYTES_WRITTEN 3,776,871 681,575 4,458,446 2,225,432 706,037 2,931,469
Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720
Map output records 33,661,880 0 33,661,880 21,952 0 21,952
Combine input records 33,714,223 0 33,714,223 21,952 0 21,952
Combine output records 74,295 0 74,295 21,952 0 21,952
Reduce input records 0 21,952 21,952 0 21,952 21,952
Reduce output records 0 343 343 0 343 343
Map output bytes 888,738,480 0 888,738,480 613,248 0 613,248
Map output materialized bytes 657,536 0 657,536 657,92 0 657,92
Reduce shuffle bytes 0 657,536 657,536 0 657,92 657,92
Physical memory (bytes) snapshot 12,008,472,576 86,654,976 12,095,127,552
12,065,873,9
20
171,278,336
12,237,152,2
56
Spilled Records 74,295 21,952 96,247 21,952 21,952 43,904
Total committed heap usage (bytes) 10,188,226,560 59,441,152 10,247,667,712
10,188,226,5
60
118,882,304
10,307,108,8
64
CPU time spent (ms) 450,26 6,23 456,49 295,79 9,86 305,65
Com Crunch
Com
Pré-Combinação
Os gargalos geralmente são
causados pela quantidade de
dados que é trafegada na rede
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
http://www.tailtarget.com/about/ - Tecnologia
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
http://www.tailtarget.com/about/ - Tecnologia
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
Merge
1
2 3
4 5
6
{http://www.tailtarget.com/home/, [u=0C010003 ], Tecnologia}
{http://cnn.com/news, [u=12070002], Notícias}
{http://www.tailtarget.com/about/, [u=00AD0e12 ], Tecnologia}
Arquitetando o Pipeline
{http://www.tailtarget.com/home/, [u=0C010003 ]}
{http://cnn.com/news, [u=12070002]}
{http://www.tailtarget.com/about/, [u=00AD0e12 ]}
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
1
2
3
4
Faça o máximo que puder com
os dados que tem na mão
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/about/
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
http://www.tailtarget.com/about/ - Tecnologia
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
Merge
Arquitetando o Pipeline
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
Redis
4
2 4
3
56
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
1
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
2
Arquitetando o Pipeline
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194
u=12070002 - http://cnn.com/news - 189.19.123.161
u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127
u=0C010003 - http://www.tailtarget.com/home/
u=12070002 - http://cnn.com/news
u=00AD0e12 - http://www.tailtarget.com/about/
http://www.tailtarget.com/home/
http://cnn.com/news
http://www.tailtarget.com/home/ - Tecnologia
http://cnn.com/news - Notícias
u=0C010003 - Tecnologia
u=12070002 - Notícias
u=00AD0e12 - Tecnologia
Redis
1
2 3
4 6
Pipeline A:
Input: 1
Output: 2, 4
Pipeline B:
Input: 1
Output: 3, 6
PipelineResult pipelineResult = pipeline.done();
String dotFileContents = pipeline.getConfiguration()
.get(PlanningParameters.PIPELINE_PLAN_DOTFILE);
FileUtils.writeStringToFile(
new File("/tmp/logpipelinegraph.dot"), dotFileContents);
Maximize o paralelismo
Nada tem mais impacto
na performance da sua aplicação
do que a otimização do seu
próprio código
SmallData
*É big data, lembra?
Hadoop/HDFS não funcionam bem
com arquivos pequenos.
E se a fonte não for um arquivo texto?
Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf());
DataBaseSource<Vistors> dbsrc = new DataBaseSource.Builder<Vistors>(Vistors.class)
.setDriverClass(org.h2.Driver.class)
.setUrl(”jdbc://…").setUsername(”root").setPassword("")
.selectSQLQuery("SELECT URL, UID FROM TEST")
.countSQLQuery("select count(*) from Test").build();
PCollection<Vistors> visitors= pipeline.read(dbsrc);
PTable<String, Integer> visitors = lines.parallelDo("Count Visitors",
new NaiveCountVisitors(),
Writables.tableOf(Writables.strings(), Writables.ints()));
…
PipelineResult pipelineResult = pipeline.done();
Ou crie o seu datasource…
public class RedisSource<T extends Writable> implements Source<T> {
@Override
public void configureSource(org.apache.hadoop.mapreduce.Job job, int inputId)
throws IOException {
Configuration configuration = job.getConfiguration();
RedisConfiguration.configureDB(configuration,
redisMasters, dbNumber, dataStructure);
job.setInputFormatClass(RedisInputFormat.class);
RedisInputFormat.setInput(job, inputClass, redisMasters,
dbNumber, sliceExpression, maxRecordsToReturn,
dataStructure, null, null);
}
}
- Read
- getSplits
- Write
Tenha certeza que seu Big Data é
Big mesmo
Otimize o seu código. Ele será
repetido MUITAS vezes
Projete o seu pipeline para
maximizar o paralelismo
Na nuvem você tem recursos
virtualmente ilimitados.
Mas o custo também.
Big Data otimizado: Arquiteturas
eficientes para construção de
Pipelines MapReduce
Fabiane Bizinella Nardon
@fabianenardon

Más contenido relacionado

La actualidad más candente

Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopSteven Francia
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014MongoDB
 
Webinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation FrameworkWebinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation FrameworkMongoDB
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceLivePerson
 
Norikra in Action (ver. 2014 spring)
Norikra in Action (ver. 2014 spring)Norikra in Action (ver. 2014 spring)
Norikra in Action (ver. 2014 spring)SATOSHI TAGOMORI
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB MongoDB
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovAltinity Ltd
 
Extending Slate Queries & Reports with JSON & JQUERY
Extending Slate Queries & Reports with JSON & JQUERYExtending Slate Queries & Reports with JSON & JQUERY
Extending Slate Queries & Reports with JSON & JQUERYJonathan Wehner
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesAltinity Ltd
 
D-Talk: What's awesome about Ruby 2.x and Rails 4
D-Talk: What's awesome about Ruby 2.x and Rails 4D-Talk: What's awesome about Ruby 2.x and Rails 4
D-Talk: What's awesome about Ruby 2.x and Rails 4Jan Berdajs
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...Altinity Ltd
 
[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.io[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.ioItay Weinberger
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...Altinity Ltd
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo dbMongoDB
 

La actualidad más candente (20)

Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
MongoDB 3.2 - Analytics
MongoDB 3.2  - AnalyticsMongoDB 3.2  - Analytics
MongoDB 3.2 - Analytics
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
 
Webinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation FrameworkWebinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation Framework
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
Norikra in Action (ver. 2014 spring)
Norikra in Action (ver. 2014 spring)Norikra in Action (ver. 2014 spring)
Norikra in Action (ver. 2014 spring)
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
Extending Slate Queries & Reports with JSON & JQUERY
Extending Slate Queries & Reports with JSON & JQUERYExtending Slate Queries & Reports with JSON & JQUERY
Extending Slate Queries & Reports with JSON & JQUERY
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
 
D-Talk: What's awesome about Ruby 2.x and Rails 4
D-Talk: What's awesome about Ruby 2.x and Rails 4D-Talk: What's awesome about Ruby 2.x and Rails 4
D-Talk: What's awesome about Ruby 2.x and Rails 4
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
 
[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.io[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.io
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 

Similar a Big Data Architecture for Efficient MapReduce Pipelines

Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax Academy
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Streaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleStreaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleMariaDB plc
 

Similar a Big Data Architecture for Efficient MapReduce Pipelines (20)

Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Hadoop
HadoopHadoop
Hadoop
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and Future
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Streaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleStreaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScale
 

Último

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Big Data Architecture for Efficient MapReduce Pipelines

  • 1. Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce Fabiane Bizinella Nardon @fabianenardon
  • 2. Eu e Big Data
  • 4. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: Todos os seus dados não cabem em uma só máquina byFernandoStankuns
  • 5. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: Você está falando mais em Terabytes do que em Gigabytes
  • 6. COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO: A quantidade de dados que você processa cresce constantemente. E deve dobrar no ano que vem. bySauloCruz
  • 7. PARA TODO O RESTO: KEEP IT SIMPLE!
  • 11. Pipeline (Exemplo) http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
  • 12. MapReduce Pipelines - Ferramentas - Orquestrar Encadear Otimizar Hadoop HBase Hive Crunch HDFS Cascading Pig Mahout Redis MongoDB MySQL Cassandra
  • 13. Apache Crunch Biblioteca para construção de MapReduce pipelines sobre Hadoop Intercala e orquestra diferentes funções de MapReduce De quebra, otimiza e facilita a implementação de MapReduce FlumeJava: Easy, Efficient Data-Parallel Pipelines (Google, 2010)
  • 14. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* Ptable* • PCollection, PTable ouPGroupedTable Write Data Source HDFS HDFS Data Target
  • 15. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* Ptable* Write Data Source parallelDo() HDFS HDFS • PCollection, PTable ouPGroupedTable Data Target
  • 16. Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 1PCollection* Ptable* Write Data Source Data Target parallelDo() HDFS HDFS • PCollection, PTable ouPGroupedTable DoFN 1 DoFN 2
  • 17. Data Target Crunch – Anatomia de um Pipeline Hadoop Node 1 Hadoop Node 2 DoFN 1 DoFN 2PCollection* PTable* Write Data Source Data Target parallelDo() DoFN 1 DoFN 1 HDFS HDFS • PCollection, PTable ouPGroupedTable
  • 18. u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 tailtarget.com – 2 cnn.com - 1
  • 19. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 20. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 21. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 22. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 23. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 24. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 25. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 26. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 27. Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); PCollection<String> lines = pipeline.readTextFile("my/file"); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS()); pipeline.writeTextFile(counts, "my/output/file"); PipelineResult pipelineResult = pipeline.done();
  • 28. public class NaiveCountVisitors extends DoFn<String, Pair<String, Integer>> { public NaiveCountVisitors() { } public void process(String line, Emitter<Pair<String, Integer>> emitter) { String[] parts = line.split(" "); URL url = new URL(parts[2]); emitter.emit(Pair.of(url.getHost(), 1)); } }
  • 29. public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private static final IntWritable one = new IntWritable(1); private Text page = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); String[] parts = line.split(" "); page.set(new URL(parts[2]).getHost()); context.write(page, one); } }
  • 30. public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable counter = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context){ int count = 0; for (IntWritable value : values) { count = count + value.get(); } counter.set(count); context.write(key, counter); } } PGroupedTable<String, Integer> grouped = visitors.groupByKey(); PTable<String, Integer> counts = grouped.combineValues(Aggregators.<String>SUM_INTS());
  • 33. HDFS Chunk 1 Chunk 2 Record Reader Record Reader Map Map Combine Combine Local Storage Map Local Storage Copy Sort Reduce Reduce
  • 34. Counter Map Reduce Total Map Reduce Total SLOTS_MILLIS_MAPS 0 0 1,635,906 0 0 1,434,544 SLOTS_MILLIS_REDUCES 0 0 870,082 0 0 384,755 FILE_BYTES_WRITTEN 1,907,284,471 956,106,354 2,863,390,825 3,776,871 681,575 4,458,446 Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720 Map output records 33,661,880 0 33,661,880 33,661,880 0 33,661,880 Combine input records 0 0 0 33,714,223 0 33,714,223 Combine output records 0 0 0 74,295 0 74,295 Reduce input records 0 33,661,880 33,661,880 0 21,952 21,952 Reduce output records 0 343 343 0 343 343 Map output bytes 888,738,480 0 888,738,480 888,738,480 0 888,738,480 Map output materialized bytes 956,063,008 0 956,063,008 657,536 0 657,536 Reduce shuffle bytes 0 940,985,238 940,985,238 0 657,536 657,536 Physical memory (bytes) snapshot 11,734,376,448 527,491,072 12,261,867,520 12,008,472,576 86,654,976 12,095,127,552 Spilled Records 67,103,496 33,661,880 100,765,376 74,295 21,952 96,247 Total committed heap usage (bytes) 10,188,226,560 396,902,400 10,585,128,960 10,188,226,560 59,441,152 10,247,667,712 CPU time spent (ms) 456,03 79,84 535,87 450,26 6,23 456,49 MapReduce Com Crunch
  • 35. private Map<String, Integer> items = null; public void initialize() { items = new HashMap<String, Integer>(); } public void process(String line, Emitter<Pair<String, Integer>> emitter) { String[] parts = line.split(" "); Integer value = items.get(new URL(parts[2]).getHost()); if (value == null) { items.put(url.getHost(), 1); } else { items.put(url.getHost(), value + 1); } } public void cleanup(Emitter<Pair<String, Integer>> emitter) { for (Entry<String, Integer> item : items.entrySet()) { emitter.emit(Pair.of(item.getKey(), item.getValue())); } }
  • 37. Counter Map Reduce Total Map Reduce Total SLOTS_MILLIS_MAPS 0 0 1,434,544 0 0 1,151,037 SLOTS_MILLIS_REDUCES 0 0 384,755 0 0 614,053 FILE_BYTES_WRITTEN 3,776,871 681,575 4,458,446 2,225,432 706,037 2,931,469 Map input records 33,809,720 0 33,809,720 33,809,720 0 33,809,720 Map output records 33,661,880 0 33,661,880 21,952 0 21,952 Combine input records 33,714,223 0 33,714,223 21,952 0 21,952 Combine output records 74,295 0 74,295 21,952 0 21,952 Reduce input records 0 21,952 21,952 0 21,952 21,952 Reduce output records 0 343 343 0 343 343 Map output bytes 888,738,480 0 888,738,480 613,248 0 613,248 Map output materialized bytes 657,536 0 657,536 657,92 0 657,92 Reduce shuffle bytes 0 657,536 657,536 0 657,92 657,92 Physical memory (bytes) snapshot 12,008,472,576 86,654,976 12,095,127,552 12,065,873,9 20 171,278,336 12,237,152,2 56 Spilled Records 74,295 21,952 96,247 21,952 21,952 43,904 Total committed heap usage (bytes) 10,188,226,560 59,441,152 10,247,667,712 10,188,226,5 60 118,882,304 10,307,108,8 64 CPU time spent (ms) 450,26 6,23 456,49 295,79 9,86 305,65 Com Crunch Com Pré-Combinação
  • 38. Os gargalos geralmente são causados pela quantidade de dados que é trafegada na rede
  • 39. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
  • 40. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ Merge 1 2 3 4 5 6
  • 41. {http://www.tailtarget.com/home/, [u=0C010003 ], Tecnologia} {http://cnn.com/news, [u=12070002], Notícias} {http://www.tailtarget.com/about/, [u=00AD0e12 ], Tecnologia} Arquitetando o Pipeline {http://www.tailtarget.com/home/, [u=0C010003 ]} {http://cnn.com/news, [u=12070002]} {http://www.tailtarget.com/about/, [u=00AD0e12 ]} u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 1 2 3 4
  • 42. Faça o máximo que puder com os dados que tem na mão
  • 43. Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias http://www.tailtarget.com/about/ - Tecnologia u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/
  • 44. Merge Arquitetando o Pipeline http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia Redis 4 2 4 3 56 u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 1 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ 2
  • 45. Arquitetando o Pipeline u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/home/ - Tecnologia http://cnn.com/news - Notícias u=0C010003 - Tecnologia u=12070002 - Notícias u=00AD0e12 - Tecnologia Redis 1 2 3 4 6 Pipeline A: Input: 1 Output: 2, 4 Pipeline B: Input: 1 Output: 3, 6
  • 46. PipelineResult pipelineResult = pipeline.done(); String dotFileContents = pipeline.getConfiguration() .get(PlanningParameters.PIPELINE_PLAN_DOTFILE); FileUtils.writeStringToFile( new File("/tmp/logpipelinegraph.dot"), dotFileContents);
  • 47.
  • 49. Nada tem mais impacto na performance da sua aplicação do que a otimização do seu próprio código
  • 51. *É big data, lembra? Hadoop/HDFS não funcionam bem com arquivos pequenos.
  • 52. E se a fonte não for um arquivo texto? Pipeline pipeline = new MRPipeline(SimpleNaiveMapReduce.class, getConf()); DataBaseSource<Vistors> dbsrc = new DataBaseSource.Builder<Vistors>(Vistors.class) .setDriverClass(org.h2.Driver.class) .setUrl(”jdbc://…").setUsername(”root").setPassword("") .selectSQLQuery("SELECT URL, UID FROM TEST") .countSQLQuery("select count(*) from Test").build(); PCollection<Vistors> visitors= pipeline.read(dbsrc); PTable<String, Integer> visitors = lines.parallelDo("Count Visitors", new NaiveCountVisitors(), Writables.tableOf(Writables.strings(), Writables.ints())); … PipelineResult pipelineResult = pipeline.done();
  • 53. Ou crie o seu datasource… public class RedisSource<T extends Writable> implements Source<T> { @Override public void configureSource(org.apache.hadoop.mapreduce.Job job, int inputId) throws IOException { Configuration configuration = job.getConfiguration(); RedisConfiguration.configureDB(configuration, redisMasters, dbNumber, dataStructure); job.setInputFormatClass(RedisInputFormat.class); RedisInputFormat.setInput(job, inputClass, redisMasters, dbNumber, sliceExpression, maxRecordsToReturn, dataStructure, null, null); } } - Read - getSplits - Write
  • 54. Tenha certeza que seu Big Data é Big mesmo Otimize o seu código. Ele será repetido MUITAS vezes Projete o seu pipeline para maximizar o paralelismo
  • 55. Na nuvem você tem recursos virtualmente ilimitados. Mas o custo também.
  • 56. Big Data otimizado: Arquiteturas eficientes para construção de Pipelines MapReduce Fabiane Bizinella Nardon @fabianenardon

Notas del editor

  1. The bottleneck usually is caused by the amount of data going across the network
  2. The bottleneck usually is caused by the amount of data going across the network