SlideShare a Scribd company logo
1 of 86
Download to read offline
Dr. Rubén Casado 
ruben.casado@treelogic.com 
ruben_casado 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades 
UNIVERSIDAD COMPLUETENSEMADRID 
19 de Noviembre de 2014
1. 
Big Data processing 
2. 
Batch processing 
3. 
Streaming processing 
4. 
Hybrid computation model 
5. 
Open Issues & Conclusions 
Agenda
 
PhD in Software Engineering 
 
MSc in Computer Science 
 
BSc in Computer Science 
Academics 
Work 
Experience
1. 
Big Data processing 
2. 
Batchprocessing 
3. 
Streamingprocessing 
4. 
Hybridcomputationmodel 
5. 
Open Issues & Conclusions 
Agenda
A massivevolume of both structuredand unstructureddata that is so large to process with traditional database and software techniques 
What is Big Data?
Big Data are high-volume, high-velocity, and/or high-varietyinformation assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization 
How is Big Data? 
-Gartner IT Glossary -
3 problems 
Volume 
Variety 
Velocity
3 solutions 
Batch processing 
NoSQL 
Streaming 
processing
3 solutions 
Batch processing 
NoSQL 
Streaming 
processing
Volume 
Variety 
Velocity 
Science or Engineering?
Science or Engineering? 
Volume 
Variety 
Value 
Velocity
Science or Engineering? 
Volume 
Variety 
Value 
Velocity 
Software 
Engineering 
Data Science
13 
 
Relational Databases 
 
Schema based 
 
ACID (Atomicity, Consistency, Isolation, Durability) 
 
Performance penalty 
 
Scalability issues 
 
NoSQL 
 
Not Only SQL 
 
Families of solutions 
 
Google BigTable, Amazon Dynamo 
 
BASE = Basically Available, Soft state, Eventually consistent 
 
CAP= Consistency, Availability, Partition tolerance 
NoSQL
14 
 
Key-value 
 
Key: ID 
 
Value: associateddata 
 
Diccionario 
 
LinkedIn Voldemort 
 
Riak, Redis 
 
Memcache, Membase 
 
Document 
 
More complex tan K-V 
 
Documents are indexed by ID 
 
Multiple index 
 
MongoDB 
 
CouchDB 
 
Column 
 
Tables with predefined families of fields 
 
Fields within families are flexible 
 
Vertical and horizontal partitioning 
 
HBase 
 
Cassandra 
 
Graph 
 
Nodes 
 
Relationships 
 
Neo4j 
 
FlockDB 
 
OrientDB 
CR7: ‘Cristiano Ronaldo’ 
CR7:{Name: ’Cristiano’ 
Surname: ‘Ronaldo’ 
Age: 29} 
CR7: [Personal:{Name: ’Cristiano’ 
Surname: ‘Ronaldo’ 
Edad: 29} 
Job: {Team: ‘R. Madrid’ 
Salary: 20.000.000}] 
NoSQL 
[CR] 
[R.Madrid] 
se_llama 
juega 
[Cristiano]
• 
Scalable 
• 
Large amount of staticdata 
• 
Distributed 
• 
Parallel 
• 
Fault tolerant 
• 
High latency 
Batch processing 
Volume
• 
Low latency 
• 
Continuous unbounded streams of data 
• 
Distributed 
• 
Parallel 
• 
Fault-tolerant 
Streaming processing 
Velocity
• 
Lowlatency:real-time 
• 
Massivedata-at-rest+ data-in-motion 
• 
Scalable 
• 
Combinebatchand streamingresults 
Hybrid computation model 
Volume 
Velocity
All data 
New data 
Batch processing 
Streamingprocessing 
Batch 
results 
Stream 
results 
Combination 
Final results 
Hybrid computation model
 
Batchprocessing 
 
Largeamountof staticsdata 
 
Scalablesolution 
 
Volume 
 
Streamingprocessing 
 
Computing streamingdata 
 
Lowlatency 
 
Velocity 
 
Hybridcomputation 
 
Lambda Architecture 
 
Volume+ Velocity 
2006 
2010 
2014 
1ª Generation 
2ª Generation 
3ª Generation 
Inception 
2003 
Processing Paradigms
Batch 
+10 years of Big Data processing technologies 
2003 
2004 
2005 
2013 
2011 
2010 
2008 
The Google File System 
MapReduce: Simplified Data Processing on Large Clusters 
Doug Cutting starts developing Hadoop 
2006 
Yahoo! starts working on Hadoop 
Apache Hadoop is in production 
Nathan Marzcreates Storm 
Yahoo! creates S4 
2009 
Facebook creates Hive 
Yahoo! creates Pig 
MillWheel: Fault-Tolerant Stream Processing at Internet Scale 
LinkedIn presents Samza 
LinkedIn presents KafkA 
Clouderapresents Flume 
2012 
Nathan Marzdefines the Lambda Architecture 
Streaming 
Hybrid 
2014 
Spark stack is open sourced 
Lambdoop & Summinbgirdfirst steps 
StratospherebecomesApache Flink
Processing Pipeline 
DATA 
ACQUISITION 
DATA 
STORAGE 
DATA 
ANALYSIS 
RESULTS
 
Static stations and mobile sensors in Asturias sending streaming data 
 
Historical data of > 10 years 
 
Monitoring, trends identification, predictions 
Air Quality case study
1. 
Big Data processing overview 
2. 
Batch processing 
3. 
Real-time processing 
4. 
Hybrid computation model 
5. 
Open Issues & Conclusions 
Agenda
Batch processing technologies 
DATA 
ACQUISITION 
DATA 
STORAGE 
DATA 
ANALYSIS 
RESULTS 
o 
HDFS commands 
o 
Sqoop 
o 
Flume 
o 
Scribe 
o 
HDFS 
o 
HBase 
o 
MapReduce 
o 
Hive 
o 
Pig 
o 
Cascading 
o 
Spark 
o 
SparkSQL (Shark)
• 
Import to HDFS 
hadoopdfs-copyFromLocal 
<path-to-local> <path-to-remote> 
hadoopdfs–copyFromLocal/home/hduser/AirQuality/ /hdfs/AirQuality/ 
HDFS commands 
DATA 
ACQUISITION 
BATCH
• 
Tool designed for transferring data between HDFS/HBase and structural datastores 
• 
Based in MapReduce 
• 
Includes connectors for multiple databases 
o 
MySQL, 
o 
PostgreSQL, 
o 
Oracle, 
o 
SQL Server and 
o 
DB2 
o 
Generic JDBC connector 
• 
Java API 
Sqoop 
DATA 
ACQUISITION 
BATCH
import-all-tables--connectjdbc:mysql://localhost/testDatabase--target-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1 
1) Import data from database to HDFS 
export--connectjdbc:mysql://localhost/testDatabase--export-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1 
3) Export results to database 
2) Analyzedata (HADOOP) 
Sqoop 
DATA 
ACQUISITION 
BATCH
• 
Service for collecting, aggregating, and moving large amounts of log data 
• 
Simple and flexible architecture based on streaming data flows 
• 
Reliability, scalability, extensibility, manageability 
• 
Support log stream types 
o 
Avro 
o 
Syslog 
o 
Netcast 
Flume 
DATA 
ACQUISITION 
BATCH
Sources 
Channels 
Sinks 
Avro 
Memory 
HDFS 
Thrift 
JDBC 
Logger 
Exec 
File 
Avro 
JMS 
Thrift 
NetCat 
IRC 
Syslog TCP/UDP 
File Roll 
HTTP 
Null 
HBase 
Custom 
Custom 
• 
Architecture 
o 
Source 
• 
Waitingforevents. 
o 
Sink 
• 
Sendstheinformationtowardsanotheragentorsystem. 
o 
Channel 
• 
Storestheinformationuntilitisconsumedbythesink. 
Flume 
DATA 
ACQUISITION 
BATCH
Stations send the information to the servers. Flume collects this information and move it into the HDFS for further analsys 
 
Air quality syslogs 
Flume 
DATA 
ACQUISITION 
BATCH 
Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";
• 
Server for aggregating log datastreamed in real time from a large number of servers 
• 
There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. 
• 
The central scribe server(s) can write the messages to the files that are their final destination 
Scribe 
DATA 
ACQUISITION 
BATCH
category=‘mobile‘; 
// '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …' 
message= sensor_log.readLine(); 
log_entry= scribe.LogEntry(category, message) 
// Createa Scribe Client 
client= scribe.Client(iprot=protocol, oprot=protocol) 
transport.open() 
result= client.Log(messages=[log_entry]) 
transport.close() 
• 
Sending a sensor message to a Scribe Server 
Scribe 
DATA 
ACQUISITION 
BATCH
• 
Distributed FileSystem for Hadoop 
• 
Master-Slaves Architecture (NameNode–DataNodes) 
o 
NameNode: Manage the directory tree and regulates access to files by clients 
o 
DataNodes: Store the data 
• 
Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes 
HDFS 
DATA 
STORAGE 
BATCH
• 
Open-source non-relational distributed column-oriented databasemodeled after Google’s BigTable. 
• 
Random, realtime read/write access to the data. 
• 
Not a relational database. 
o 
Very light «schema» 
• 
Rows are stored in sorted order. 
DATA 
STORAGE 
BATCH 
HBase
• 
Framework for processing large amount of datain parallel 
across a distributed cluster 
• 
Slightly inspired in the Divide and Conquer (D&C) classic strategy 
• 
Developer has to implement Map and Reduce functions: 
o 
Map: It takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodesparsed to the format <K, V> 
o 
Reduce: It collects the <K, List(V)> and generates the results 
MapReduce 
DATA 
ANALYTICS 
BATCH
• 
Design Patterns 
o 
Joins 
o 
Reduce side Join 
o 
Replicated join 
o 
Semi join 
o 
Sorting: 
o 
Secondary sort 
o 
Total Order Sort 
o 
Filtering 
MapReduce 
o 
Statistics 
o 
AVG 
o 
VAR 
o 
Count 
o 
… 
o 
Top-K 
o 
Binning 
o 
… 
DATA 
ANALYTICS 
BATCH
• 
Obtain the S02average of each station 
MapReduce 
Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23"; 
DATA 
ANALYTICS 
BATCH
Input Data 
Mapper 
Mapper 
Mapper 
<1, 6> 
… 
… 
… 
Shuffling 
<1, 2> 
<3, 1> 
<1, 9> 
<3, 9> 
<2, 6> 
<2, 6> 
<1, 6> 
<2, 0> 
<2, 8> 
<1, 2> 
<3,9> 
<Station_ID, S02_VALUE> 
MapReduce 
DATA 
ANALYTICS 
BATCH 
• 
Maps get records and produce the SO2 value in <Station_Id, SO2_value>
Station_ID, AVG_SO2 
1, 2,013 
2, 2,695 
3, 3,562 
Reducer 
Sum 
Divide 
Shuffling 
Reducer 
Sum 
Divide 
… 
<Station_ID, [SO1, SO2,…,SOn> 
• 
Reducer receives <Station_Id, List<SO2_value> > and computes the average for the station 
MapReduce 
DATA 
ANALYTICS 
BATCH
Hive 
• 
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hocqueries, and the analysis of large datasets 
• 
Abstractionlayer on top of MapReduce 
• 
SQL-like language called HiveQL. 
• 
Metastore: Central repository of Hive metadata. 
DATA 
ANALYTICS 
BATCH
CREATE TABLE air_quality(Estacionint, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;' 
LINES TERMINATED BY 'n' 
STORED AS TEXTFILE; 
LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire; 
Hive 
• 
Obtain the S02average of each station 
SELECT Titulo, avg(SO2) 
FROM air_quality 
GROUP BY Estacion 
DATA 
ANALYTICS 
BATCH
• 
Platform for analyzinglarge data sets 
• 
High-level language for expressing data analysis programs. Pig Latin. Data flow programming language. 
• 
Abstraction layer on top of MapReduce 
• 
Procedurallanguage 
Pig 
DATA 
ANALYTICS 
BATCH
Pig 
DATA 
ANALYTICS 
BATCH 
• 
Obtain the S02average of each station 
calidad_aire= load'/CalidadAire_Gijon' usingPigStorage(';') AS(estacion:chararray, titulo:chararray, latitud:chararray, longitud:chararray, fecha:chararray, so2:chararray, no:chararray, co:chararray, pm10:chararray, o3:chararray, dd:chararray, vv:chararray, tmp:chararray, hr:chararray, prb:chararray, rs:chararray, ll:chararray, ben:chararray, tol:chararray, mxil:chararray, pm25:chararray); 
grouped = GROUPair_qualityBYestacion; 
avg= FOREACHgrouped GENERATEgroup, AVG(so2); 
dumpavg;
• 
Cascading is a data processing APIand processing query planner used for defining, sharing, and executing data-processing workflows 
• 
Makes development of complex Hadoop MapReduce workflows easy 
• 
In the same way that Pig 
DATA 
ANALYTICS 
BATCH 
Cascading
// define source and sink Taps. 
Tap source= new Hfs( sourceScheme, inputPath); 
Scheme sinkScheme= new TextLine( new Fields( “Estacion", “SO2" ) ); 
Tap sink= new Hfs( sinkScheme, outputPath, SinkMode.REPLACE); 
Pipe assembly= new Pipe( “avgSO2" ); 
assembly = new GroupBy( assembly, new Fields( “Estacion" ) ); 
// For every Tuple group 
Aggregator avg= new Average( new Fields( “SO2" ) ); 
assembly = new Every( assembly, avg); 
// Tell Hadoopwhich jar file to use 
Flow flow= flowConnector.connect( “avg-SO2", source, sink, assembly ); 
// execute the flow, block until complete 
flow.complete(); 
DATA 
ANALYTICS 
BATCH 
• 
Obtain the S02average of each station 
Cascading
Spark 
• 
Cluster computing systems for faster data analytics 
• 
Nota modified version of Hadoop 
• 
Compatible with HDFS 
• 
In-memorydata storage for very fast iterativeprocessing 
• 
MapReduce-likeengine 
• 
API in Scala, Java and Python 
DATA 
ANALYTICS 
BATCH
Spark 
DATA 
ANALYTICS 
BATCH 
• 
Hadoop is slow due to replication, serialization and IO tasks
Spark 
DATA 
ANALYTICS 
BATCH 
• 
10x-100x faster
Spark SQL 
• 
Large-scale data warehouse system for Spark 
• 
SQLon top of Spark (akaSHARK) 
• 
ActuallyHiveQLover Spark 
• 
Up to100 x faster than Hive 
DATA 
ANALYTICS 
BATCH
Pros 
• 
FasterthanHadoopecosystem 
• 
Easierto developnew applications 
o 
(Scala, Java and Python API) 
Cons 
• 
Not tested in extremely large clusters yet 
• 
Problems when Reducer’s data does not fit in memory 
DATA 
ANALYTICS 
BATCH 
Spark
1. 
Big Data processing 
2. 
Batch processing 
3. 
Streaming processing 
4. 
Hybrid computation model 
5. 
Open Issues & Conclusions 
Agenda
Real-time processing technologies 
DATA 
ACQUISITION 
DATA 
STORAGE 
DATA 
ANALYSIS 
RESULTS 
o 
Flume 
o 
Kafka 
o 
Kestrel 
o 
Flume 
o 
Storm 
o 
Trident 
o 
S4 
o 
Spark Streaming
Flume 
DATA 
ACQUISITION 
STREAM
• 
Kafka is a distributed, partitioned, replicated commit log service 
o 
Producer/Consumermodel 
o 
Kafka maintains feeds of messages in categories called topics 
o 
Kafka is run as a cluster 
Kafka 
DATA 
STORAGE 
STREAM
Insert AirQuality sensor log file into Kafka cluster and consume the info. 
// new Producter 
Producer<String, String> producer= new Producer<String, String>(config); 
//Open sensor log file 
BufferedReaderbr… 
Stringline; 
while(true) 
{ 
line = br.readLine(); 
if(line ==null) 
… //wait; 
else 
producer.send(new KeyedMessage<String, String>(topic, line)); 
} 
Kafka 
DATA 
STORAGE 
STREAM
AirQuality Consumer 
ConsumerConnectorconsumer= Consumer.createJavaConsumerConnector(config); 
Map<String, Integer> topicCountMap= new HashMap<String, Integer>(); 
topicCountMap.put(topic, new Integer(1)); 
Map<String, List<KafkaMessageStream>> consumerMap= consumer.createMessageStreams(topicCountMap); 
KafkaMessageStreamstream= consumerMap.get(topic).get(0); 
ConsumerIteratorit= stream.iterator(); 
while(it.hasNext()){ 
// consume it.next() 
Kafka 
DATA 
STORAGE 
STREAM
• 
Simple distributed message queue 
• 
A single Kestrel server has a set of queues (strictly-ordered FIFO) 
• 
On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication 
• 
Kestrel vsKafka 
o 
Kafka consumers cheaper (basically just the bandwidth usage) 
o 
Kestrel does not depend on Zookeeper which means it is operationally less complex if you don't already have a zookeeper installation. 
o 
Kafka has significantly better throughput. 
o 
Kestrel does not support ordered consumption 
Kestrel 
DATA 
STORAGE 
STREAM
Interceptor 
• 
Interface org.apache.flume.interceptor.Interceptor 
• 
Can modify or even drop events based on any criteria 
• 
Flume supports chainingof interceptors. 
• 
Types: 
o 
Timestamp interceptor 
o 
Host interceptor 
o 
Static interceptor 
o 
UUID interceptor 
o 
Morphline interceptor 
o 
Regex Filtering interceptor 
o 
Regex Extractor interceptor 
DATA 
ANALYTICS 
STREAM 
Flume
• 
The sensors’ information must be filtered by "Station 2" 
o 
An interceptor will filter information between Sourceand Channel. 
Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; 
"2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35";"981"; "23"; 
"3";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; 
"2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; 
"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62";"983"; "23"; 
Flume 
DATA 
ANALYTICS 
STREAM
# Writeformatcan be textorwritable 
… 
#Definingchannel–Memory type …1 
… 
#Definingsource–Syslog… 
… 
# Definingsink–HDFS … 
… 
#Defininginterceptor 
agent.sources.source.interceptors= i1 
agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter 
class StationFilter implements Interceptor 
… 
if(!"Station".equals("2")) 
discard data; 
else 
save data; 
Flume 
DATA 
ANALYTICS 
STREAM
Hadoop 
Storm 
JobTracker 
Nimbus 
TaskTracker 
Supervisor 
Job 
Topology 
• 
Distributed and scalable realtime computation system 
• 
Doing for real-time processing what Hadoop did for batch processing 
• 
Topology:processinggraph.Eachnodecontainsprocessinglogic(spoutsandbolts).Linksbetweennodesarestreamsofdata 
o 
Spout:Sourceofstreams.Readadatasourceandemitthedataintothetopologyasastream 
o 
Bolts:Processingunit.Readdatafromseveralstreams,doessomeprocessingandpossiblyemitsnewstreams 
o 
Stream:Unboundedsequenceoftuples.Tuplescancontainanyserializableobject 
Storm 
DATA 
ANALYTICS 
STREAM
CAReader 
LineProcessor 
AvgValues 
• 
AirQuality average values 
o 
Step 1: build the topology 
Storm 
Spout 
Bolt 
Bolt 
DATA 
ANALYTICS 
STREAM
• 
AirQuality average values 
o 
Step 1: build the topology 
TopologyBuilderAirAVG= new TopologyBuilder(); 
builder.setSpout("ca-reader", new CAReader(), 1); 
//shuffleGrouping-> evendistribution 
AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3) 
.shuffleGrouping("ca-reader"); 
//fieldsGrouping-> fieldswiththesamevaluegoestothesametask 
AirAVG.setBolt("ca-avg-values", new AvgValues(), 2) 
.fieldsGrouping("ca-line-processor", new Fields("id")); 
Storm 
DATA 
ANALYTICS 
STREAM
public void open(Map conf, TopologyContextcontext, 
SpoutOutputCollectorcollector) { 
//Initializefile 
BufferedReaderbr= new … 
… 
} 
publicvoidnextTuple() { 
Stringline = br.readLine(); 
if(line == null) { 
return; 
} else 
collector.emit(new Values(line)); 
} 
Storm 
• 
AirQuality average values 
o 
Step 2: CAReader implementation (IRichSpout interface) 
DATA 
ANALYTICS 
STREAM
publicvoiddeclareOutputFields(OutputFieldsDeclarerdeclarer) 
{ 
declarer.declare(new 
Fields("id", "stationName", "lat", … 
} 
publicvoidexecute(Tupleinput, BasicOutputCollectorcollector) 
{ 
collector.emit(new Values(input.getString(0).split(";"); 
} 
Storm 
• 
AirQuality average values 
o 
Step 3: LineProcessor implementation (IBasicBolt interface) 
DATA 
ANALYTICS 
STREAM
public void execute (Tuple input, BasicOutputCollector collector) 
{ 
//totals and count are hashmaps with each station accumulated values 
if (totals.containsKey(id)) { 
item = totals.get(id); 
count = counts.get(id); 
} 
else { 
//Create new item 
} 
//update values 
item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2"))); 
item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no"))); 
… 
} 
Storm 
• 
AirQuality average values 
oStep 4: AvgValues implementation (IBasicBolt interface) 
DATA 
ANALYTICS 
STREAM 
66
• 
High level abstraction on top of Storm 
o 
Provides high level operations (joins, filters, projections, aggregations, functions…) 
Pros 
o 
Easy, powerful and flexible 
o 
Incremental topology development 
o 
Exactly-once semantics 
Cons 
o 
Very few built-in functions 
o 
Lower performance and higher latency than Storm 
Trident 
DATA 
ANALYTICS 
STREAM
 
Simple Scalable Streaming System 
 
Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data 
 
Inspired by MapReduce and Actor models of computation 
o 
Data processing is based on Processing Elements (PE) 
o 
Messages are transmitted between PEs in the form of events (Key, Attributes) 
o 
Processing Nodes are the logical hosts to PEs 
S4 
DATA 
ANALYTICS 
STREAM
… 
<bean id="split" class="SplitPE"> 
<property name="dispatcher" ref="dispatcher"/> 
<property name="keys"> 
<!--Listen for both words and sentences --> 
<list> 
<value>LogLines *</value> 
</list> 
</property> 
</bean> 
<bean id="average" class="AveragePE"> 
<property name="keys"> 
<list> 
<value>CAItem stationId</value> 
</list> 
</property> 
</bean> 
… 
• 
AirQuality average values 
S4 
DATA 
ANALYTICS 
STREAM
Spark Streaming 
• 
Spark for real-time processing 
• 
Streaming computation as a series of very short batch jobs (windows) 
• 
Keep state in memory 
• 
API similar to Spark 
DATA 
ANALYTICS 
STREAM
1. 
Big Data processing 
2. 
Batch processing 
3. 
Streaming processing 
4. 
Hybrid computation model 
5. 
Open Issues & Conclusions 
Agenda
• 
We are in the beginning of this generation 
• 
Short-term Big Data processing goal 
• 
Abstraction layer over the Lambda Architecture 
• 
Promising technologies 
o 
SummingBird 
o 
Lambdoop 
Hybrid Computation Model
SummingBird 
• 
Library to write MapReduce-likeprocess that can be executed on Hadoop, Stormor hybrid model 
• 
Scalasyntaxis 
• 
Same logic can be executed in batch, real-time and hybrid bath/real mode 
HYBRID 
COMPUTATION 
MODEL
SummingBird 
HYBRID 
COMPUTATION 
MODEL
Pros 
• 
Hybrid computation model 
• 
Same programing model for all proccesing paradigms 
• 
Extensible 
Cons 
• 
MapReduce-like programing 
• 
Scala 
• 
Not as abstract as some users would like 
SummingBird 
HYBRID 
COMPUTATION 
MODEL
 
Software abstraction layer over Open Source technologies 
o 
Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident 
 
Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process 
 
Same single APIfor the three processing paradigms 
o 
Batch processing similar to Pig / Cascading 
o 
Real time processing using built-in functions easier than Trident 
o 
Hybrid computation model transparent for the developer 
Lambdoop 
HYBRID 
COMPUTATION 
MODEL
Lambdoop 
Data 
Operation 
Data 
Workflow 
Streamingdata 
Staticdata 
HYBRID 
COMPUTATION 
MODEL
DataInput db_historical = new StaticCSVInput(URI_db); 
Datahistorical = new Data(db_historical); 
Workflowbatch = new Workflow(historical); 
Operation filter = new Filter(“Station",“=", 2); 
Operation select = new Select(“Titulo“, “SO2"); 
Operation group = new Group(“Titulo"); 
Operation average = new Average(“SO2"); 
batch.add(filter); 
batch.add(select); 
batch.add(group); 
batch.add(variance); 
batch.run(); 
Dataresults = batch.getResults(); 
… 
Lambdoop 
HYBRID 
COMPUTATION 
MODEL
DataInput stream_sensor = new StreamXMLInput(URI_sensor); 
Datasensor = new Data(stream_sensor) 
Workflowstreaming = new Workflow (sensor, new WindowsTime(100) ); 
Operation filter = new Filter("Station","=", 2); 
Operation select = new Select("Titulo", "S02"); 
Operation group = new Group("Titulo"); 
Operation average = new Average("S02"); 
streaming.add(filter); 
streaming.add(select); 
streaming.add(group); 
streaming.add(average); 
streaming.run(); 
While (true) 
{ 
Data live_results = streaming.getResults(); 
… 
} 
Lambdoop 
HYBRID 
COMPUTATION 
MODEL
DataInput historical= new StaticCSVInput(URI_folder); 
DataInput stream_sensor= new StreamXMLInput(URI_sensor); 
Data all_info = new Data (historical, stream_sensor); 
Workflow hybrid= new Workflow (all_info, new WindowsTime(1000) ); 
Operation filter = new Filter ("Station","=", 2); 
Operation select = new Select ("Titulo", "SO2"); 
Operation group = new Group("Titulo"); 
Operation average = new Average("SO2"); 
hybrid.add(filter); 
hybrid.add(select); 
hybrid.add(group); 
hybrid.add(variance); 
hybrid.run(); 
Data updated_results = hybrid.getResults(); 
Lambdoop 
HYBRID 
COMPUTATION 
MODEL
Pros 
• 
High abstraction layer for all processing model 
• 
All steps in the data processing pipeline 
• 
Same Java API for all programing paradigms 
• 
Extensible 
Cons 
• 
Ongoing project 
• 
Not open-source yet 
• 
Not tested in larger cluster yet 
Lambdoop 
HYBRID 
COMPUTATION 
MODEL
1. 
Big Data processing 
2. 
Batch processing 
3. 
Streaming processing 
4. 
Hybrid computation model 
5. 
Open Issues & Conclusions 
Agenda
Open Issues 
• 
Interoperabilitybetween well-known techniques / technologies (SQL, R) and Big Data platforms (Hadoop, Spark) 
• 
European technologies (Stratosphere / Apache Flink) 
• 
Massive StreamingMachine Learning 
• 
Real-time Interactive Visual Analytics 
• 
Vertical (domain-driven) solutions
Conclusions 
Casado R., Younas M. Emergingtrendsand technologiesin big data processing. ConcurrencyComputat.: Pract. Exper. 2014
Conclusions 
• 
Big Data is notonlyHadoop 
• 
Identify the processing requirements of your project 
• 
Analyzethe alternatives for all steps in the data pipeline 
• 
The battle for real-time processing is open 
• 
Stay tuned for the hybrid computation model
Thanks for your attention! Questions? 
ruben.casado@treelogic.com 
ruben_casado

More Related Content

What's hot

Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Spark Application for Time Series Analysis
Spark Application for Time Series AnalysisSpark Application for Time Series Analysis
Spark Application for Time Series AnalysisMapR Technologies
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 

What's hot (20)

Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Spark Application for Time Series Analysis
Spark Application for Time Series AnalysisSpark Application for Time Series Analysis
Spark Application for Time Series Analysis
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
druid.io
druid.iodruid.io
druid.io
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 

Viewers also liked

Estudio de la robustez frente a SEUs de algoritmos auto-convergentes
Estudio de la robustez frente a SEUs de algoritmos auto-convergentesEstudio de la robustez frente a SEUs de algoritmos auto-convergentes
Estudio de la robustez frente a SEUs de algoritmos auto-convergentesFacultad de Informática UCM
 
Big Data - 25 Facts to Know
Big Data - 25 Facts to KnowBig Data - 25 Facts to Know
Big Data - 25 Facts to KnowTamela Coval
 
25 Amazing Facts Everyone Should Know about David Bowie
25 Amazing Facts Everyone Should Know about David Bowie25 Amazing Facts Everyone Should Know about David Bowie
25 Amazing Facts Everyone Should Know about David BowieMail Prospects
 
A brief history of data processing
A brief history of data processingA brief history of data processing
A brief history of data processingGary Orenstein
 
Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...
Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...
Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...Facultad de Informática UCM
 
7 Financial KPIs Everyone Needs To Know
7 Financial KPIs Everyone Needs To Know7 Financial KPIs Everyone Needs To Know
7 Financial KPIs Everyone Needs To KnowBernard Marr
 
5 Customer Metrics Every Manager Should Know
5 Customer Metrics Every Manager Should Know5 Customer Metrics Every Manager Should Know
5 Customer Metrics Every Manager Should KnowBernard Marr
 
Big Data
Big DataBig Data
Big DataNGDATA
 
25 KPIs Every Manager Needs To Know
25 KPIs Every Manager Needs To Know25 KPIs Every Manager Needs To Know
25 KPIs Every Manager Needs To KnowBernard Marr
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 

Viewers also liked (15)

Estudio de la robustez frente a SEUs de algoritmos auto-convergentes
Estudio de la robustez frente a SEUs de algoritmos auto-convergentesEstudio de la robustez frente a SEUs de algoritmos auto-convergentes
Estudio de la robustez frente a SEUs de algoritmos auto-convergentes
 
Big Data - 25 Facts to Know
Big Data - 25 Facts to KnowBig Data - 25 Facts to Know
Big Data - 25 Facts to Know
 
25 Amazing Facts Everyone Should Know about David Bowie
25 Amazing Facts Everyone Should Know about David Bowie25 Amazing Facts Everyone Should Know about David Bowie
25 Amazing Facts Everyone Should Know about David Bowie
 
A brief history of data processing
A brief history of data processingA brief history of data processing
A brief history of data processing
 
Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...
Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...
Design and Validation of cloud storage systems using Maude - Dr. Peter Csaba ...
 
7 Financial KPIs Everyone Needs To Know
7 Financial KPIs Everyone Needs To Know7 Financial KPIs Everyone Needs To Know
7 Financial KPIs Everyone Needs To Know
 
5 Customer Metrics Every Manager Should Know
5 Customer Metrics Every Manager Should Know5 Customer Metrics Every Manager Should Know
5 Customer Metrics Every Manager Should Know
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data
Big DataBig Data
Big Data
 
25 KPIs Every Manager Needs To Know
25 KPIs Every Manager Needs To Know25 KPIs Every Manager Needs To Know
25 KPIs Every Manager Needs To Know
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 

Similar to Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades

Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Hadoop & no sql new generation database systems
Hadoop & no sql   new generation database systemsHadoop & no sql   new generation database systems
Hadoop & no sql new generation database systemsramazan fırın
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real WorldMark Kromer
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehousePrecisely
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 

Similar to Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades (20)

Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Hadoop & no sql new generation database systems
Hadoop & no sql   new generation database systemsHadoop & no sql   new generation database systems
Hadoop & no sql new generation database systems
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 

More from Facultad de Informática UCM

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?Facultad de Informática UCM
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...Facultad de Informática UCM
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersFacultad de Informática UCM
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmFacultad de Informática UCM
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingFacultad de Informática UCM
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroFacultad de Informática UCM
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...Facultad de Informática UCM
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Facultad de Informática UCM
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFacultad de Informática UCM
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoFacultad de Informática UCM
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCFacultad de Informática UCM
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Facultad de Informática UCM
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Facultad de Informática UCM
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Facultad de Informática UCM
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windFacultad de Informática UCM
 

More from Facultad de Informática UCM (20)

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation Computers
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura Arm
 
Formalizing Mathematics in Lean
Formalizing Mathematics in LeanFormalizing Mathematics in Lean
Formalizing Mathematics in Lean
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented Computing
 
Computer Design Concepts for Machine Learning
Computer Design Concepts for Machine LearningComputer Design Concepts for Machine Learning
Computer Design Concepts for Machine Learning
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuro
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error Correction
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intento
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
 
Type and proof structures for concurrency
Type and proof structures for concurrencyType and proof structures for concurrency
Type and proof structures for concurrency
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
 
Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore wind
 

Recently uploaded

How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 

Recently uploaded (20)

How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 

Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades

  • 1. Dr. Rubén Casado ruben.casado@treelogic.com ruben_casado Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades UNIVERSIDAD COMPLUETENSEMADRID 19 de Noviembre de 2014
  • 2. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  • 3.  PhD in Software Engineering  MSc in Computer Science  BSc in Computer Science Academics Work Experience
  • 4. 1. Big Data processing 2. Batchprocessing 3. Streamingprocessing 4. Hybridcomputationmodel 5. Open Issues & Conclusions Agenda
  • 5. A massivevolume of both structuredand unstructureddata that is so large to process with traditional database and software techniques What is Big Data?
  • 6. Big Data are high-volume, high-velocity, and/or high-varietyinformation assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization How is Big Data? -Gartner IT Glossary -
  • 7. 3 problems Volume Variety Velocity
  • 8. 3 solutions Batch processing NoSQL Streaming processing
  • 9. 3 solutions Batch processing NoSQL Streaming processing
  • 10. Volume Variety Velocity Science or Engineering?
  • 11. Science or Engineering? Volume Variety Value Velocity
  • 12. Science or Engineering? Volume Variety Value Velocity Software Engineering Data Science
  • 13. 13  Relational Databases  Schema based  ACID (Atomicity, Consistency, Isolation, Durability)  Performance penalty  Scalability issues  NoSQL  Not Only SQL  Families of solutions  Google BigTable, Amazon Dynamo  BASE = Basically Available, Soft state, Eventually consistent  CAP= Consistency, Availability, Partition tolerance NoSQL
  • 14. 14  Key-value  Key: ID  Value: associateddata  Diccionario  LinkedIn Voldemort  Riak, Redis  Memcache, Membase  Document  More complex tan K-V  Documents are indexed by ID  Multiple index  MongoDB  CouchDB  Column  Tables with predefined families of fields  Fields within families are flexible  Vertical and horizontal partitioning  HBase  Cassandra  Graph  Nodes  Relationships  Neo4j  FlockDB  OrientDB CR7: ‘Cristiano Ronaldo’ CR7:{Name: ’Cristiano’ Surname: ‘Ronaldo’ Age: 29} CR7: [Personal:{Name: ’Cristiano’ Surname: ‘Ronaldo’ Edad: 29} Job: {Team: ‘R. Madrid’ Salary: 20.000.000}] NoSQL [CR] [R.Madrid] se_llama juega [Cristiano]
  • 15. • Scalable • Large amount of staticdata • Distributed • Parallel • Fault tolerant • High latency Batch processing Volume
  • 16. • Low latency • Continuous unbounded streams of data • Distributed • Parallel • Fault-tolerant Streaming processing Velocity
  • 17. • Lowlatency:real-time • Massivedata-at-rest+ data-in-motion • Scalable • Combinebatchand streamingresults Hybrid computation model Volume Velocity
  • 18. All data New data Batch processing Streamingprocessing Batch results Stream results Combination Final results Hybrid computation model
  • 19.  Batchprocessing  Largeamountof staticsdata  Scalablesolution  Volume  Streamingprocessing  Computing streamingdata  Lowlatency  Velocity  Hybridcomputation  Lambda Architecture  Volume+ Velocity 2006 2010 2014 1ª Generation 2ª Generation 3ª Generation Inception 2003 Processing Paradigms
  • 20. Batch +10 years of Big Data processing technologies 2003 2004 2005 2013 2011 2010 2008 The Google File System MapReduce: Simplified Data Processing on Large Clusters Doug Cutting starts developing Hadoop 2006 Yahoo! starts working on Hadoop Apache Hadoop is in production Nathan Marzcreates Storm Yahoo! creates S4 2009 Facebook creates Hive Yahoo! creates Pig MillWheel: Fault-Tolerant Stream Processing at Internet Scale LinkedIn presents Samza LinkedIn presents KafkA Clouderapresents Flume 2012 Nathan Marzdefines the Lambda Architecture Streaming Hybrid 2014 Spark stack is open sourced Lambdoop & Summinbgirdfirst steps StratospherebecomesApache Flink
  • 21. Processing Pipeline DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS
  • 22.  Static stations and mobile sensors in Asturias sending streaming data  Historical data of > 10 years  Monitoring, trends identification, predictions Air Quality case study
  • 23. 1. Big Data processing overview 2. Batch processing 3. Real-time processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  • 24. Batch processing technologies DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS o HDFS commands o Sqoop o Flume o Scribe o HDFS o HBase o MapReduce o Hive o Pig o Cascading o Spark o SparkSQL (Shark)
  • 25. • Import to HDFS hadoopdfs-copyFromLocal <path-to-local> <path-to-remote> hadoopdfs–copyFromLocal/home/hduser/AirQuality/ /hdfs/AirQuality/ HDFS commands DATA ACQUISITION BATCH
  • 26. • Tool designed for transferring data between HDFS/HBase and structural datastores • Based in MapReduce • Includes connectors for multiple databases o MySQL, o PostgreSQL, o Oracle, o SQL Server and o DB2 o Generic JDBC connector • Java API Sqoop DATA ACQUISITION BATCH
  • 27. import-all-tables--connectjdbc:mysql://localhost/testDatabase--target-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1 1) Import data from database to HDFS export--connectjdbc:mysql://localhost/testDatabase--export-dirhdfs://rootHDFS/testDatabase -- usernameuser1 --passwordpass1 -m 1 3) Export results to database 2) Analyzedata (HADOOP) Sqoop DATA ACQUISITION BATCH
  • 28. • Service for collecting, aggregating, and moving large amounts of log data • Simple and flexible architecture based on streaming data flows • Reliability, scalability, extensibility, manageability • Support log stream types o Avro o Syslog o Netcast Flume DATA ACQUISITION BATCH
  • 29. Sources Channels Sinks Avro Memory HDFS Thrift JDBC Logger Exec File Avro JMS Thrift NetCat IRC Syslog TCP/UDP File Roll HTTP Null HBase Custom Custom • Architecture o Source • Waitingforevents. o Sink • Sendstheinformationtowardsanotheragentorsystem. o Channel • Storestheinformationuntilitisconsumedbythesink. Flume DATA ACQUISITION BATCH
  • 30. Stations send the information to the servers. Flume collects this information and move it into the HDFS for further analsys  Air quality syslogs Flume DATA ACQUISITION BATCH Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";
  • 31. • Server for aggregating log datastreamed in real time from a large number of servers • There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. • The central scribe server(s) can write the messages to the files that are their final destination Scribe DATA ACQUISITION BATCH
  • 32. category=‘mobile‘; // '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …' message= sensor_log.readLine(); log_entry= scribe.LogEntry(category, message) // Createa Scribe Client client= scribe.Client(iprot=protocol, oprot=protocol) transport.open() result= client.Log(messages=[log_entry]) transport.close() • Sending a sensor message to a Scribe Server Scribe DATA ACQUISITION BATCH
  • 33. • Distributed FileSystem for Hadoop • Master-Slaves Architecture (NameNode–DataNodes) o NameNode: Manage the directory tree and regulates access to files by clients o DataNodes: Store the data • Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes HDFS DATA STORAGE BATCH
  • 34. • Open-source non-relational distributed column-oriented databasemodeled after Google’s BigTable. • Random, realtime read/write access to the data. • Not a relational database. o Very light «schema» • Rows are stored in sorted order. DATA STORAGE BATCH HBase
  • 35. • Framework for processing large amount of datain parallel across a distributed cluster • Slightly inspired in the Divide and Conquer (D&C) classic strategy • Developer has to implement Map and Reduce functions: o Map: It takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodesparsed to the format <K, V> o Reduce: It collects the <K, List(V)> and generates the results MapReduce DATA ANALYTICS BATCH
  • 36. • Design Patterns o Joins o Reduce side Join o Replicated join o Semi join o Sorting: o Secondary sort o Total Order Sort o Filtering MapReduce o Statistics o AVG o VAR o Count o … o Top-K o Binning o … DATA ANALYTICS BATCH
  • 37. • Obtain the S02average of each station MapReduce Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23"; DATA ANALYTICS BATCH
  • 38. Input Data Mapper Mapper Mapper <1, 6> … … … Shuffling <1, 2> <3, 1> <1, 9> <3, 9> <2, 6> <2, 6> <1, 6> <2, 0> <2, 8> <1, 2> <3,9> <Station_ID, S02_VALUE> MapReduce DATA ANALYTICS BATCH • Maps get records and produce the SO2 value in <Station_Id, SO2_value>
  • 39. Station_ID, AVG_SO2 1, 2,013 2, 2,695 3, 3,562 Reducer Sum Divide Shuffling Reducer Sum Divide … <Station_ID, [SO1, SO2,…,SOn> • Reducer receives <Station_Id, List<SO2_value> > and computes the average for the station MapReduce DATA ANALYTICS BATCH
  • 40. Hive • Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hocqueries, and the analysis of large datasets • Abstractionlayer on top of MapReduce • SQL-like language called HiveQL. • Metastore: Central repository of Hive metadata. DATA ANALYTICS BATCH
  • 41. CREATE TABLE air_quality(Estacionint, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;' LINES TERMINATED BY 'n' STORED AS TEXTFILE; LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire; Hive • Obtain the S02average of each station SELECT Titulo, avg(SO2) FROM air_quality GROUP BY Estacion DATA ANALYTICS BATCH
  • 42. • Platform for analyzinglarge data sets • High-level language for expressing data analysis programs. Pig Latin. Data flow programming language. • Abstraction layer on top of MapReduce • Procedurallanguage Pig DATA ANALYTICS BATCH
  • 43. Pig DATA ANALYTICS BATCH • Obtain the S02average of each station calidad_aire= load'/CalidadAire_Gijon' usingPigStorage(';') AS(estacion:chararray, titulo:chararray, latitud:chararray, longitud:chararray, fecha:chararray, so2:chararray, no:chararray, co:chararray, pm10:chararray, o3:chararray, dd:chararray, vv:chararray, tmp:chararray, hr:chararray, prb:chararray, rs:chararray, ll:chararray, ben:chararray, tol:chararray, mxil:chararray, pm25:chararray); grouped = GROUPair_qualityBYestacion; avg= FOREACHgrouped GENERATEgroup, AVG(so2); dumpavg;
  • 44. • Cascading is a data processing APIand processing query planner used for defining, sharing, and executing data-processing workflows • Makes development of complex Hadoop MapReduce workflows easy • In the same way that Pig DATA ANALYTICS BATCH Cascading
  • 45. // define source and sink Taps. Tap source= new Hfs( sourceScheme, inputPath); Scheme sinkScheme= new TextLine( new Fields( “Estacion", “SO2" ) ); Tap sink= new Hfs( sinkScheme, outputPath, SinkMode.REPLACE); Pipe assembly= new Pipe( “avgSO2" ); assembly = new GroupBy( assembly, new Fields( “Estacion" ) ); // For every Tuple group Aggregator avg= new Average( new Fields( “SO2" ) ); assembly = new Every( assembly, avg); // Tell Hadoopwhich jar file to use Flow flow= flowConnector.connect( “avg-SO2", source, sink, assembly ); // execute the flow, block until complete flow.complete(); DATA ANALYTICS BATCH • Obtain the S02average of each station Cascading
  • 46. Spark • Cluster computing systems for faster data analytics • Nota modified version of Hadoop • Compatible with HDFS • In-memorydata storage for very fast iterativeprocessing • MapReduce-likeengine • API in Scala, Java and Python DATA ANALYTICS BATCH
  • 47. Spark DATA ANALYTICS BATCH • Hadoop is slow due to replication, serialization and IO tasks
  • 48. Spark DATA ANALYTICS BATCH • 10x-100x faster
  • 49. Spark SQL • Large-scale data warehouse system for Spark • SQLon top of Spark (akaSHARK) • ActuallyHiveQLover Spark • Up to100 x faster than Hive DATA ANALYTICS BATCH
  • 50. Pros • FasterthanHadoopecosystem • Easierto developnew applications o (Scala, Java and Python API) Cons • Not tested in extremely large clusters yet • Problems when Reducer’s data does not fit in memory DATA ANALYTICS BATCH Spark
  • 51. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  • 52. Real-time processing technologies DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS o Flume o Kafka o Kestrel o Flume o Storm o Trident o S4 o Spark Streaming
  • 54. • Kafka is a distributed, partitioned, replicated commit log service o Producer/Consumermodel o Kafka maintains feeds of messages in categories called topics o Kafka is run as a cluster Kafka DATA STORAGE STREAM
  • 55. Insert AirQuality sensor log file into Kafka cluster and consume the info. // new Producter Producer<String, String> producer= new Producer<String, String>(config); //Open sensor log file BufferedReaderbr… Stringline; while(true) { line = br.readLine(); if(line ==null) … //wait; else producer.send(new KeyedMessage<String, String>(topic, line)); } Kafka DATA STORAGE STREAM
  • 56. AirQuality Consumer ConsumerConnectorconsumer= Consumer.createJavaConsumerConnector(config); Map<String, Integer> topicCountMap= new HashMap<String, Integer>(); topicCountMap.put(topic, new Integer(1)); Map<String, List<KafkaMessageStream>> consumerMap= consumer.createMessageStreams(topicCountMap); KafkaMessageStreamstream= consumerMap.get(topic).get(0); ConsumerIteratorit= stream.iterator(); while(it.hasNext()){ // consume it.next() Kafka DATA STORAGE STREAM
  • 57. • Simple distributed message queue • A single Kestrel server has a set of queues (strictly-ordered FIFO) • On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication • Kestrel vsKafka o Kafka consumers cheaper (basically just the bandwidth usage) o Kestrel does not depend on Zookeeper which means it is operationally less complex if you don't already have a zookeeper installation. o Kafka has significantly better throughput. o Kestrel does not support ordered consumption Kestrel DATA STORAGE STREAM
  • 58. Interceptor • Interface org.apache.flume.interceptor.Interceptor • Can modify or even drop events based on any criteria • Flume supports chainingof interceptors. • Types: o Timestamp interceptor o Host interceptor o Static interceptor o UUID interceptor o Morphline interceptor o Regex Filtering interceptor o Regex Extractor interceptor DATA ANALYTICS STREAM Flume
  • 59. • The sensors’ information must be filtered by "Station 2" o An interceptor will filter information between Sourceand Channel. Station; Tittle;latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982"; "2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35";"981"; "23"; "3";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23"; "2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23"; "1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62";"983"; "23"; Flume DATA ANALYTICS STREAM
  • 60. # Writeformatcan be textorwritable … #Definingchannel–Memory type …1 … #Definingsource–Syslog… … # Definingsink–HDFS … … #Defininginterceptor agent.sources.source.interceptors= i1 agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter class StationFilter implements Interceptor … if(!"Station".equals("2")) discard data; else save data; Flume DATA ANALYTICS STREAM
  • 61. Hadoop Storm JobTracker Nimbus TaskTracker Supervisor Job Topology • Distributed and scalable realtime computation system • Doing for real-time processing what Hadoop did for batch processing • Topology:processinggraph.Eachnodecontainsprocessinglogic(spoutsandbolts).Linksbetweennodesarestreamsofdata o Spout:Sourceofstreams.Readadatasourceandemitthedataintothetopologyasastream o Bolts:Processingunit.Readdatafromseveralstreams,doessomeprocessingandpossiblyemitsnewstreams o Stream:Unboundedsequenceoftuples.Tuplescancontainanyserializableobject Storm DATA ANALYTICS STREAM
  • 62. CAReader LineProcessor AvgValues • AirQuality average values o Step 1: build the topology Storm Spout Bolt Bolt DATA ANALYTICS STREAM
  • 63. • AirQuality average values o Step 1: build the topology TopologyBuilderAirAVG= new TopologyBuilder(); builder.setSpout("ca-reader", new CAReader(), 1); //shuffleGrouping-> evendistribution AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3) .shuffleGrouping("ca-reader"); //fieldsGrouping-> fieldswiththesamevaluegoestothesametask AirAVG.setBolt("ca-avg-values", new AvgValues(), 2) .fieldsGrouping("ca-line-processor", new Fields("id")); Storm DATA ANALYTICS STREAM
  • 64. public void open(Map conf, TopologyContextcontext, SpoutOutputCollectorcollector) { //Initializefile BufferedReaderbr= new … … } publicvoidnextTuple() { Stringline = br.readLine(); if(line == null) { return; } else collector.emit(new Values(line)); } Storm • AirQuality average values o Step 2: CAReader implementation (IRichSpout interface) DATA ANALYTICS STREAM
  • 65. publicvoiddeclareOutputFields(OutputFieldsDeclarerdeclarer) { declarer.declare(new Fields("id", "stationName", "lat", … } publicvoidexecute(Tupleinput, BasicOutputCollectorcollector) { collector.emit(new Values(input.getString(0).split(";"); } Storm • AirQuality average values o Step 3: LineProcessor implementation (IBasicBolt interface) DATA ANALYTICS STREAM
  • 66. public void execute (Tuple input, BasicOutputCollector collector) { //totals and count are hashmaps with each station accumulated values if (totals.containsKey(id)) { item = totals.get(id); count = counts.get(id); } else { //Create new item } //update values item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2"))); item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no"))); … } Storm • AirQuality average values oStep 4: AvgValues implementation (IBasicBolt interface) DATA ANALYTICS STREAM 66
  • 67. • High level abstraction on top of Storm o Provides high level operations (joins, filters, projections, aggregations, functions…) Pros o Easy, powerful and flexible o Incremental topology development o Exactly-once semantics Cons o Very few built-in functions o Lower performance and higher latency than Storm Trident DATA ANALYTICS STREAM
  • 68.  Simple Scalable Streaming System  Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data  Inspired by MapReduce and Actor models of computation o Data processing is based on Processing Elements (PE) o Messages are transmitted between PEs in the form of events (Key, Attributes) o Processing Nodes are the logical hosts to PEs S4 DATA ANALYTICS STREAM
  • 69. … <bean id="split" class="SplitPE"> <property name="dispatcher" ref="dispatcher"/> <property name="keys"> <!--Listen for both words and sentences --> <list> <value>LogLines *</value> </list> </property> </bean> <bean id="average" class="AveragePE"> <property name="keys"> <list> <value>CAItem stationId</value> </list> </property> </bean> … • AirQuality average values S4 DATA ANALYTICS STREAM
  • 70. Spark Streaming • Spark for real-time processing • Streaming computation as a series of very short batch jobs (windows) • Keep state in memory • API similar to Spark DATA ANALYTICS STREAM
  • 71. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  • 72. • We are in the beginning of this generation • Short-term Big Data processing goal • Abstraction layer over the Lambda Architecture • Promising technologies o SummingBird o Lambdoop Hybrid Computation Model
  • 73. SummingBird • Library to write MapReduce-likeprocess that can be executed on Hadoop, Stormor hybrid model • Scalasyntaxis • Same logic can be executed in batch, real-time and hybrid bath/real mode HYBRID COMPUTATION MODEL
  • 75. Pros • Hybrid computation model • Same programing model for all proccesing paradigms • Extensible Cons • MapReduce-like programing • Scala • Not as abstract as some users would like SummingBird HYBRID COMPUTATION MODEL
  • 76.  Software abstraction layer over Open Source technologies o Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident  Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process  Same single APIfor the three processing paradigms o Batch processing similar to Pig / Cascading o Real time processing using built-in functions easier than Trident o Hybrid computation model transparent for the developer Lambdoop HYBRID COMPUTATION MODEL
  • 77. Lambdoop Data Operation Data Workflow Streamingdata Staticdata HYBRID COMPUTATION MODEL
  • 78. DataInput db_historical = new StaticCSVInput(URI_db); Datahistorical = new Data(db_historical); Workflowbatch = new Workflow(historical); Operation filter = new Filter(“Station",“=", 2); Operation select = new Select(“Titulo“, “SO2"); Operation group = new Group(“Titulo"); Operation average = new Average(“SO2"); batch.add(filter); batch.add(select); batch.add(group); batch.add(variance); batch.run(); Dataresults = batch.getResults(); … Lambdoop HYBRID COMPUTATION MODEL
  • 79. DataInput stream_sensor = new StreamXMLInput(URI_sensor); Datasensor = new Data(stream_sensor) Workflowstreaming = new Workflow (sensor, new WindowsTime(100) ); Operation filter = new Filter("Station","=", 2); Operation select = new Select("Titulo", "S02"); Operation group = new Group("Titulo"); Operation average = new Average("S02"); streaming.add(filter); streaming.add(select); streaming.add(group); streaming.add(average); streaming.run(); While (true) { Data live_results = streaming.getResults(); … } Lambdoop HYBRID COMPUTATION MODEL
  • 80. DataInput historical= new StaticCSVInput(URI_folder); DataInput stream_sensor= new StreamXMLInput(URI_sensor); Data all_info = new Data (historical, stream_sensor); Workflow hybrid= new Workflow (all_info, new WindowsTime(1000) ); Operation filter = new Filter ("Station","=", 2); Operation select = new Select ("Titulo", "SO2"); Operation group = new Group("Titulo"); Operation average = new Average("SO2"); hybrid.add(filter); hybrid.add(select); hybrid.add(group); hybrid.add(variance); hybrid.run(); Data updated_results = hybrid.getResults(); Lambdoop HYBRID COMPUTATION MODEL
  • 81. Pros • High abstraction layer for all processing model • All steps in the data processing pipeline • Same Java API for all programing paradigms • Extensible Cons • Ongoing project • Not open-source yet • Not tested in larger cluster yet Lambdoop HYBRID COMPUTATION MODEL
  • 82. 1. Big Data processing 2. Batch processing 3. Streaming processing 4. Hybrid computation model 5. Open Issues & Conclusions Agenda
  • 83. Open Issues • Interoperabilitybetween well-known techniques / technologies (SQL, R) and Big Data platforms (Hadoop, Spark) • European technologies (Stratosphere / Apache Flink) • Massive StreamingMachine Learning • Real-time Interactive Visual Analytics • Vertical (domain-driven) solutions
  • 84. Conclusions Casado R., Younas M. Emergingtrendsand technologiesin big data processing. ConcurrencyComputat.: Pract. Exper. 2014
  • 85. Conclusions • Big Data is notonlyHadoop • Identify the processing requirements of your project • Analyzethe alternatives for all steps in the data pipeline • The battle for real-time processing is open • Stay tuned for the hybrid computation model
  • 86. Thanks for your attention! Questions? ruben.casado@treelogic.com ruben_casado