SlideShare una empresa de Scribd logo
1 de 54
$>whoami
Edward Capriolo
●

Developer @ dstillery (the company formally
known as m6d aka media6degrees)

●

Hive: Project Management Committee

●

Hadoop'in it since 0.17.2

●

Cassandra-'in it since 0.6.X

●

Hive'in it 0.3.X

●

Incredibly skilled with power point
Agenda for this talk
●

Batch processing via Hadoop

●

Stream processing

●

Relational Databases and NoSQL

●

Life lessons, quips, and other prospective
Before we talk tech...
●
●

●

●

Lets talk math!
Yay! math fun! (as people start leaving room)
Don't worry. It is only a couple slides.
Wanted to talk about relational algebra since it
is the foundation of relation databases
Even in the NoSQL age, relational algebra is
alive and well
Relational algebra...
A big slide with many words
●

●

Relational algebra received little attention outside of pure
mathematics until the publication of E.F. Codd's relational model of
data in 1970. Codd proposed such an algebra as a basis for
database query languages.
In computer science, relational algebra is an offshoot of first-order
logic and of algebra of sets concerned with operations over finitary
relations, usually made more convenient to work with by identifying
the components of a tuple by a name (called attribute) rather than
by a numeric column index, which is called a relation in database
terminology.

http://en.wikipedia.org/wiki/Relational_algebra
Operators of Relational algebra:
Projection

●

SELECT Age, Weight ...
Extended projections

●

SELECT Age+Weight as X ...

●

SELECT ROUND(Weight),Age+1 as X ...
Selection

●

SELECT * FROM Person

●

SELECT * FROM Person WHERE Age >=34

●

SELECT * FROM Person WHERE Age = Weight
Joins

●

●

SELECT * FROM Car JOIN Boat on (CarPrice
>= BoatPrice)
SELECT * FROM Car JOIN Boat on (CarPrice
= BoatPrice)
Aggregate

●

SELECT sum(C) FROM r

●

SELECT A, sum(C) FROM r GROUP BY A
http://www.cbcb.umd.edu/confcour/CMSC424/Relational_algebra.pdf
Other Operators
●

Set operations
–
–

Intersection

–
●

Union
Cartesian Product

Outer joins
–
–

RIGHT,

–
●

LEFT
FULL

Semi Join / Exists
Batch Processing and Big Data
●

When hadoop game on the scene it was a
game changer because:
–

Viable implementation of Google's map reduce
white paper

–

Worked with commodity hardware

–

Had no exuberant software fees

–

Scaled processing and storage with growing
companies without typically needed processes
to be redesigned
Archetype Hadoop deployment
(circa facebook 2009)
Scribe Writers
Realtime
Hadoop
Cluster
Web Servers

Scribe MidTier

Oracle RAC

Hadoop Hive Warehouse

MySQL

http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
The Hadoop archetype
●
●

●
●

Component generating events (web servers)
Component collecting logs into hadoop
(scribe)
Translation of raw data using hadoop and hive
Output of rollups to oracle and other data
systems
–

feedback loops (mysql <-> hive)
Use case: Book store
●

Our book store will be named (say it with me!):
–
–

Big Data,

–

No SQL,

–

Real Time Analytics,

–
●

Web scale,

Books!

One more time!
–

Web scale, Big Data, No SQL, Real Time Analytics, Books
●

(A buzzword bingo company)
Domain model
{
"id":"00001",
"refer":"http://affiliate1.superbooks.com",
"ip":"209.191.139.200",
"status":"ACCEPTED",
"eventTimeInMillis":1383011801439,
"credit_hash":"ab45de21",
"email":"bob@compuserv.com",
"purchases":[ {
"name":"Programming Hive",
"cost":30.0 }, {
"name":"frAgile Software Development",
"cost":0.2 } ]
}
Complex serialized payloads
●

●

●

“process web logs” in facebook's case were
NOT always tab delimited text files
In many cases scribe was logging complex
structures in thrift format
Hadoop (and hive) can work with complex
records not typical in RDBMS
Log collection/ingestion

http://flume.apache.org/FlumeUserGuide.html
Several ingestion approaches
●

Scribe never took off

●

Choctaw (hangs around not sexy)

●

Log servers log direct with HDFS API

●

Duck taped up set of shell scripts

●

Flume seems to be the most widely used,
feature rich, and supported system
Left up to the user...
●

What format do you want the raw data in

●

How should the data be staged in HDFS
–
–

●

hourly directories
by host

How to monitor
–

Semantics of what the pipeline should do if files
stop appearing?

–

Application specific sanity checks
Unleash the hounds!
Hive and relational algebra
●

SELECT refer,
sum(purchase.cost)
FROM store_transaction

<- Projection
<- Aggregation

LATERAL VIEW explode
(purchase) plist as purchase <- Hive sexyness
<- Aggregation
GROUP BY refer
WHERE refer = 'y'
<- Selection
Hadoop/Hive's parallel
implementation
Drawbacks of
the batch approach
●

Not efficient/possible on small time windows
–

●

Jobs have start up time and over head

Late data can be troublesome
–

Resulting in full rerun

–

Re-run of dependent jobs

●

Failures can set processing hours back (or maybe days

●

Scheduling of dependent tasks
–

Not a huge consensus around proper tool
●
●
●

Oozie
Azcaban
Cron ... pause not
More drawbacks of Batch data
●

Interactive analysis of results

●

Detecting sanity of input

●

●

Result data typically moved into other systems
for interactive analysis (post process)
Most computational steps spill/persist to disk
–

Components of a job can be pipelined but
between two jobs is persistent storage. That
needs to be re-read in for next batch.
Stream Processing
Stream processing
●

My first job “stream processing” reading in
Associated Press data
–

–
●

●

Connecting to a terminal server connected to a serial
modem
Writing this information to a database

My definition: Processing data across one or more
coordinated data channels
Like “Big Data”, Stream Processing is:
–

Whatever you say it is
Common components
of stream processing
●

●

●

Message Queue – A system that delivers a
never ending stream of data
Processing engine – Manages streams and
connects data to processing
External/Internal persistence – Some data
may live outside the stream.
–

It could be transient or persistent
Message Queues
Why most Message Queue
software does not 'scale'
●

MQ 'guarantees'
●
●

●

MQ Typically optimize by keeping all data in memory
–

Semantics around what happens when memory is full
●
●
●

●

In order delivery
Acknowledgments

Block
Persist to disk
Throw away

Not trashing Messages Queues here. Many of their
guarantees are hard to deliver at scale, and not always
needed
Kafka – A high-throughput
distributed messaging system

A publish-subscribe messaging re-thought as a
distributed commit log
Distributed
●

Data streams are partitioned and spread over a
cluster of machines
Durable and fast
●

Messages are always persisted to disk!

●

Consumers track their position in log files

●

Kafka uses the sendfile system call for
performance
Consumer Groups
●

●

Multiple groups can subscribe to an event
stream
Producers can determine event partitioning
Great! You have streaming data.
How do you process it?
●

Storm - https://github.com/nathanmarz/storm

●

Samza - samza.incubator.apache.org

●

S4 - http://incubator.apache.org/s4/

●

http://www03.ibm.com/software/products/us/en/infospherestreams/
Heck even I wrote one!

●

IronCount https://github.com/edwardcapriolo/IronCount
Before you have a holy war
over this software decision...
Storm
●

Distributed and fault-tolerant realtime
computation: stream processing,
continuous computation, distributed RPC.
Storm (Trident) API
●

Data comes from spouts

●

Spouts/streams produce tuples

●

FixedBatchSpout spout = new
FixedBatchSpout(new Fields("sentence"), 1,
new Values("line one"),
new Values("line two"));

https://github.com/nathanmarz/storm/wiki/Trident-tutorial
(extended) Projection
●

Stream can be processed into another stream

●

Here a line is split into words

●

●

Stream words = stream.each(new
Fields("sentence"), new Split(), new
Fields("word"));
(Similar to hive's LATERAL VIEW)
Grouping and Aggregation
●

●

GroupedStream groupByWord =
words.groupBy( new Fields("word"));
TridentState groupByState =
groupByWord.persistentAggregate(new
MemoryMapState.Factory(), new Count(),
new Fields("count"));
Great! We just did distributed
stream processing!
●
●

But where is the results?
groupByWord.persistentAggregate(new
MemoryMapState.Factory(), new Count(),
new Fields("count"));

●

In Memory... aka nowhere :)

●

We can change that...

●

But first some math/science/dribble I stole
from wikipedia in an attempt to sound smart!
Temporal database
●

●

A temporal database is a database with built-in
support for handling data involving time, for
example a temporal data model and a
temporal version of Structured Query
Language (SQL).
Temporal databases are in contrast to current
databases, which store only facts which are
believed to be true at the current time
Batch/Hadoop was easy
(temporaly speaking)
●

Input data is typically in write-once hdfs files*

●

Output data typically to write-once output files*

●

●

●

Reduce phase does not start until map/shuffle
is done
Output data typically available until the entire
job is done*
Idempotent computation
*Going to qualify everything with typically, because of computational idempotency
The real “real time”
●

Real time is often misused

●

Anecdotally people usually mean
–
–

●

Low latency
Small windows of time (sub-minute & sub-second)

Our bookstore wants “real time” stats
–

●

aggegations and data stores updated incrementally as
data is processed

One way to implement this is discrete columns
bucketed by time
Tempor-alizing data
●

●

●

●

In an earlier example we aggegated revenue by
referrer like this:
SELECT refer, sum(purchase.cost) ...
GROUP BY refer
Now we include the time:
SELECT date(eventtime),hour(eventtime),
minute(eventtime) refer, sum(purchase.cost)
GROUP BY day(eventtime),hour(eventtime),
minute(eventtime)
Storing data in Cassandra
●

Horizontally scalable (hundreds of nodes)

●

No single point of failure

●

Integrated replication

●

Writes like lightning (Structured log storage)

●

Reads like thunder (LevelDB & BigTable
inspired storage)
Scalable time series made
easy with cassandra
●

●

●

Create a table with one row per day per refer, sorted by
time
CREATE TABLE purchase_by_refer (
refer text,
dt date,
event_time timestamp,
tot counter,
PRIMARY KEY ((refer,dt),event_time));
UPDATE purchase_by_refer set tot=tot+1 where refer =
'store1 and dt='2013-01-12' and event_time=''2013-01-12
07:03:00'
If you want c* and storm
●
●

●

https://github.com/hmsonline/storm-cassandra
Uses Cassandra as a peristance model for
storm
Good documentation
The home stretch: Joining streams
and caching data
●

●

●

Some use cases of distributed streaming
involve keeping local caches
Streaming algorithms requires memory of
recent events and do not want to query a
datastore each time an event is received
Kafka is useful in this case because the user
can dictated the partition the data is sent to
Streaming
Recommendation System

https://github.com/edwardcapriolo/IronCount
Input Streams
Stream 1: users

Stream 2: items

user|1:edward

cart|1:saw:2.00

user|2:nate

cart|1:hammer:3.00

user|3:stacey

cart|3:puppy:1.00

●

Both streams merged (union)

●

The field after the pipe is the userid (projection)

●

User id should be the partition key when sent on
(aggregation)
Handle message and route by id
public void handleMessage(MessageAndMetadata
<Message> m) {
String line = getMessage(m.message());
String[] parts = line.split("|");
String table = parts[0];
String row = parts[1];
String [] columns = row.split(":");
producer.send(new ProducerData<String, String>
("reduce", columns[0], Arrays.asList(table+"|"+row)));
}
Update in memory copy
●

public class ReduceHandler implements MessageHandler {
HashMap<User,ArrayList<Item>> data = new
EvictingHashMap<User,ArrayList<Item>>();
...
public void handleMessage (MessageAndMetadata<Message>
m) {
if ( table.equals("cart")){
Item i = new Item();
i.parse(columns);
incrementItemCounter(u);
incrementDollarByUser(u,i);
}
suggestNewItemsForUser(u);
Challenges of streaming
●

Replay of data could double/miss count

●

New evolving API's
–

●
●

●

You may have to build support for your stack

Distributed computation is harder to log/debug
Monitoring consumption on topics to avoid
falling behind
Monitoring topics to notice if data stops
El fin

Más contenido relacionado

La actualidad más candente

M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
Edward Capriolo
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 

La actualidad más candente (20)

M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
NoSQL and NewSQL: Tradeoffs between Scalable Performance & ConsistencyNoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
 
Webinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlWebinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under Control
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Introduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraIntroduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache Cassandra
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide Rows
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyond
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexing
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache CassandraCassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
 
Webinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraWebinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache Cassandra
 
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 

Destacado

9/11 Lore Nata Maria Ale
9/11 Lore Nata Maria Ale9/11 Lore Nata Maria Ale
9/11 Lore Nata Maria Ale
Lorena Pga
 
Multimedia01
Multimedia01Multimedia01
Multimedia01
Les Davy
 
11 Mistakes While Looking For A Job
11 Mistakes While Looking For A Job11 Mistakes While Looking For A Job
11 Mistakes While Looking For A Job
Patrick Lynch
 
Researchers - recommendations from AIGLIA2014
Researchers - recommendations from AIGLIA2014Researchers - recommendations from AIGLIA2014
Researchers - recommendations from AIGLIA2014
futureagricultures
 
The 12 types of advertising 5&6
The 12 types of advertising 5&6The 12 types of advertising 5&6
The 12 types of advertising 5&6
Les Davy
 
Ancillary magazine making
Ancillary magazine makingAncillary magazine making
Ancillary magazine making
aq101824
 
20087067 choi mun jung presentation
20087067 choi mun jung presentation20087067 choi mun jung presentation
20087067 choi mun jung presentation
문정 최
 
Company Profile
Company ProfileCompany Profile
Company Profile
sach_76
 
Scarlett Falling Down
Scarlett Falling DownScarlett Falling Down
Scarlett Falling Down
Les Davy
 

Destacado (20)

Akash sharma lo 2
Akash sharma lo 2Akash sharma lo 2
Akash sharma lo 2
 
9/11 Lore Nata Maria Ale
9/11 Lore Nata Maria Ale9/11 Lore Nata Maria Ale
9/11 Lore Nata Maria Ale
 
Apres pi pcbc
Apres pi pcbcApres pi pcbc
Apres pi pcbc
 
第2回 ★★オンラインくるま座集会★★ 2013.10.03 at 小泉一真くるま座談話室
第2回 ★★オンラインくるま座集会★★ 2013.10.03 at 小泉一真くるま座談話室第2回 ★★オンラインくるま座集会★★ 2013.10.03 at 小泉一真くるま座談話室
第2回 ★★オンラインくるま座集会★★ 2013.10.03 at 小泉一真くるま座談話室
 
Multimedia01
Multimedia01Multimedia01
Multimedia01
 
1 18
1 181 18
1 18
 
11 Mistakes While Looking For A Job
11 Mistakes While Looking For A Job11 Mistakes While Looking For A Job
11 Mistakes While Looking For A Job
 
Adore global pvt ltd
Adore global pvt ltdAdore global pvt ltd
Adore global pvt ltd
 
ClinicalStandards
ClinicalStandardsClinicalStandards
ClinicalStandards
 
Researchers - recommendations from AIGLIA2014
Researchers - recommendations from AIGLIA2014Researchers - recommendations from AIGLIA2014
Researchers - recommendations from AIGLIA2014
 
Full turkey cycle17 2013
Full turkey cycle17 2013Full turkey cycle17 2013
Full turkey cycle17 2013
 
Unit 2 age of exploration- guided notes
Unit 2  age of exploration- guided notesUnit 2  age of exploration- guided notes
Unit 2 age of exploration- guided notes
 
The 12 types of advertising 5&6
The 12 types of advertising 5&6The 12 types of advertising 5&6
The 12 types of advertising 5&6
 
Ancillary magazine making
Ancillary magazine makingAncillary magazine making
Ancillary magazine making
 
20087067 choi mun jung presentation
20087067 choi mun jung presentation20087067 choi mun jung presentation
20087067 choi mun jung presentation
 
Company Profile
Company ProfileCompany Profile
Company Profile
 
TelOne Zimbabwe - An Internet Research
TelOne Zimbabwe - An Internet ResearchTelOne Zimbabwe - An Internet Research
TelOne Zimbabwe - An Internet Research
 
Scarlett Falling Down
Scarlett Falling DownScarlett Falling Down
Scarlett Falling Down
 
JMS PowerPoint for our Epals
JMS PowerPoint for our EpalsJMS PowerPoint for our Epals
JMS PowerPoint for our Epals
 
Introduction to Density
Introduction to DensityIntroduction to Density
Introduction to Density
 

Similar a Web-scale data processing: practical approaches for low-latency and batch

Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptx
Paulo Alonso
 

Similar a Web-scale data processing: practical approaches for low-latency and batch (20)

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptx
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 

Más de Edward Capriolo

Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
Edward Capriolo
 

Más de Edward Capriolo (14)

Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 
Cassandra4hadoop
Cassandra4hadoopCassandra4hadoop
Cassandra4hadoop
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
M6d cassandra summit
M6d cassandra summitM6d cassandra summit
M6d cassandra summit
 
Apache Kafka Demo
Apache Kafka DemoApache Kafka Demo
Apache Kafka Demo
 
Cassandra NoSQL Lan party
Cassandra NoSQL Lan partyCassandra NoSQL Lan party
Cassandra NoSQL Lan party
 
Breaking first-normal form with Hive
Breaking first-normal form with HiveBreaking first-normal form with Hive
Breaking first-normal form with Hive
 
Casbase presentation
Casbase presentationCasbase presentation
Casbase presentation
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best Practices
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
 
Cli deep dive
Cli deep diveCli deep dive
Cli deep dive
 
Cassandra as Memcache
Cassandra as MemcacheCassandra as Memcache
Cassandra as Memcache
 
Counters for real-time statistics
Counters for real-time statisticsCounters for real-time statistics
Counters for real-time statistics
 
Real world capacity
Real world capacityReal world capacity
Real world capacity
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Web-scale data processing: practical approaches for low-latency and batch

  • 1. $>whoami Edward Capriolo ● Developer @ dstillery (the company formally known as m6d aka media6degrees) ● Hive: Project Management Committee ● Hadoop'in it since 0.17.2 ● Cassandra-'in it since 0.6.X ● Hive'in it 0.3.X ● Incredibly skilled with power point
  • 2. Agenda for this talk ● Batch processing via Hadoop ● Stream processing ● Relational Databases and NoSQL ● Life lessons, quips, and other prospective
  • 3. Before we talk tech... ● ● ● ● Lets talk math! Yay! math fun! (as people start leaving room) Don't worry. It is only a couple slides. Wanted to talk about relational algebra since it is the foundation of relation databases Even in the NoSQL age, relational algebra is alive and well
  • 4. Relational algebra... A big slide with many words ● ● Relational algebra received little attention outside of pure mathematics until the publication of E.F. Codd's relational model of data in 1970. Codd proposed such an algebra as a basis for database query languages. In computer science, relational algebra is an offshoot of first-order logic and of algebra of sets concerned with operations over finitary relations, usually made more convenient to work with by identifying the components of a tuple by a name (called attribute) rather than by a numeric column index, which is called a relation in database terminology. http://en.wikipedia.org/wiki/Relational_algebra
  • 6. Projection ● SELECT Age, Weight ... Extended projections ● SELECT Age+Weight as X ... ● SELECT ROUND(Weight),Age+1 as X ...
  • 7. Selection ● SELECT * FROM Person ● SELECT * FROM Person WHERE Age >=34 ● SELECT * FROM Person WHERE Age = Weight
  • 8. Joins ● ● SELECT * FROM Car JOIN Boat on (CarPrice >= BoatPrice) SELECT * FROM Car JOIN Boat on (CarPrice = BoatPrice)
  • 9. Aggregate ● SELECT sum(C) FROM r ● SELECT A, sum(C) FROM r GROUP BY A http://www.cbcb.umd.edu/confcour/CMSC424/Relational_algebra.pdf
  • 10. Other Operators ● Set operations – – Intersection – ● Union Cartesian Product Outer joins – – RIGHT, – ● LEFT FULL Semi Join / Exists
  • 11. Batch Processing and Big Data ● When hadoop game on the scene it was a game changer because: – Viable implementation of Google's map reduce white paper – Worked with commodity hardware – Had no exuberant software fees – Scaled processing and storage with growing companies without typically needed processes to be redesigned
  • 12. Archetype Hadoop deployment (circa facebook 2009) Scribe Writers Realtime Hadoop Cluster Web Servers Scribe MidTier Oracle RAC Hadoop Hive Warehouse MySQL http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
  • 13. The Hadoop archetype ● ● ● ● Component generating events (web servers) Component collecting logs into hadoop (scribe) Translation of raw data using hadoop and hive Output of rollups to oracle and other data systems – feedback loops (mysql <-> hive)
  • 14. Use case: Book store ● Our book store will be named (say it with me!): – – Big Data, – No SQL, – Real Time Analytics, – ● Web scale, Books! One more time! – Web scale, Big Data, No SQL, Real Time Analytics, Books ● (A buzzword bingo company)
  • 16. Complex serialized payloads ● ● ● “process web logs” in facebook's case were NOT always tab delimited text files In many cases scribe was logging complex structures in thrift format Hadoop (and hive) can work with complex records not typical in RDBMS
  • 18. Several ingestion approaches ● Scribe never took off ● Choctaw (hangs around not sexy) ● Log servers log direct with HDFS API ● Duck taped up set of shell scripts ● Flume seems to be the most widely used, feature rich, and supported system
  • 19. Left up to the user... ● What format do you want the raw data in ● How should the data be staged in HDFS – – ● hourly directories by host How to monitor – Semantics of what the pipeline should do if files stop appearing? – Application specific sanity checks
  • 21. Hive and relational algebra ● SELECT refer, sum(purchase.cost) FROM store_transaction <- Projection <- Aggregation LATERAL VIEW explode (purchase) plist as purchase <- Hive sexyness <- Aggregation GROUP BY refer WHERE refer = 'y' <- Selection
  • 23. Drawbacks of the batch approach ● Not efficient/possible on small time windows – ● Jobs have start up time and over head Late data can be troublesome – Resulting in full rerun – Re-run of dependent jobs ● Failures can set processing hours back (or maybe days ● Scheduling of dependent tasks – Not a huge consensus around proper tool ● ● ● Oozie Azcaban Cron ... pause not
  • 24. More drawbacks of Batch data ● Interactive analysis of results ● Detecting sanity of input ● ● Result data typically moved into other systems for interactive analysis (post process) Most computational steps spill/persist to disk – Components of a job can be pipelined but between two jobs is persistent storage. That needs to be re-read in for next batch.
  • 26. Stream processing ● My first job “stream processing” reading in Associated Press data – – ● ● Connecting to a terminal server connected to a serial modem Writing this information to a database My definition: Processing data across one or more coordinated data channels Like “Big Data”, Stream Processing is: – Whatever you say it is
  • 27. Common components of stream processing ● ● ● Message Queue – A system that delivers a never ending stream of data Processing engine – Manages streams and connects data to processing External/Internal persistence – Some data may live outside the stream. – It could be transient or persistent
  • 29. Why most Message Queue software does not 'scale' ● MQ 'guarantees' ● ● ● MQ Typically optimize by keeping all data in memory – Semantics around what happens when memory is full ● ● ● ● In order delivery Acknowledgments Block Persist to disk Throw away Not trashing Messages Queues here. Many of their guarantees are hard to deliver at scale, and not always needed
  • 30. Kafka – A high-throughput distributed messaging system A publish-subscribe messaging re-thought as a distributed commit log
  • 31. Distributed ● Data streams are partitioned and spread over a cluster of machines
  • 32. Durable and fast ● Messages are always persisted to disk! ● Consumers track their position in log files ● Kafka uses the sendfile system call for performance
  • 33. Consumer Groups ● ● Multiple groups can subscribe to an event stream Producers can determine event partitioning
  • 34. Great! You have streaming data. How do you process it? ● Storm - https://github.com/nathanmarz/storm ● Samza - samza.incubator.apache.org ● S4 - http://incubator.apache.org/s4/ ● http://www03.ibm.com/software/products/us/en/infospherestreams/ Heck even I wrote one! ● IronCount https://github.com/edwardcapriolo/IronCount
  • 35. Before you have a holy war over this software decision...
  • 36. Storm ● Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC.
  • 37. Storm (Trident) API ● Data comes from spouts ● Spouts/streams produce tuples ● FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 1, new Values("line one"), new Values("line two")); https://github.com/nathanmarz/storm/wiki/Trident-tutorial
  • 38. (extended) Projection ● Stream can be processed into another stream ● Here a line is split into words ● ● Stream words = stream.each(new Fields("sentence"), new Split(), new Fields("word")); (Similar to hive's LATERAL VIEW)
  • 39. Grouping and Aggregation ● ● GroupedStream groupByWord = words.groupBy( new Fields("word")); TridentState groupByState = groupByWord.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"));
  • 40. Great! We just did distributed stream processing! ● ● But where is the results? groupByWord.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ● In Memory... aka nowhere :) ● We can change that... ● But first some math/science/dribble I stole from wikipedia in an attempt to sound smart!
  • 41. Temporal database ● ● A temporal database is a database with built-in support for handling data involving time, for example a temporal data model and a temporal version of Structured Query Language (SQL). Temporal databases are in contrast to current databases, which store only facts which are believed to be true at the current time
  • 42. Batch/Hadoop was easy (temporaly speaking) ● Input data is typically in write-once hdfs files* ● Output data typically to write-once output files* ● ● ● Reduce phase does not start until map/shuffle is done Output data typically available until the entire job is done* Idempotent computation *Going to qualify everything with typically, because of computational idempotency
  • 43. The real “real time” ● Real time is often misused ● Anecdotally people usually mean – – ● Low latency Small windows of time (sub-minute & sub-second) Our bookstore wants “real time” stats – ● aggegations and data stores updated incrementally as data is processed One way to implement this is discrete columns bucketed by time
  • 44. Tempor-alizing data ● ● ● ● In an earlier example we aggegated revenue by referrer like this: SELECT refer, sum(purchase.cost) ... GROUP BY refer Now we include the time: SELECT date(eventtime),hour(eventtime), minute(eventtime) refer, sum(purchase.cost) GROUP BY day(eventtime),hour(eventtime), minute(eventtime)
  • 45. Storing data in Cassandra ● Horizontally scalable (hundreds of nodes) ● No single point of failure ● Integrated replication ● Writes like lightning (Structured log storage) ● Reads like thunder (LevelDB & BigTable inspired storage)
  • 46. Scalable time series made easy with cassandra ● ● ● Create a table with one row per day per refer, sorted by time CREATE TABLE purchase_by_refer ( refer text, dt date, event_time timestamp, tot counter, PRIMARY KEY ((refer,dt),event_time)); UPDATE purchase_by_refer set tot=tot+1 where refer = 'store1 and dt='2013-01-12' and event_time=''2013-01-12 07:03:00'
  • 47. If you want c* and storm ● ● ● https://github.com/hmsonline/storm-cassandra Uses Cassandra as a peristance model for storm Good documentation
  • 48. The home stretch: Joining streams and caching data ● ● ● Some use cases of distributed streaming involve keeping local caches Streaming algorithms requires memory of recent events and do not want to query a datastore each time an event is received Kafka is useful in this case because the user can dictated the partition the data is sent to
  • 50. Input Streams Stream 1: users Stream 2: items user|1:edward cart|1:saw:2.00 user|2:nate cart|1:hammer:3.00 user|3:stacey cart|3:puppy:1.00 ● Both streams merged (union) ● The field after the pipe is the userid (projection) ● User id should be the partition key when sent on (aggregation)
  • 51. Handle message and route by id public void handleMessage(MessageAndMetadata <Message> m) { String line = getMessage(m.message()); String[] parts = line.split("|"); String table = parts[0]; String row = parts[1]; String [] columns = row.split(":"); producer.send(new ProducerData<String, String> ("reduce", columns[0], Arrays.asList(table+"|"+row))); }
  • 52. Update in memory copy ● public class ReduceHandler implements MessageHandler { HashMap<User,ArrayList<Item>> data = new EvictingHashMap<User,ArrayList<Item>>(); ... public void handleMessage (MessageAndMetadata<Message> m) { if ( table.equals("cart")){ Item i = new Item(); i.parse(columns); incrementItemCounter(u); incrementDollarByUser(u,i); } suggestNewItemsForUser(u);
  • 53. Challenges of streaming ● Replay of data could double/miss count ● New evolving API's – ● ● ● You may have to build support for your stack Distributed computation is harder to log/debug Monitoring consumption on topics to avoid falling behind Monitoring topics to notice if data stops

Notas del editor

  1. {}