2024: Domino Containers - The Next Step. News from the Domino Container commu...
Web-scale data processing: practical approaches for low-latency and batch
1. $>whoami
Edward Capriolo
●
Developer @ dstillery (the company formally
known as m6d aka media6degrees)
●
Hive: Project Management Committee
●
Hadoop'in it since 0.17.2
●
Cassandra-'in it since 0.6.X
●
Hive'in it 0.3.X
●
Incredibly skilled with power point
2. Agenda for this talk
●
Batch processing via Hadoop
●
Stream processing
●
Relational Databases and NoSQL
●
Life lessons, quips, and other prospective
3. Before we talk tech...
●
●
●
●
Lets talk math!
Yay! math fun! (as people start leaving room)
Don't worry. It is only a couple slides.
Wanted to talk about relational algebra since it
is the foundation of relation databases
Even in the NoSQL age, relational algebra is
alive and well
4. Relational algebra...
A big slide with many words
●
●
Relational algebra received little attention outside of pure
mathematics until the publication of E.F. Codd's relational model of
data in 1970. Codd proposed such an algebra as a basis for
database query languages.
In computer science, relational algebra is an offshoot of first-order
logic and of algebra of sets concerned with operations over finitary
relations, usually made more convenient to work with by identifying
the components of a tuple by a name (called attribute) rather than
by a numeric column index, which is called a relation in database
terminology.
http://en.wikipedia.org/wiki/Relational_algebra
11. Batch Processing and Big Data
●
When hadoop game on the scene it was a
game changer because:
–
Viable implementation of Google's map reduce
white paper
–
Worked with commodity hardware
–
Had no exuberant software fees
–
Scaled processing and storage with growing
companies without typically needed processes
to be redesigned
13. The Hadoop archetype
●
●
●
●
Component generating events (web servers)
Component collecting logs into hadoop
(scribe)
Translation of raw data using hadoop and hive
Output of rollups to oracle and other data
systems
–
feedback loops (mysql <-> hive)
14. Use case: Book store
●
Our book store will be named (say it with me!):
–
–
Big Data,
–
No SQL,
–
Real Time Analytics,
–
●
Web scale,
Books!
One more time!
–
Web scale, Big Data, No SQL, Real Time Analytics, Books
●
(A buzzword bingo company)
16. Complex serialized payloads
●
●
●
“process web logs” in facebook's case were
NOT always tab delimited text files
In many cases scribe was logging complex
structures in thrift format
Hadoop (and hive) can work with complex
records not typical in RDBMS
18. Several ingestion approaches
●
Scribe never took off
●
Choctaw (hangs around not sexy)
●
Log servers log direct with HDFS API
●
Duck taped up set of shell scripts
●
Flume seems to be the most widely used,
feature rich, and supported system
19. Left up to the user...
●
What format do you want the raw data in
●
How should the data be staged in HDFS
–
–
●
hourly directories
by host
How to monitor
–
Semantics of what the pipeline should do if files
stop appearing?
–
Application specific sanity checks
23. Drawbacks of
the batch approach
●
Not efficient/possible on small time windows
–
●
Jobs have start up time and over head
Late data can be troublesome
–
Resulting in full rerun
–
Re-run of dependent jobs
●
Failures can set processing hours back (or maybe days
●
Scheduling of dependent tasks
–
Not a huge consensus around proper tool
●
●
●
Oozie
Azcaban
Cron ... pause not
24. More drawbacks of Batch data
●
Interactive analysis of results
●
Detecting sanity of input
●
●
Result data typically moved into other systems
for interactive analysis (post process)
Most computational steps spill/persist to disk
–
Components of a job can be pipelined but
between two jobs is persistent storage. That
needs to be re-read in for next batch.
26. Stream processing
●
My first job “stream processing” reading in
Associated Press data
–
–
●
●
Connecting to a terminal server connected to a serial
modem
Writing this information to a database
My definition: Processing data across one or more
coordinated data channels
Like “Big Data”, Stream Processing is:
–
Whatever you say it is
27. Common components
of stream processing
●
●
●
Message Queue – A system that delivers a
never ending stream of data
Processing engine – Manages streams and
connects data to processing
External/Internal persistence – Some data
may live outside the stream.
–
It could be transient or persistent
29. Why most Message Queue
software does not 'scale'
●
MQ 'guarantees'
●
●
●
MQ Typically optimize by keeping all data in memory
–
Semantics around what happens when memory is full
●
●
●
●
In order delivery
Acknowledgments
Block
Persist to disk
Throw away
Not trashing Messages Queues here. Many of their
guarantees are hard to deliver at scale, and not always
needed
30. Kafka – A high-throughput
distributed messaging system
A publish-subscribe messaging re-thought as a
distributed commit log
32. Durable and fast
●
Messages are always persisted to disk!
●
Consumers track their position in log files
●
Kafka uses the sendfile system call for
performance
34. Great! You have streaming data.
How do you process it?
●
Storm - https://github.com/nathanmarz/storm
●
Samza - samza.incubator.apache.org
●
S4 - http://incubator.apache.org/s4/
●
http://www03.ibm.com/software/products/us/en/infospherestreams/
Heck even I wrote one!
●
IronCount https://github.com/edwardcapriolo/IronCount
37. Storm (Trident) API
●
Data comes from spouts
●
Spouts/streams produce tuples
●
FixedBatchSpout spout = new
FixedBatchSpout(new Fields("sentence"), 1,
new Values("line one"),
new Values("line two"));
https://github.com/nathanmarz/storm/wiki/Trident-tutorial
38. (extended) Projection
●
Stream can be processed into another stream
●
Here a line is split into words
●
●
Stream words = stream.each(new
Fields("sentence"), new Split(), new
Fields("word"));
(Similar to hive's LATERAL VIEW)
39. Grouping and Aggregation
●
●
GroupedStream groupByWord =
words.groupBy( new Fields("word"));
TridentState groupByState =
groupByWord.persistentAggregate(new
MemoryMapState.Factory(), new Count(),
new Fields("count"));
40. Great! We just did distributed
stream processing!
●
●
But where is the results?
groupByWord.persistentAggregate(new
MemoryMapState.Factory(), new Count(),
new Fields("count"));
●
In Memory... aka nowhere :)
●
We can change that...
●
But first some math/science/dribble I stole
from wikipedia in an attempt to sound smart!
41. Temporal database
●
●
A temporal database is a database with built-in
support for handling data involving time, for
example a temporal data model and a
temporal version of Structured Query
Language (SQL).
Temporal databases are in contrast to current
databases, which store only facts which are
believed to be true at the current time
42. Batch/Hadoop was easy
(temporaly speaking)
●
Input data is typically in write-once hdfs files*
●
Output data typically to write-once output files*
●
●
●
Reduce phase does not start until map/shuffle
is done
Output data typically available until the entire
job is done*
Idempotent computation
*Going to qualify everything with typically, because of computational idempotency
43. The real “real time”
●
Real time is often misused
●
Anecdotally people usually mean
–
–
●
Low latency
Small windows of time (sub-minute & sub-second)
Our bookstore wants “real time” stats
–
●
aggegations and data stores updated incrementally as
data is processed
One way to implement this is discrete columns
bucketed by time
44. Tempor-alizing data
●
●
●
●
In an earlier example we aggegated revenue by
referrer like this:
SELECT refer, sum(purchase.cost) ...
GROUP BY refer
Now we include the time:
SELECT date(eventtime),hour(eventtime),
minute(eventtime) refer, sum(purchase.cost)
GROUP BY day(eventtime),hour(eventtime),
minute(eventtime)
45. Storing data in Cassandra
●
Horizontally scalable (hundreds of nodes)
●
No single point of failure
●
Integrated replication
●
Writes like lightning (Structured log storage)
●
Reads like thunder (LevelDB & BigTable
inspired storage)
46. Scalable time series made
easy with cassandra
●
●
●
Create a table with one row per day per refer, sorted by
time
CREATE TABLE purchase_by_refer (
refer text,
dt date,
event_time timestamp,
tot counter,
PRIMARY KEY ((refer,dt),event_time));
UPDATE purchase_by_refer set tot=tot+1 where refer =
'store1 and dt='2013-01-12' and event_time=''2013-01-12
07:03:00'
47. If you want c* and storm
●
●
●
https://github.com/hmsonline/storm-cassandra
Uses Cassandra as a peristance model for
storm
Good documentation
48. The home stretch: Joining streams
and caching data
●
●
●
Some use cases of distributed streaming
involve keeping local caches
Streaming algorithms requires memory of
recent events and do not want to query a
datastore each time an event is received
Kafka is useful in this case because the user
can dictated the partition the data is sent to
50. Input Streams
Stream 1: users
Stream 2: items
user|1:edward
cart|1:saw:2.00
user|2:nate
cart|1:hammer:3.00
user|3:stacey
cart|3:puppy:1.00
●
Both streams merged (union)
●
The field after the pipe is the userid (projection)
●
User id should be the partition key when sent on
(aggregation)
51. Handle message and route by id
public void handleMessage(MessageAndMetadata
<Message> m) {
String line = getMessage(m.message());
String[] parts = line.split("|");
String table = parts[0];
String row = parts[1];
String [] columns = row.split(":");
producer.send(new ProducerData<String, String>
("reduce", columns[0], Arrays.asList(table+"|"+row)));
}
52. Update in memory copy
●
public class ReduceHandler implements MessageHandler {
HashMap<User,ArrayList<Item>> data = new
EvictingHashMap<User,ArrayList<Item>>();
...
public void handleMessage (MessageAndMetadata<Message>
m) {
if ( table.equals("cart")){
Item i = new Item();
i.parse(columns);
incrementItemCounter(u);
incrementDollarByUser(u,i);
}
suggestNewItemsForUser(u);
53. Challenges of streaming
●
Replay of data could double/miss count
●
New evolving API's
–
●
●
●
You may have to build support for your stack
Distributed computation is harder to log/debug
Monitoring consumption on topics to avoid
falling behind
Monitoring topics to notice if data stops