This document discusses big data and strategies for processing large datasets. It begins by defining big data as datasets that cannot be processed by a single machine and require distributed architectures. It then discusses common big data processing tasks like create, read, update, and delete operations. It also introduces the concept of shared-nothing parallel databases, NoSQL key-value stores, and map-reduce frameworks as approaches to distributed and parallel processing of big data.
1. Everything you've always wanted to
know about Big Data
(But were afraid to ask)
Howie Rosenshine
howie.rosenshine@gmail.com
PhillyDB – 6/19/2012
2. Administrivia
Why the title of the talk?
When I use the unqualified term database I
probably mean DBMS. Bad habit.
When I use the unqualified term database I
probably mean RDBMS (not NoSQL)
Another habit...maybe not bad...it will be for you
to decide
Howie Rosenshine - Ergo Analytics
3. What is Big Data?
Size is such that it cannot (easily/economically)
be processed within a single node (or single
"shared something" cluster)
Or any smaller architecture that is capable of
scaling to the above Big Data definition
And what exactly do we mean by “processed” ?
Howie Rosenshine - Ergo Analytics
4. CRUD
Create Read Update Destroy (plus a potentially
huge amount of actual computing)
Big Data examples tend to come from “machine
generated” domain e.g. web crawling or
tracking, realtime sensor data, logfiles, etc.
So for Big Data, CRUD ⇒ CRud. Or perhaps:
CRAP ⇒ Create Read Analytical Processing
Howie Rosenshine - Ergo Analytics
5. Why is all this CRAP a problem?
Because tools and architectures that have
grown to support not(B) do not scale well to B,
where B=Big Data.
Why is this so?
Scalability? No such thing. Bottlenecks!
Howie Rosenshine - Ergo Analytics
6. Bottlenecks (Scalability)
Consider the single Node RDB example (single
node = shared everything)
Bottlenecks can be hardware or software
(probably software more often than not) e.g.
kernel locks for I/O contention will probably bite
before you run out of disks to attach or PCI
bandwidth, etc.
But one or the other will bite eventually.
Howie Rosenshine - Ergo Analytics
8. Shared Something
Shared logical disk implemented with pretty
extensive inter-machine ipc locking mechanism.
This will typically bottleneck long before any
aggregated hardware limits
Nevertheless, it is good enough to become a
dominant force in the OLTP industry.
Note: this is not to say that you can’t do serious
analytical processing on such an architecture
But what happens when your “really big data”exceeds
this limit?
Howie Rosenshine - Ergo Analytics
9. Big Data Strategies
Shared nothing parallel relational database
NoSQL (key/value stores)
Map Reduce
Note: Embarrassingly parallel problems require
none of these. Examples:
Static web pages.
Wikipedia (at least w/o edits).
Google maps/earth.
Howie Rosenshine - Ergo Analytics
11. Shared Nothing
Well, “Nothing but Net”, that is
Network should be fast, certainly for bandwidth,
preferably for latency as well
At least for some queries (see next section)
Howie Rosenshine - Ergo Analytics
12. Partitioning/Sharding
Ideally little/no inter-shard/inter-node
communication (local/localized join)
Data distribution/redistribution among shards
Redundancy also allows for orthogonal
sharding
Howie Rosenshine - Ergo Analytics
13. Columnar Store
Columnar store, for the most part at this point
“Some RDBMS are born columnar, and some
have columnarness thrust upon them”
Strong advantage for aggregation
Also advantageous for compression
Howie Rosenshine - Ergo Analytics
15. NoSQL (Key/Value store)
Characteristics
Relatively low latency, targeted at transaction
oriented data (simple transactions)
Typically not ACID
Typically no joins
Howie Rosenshine - Ergo Analytics
16. Database vs Datastore?
Is it ACID?
Must a database be an instantiation of DBMS?
“I shall not attempt to further define the
characteristics of a database, but I know it
when I see it, and this isn’t it”
Howie Rosenshine - Ergo Analytics
17. Map Reduce
(“And now for something completely different”)
Practical general purpose (or as close as
anyone has come) implicit parallel
programming paradigm
Attributed to Google, who published the
original Map Reduce white paper.
Open Source Hadoop - Doug Cutting, Yahoo
Note: Hadoop is an “ecosystem”, not a “product”,
however the unqualified use of Hadoop is typically
taken to mean the use of Hadoop map reduce
Howie Rosenshine - Ergo Analytics
18. Hadoop characteristics
Hadoop addresses the crAP partition
Hadoop map reduce is composed, primarily of
HDFS and map reduce itself.
Not just Java ⇒ streams interface
Python, Ruby...,
Unix: utilities, pipes, filters, shell
Howie Rosenshine - Ergo Analytics
20. General Purpose
Use your imagination: If you can make the
shoe fit, Hadoop will wear it
HIVE ⇒ RDBMS
RDBMS X...new and improved, 100% fortified
with Hadoop ⇒ ETL
Howie Rosenshine - Ergo Analytics
21. Big Picture “Scalability”
Order of magnitude comparison 1/10/100/1000
⇒ Single node/shared something rdb/
shared nothing rdb/map reduce
This is not necessarily a good inter-platform
performance comparison, though it may be
reasonable for intra-platform comparison
Howie Rosenshine - Ergo Analytics
22. Further Reading:
dbms2.com - Curt Monash
dbmsmusings.blogspot.com - Daniel Abadi
Howie Rosenshine - Ergo Analytics