Big Data Overview

Everything you've always wanted to
know about Big Data
(But were afraid to ask)

Howie Rosenshine
howie.rosenshine@gmail.com

PhillyDB – 6/19/2012

Administrivia


Why the title of the talk?

When I use the unqualified term database I
probably mean DBMS. Bad habit.

When I use the unqualified term database I
probably mean RDBMS (not NoSQL)
Another habit...maybe not bad...it will be for you
to decide

Howie Rosenshine - Ergo Analytics

What is Big Data?


Size is such that it cannot (easily/economically)
be processed within a single node (or single
"shared something" cluster)

Or any smaller architecture that is capable of
scaling to the above Big Data definition

And what exactly do we mean by “processed” ?


CRUD


Create Read Update Destroy (plus a potentially
huge amount of actual computing)

Big Data examples tend to come from “machine
generated” domain e.g. web crawling or
tracking, realtime sensor data, logfiles, etc.

So for Big Data, CRUD ⇒ CRud. Or perhaps:

CRAP ⇒ Create Read Analytical Processing


Why is all this CRAP a problem?


Because tools and architectures that have
grown to support not(B) do not scale well to B,
where B=Big Data.

Why is this so?

Scalability? No such thing. Bottlenecks!


Bottlenecks (Scalability)


Consider the single Node RDB example (single
node = shared everything)

Bottlenecks can be hardware or software
(probably software more often than not) e.g.
kernel locks for I/O contention will probably bite
before you run out of disks to attach or PCI
bandwidth, etc.

But one or the other will bite eventually.


Scalability Solution(?)


Distribute!

Multiple machines ⇒
multiple kernels, multiple I/O backplanes...yay!

Shared something...yay?


Shared Something


Shared logical disk implemented with pretty
extensive inter-machine ipc locking mechanism.

This will typically bottleneck long before any
aggregated hardware limits

Nevertheless, it is good enough to become a
dominant force in the OLTP industry.
Note: this is not to say that you can’t do serious
analytical processing on such an architecture
But what happens when your “really big data”exceeds
this limit?

Big Data Strategies


Shared nothing parallel relational database

NoSQL (key/value stores)

Map Reduce
Note: Embarrassingly parallel problems require
none of these. Examples:

Static web pages.

Wikipedia (at least w/o edits).

Google maps/earth.


Shared Nothing parallel RDB


Shared nothing, obviously

Partitioning/Sharding

Columnar (typically)


Shared Nothing


Well, “Nothing but Net”, that is

Network should be fast, certainly for bandwidth,
preferably for latency as well

At least for some queries (see next section)


Partitioning/Sharding


Ideally little/no inter-shard/inter-node
communication (local/localized join)

Data distribution/redistribution among shards

Redundancy also allows for orthogonal
sharding


Columnar Store


Columnar store, for the most part at this point
“Some RDBMS are born columnar, and some
have columnarness thrust upon them”

Strong advantage for aggregation

Also advantageous for compression


NoSQL (Key/Value store) Types


“Key value” store (simple key/value store)
⇒ riak, voldemort, etc

Document store (complex key/value store)
⇒ mongodb, couchdb, etc

Column oriented stores (tabular key/value
store)
⇒ bigtable, hbase, cassandra, etc


NoSQL (Key/Value store)
Characteristics

Relatively low latency, targeted at transaction
oriented data (simple transactions)

Typically not ACID

Typically no joins


Database vs Datastore?


Is it ACID?

Must a database be an instantiation of DBMS?

“I shall not attempt to further define the
characteristics of a database, but I know it
when I see it, and this isn’t it”


Map Reduce
(“And now for something completely different”)


Practical general purpose (or as close as
anyone has come) implicit parallel
programming paradigm

Attributed to Google, who published the
original Map Reduce white paper.

Open Source Hadoop - Doug Cutting, Yahoo
Note: Hadoop is an “ecosystem”, not a “product”,
however the unqualified use of Hadoop is typically
taken to mean the use of Hadoop map reduce


Hadoop characteristics


Hadoop addresses the crAP partition

Hadoop map reduce is composed, primarily of
HDFS and map reduce itself.

Not just Java ⇒ streams interface
Python, Ruby...,
Unix: utilities, pipes, filters, shell


Hadoop


“Hello World”

$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-output myOutputDir -
mapper /bin/cat -
reducer /bin/wc


General Purpose


Use your imagination: If you can make the
shoe fit, Hadoop will wear it

HIVE ⇒ RDBMS

RDBMS X...new and improved, 100% fortified
with Hadoop ⇒ ETL


Big Picture “Scalability”


Order of magnitude comparison 1/10/100/1000
⇒ Single node/shared something rdb/
shared nothing rdb/map reduce


This is not necessarily a good inter-platform
performance comparison, though it may be
reasonable for intra-platform comparison


Further Reading:


dbms2.com - Curt Monash

dbmsmusings.blogspot.com - Daniel Abadi


Big Data Overview

Recomendados

Recomendados

Más contenido relacionado

Similar a Big Data Overview

Similar a Big Data Overview (20)

Último

Último (20)

Big Data Overview