On Storing Big Data

On Storing Big Data
Ilias Flaounas
Intelligent Systems Lab
30 October 2012
I. Flaounas (Intelligent Systems Lab) 30 October 2012 1 / 16

Storing Big Data
Data start to play an increasingly important role in business and
science.

Storing Big Data
science.
Storing, searching, sharing, analysing and visualising big data has
become a challenge.

Storing Big Data
science.
become a challenge.
Especially storing of data is often disregarded as an issue.

Storing Big Data
science.
become a challenge.
Note that sometimes a MySQL database is not enough.

Storing Big Data
science.
become a challenge.
Note that sometimes a MySQL database is not enough.
Hadoop offers an out of the box distributed filesystem for storing data
files. However, the challenge appears when someone needs DB
capabilities, frequent updates or real time processing.

The Problems
Nowadays traditional relational databases can reach their limit in
performance.

The Problems
performance.
Data keep on coming in high velocity, high volumes, and high variety.

The Problems
performance.
Common practices to increase performance fail after a while: buying a
faster server, getting more RAM, using materialised views, ﬁne tuning
queries...

The Problems
performance.
queries...
Furthermore, “alter table” doesn’t really work with lots of data.

The Problems
performance.
queries...
Furthermore, “alter table” doesn’t really work with lots of data.
Backups and data availability becomes an issue.

NoSQL Movement
The term is too broad and new to really deﬁne it.

NoSQL Movement
Wikipedia: “NoSQL (Not only SQL) DB systems are often highly
optimized for retrieve and append operations and often oﬀer little
functionality beyond record storage.”

NoSQL Movement
No schema

NoSQL Movement
No schema
No joins between tables

NoSQL Movement
No schema
No common scripting language (like SQL)

NoSQL Movement
No schema
No ACID (atomicity, consistency, isolation, durability)

NoSQL Movement
No schema
No ACID (atomicity, consistency, isolation, durability)
On the other hand you gain horizontal scalability and high performance.
Also, most NoSQL systems are Map/Reduce ready and/or bind with
Hadoop.

NoSQL DBs
There are lots of different systems under the NoSQL ‘umbrella’. Each one
is optimised with different application scenarios in mind, and with different
choices on trade-offs.

NoSQL DBs
Document based: CouchDB, MongoDB,...

NoSQL DBs
Key-value: Cassandra, Dynamo, Riak,...

NoSQL DBs
Tabular based: BigTable, HBase,...

NoSQL DBs
Memory based: Memcached, Redis, other optimised for solid state
disks...

NoSQL DBs
disks...
Specialised for graphs: Neo4j, InﬁniteGraph,...

NoSQL DBs
disks...
Specialised for full-text search: Lucene, Solr...

NoSQL DBs
disks...
Specialised for full-text search: Lucene, Solr...
Understand your requirements and then make a choice.

Oracle response

Oracle response
May, 2011: Oracle issues a white paper titled “Debunking the NoSQL
Hype”.

Oracle response
Hype”.
The conclusion:
“Go for the tried and true path. Don’t be risking your data on NoSQL
databases.”

Oracle response
Hype”.
The conclusion:
“Go for the tried and true path. Don’t be risking your data on NoSQL
databases.”
October 2011: Oracle releases the “Oracle NoSQL Database”. The white
paper is now reachable only via Google archives.

Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.

Example: MongoDB
Document-Oriented storage

Example: MongoDB
No predeﬁned schema

Example: MongoDB
High Performance

Example: MongoDB
High Performance
Easy to add new “columns” in data rows

Example: MongoDB
High Performance

Example: MongoDB
High Performance
Easy to scale horizontally: Auto-Sharding

Example: MongoDB
High Performance
Automatic fail-over: invisible to applications

Example: MongoDB
High Performance
Full Index Support

Example: MongoDB
High Performance
Full Index Support
Map/Reduce ready - Can bind with Hadoop

Example: MongoDB
High Performance
Full Index Support
Eventually consistent

Example: MongoDB
High Performance
Full Index Support
Eventually consistent
Open Source but developed and maintained by company “10gen”

Document based DB
A document is represented in JSON format:
{
“ id” : 12345678,
“Link” : “http://news.scotsman.com/abc.html”,
“Title”:“Blah blah blah”,
“Content”: “More blah blah”,
“OutletID” : 14,
“Date” : ISODate(“2011-11-17T20:33:15.097Z”),
“ Hash” : 550973592,
“Tags” : [ International, News, Scotland],
}

Single Server
A single machine stores the DB, e.g MySQL.

Master/Slave
Two machines in Master/Slave conﬁguration.

MongoDB - Replication
Automatic Fail Over - The Master is elected among servers.

MongoDB - Sharding
Data is spread horizontally.

MongoDB
If new shard is added, data is balanced automatically.

MongoDB
No single point of failure, distributed read/writes.

Big Data come with Big Problems
Maintenance of infrastructure - It is easier to manage one instead of
10 servers

10 servers
Need to adapt legacy software

10 servers
Training people on the new techs

10 servers
Designing DB – splitting data among machines for maximum I/O

10 servers
Bugs or ‘simple’ features may be missing, new versions come out too
often...

10 servers
Bugs or ‘simple’ features may be missing, new versions come out too
often...
Security

Thank you!

On Storing Big Data

Recomendados

Recomendados

Más contenido relacionado

Similar a On Storing Big Data

Similar a On Storing Big Data (20)

Más de Ilias Flaounas

Más de Ilias Flaounas (10)

Último

Último (20)

On Storing Big Data