1. On Storing Big Data
Ilias Flaounas
Intelligent Systems Lab
30 October 2012
I. Flaounas (Intelligent Systems Lab) 30 October 2012 1 / 16
2. Storing Big Data
Data start to play an increasingly important role in business and
science.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16
3. Storing Big Data
Data start to play an increasingly important role in business and
science.
Storing, searching, sharing, analysing and visualising big data has
become a challenge.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16
4. Storing Big Data
Data start to play an increasingly important role in business and
science.
Storing, searching, sharing, analysing and visualising big data has
become a challenge.
Especially storing of data is often disregarded as an issue.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16
5. Storing Big Data
Data start to play an increasingly important role in business and
science.
Storing, searching, sharing, analysing and visualising big data has
become a challenge.
Especially storing of data is often disregarded as an issue.
Note that sometimes a MySQL database is not enough.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16
6. Storing Big Data
Data start to play an increasingly important role in business and
science.
Storing, searching, sharing, analysing and visualising big data has
become a challenge.
Especially storing of data is often disregarded as an issue.
Note that sometimes a MySQL database is not enough.
Hadoop offers an out of the box distributed filesystem for storing data
files. However, the challenge appears when someone needs DB
capabilities, frequent updates or real time processing.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16
7. The Problems
Nowadays traditional relational databases can reach their limit in
performance.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16
8. The Problems
Nowadays traditional relational databases can reach their limit in
performance.
Data keep on coming in high velocity, high volumes, and high variety.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16
9. The Problems
Nowadays traditional relational databases can reach their limit in
performance.
Data keep on coming in high velocity, high volumes, and high variety.
Common practices to increase performance fail after a while: buying a
faster server, getting more RAM, using materialised views, fine tuning
queries...
I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16
10. The Problems
Nowadays traditional relational databases can reach their limit in
performance.
Data keep on coming in high velocity, high volumes, and high variety.
Common practices to increase performance fail after a while: buying a
faster server, getting more RAM, using materialised views, fine tuning
queries...
Furthermore, “alter table” doesn’t really work with lots of data.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16
11. The Problems
Nowadays traditional relational databases can reach their limit in
performance.
Data keep on coming in high velocity, high volumes, and high variety.
Common practices to increase performance fail after a while: buying a
faster server, getting more RAM, using materialised views, fine tuning
queries...
Furthermore, “alter table” doesn’t really work with lots of data.
Backups and data availability becomes an issue.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16
12. NoSQL Movement
The term is too broad and new to really define it.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16
13. NoSQL Movement
The term is too broad and new to really define it.
Wikipedia: “NoSQL (Not only SQL) DB systems are often highly
optimized for retrieve and append operations and often offer little
functionality beyond record storage.”
I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16
14. NoSQL Movement
The term is too broad and new to really define it.
Wikipedia: “NoSQL (Not only SQL) DB systems are often highly
optimized for retrieve and append operations and often offer little
functionality beyond record storage.”
No schema
I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16
15. NoSQL Movement
The term is too broad and new to really define it.
Wikipedia: “NoSQL (Not only SQL) DB systems are often highly
optimized for retrieve and append operations and often offer little
functionality beyond record storage.”
No schema
No joins between tables
I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16
16. NoSQL Movement
The term is too broad and new to really define it.
Wikipedia: “NoSQL (Not only SQL) DB systems are often highly
optimized for retrieve and append operations and often offer little
functionality beyond record storage.”
No schema
No joins between tables
No common scripting language (like SQL)
I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16
17. NoSQL Movement
The term is too broad and new to really define it.
Wikipedia: “NoSQL (Not only SQL) DB systems are often highly
optimized for retrieve and append operations and often offer little
functionality beyond record storage.”
No schema
No joins between tables
No common scripting language (like SQL)
No ACID (atomicity, consistency, isolation, durability)
I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16
18. NoSQL Movement
The term is too broad and new to really define it.
Wikipedia: “NoSQL (Not only SQL) DB systems are often highly
optimized for retrieve and append operations and often offer little
functionality beyond record storage.”
No schema
No joins between tables
No common scripting language (like SQL)
No ACID (atomicity, consistency, isolation, durability)
On the other hand you gain horizontal scalability and high performance.
Also, most NoSQL systems are Map/Reduce ready and/or bind with
Hadoop.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16
19. NoSQL DBs
There are lots of different systems under the NoSQL ‘umbrella’. Each one
is optimised with different application scenarios in mind, and with different
choices on trade-offs.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16
20. NoSQL DBs
There are lots of different systems under the NoSQL ‘umbrella’. Each one
is optimised with different application scenarios in mind, and with different
choices on trade-offs.
Document based: CouchDB, MongoDB,...
I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16
21. NoSQL DBs
There are lots of different systems under the NoSQL ‘umbrella’. Each one
is optimised with different application scenarios in mind, and with different
choices on trade-offs.
Document based: CouchDB, MongoDB,...
Key-value: Cassandra, Dynamo, Riak,...
I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16
22. NoSQL DBs
There are lots of different systems under the NoSQL ‘umbrella’. Each one
is optimised with different application scenarios in mind, and with different
choices on trade-offs.
Document based: CouchDB, MongoDB,...
Key-value: Cassandra, Dynamo, Riak,...
Tabular based: BigTable, HBase,...
I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16
23. NoSQL DBs
There are lots of different systems under the NoSQL ‘umbrella’. Each one
is optimised with different application scenarios in mind, and with different
choices on trade-offs.
Document based: CouchDB, MongoDB,...
Key-value: Cassandra, Dynamo, Riak,...
Tabular based: BigTable, HBase,...
Memory based: Memcached, Redis, other optimised for solid state
disks...
I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16
24. NoSQL DBs
There are lots of different systems under the NoSQL ‘umbrella’. Each one
is optimised with different application scenarios in mind, and with different
choices on trade-offs.
Document based: CouchDB, MongoDB,...
Key-value: Cassandra, Dynamo, Riak,...
Tabular based: BigTable, HBase,...
Memory based: Memcached, Redis, other optimised for solid state
disks...
Specialised for graphs: Neo4j, InfiniteGraph,...
I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16
25. NoSQL DBs
There are lots of different systems under the NoSQL ‘umbrella’. Each one
is optimised with different application scenarios in mind, and with different
choices on trade-offs.
Document based: CouchDB, MongoDB,...
Key-value: Cassandra, Dynamo, Riak,...
Tabular based: BigTable, HBase,...
Memory based: Memcached, Redis, other optimised for solid state
disks...
Specialised for graphs: Neo4j, InfiniteGraph,...
Specialised for full-text search: Lucene, Solr...
I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16
26. NoSQL DBs
There are lots of different systems under the NoSQL ‘umbrella’. Each one
is optimised with different application scenarios in mind, and with different
choices on trade-offs.
Document based: CouchDB, MongoDB,...
Key-value: Cassandra, Dynamo, Riak,...
Tabular based: BigTable, HBase,...
Memory based: Memcached, Redis, other optimised for solid state
disks...
Specialised for graphs: Neo4j, InfiniteGraph,...
Specialised for full-text search: Lucene, Solr...
Understand your requirements and then make a choice.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16
28. Oracle response
May, 2011: Oracle issues a white paper titled “Debunking the NoSQL
Hype”.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 6 / 16
29. Oracle response
May, 2011: Oracle issues a white paper titled “Debunking the NoSQL
Hype”.
The conclusion:
“Go for the tried and true path. Don’t be risking your data on NoSQL
databases.”
I. Flaounas (Intelligent Systems Lab) 30 October 2012 6 / 16
30. Oracle response
May, 2011: Oracle issues a white paper titled “Debunking the NoSQL
Hype”.
The conclusion:
“Go for the tried and true path. Don’t be risking your data on NoSQL
databases.”
October 2011: Oracle releases the “Oracle NoSQL Database”. The white
paper is now reachable only via Google archives.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 6 / 16
31. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
32. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
Document-Oriented storage
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
33. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
Document-Oriented storage
No predefined schema
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
34. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
Document-Oriented storage
No predefined schema
High Performance
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
35. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
Document-Oriented storage
No predefined schema
High Performance
Easy to add new “columns” in data rows
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
36. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
Document-Oriented storage
No predefined schema
High Performance
Easy to add new “columns” in data rows
No joins between tables
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
37. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
Document-Oriented storage
No predefined schema
High Performance
Easy to add new “columns” in data rows
No joins between tables
Easy to scale horizontally: Auto-Sharding
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
38. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
Document-Oriented storage
No predefined schema
High Performance
Easy to add new “columns” in data rows
No joins between tables
Easy to scale horizontally: Auto-Sharding
Automatic fail-over: invisible to applications
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
39. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
Document-Oriented storage
No predefined schema
High Performance
Easy to add new “columns” in data rows
No joins between tables
Easy to scale horizontally: Auto-Sharding
Automatic fail-over: invisible to applications
Full Index Support
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
40. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
Document-Oriented storage
No predefined schema
High Performance
Easy to add new “columns” in data rows
No joins between tables
Easy to scale horizontally: Auto-Sharding
Automatic fail-over: invisible to applications
Full Index Support
Map/Reduce ready - Can bind with Hadoop
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
41. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
Document-Oriented storage
No predefined schema
High Performance
Easy to add new “columns” in data rows
No joins between tables
Easy to scale horizontally: Auto-Sharding
Automatic fail-over: invisible to applications
Full Index Support
Map/Reduce ready - Can bind with Hadoop
Eventually consistent
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
42. Example: MongoDB
MongoDB (from “humongous”) is an open source, high-performance,
schema-free, document-oriented database.
Document-Oriented storage
No predefined schema
High Performance
Easy to add new “columns” in data rows
No joins between tables
Easy to scale horizontally: Auto-Sharding
Automatic fail-over: invisible to applications
Full Index Support
Map/Reduce ready - Can bind with Hadoop
Eventually consistent
Open Source but developed and maintained by company “10gen”
I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16
43. Document based DB
A document is represented in JSON format:
{
“ id” : 12345678,
“Link” : “http://news.scotsman.com/abc.html”,
“Title”:“Blah blah blah”,
“Content”: “More blah blah”,
“OutletID” : 14,
“Date” : ISODate(“2011-11-17T20:33:15.097Z”),
“ Hash” : 550973592,
“Tags” : [ International, News, Scotland],
}
I. Flaounas (Intelligent Systems Lab) 30 October 2012 8 / 16
44. Single Server
A single machine stores the DB, e.g MySQL.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 9 / 16
45. Master/Slave
Two machines in Master/Slave configuration.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 10 / 16
46. MongoDB - Replication
Automatic Fail Over - The Master is elected among servers.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 11 / 16
47. MongoDB - Sharding
Data is spread horizontally.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 12 / 16
48. MongoDB
If new shard is added, data is balanced automatically.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 13 / 16
49. MongoDB
No single point of failure, distributed read/writes.
I. Flaounas (Intelligent Systems Lab) 30 October 2012 14 / 16
50. Big Data come with Big Problems
Maintenance of infrastructure - It is easier to manage one instead of
10 servers
I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16
51. Big Data come with Big Problems
Maintenance of infrastructure - It is easier to manage one instead of
10 servers
Need to adapt legacy software
I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16
52. Big Data come with Big Problems
Maintenance of infrastructure - It is easier to manage one instead of
10 servers
Need to adapt legacy software
Training people on the new techs
I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16
53. Big Data come with Big Problems
Maintenance of infrastructure - It is easier to manage one instead of
10 servers
Need to adapt legacy software
Training people on the new techs
Designing DB – splitting data among machines for maximum I/O
I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16
54. Big Data come with Big Problems
Maintenance of infrastructure - It is easier to manage one instead of
10 servers
Need to adapt legacy software
Training people on the new techs
Designing DB – splitting data among machines for maximum I/O
Bugs or ‘simple’ features may be missing, new versions come out too
often...
I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16
55. Big Data come with Big Problems
Maintenance of infrastructure - It is easier to manage one instead of
10 servers
Need to adapt legacy software
Training people on the new techs
Designing DB – splitting data among machines for maximum I/O
Bugs or ‘simple’ features may be missing, new versions come out too
often...
Security
I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16