An perspective into the raise of NoSQL systems and an comparison between RDBMS and NoSQL technologies.
The basic idea of the presentation originated while trying to understand the different alternatives available for managing data while building a fast, highly scalable, available, and reliable enterprise application.
2. ο RDBMS β The origins
ο Concepts, Architecture and Principles
ο Golden Age β Way of life.
ο Changing Timesβ New Problems, New Needs
ο Attack on the citadel - Revisiting the norms
ο Ignited Minds β Working towards NoSQL Solutions
ο Way Aheadβ It is a Cloudy out there
3. ο Girish Narasimha Raghavan
ο Over 15 years experience building distributed, large
scale and highly available enterprise systems.
ο Current interest include build SAC (Social, Big Data
Analytics, and Cloud) solutions.
ο Likes to write and discuss technologies and its
applications to solve real world problems.
ο http://randomtechthought.blogspot.com
4.
5. ο In the world data abounds. Always has and always will.
ο Record keeping is as old as Human race.
ο Consistent quest to improve storing , accessing, and analyzing
records
ο The early machines had serious shortcomings.
ο only a very limited amount of program code and data could be stored
in memory.
ο Electromagnetic data storage was feasible only at an extremely high
cost.
ο Storing Data was an issue
ο Organizations had to store data β related to Administration,
Research, Operations.
ο Data stored in proprietary formats β Database Systems did not exist
ο Plagued by data integrity issues
ο Non standard application logic for accessing stored data
6. ο First attempt: File based systems
ο Data sets were growing and accumulating.
ο Data had to be managed at a detailed transaction level.
ο Computing systems started to be used for critical business
needs.
ο Data inconsistency and redundancy.
ο Enter Database Systems
ο Attempts to standardize the processes and rules to store and
access data.
ο Intention to reuse, resell and redeploy solutions across
organizations (with significant customizations).
ο Attempt to proactively manage Data Integrity and Quality.
7. ο Database Systems and concepts Evolve
ο Hierarchical DBMS
ο Information represented using parent/child relationships
ο Tree structure is primary data structure.
ο Network DBMS
ο The relationships is represented in form of a network.
ο Graph is the primary data structure.
ο Challenges Galore
ο Hardware Dependency β Software strongly dependent on the
underlying hardware.
ο Modeling challenges β Representing data under a common
structure.
ο Integration issues - Integrating across dependent packages was a
nightmare.
ο Introducing new functionality and updates - Solution providers
struggled with it across customized software deployment.
8. Father of the Relational
Database model
Edgar F Codd
A British Computer Scientist
who made significant
contributions to the theory of
Relational Databases while
working for IBM.
9. ο Landmark Paper by Codd - βA relational Model of Data for
large shared Data Banksβ.
ο Independence of Data from the Hardware- and Storage
Implementation.
ο automatic navigation to the data set through high level
nonprocedural language for data access.
ο Concept of keys (primary, secondary).
ο theoretical proposal, no practical design or implementation.
ο Coddβs 12 rules for Relational management System
ο http://cims.clayton.edu/booth/ITDB%204201/Codd%20PDF.
pdf
10.
11. Application Reporting
1 Solutions
Database Databases
Application Management Data
2 Systems (DBMS) Strorage
Application Future
3 Applications
12. ο Data Definition
ο For describing data and data structures for handling the data
ο Data Manipulation
ο For describing the operations associated with the data like storage, query, change,
etc.
ο Data Security and Integrity
ο For ensuring secure and controlled access to storage and manipulation of data.
ο For ensuring correctness, consistency and reliability of the data stored .
ο Data Recovery and Concurrency
ο For providing and enforcing recovery and concurrency controls.
ο Data Dictionary
ο For providing information about the data stored.
ο For Liaisoning between the conceptual and physical storage.
ο Performance
ο For ensuring all the above mentioned operations are performed efficiently and
effectively
13. External/User
How the user access and sees the data
[Tables, Views]
Conceptual/Logical
How data is organized logically
[Table Spaces]
Physical/Internal
How data is stored internally
[Data Files]
14. ο Relation (Tables)β Set of Tuples that have the
same attributes.
ο Tuples (Rows) β A Tuple usually represents an
object and information about that object.
ο Attribute (Columns)β Represent a particular
characteristic of that object
ο Domain - A domain describes the set of permitted values for a given attribute.
It is the set from which the values of an attribute can be defined.
ο Constraints - Constraints make it possible to further restrict the domain of an
attribute. Constraints help in binding the attribute to a set of rules.
ο Primary Key - A primary key is a (set of) attribute (s) that uniquely defines a
relationship within a database.
ο Foreign Key - The foreign key can be used to cross-reference tables.
ο Cardinality - Expresses the number of instances of the entity to which another
entity can be associated via a relation
ο Index - An index is a mechanism for providing quicker access to data. Indices
can be created on any combination of attributes on a relation.
15. ο Based on the perception that real world can be modeled around
base objects (entities) and relationship among them.
ο Modeling of data in a top down fashion
ο Conceptual Model β The model is the highest and least granular
model that defines master reference data entities that are
commonly used in the problem space.
ο Logical Model β The model generally builds over the conceptual
model by adding additional granular details like operational and
transactional data entities.
ο Physical Model - Specifies relational database objects such
as database tables, database indexes such as unique key indexes,
and database constraints.
ο The models can be visualized through what is commonly known
as ER-Diagrams.
16. ο Process for organizing the attributes and tables of a relational
database to minimize redundancy and dependency.
ο Objectives (as specified by Codd)
ο To free the collection of relations from undesirable insertion, update
and deletion dependencies.
ο To reduce the need for restructuring the collection of relations, as new
types of data are introduced, and thus increase the life span of
application programs.
ο To make the relational model more informative to users.
ο To make the collection of relations neutral to the query statistics, where
these statistics are liable to change as time goes by.
ο Normal Forms (NF)
ο 1NF - it contains atomic values only
ο 2NF - 1NF + every non-key attribute is dependent on the primary key
ο 3NF - 2NF + every non-key attribute is non-transitively dependent on
the primary key
17. ο Properties that guarantee that database transactions are processed
reliably.
ο Single logical operation (involving multiple steps) is called transaction.
ο Properties
ο Atomicity β βAll or Nothingβ β If one part of the transaction fails, entire
transaction fails.
ο Consistency β Any data written to the database must be valid according
to all defined rules, and constraints.
ο Isolation β Even during concurrent executions, the system result in a
state that is same as the state which will be obtained when executed
serially.
ο Durability - Once a transaction has been committed, the results will
be stored permanently irrespective of errors and crashes that can occur
post commit.
ο In RDBMS ACID properties are implemented using various
techniques like locking and Multi Versioning
18.
19.
20. ο RDBMS based solutions is generally the first choice for
database storage/access needs
ο RDBMS solutions is now mature and predictable.
ο An army of skilled specialists exists for using,
managing and maintaining RDBMS based systems
ο RDBMS has spawned an ecosystem of products that
makes choosing RDBMS as no brainer
21. ο Ensures Consistent behavior
ο With the table structure as the base, RDBMS provides a consistent mechanism for
storing and accessing different data sets.
ο Removes Redundancies
ο Through Normal forms, redundancies in the data are removed thereby addressing
the errors that can arise from consistency of the data stored
ο Avoid errors
ο Ensures Data integrity and quality by ensuring consistent storage, enforcing
constraints and relationships and with ability to check data as they are entered
ο Facilitates Easy analysis
ο With the SQL based query as the foundation, analyzing different data set is seamless.
Also given the history of RDBMS, users are provided with a vast repository of tools to
perform analysis.
ο Ensures Robust Maintenance and Management
ο Database administrators are provided with tools that enable them to easily
maintain, test, repair and back up the databases housed in the system.
ο Is Secure
ο Offers good level of security and access control. Whole or part of the data can be
securely shared across multiple users(applications) based on the privileges granted
to them(it).
22.
23. ο Raise of Social Networks during early 2000s
ο World Wide Web acts as the foundation
ο Shift in communication patterns
ο Sharing of personal information and usage of the same
ο Everyone turned into a publisher
ο Increased focus around personalization
ο Recommendations, Ratings, Preferences and providing
Personalized interfaces
ο Big Data Flood
ο More data is being generated currently than what was generated till
now throughout history of human kind
ο Need to store and process unstructured or semi structured data at
volumes previously not anticipated and at frequencies not
encountered previously
25. ο Accessible by users across the globe
ο Geography is irrelevant
ο Facebook, Google, Yahoo, Twitter, etc. have users across the world
ο Highly networked and distributed systems
ο Systems are accessed and connected over the Internet
ο Need to be highly scalable
ο Should be able to handle additional load without redesign
ο Amazon sees a manifold increase in traffic to the site during the holiday seasons
ο Expected to be highly available
ο Systems will be available for access and operations always
ο Google will incur a huge revenue and credibility loss if the site goes down
ο Handle large data sets hitting the systems with high frequency
ο The data need to be stored and processed very quickly
ο Number of likes and comments on Facebook has exceeded 2.7 billion per day
26.
27. ο Brewers CAP Theorem
ο You can get only two out of the following three
ο Consistency β Same as Atomicity. You get βAll or Nothingβ
ο Availability - Need to be available for operations always
ο Partition Tolerance β Need to work when some nodes are not
accessible.
ο RDBMS were essentially designed for CA
ο Latency (response times) is an unfortunate tradeoff for
consistency
ο Partition tolerance becomes essential in distributed
systems
28. ο Beyond a point you cannot afford to Scale up storage
ο It becomes very expensive to keep scaling up.
ο Is strict consistency really so important?
ο Ensuring consistency slows the system
ο Google found that moving from a 10-result page loading in 0.4 seconds to
a 30-result page loading in 0.9 seconds decreased traffic and ad revenues
by 20% (Linden 2006)
ο Redundancy can be managed
ο Joins across normalized database tables is less efficient than reading
from a data store
ο Not All data is relational
ο Fitting every kind of data under the Rigid Schema structure of RDBMS is
a challenge
ο Data read from RDBMS modeled back in its original model (say tree,
graph, key value) induces significant stress on computing resources.
ο Attributes (columns) are restricted by domain to store similar data.
ο Managing semi structured, unstructured data like documents becomes a
challenge.
29. ο CRUD (Create, Read, Update, Delete) is crude
ο Updates and deletes should never be allowed as they destroy
information.
ο Logical and physical separation of concerns ignored
ο Relational model is a logical model
ο Database products implemented the relational model at the physical
level as a set of btree files with multiple indexes.
ο Induces artificial overhead onto managing the database.
ο It is over spinning disks
ο All RDBMS implementations assume that the data is coming from the
disks
ο Legacy of an era when memory was expensive.
ο Memory based systems will be faster
ο Databases are big and slow
ο Fundamentally not designed for big data sets
ο Long queries get slower with more data
30.
31. ο Core Tenets
ο Basically Available
ο System seem to work all the time
ο Soft State
ο It doesnβt have to be consistent all the time
ο Eventual Consistency
ο Becomes consistent eventually (at some later time)
ο Significance
ο BASE is diametrically opposed to ACID.
ο ACID is pessimistic and forces consistency at the end of every operation
ο BASE is optimistic and accepts that the database consistency will be in a
state of flux.
ο The availability is achieved through supporting partial failures
without total system failure
ο It is ok for the system to be available for 80% of users and limit failure
to 20% of the user.
ο Users should understand the implication of Eventual Consistency
ο Factors in a probability of data loss. Safety of the data is the tradeoff
ο Need to understand how eventual is Eventual
32. ο NoSQL β Not Only SQL
ο It is not SQL and it is not Relational
ο Essential Feature set
ο Elastic Scaling β Rely on Scale out rather than Scale up
ο Big Data β Handle High Volume, High Velocity, High Variability
ο Commoditize Manageability β Reduce dependence on highly skilled
DBA and lower administration costs
ο Economics β Build over commodity hardware
ο Flexible data model β Remove data model based restrictions.
ο Applicability
ο Performance and real time nature over consistency
ο High scalability
ο Store and retrieve large data sets
ο Does not require a relational model
33. ο Key Value
ο Idea is to use a hash table where there is a unique key and a pointer to a
particular item of data. Simplest to implement.
ο it is inefficient when you are only interested in querying or updating part
of a value
ο Column Store
ο Created to store and process very large amounts of data distributed over
many machines
ο Still keys but they point to multiple columns.
ο The columns are arranged by column family.
ο Document
ο The model is basically versioned documents that are collections of other
key-value collections.
ο The semi-structured documents are stored in formats like JSON.
ο allowing nested values associated with each key
ο Document databases support querying more efficiently.
ο Graph
ο flexible graph model is used which, again, can scale across multiple
machines
34. Access Interfaces
Language Specific
REST/HTTP Thrift Map Reduce
API
Logical Data Model
Key Value Column Family Store Document Graph
Support and Distribution
Multi Data Center Dynamic
CAP Support Proactive Monitoring
Support Provisioning
Data Persistence
Combination of Memory and
Memory Based Disk Based
Disk
35. NoSQL
Key Value Column Store Document Graph
MemCached SimpleDB CouchDB Neo4J
Redis BigTable MangoDB InfoGrid
SimpleDB Hbase Lotus Domino FlockDB
Tokyo Cabinet Cassandra Riak InfiniteGraph
Dynamo HyperTable
Voldemort Azure TS
36.
37. ο It is not Mature
ο RDBMS is mature, stable and functionally rich.
ο Most NoSQL alternatives are in pre-production versions with many key
features yet to be implemented.
ο Support
ο Nost NoSQL systems are open source projects.
ο Support mostly offered by startup companies, with reach and
credibility not on par with RDBMS Vendors.
ο Analytics
ο NoSQL databases offer few facilities for ad-hoc query and analysis.
ο Even a simple query requires significant programming expertise.
ο At present, commonly used BI tools do not provide credible
connectivity to NoSQL.
ο Administration and Maintenance
ο The desired goal of zero maintenance is far away.
ο In reality significant effort t required to maintain the systems.
ο Expertise
ο Currently very limited awareness and knowledge
38. ο Scalability
ο Master Slave - One master many slaves
ο Write to master; Read from any of the slaves
ο Partitioning β Group and localize related functions across nodes
ο Partition Vertically (by functions) or Horizontally ( by keys)
ο Caching - Memory based cache in front of the Database
ο Address scaling issues due to read and write loads
ο High Availability
ο Clustering - Group of systems responsible for a service
ο Build redundancy into a cluster to eliminate single points of failure
ο Mirroring and Replication β Maintain a hot standby
ο Handle planned or unplanned downtimes
ο Recovery Solutions - dependable data backup, restore, and
recovery procedures
ο Combine process with tools
39. ο Performance
ο Be open to Denormalization β And accelerate reads
ο Allow redundancy and duplicates to reduce joins
ο Optimize your costly queries- Analyze and optimize the expensive
queries
ο Use a mix of design strategy, indices, and analysis from query optimization tools
ο Invest in better hardware β storage and memory
ο It is not a bad bet - The storage and memory costs have dropped significantly
ο Rigid Schemas β Not all data is relational
ο Even the most schema-less model has some schema
ο World revolves round the structures
ο If Key-Value kind of store is needed, You can do the same in any
RDBMS
ο RDBMS will provide an added advantage of structured access and queries
40. ο Systems eventually will gravitate towards one of these three
ο Fast, agile, highly scalable data stores
ο Handlers of complex transactional semantics
ο Analytical processors and facilitators
ο World is never binary
ο It is never either this or that.
ο Why fight over technicalities
ο Drive decisions based on use cases
ο Choose a model based on the use cases and scenarios
ο Research and understand what your application needs
ο Stay away from substituting βHard workβ with βRhetoricβ
ο Be open to experimentation
41.
42. ο http://www.guug.de/lokal/muenchen/2007-05-14/rdbmsc.pdf
ο http://ansonalex.com/infographics/twitter-usage-statistics-2012-infographic/
ο http://www.mountainman.com.au/software/history/it1.html
ο http://www.slideshare.net/renguzi/codd
ο http://cims.clayton.edu/booth/ITDB%204201/Codd%20PDF.pdf
ο http://www.scribd.com/doc/19381895/RDBMS-Concepts
ο http://www.gitta.info/DBSysConcept/en/text/DBSysConcept.pdf
ο http://en.wikipedia.org/wiki/Relational_database
ο http://en.wikipedia.org/wiki/ACID
ο http://blogs.hbr.org/now-new-next/2009/05/the-social-data-revolution.html
ο http://www.go-gulf.com/blog/60-seconds
ο http://en.wikipedia.org/wiki/CAP_theorem
ο http://highscalability.com/drop-acid-and-think-about-data
ο http://queue.acm.org/detail.cfm?id=1394128
ο http://www.bailis.org/blog/safety-and-liveness-eventual-consistency-is-not-safe/
ο http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
ο http://rebelic.nl/engineering/the-four-categories-of-nosql-databases/
ο http://www.slideshare.net/ksankar/nosql-4559402
ο http://www.thevirtualcircle.com/2008/11/10/6-reasons-why-relational-database-will-be-superseded/
ο http://www.slideshare.net/sbtourist/scale-your-database-and-be-happy
ο Note:
Many images used in the deck have been a result of using google image search. Even though, I have not been able to
mention the sources of all the images individually, I extend my sincere thanks for the owners of the images for making the
same available on the net