More Related Content Similar to Co existence or Competitions? RDBMS and Hadoop Similar to Co existence or Competitions? RDBMS and Hadoop (20) Co existence or Competitions? RDBMS and Hadoop1. RDBMS and Hadoop - Co-existence
or competition
Ram Mohan
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012
2. Session Agenda!
Introduction to RDBMS
What is Hadoop and Map-Reduce
Hadoop and RDBMS – A comparison
Co-Existence – Practical Example - Master Website
Q&A
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 2
3. Relational DBMS
Based on Relational Mathematics principles
Data is represented in terms of rows and columns of a table
Relational Terminology
◦ Tuple (Row)
◦ Attribute (Column)
◦ Relation (Table)
Integrity Constraints
◦ Primary Key
◦ Foreign Key
◦ Alternate Key
ACID Test
◦ Atomicity
◦ Consistency
◦ Isolation
◦ Durability
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 3
4. Normalization
Normalization - process of removing data redundancy by decomposing
relations in a Database.
De normalization - carefully introduced redundancy to improve query
performance.
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 4
5. Relational DBMS
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 5
6. Example Data
S# SNAME STATUS CITY
S1 Smith 20 London
S2 Jones 10 Paris
S3 Blake 30 Paris
P# PNAME COLOR WEIGHT CITY
P1 Nut Red 12 London
P2 Bolt Green 17 Paris
P3 Screw Blue 17 Rome
P4 Screw Red 14 London
S# P# QTY
S1 P1 300
S1 P2 200
S1 P3 400
S2 P1 300
S2 P2 400
S3 P2 200
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 6
7. Five computers & a 640k ;-)
"I think there is a world
market for about five
Moore’s computers"
Law
Thomas Watson 1943,
Chairman of the board of IBM
"640k ought to be enough
for anybody"
Attributed to
Bill Gates in 1981.
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 7
8. The Big Data Challenges
Sources of Data and the amount of data to analyze is growing
exponentially
Stale data exists because DW solutions cannot ingest the vast amounts of
data fast enough
Lack of performance for advanced analytics and complex queries
The number of users and the concurrency of users is increasing rapidly
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 8
10. Hadoop – HDFS(Hadoop Distributed File System)
Reliably store petabytes of replicated data across thousand of nodes
◦ Data divided in to 64 MB blocks, each block replicated three times
Master/Slave architecture
◦ Master NameNode contains block locations
◦ Slave Datanode manages blocks on local FS
Built on local commodity hardware
◦ No RAID required
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 10
11. Hadoop – HDFS(Hadoop Distributed File System)
Reliably store petabytes of replicated data across thousand of nodes
◦ Data divided in to 64 MB blocks, each block replicated three times
Master/Slave architecture
◦ Master NameNode contains block locations
◦ Slave Datanode manages blocks on local FS
Built on local commodity hardware
◦ No RAID required
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 11
12. Map-Reduce Model
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 12
13. Hadoop – Limitations
Is not intended for realtime querying.
Does not support random access.
Significant learning curve
Provides barebones functionality out of the box but scaling is built-in and
inexpensive
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 13
14. Where SQL Makes life easy
Joining
◦ In a single query, get all products in an order with their product information
Secondary Indexing
◦ Get CustomerId by e-mail
Referential Integrity
Realtime Analysis.
Millions are trained in SQL and relational data modelling
RDBMS provides tremendous functionality, but is extremely difficult and
costly to scale
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 14
15. Master Website – A Practical Example
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 15
16. Master Website – RDBMS Use Cases
Profile Information – That is provided during sign up
Intelligence generated ie the output of the analytic jobs.
Any online purchasing track records and account management
Reporting tools
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 16
17. Master Website – Hadoop Use Cases
Generating Intelligence from the continuous stream of data
◦ Wall Posts on Facebook
New tags to be added based on the old logs available, due to new
requirements
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 17
18. A Practical Example – Facebook Architecture
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 18
19. THANK YOU
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 19
Editor's Notes No centralized control.Data Redundancy Data Inconsistency Data can not be sharedStandards can not be enforcedSecurity issues Integrity can not be maintainedData dependenceCentralized control.No Data Redundancy Data Consistency Data can be sharedStandards can be enforcedSecurity can be enforcedIntegrity can be maintainedData independence Can all the data be structured?Will we be able to store all the data in the tables ie can we model all the data?Should we discard the data after getting the required structured data from the log files or should we archive it? Take the example of students using the facilities provided by college. Two Core Components – HDFS & Map-ReduceMachines are un-reliableSeparates distributed fault-tolerant computing code from application logic.No need to worry about identity of a machinelets you interact with a cluster, not a bunch of machines.Analysis workloads span across multiple machinesruns as a cloud(cluster) & possibly on a cloud (EC2) Consumer interested inSocial NetworkingOnline purchasing/bookingService Provider Interested dataAdvertisements or Revenue generationReporting – For internal house keepingChallenges Recommendation – publishing those advertisements which consumer look at as an information or which he is interested in.