SlideShare una empresa de Scribd logo
1 de 41
Distributed Databases
Daniel Marcous
What?
Introduction
A distributed database is a database
in which storage devices are not all
attached to a common processing
unit such as the CPU, controlled by a
distributed database management
system.
Definitions
● RDBMS - Relational Database Management System
● DDB - Distributed Database
● Node - a unit in a distributed system (mainly a single server)
● DDBMS - Distributed Database Management System
○ In charge of managing the different DDB nodes as one integrated system
● Centralized System - data is stored in one place
● Homogenous system - built of parts (nodes) that all act the same way / consist of
the same hardware (Opposite of Heterogeneous).
Understanding the vocabulary
Basic Concepts
Distributed Database Concepts
● Number of processing elements (database nodes)
● Connection between nodes over a computer network
● Logical interrelation between different database nodes
● Absence of node homogeneity
Types of Distributed Databases
Multiprocessing Systems
● Parallel Systems
○ Shared Memory (tightly coupled) - multiple processors share the same main memory
○ Shared Disk (loosely coupled) - multiple processors share the same secondary disk storage
● Truly Distributed Systems
○ Shared Nothing - each processor with its own memory and disk,
interrelations are only through network (no SPOF)
● Distribution - Data and software distributed over multiple nodes
● Autonomy - Provision DBMS as one whole VS multiple standalone DBMSs
● Heterogeneity - use of different software / hardware on different nodes
Classification of Distributed Systems
Why?
The power of distribution
Reasons for choosing a distributed
database over a “plain” centralized
database.
Advantages
● More computing power
○ CPU
○ Memory
○ Storage
○ Network bandwidth
● Parallelism
○ Inter-query
○ Intra-query
Performance
Ease of use / development
● Transparency
● Geographically distributed sites
● Backups
● Elasticity
○ Growing
○ Shrinking
Challenges
● Transparency - One software (Ring) to rule them all
○ Management - one command
○ Data - one query
● Autonomy - Degree of Independence
○ Different settings / configurations / Cache size
○ “Master” node / Master Election
● Keeping track of data distribution
○ which server has the table / partition I need?
Management Challenges
● Reliability - Probability of failures
○ Does one server failure affects the whole system? (“Freeze”)
● Availability - Percent of time when a data source is available
○ If a node goes down, does it’s data get lost? unavailable until its up again?
● Recovery
○ What is a single point of time?
○ Nodes clocks Synchronisation (NTP)
● Transaction Management - Server X must assure that the data is “safe” and no
Complex Features Implementation
Scaling
● Synchronisation Overhead
CAP Theorem
● Eric Brewer (Berkeley->Yahoo->Google)
○ C - a read see all previously completed writes
○ A - reads and writes always succeed
○ P - read and write while network is down
● Choose 2! (2000)
● Sorry, actually only C or A… (2012)
How?
Internals
How does a distributed database
work?
● Advanced Concepts
● Architectures
Advanced Concepts
Replication
● Assumptions
○ Nodes will fail
○ Commodity Hardware - prone to failure
● Settings
○ Replication Factor
○ Data / Actions /Apply logs
○ Synchronous / ASynchronous
○ Delay
Fragmentation
● Dividing a single Data Object (Table/ File) into multiple parts
● Types
○ Horizontal - row wise
○ Vertical - column wise (Vertica/ Parquet)
○ Hybrid - both
● Advantages
○ Reports on part of the data - horizontal
○ Increased parallelism - multiple physical files
Distributed Processing
● Access by key Only!
○ Using Hash Tables
■ keys are hashed and spread (=sharded) across nodes
■ result of hash tells you which node to access
■ Hash maps exist on every node / client
● Batch Processing
○ MapReduce
■ Map - partition by key
Data Locality
● Local storage (VS centralised storage controller)
○ Bring the processing to the data
○ Free bandwidth
● Smart Load Balancing
○ Route users to the “closest” node with the data (replication duh..)
● Data sorted by Key /Hash Key
○ Same / Close enough key = Same node
○ “Process” all the rides in the TLV area
ACID
BASE● Atomicity
○ Transactions
● Consistency
○ Locked until done
● Isolation
○ No interference
● Durability
○ Completed = Persistent
● Basic Availability
○ Response to every request
● Soft State
○ States change, results are
not determinant
● Eventual Consistency
○ Consistent state may take
time but is promised
○ (CAS - Compare & Swap
Operations exist)
Architectures
Plain Old Centralized Database
● Oracle
● SQLServer (MS)
● DB2 (IBM)
● MySQL
● PostgreSQL
Relational (ACID) “Distributed” Database
● Oracle RAC (Real Applications Cluster)
● DB2 Data Sharing
● PostgresXL
Federated Database System
● IBM IIDR
Data Warehouse
● Oracle Exadata
● Teradata
● SQL Data Warehouse (MS)
● Vertica (HP)
● Greenplum (EMC)
Interactive Multiple Parallel Processing (MPP)
● Dremel (Big Query, Google)
● Redshift (Amazon)
● Presto (Facebook)
● Impala (Cloudera)
NoSQL (BASE) Shared Nothing Database
● MongoDB
● CouchBase
● Cassandra
● HBase
When?/Where?
History and Present
Where did the ideas come from and
what do we have present for use
nowadays?
The Founding Fathers
Articles
● Old School
○ Fundamentals of Database Systems (1989)
○ Principles of Distributed Database Systems (1991)
● Distributed File System
○ The Google file system (2003)
● Distributed Processing
○ MapReduce: simplified data processing on large clusters (2004)
● Interactive Querying on large scale
Adopters
● Document DB (Mostly JSON)
○MongoDB
○CouchBase
● Key-Value DB
○Cassandra
○HBase
● Graph DB
○Neo4J
NoSQL – Database Types
Known Users
Big Guys
● Google - Inside tools
○ MapReduce
○ Dremel -> Big Query
○ Flume -> DataFlow
● Facebook - Inside tools open-sourced and modified
○ Cassandra -> HBase
○ Presto
● Yahoo - Hadoop / HBase
● IDF
● Waze
● Viber - Couchbase
● Liveperson - MongoDB, CouchBase
● SimilarWeb - HBase
Israel
Distribution is
awesome, but
requires complex
skills to do right.
Don’t overkill it.

Más contenido relacionado

La actualidad más candente

Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database SystemSulemang
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Kiruthikak14
 
Distributed database system
Distributed database systemDistributed database system
Distributed database systemM. Ahmad Mahmood
 
Distributed database
Distributed databaseDistributed database
Distributed databasesanjay joshi
 
Lecture 11 - distributed database
Lecture 11 - distributed databaseLecture 11 - distributed database
Lecture 11 - distributed databaseHoneySah
 
Database , 1 Introduction
 Database , 1 Introduction Database , 1 Introduction
Database , 1 IntroductionAli Usman
 
Ddb 1.6-design issues
Ddb 1.6-design issuesDdb 1.6-design issues
Ddb 1.6-design issuesEsar Qasmi
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonGrisha Weintraub
 
Distributed Database
Distributed DatabaseDistributed Database
Distributed DatabaseJovyLee4
 
Introduction to distributed database
Introduction to distributed databaseIntroduction to distributed database
Introduction to distributed databaseSonia Panesar
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Kiruthikak14
 
Difference between Homogeneous and Heterogeneous
Difference between Homogeneous  and    HeterogeneousDifference between Homogeneous  and    Heterogeneous
Difference between Homogeneous and HeterogeneousFaraz Qaisrani
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 

La actualidad más candente (20)

Distributed database
Distributed databaseDistributed database
Distributed database
 
Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database System
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
Distributed database system
Distributed database systemDistributed database system
Distributed database system
 
Distributed database
Distributed databaseDistributed database
Distributed database
 
Lecture 11 - distributed database
Lecture 11 - distributed databaseLecture 11 - distributed database
Lecture 11 - distributed database
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Database , 1 Introduction
 Database , 1 Introduction Database , 1 Introduction
Database , 1 Introduction
 
Ddb 1.6-design issues
Ddb 1.6-design issuesDdb 1.6-design issues
Ddb 1.6-design issues
 
Distributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - IntroductionDistributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - Introduction
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
 
Distributed Database
Distributed DatabaseDistributed Database
Distributed Database
 
Introduction to distributed database
Introduction to distributed databaseIntroduction to distributed database
Introduction to distributed database
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
 
Parallel databases
Parallel databasesParallel databases
Parallel databases
 
Cassandra
CassandraCassandra
Cassandra
 
Difference between Homogeneous and Heterogeneous
Difference between Homogeneous  and    HeterogeneousDifference between Homogeneous  and    Heterogeneous
Difference between Homogeneous and Heterogeneous
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 

Similar a Distributed Databases - Concepts & Architectures

An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017Alex Robinson
 
Mesos - A Platform for Fine-Grained Resource Sharing in the Data Center
Mesos - A Platform for Fine-Grained Resource Sharing in the Data CenterMesos - A Platform for Fine-Grained Resource Sharing in the Data Center
Mesos - A Platform for Fine-Grained Resource Sharing in the Data CenterAnkur Chauhan
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architectureMarkus Klems
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of dataPiyush Katariya
 
Productionizing dl from the ground up
Productionizing dl from the ground upProductionizing dl from the ground up
Productionizing dl from the ground upAdam Gibson
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokesGagan Bajpai
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...raghdooosh
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017Severalnines
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
9780538469685 ppt ch12 1er exa
9780538469685 ppt ch12 1er exa9780538469685 ppt ch12 1er exa
9780538469685 ppt ch12 1er exacarldevsco63
 

Similar a Distributed Databases - Concepts & Architectures (20)

An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
 
Mesos - A Platform for Fine-Grained Resource Sharing in the Data Center
Mesos - A Platform for Fine-Grained Resource Sharing in the Data CenterMesos - A Platform for Fine-Grained Resource Sharing in the Data Center
Mesos - A Platform for Fine-Grained Resource Sharing in the Data Center
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 
Productionizing dl from the ground up
Productionizing dl from the ground upProductionizing dl from the ground up
Productionizing dl from the ground up
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
NoSQL Evolution
NoSQL EvolutionNoSQL Evolution
NoSQL Evolution
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
9780538469685 ppt ch12 1er exa
9780538469685 ppt ch12 1er exa9780538469685 ppt ch12 1er exa
9780538469685 ppt ch12 1er exa
 
NOsql Presentation.pdf
NOsql Presentation.pdfNOsql Presentation.pdf
NOsql Presentation.pdf
 

Más de Daniel Marcous

Cloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILCloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILDaniel Marcous
 
Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Daniel Marcous
 
Prediction of taxi rides ETA
Prediction of taxi rides ETAPrediction of taxi rides ETA
Prediction of taxi rides ETADaniel Marcous
 
Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Daniel Marcous
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Big Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleBig Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleDaniel Marcous
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 

Más de Daniel Marcous (10)

Cloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILCloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle IL
 
S2
S2S2
S2
 
Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018
 
Prediction of taxi rides ETA
Prediction of taxi rides ETAPrediction of taxi rides ETA
Prediction of taxi rides ETA
 
Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Big Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleBig Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @Google
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Data Visualisation
Data VisualisationData Visualisation
Data Visualisation
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
 

Último

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 

Último (20)

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 

Distributed Databases - Concepts & Architectures

  • 2. What? Introduction A distributed database is a database in which storage devices are not all attached to a common processing unit such as the CPU, controlled by a distributed database management system.
  • 4. ● RDBMS - Relational Database Management System ● DDB - Distributed Database ● Node - a unit in a distributed system (mainly a single server) ● DDBMS - Distributed Database Management System ○ In charge of managing the different DDB nodes as one integrated system ● Centralized System - data is stored in one place ● Homogenous system - built of parts (nodes) that all act the same way / consist of the same hardware (Opposite of Heterogeneous). Understanding the vocabulary
  • 6. Distributed Database Concepts ● Number of processing elements (database nodes) ● Connection between nodes over a computer network ● Logical interrelation between different database nodes ● Absence of node homogeneity
  • 8. Multiprocessing Systems ● Parallel Systems ○ Shared Memory (tightly coupled) - multiple processors share the same main memory ○ Shared Disk (loosely coupled) - multiple processors share the same secondary disk storage ● Truly Distributed Systems ○ Shared Nothing - each processor with its own memory and disk, interrelations are only through network (no SPOF)
  • 9. ● Distribution - Data and software distributed over multiple nodes ● Autonomy - Provision DBMS as one whole VS multiple standalone DBMSs ● Heterogeneity - use of different software / hardware on different nodes Classification of Distributed Systems
  • 10. Why? The power of distribution Reasons for choosing a distributed database over a “plain” centralized database.
  • 12. ● More computing power ○ CPU ○ Memory ○ Storage ○ Network bandwidth ● Parallelism ○ Inter-query ○ Intra-query Performance
  • 13. Ease of use / development ● Transparency ● Geographically distributed sites ● Backups ● Elasticity ○ Growing ○ Shrinking
  • 15. ● Transparency - One software (Ring) to rule them all ○ Management - one command ○ Data - one query ● Autonomy - Degree of Independence ○ Different settings / configurations / Cache size ○ “Master” node / Master Election ● Keeping track of data distribution ○ which server has the table / partition I need? Management Challenges
  • 16. ● Reliability - Probability of failures ○ Does one server failure affects the whole system? (“Freeze”) ● Availability - Percent of time when a data source is available ○ If a node goes down, does it’s data get lost? unavailable until its up again? ● Recovery ○ What is a single point of time? ○ Nodes clocks Synchronisation (NTP) ● Transaction Management - Server X must assure that the data is “safe” and no Complex Features Implementation
  • 18. CAP Theorem ● Eric Brewer (Berkeley->Yahoo->Google) ○ C - a read see all previously completed writes ○ A - reads and writes always succeed ○ P - read and write while network is down ● Choose 2! (2000) ● Sorry, actually only C or A… (2012)
  • 19. How? Internals How does a distributed database work? ● Advanced Concepts ● Architectures
  • 21. Replication ● Assumptions ○ Nodes will fail ○ Commodity Hardware - prone to failure ● Settings ○ Replication Factor ○ Data / Actions /Apply logs ○ Synchronous / ASynchronous ○ Delay
  • 22. Fragmentation ● Dividing a single Data Object (Table/ File) into multiple parts ● Types ○ Horizontal - row wise ○ Vertical - column wise (Vertica/ Parquet) ○ Hybrid - both ● Advantages ○ Reports on part of the data - horizontal ○ Increased parallelism - multiple physical files
  • 23. Distributed Processing ● Access by key Only! ○ Using Hash Tables ■ keys are hashed and spread (=sharded) across nodes ■ result of hash tells you which node to access ■ Hash maps exist on every node / client ● Batch Processing ○ MapReduce ■ Map - partition by key
  • 24. Data Locality ● Local storage (VS centralised storage controller) ○ Bring the processing to the data ○ Free bandwidth ● Smart Load Balancing ○ Route users to the “closest” node with the data (replication duh..) ● Data sorted by Key /Hash Key ○ Same / Close enough key = Same node ○ “Process” all the rides in the TLV area
  • 25. ACID BASE● Atomicity ○ Transactions ● Consistency ○ Locked until done ● Isolation ○ No interference ● Durability ○ Completed = Persistent ● Basic Availability ○ Response to every request ● Soft State ○ States change, results are not determinant ● Eventual Consistency ○ Consistent state may take time but is promised ○ (CAS - Compare & Swap Operations exist)
  • 27. Plain Old Centralized Database ● Oracle ● SQLServer (MS) ● DB2 (IBM) ● MySQL ● PostgreSQL
  • 28. Relational (ACID) “Distributed” Database ● Oracle RAC (Real Applications Cluster) ● DB2 Data Sharing ● PostgresXL
  • 30. Data Warehouse ● Oracle Exadata ● Teradata ● SQL Data Warehouse (MS) ● Vertica (HP) ● Greenplum (EMC)
  • 31. Interactive Multiple Parallel Processing (MPP) ● Dremel (Big Query, Google) ● Redshift (Amazon) ● Presto (Facebook) ● Impala (Cloudera)
  • 32. NoSQL (BASE) Shared Nothing Database ● MongoDB ● CouchBase ● Cassandra ● HBase
  • 33. When?/Where? History and Present Where did the ideas come from and what do we have present for use nowadays?
  • 35. Articles ● Old School ○ Fundamentals of Database Systems (1989) ○ Principles of Distributed Database Systems (1991) ● Distributed File System ○ The Google file system (2003) ● Distributed Processing ○ MapReduce: simplified data processing on large clusters (2004) ● Interactive Querying on large scale
  • 37. ● Document DB (Mostly JSON) ○MongoDB ○CouchBase ● Key-Value DB ○Cassandra ○HBase ● Graph DB ○Neo4J NoSQL – Database Types
  • 39. Big Guys ● Google - Inside tools ○ MapReduce ○ Dremel -> Big Query ○ Flume -> DataFlow ● Facebook - Inside tools open-sourced and modified ○ Cassandra -> HBase ○ Presto ● Yahoo - Hadoop / HBase
  • 40. ● IDF ● Waze ● Viber - Couchbase ● Liveperson - MongoDB, CouchBase ● SimilarWeb - HBase Israel
  • 41. Distribution is awesome, but requires complex skills to do right. Don’t overkill it.