Concepts, architectures and uses of distributed databases. A gentle introduction to get you up to speed and understand the value and potential of distributed databases.
2. What?
Introduction
A distributed database is a database
in which storage devices are not all
attached to a common processing
unit such as the CPU, controlled by a
distributed database management
system.
4. ● RDBMS - Relational Database Management System
● DDB - Distributed Database
● Node - a unit in a distributed system (mainly a single server)
● DDBMS - Distributed Database Management System
○ In charge of managing the different DDB nodes as one integrated system
● Centralized System - data is stored in one place
● Homogenous system - built of parts (nodes) that all act the same way / consist of
the same hardware (Opposite of Heterogeneous).
Understanding the vocabulary
6. Distributed Database Concepts
● Number of processing elements (database nodes)
● Connection between nodes over a computer network
● Logical interrelation between different database nodes
● Absence of node homogeneity
8. Multiprocessing Systems
● Parallel Systems
○ Shared Memory (tightly coupled) - multiple processors share the same main memory
○ Shared Disk (loosely coupled) - multiple processors share the same secondary disk storage
● Truly Distributed Systems
○ Shared Nothing - each processor with its own memory and disk,
interrelations are only through network (no SPOF)
9. ● Distribution - Data and software distributed over multiple nodes
● Autonomy - Provision DBMS as one whole VS multiple standalone DBMSs
● Heterogeneity - use of different software / hardware on different nodes
Classification of Distributed Systems
10. Why?
The power of distribution
Reasons for choosing a distributed
database over a “plain” centralized
database.
15. ● Transparency - One software (Ring) to rule them all
○ Management - one command
○ Data - one query
● Autonomy - Degree of Independence
○ Different settings / configurations / Cache size
○ “Master” node / Master Election
● Keeping track of data distribution
○ which server has the table / partition I need?
Management Challenges
16. ● Reliability - Probability of failures
○ Does one server failure affects the whole system? (“Freeze”)
● Availability - Percent of time when a data source is available
○ If a node goes down, does it’s data get lost? unavailable until its up again?
● Recovery
○ What is a single point of time?
○ Nodes clocks Synchronisation (NTP)
● Transaction Management - Server X must assure that the data is “safe” and no
Complex Features Implementation
18. CAP Theorem
● Eric Brewer (Berkeley->Yahoo->Google)
○ C - a read see all previously completed writes
○ A - reads and writes always succeed
○ P - read and write while network is down
● Choose 2! (2000)
● Sorry, actually only C or A… (2012)
22. Fragmentation
● Dividing a single Data Object (Table/ File) into multiple parts
● Types
○ Horizontal - row wise
○ Vertical - column wise (Vertica/ Parquet)
○ Hybrid - both
● Advantages
○ Reports on part of the data - horizontal
○ Increased parallelism - multiple physical files
23. Distributed Processing
● Access by key Only!
○ Using Hash Tables
■ keys are hashed and spread (=sharded) across nodes
■ result of hash tells you which node to access
■ Hash maps exist on every node / client
● Batch Processing
○ MapReduce
■ Map - partition by key
24. Data Locality
● Local storage (VS centralised storage controller)
○ Bring the processing to the data
○ Free bandwidth
● Smart Load Balancing
○ Route users to the “closest” node with the data (replication duh..)
● Data sorted by Key /Hash Key
○ Same / Close enough key = Same node
○ “Process” all the rides in the TLV area
25. ACID
BASE● Atomicity
○ Transactions
● Consistency
○ Locked until done
● Isolation
○ No interference
● Durability
○ Completed = Persistent
● Basic Availability
○ Response to every request
● Soft State
○ States change, results are
not determinant
● Eventual Consistency
○ Consistent state may take
time but is promised
○ (CAS - Compare & Swap
Operations exist)
35. Articles
● Old School
○ Fundamentals of Database Systems (1989)
○ Principles of Distributed Database Systems (1991)
● Distributed File System
○ The Google file system (2003)
● Distributed Processing
○ MapReduce: simplified data processing on large clusters (2004)
● Interactive Querying on large scale