1. What database?
- a practical guide to selection from
NoSQL, SQL and Polyglot data stores
Regunath B
twitter.com/RegunathB
github.com/regunathb
Engineering @ CureFit-HealthFace, ex-Flipkart Infra services, Built Aadhaar
6. Wire-protocol, Standard
interfaces
• Wire Protocol
• Custom protocols over TCP/IP, Http, gRPC
• Support popular database protocols
• Postgres - e.g. CockroachDB
• Memcached - e.g. Couch base
• Standard Interfaces
• JDBC - e.g. Apache Hive, Phoenix for HBase, Vitess
7. Schema(less) support
Schema is
not Evil
• Why Schema-less
• Sparse-Metric and Entity-Attribute-Value storage needs
• Frequent changes
• Why Schema
• Understanding structure of data
• Referential Data integrity, Quality of data controlled by Data
dictionary
• In-between
• Schema-less but require Column Indexes (e.g. ColumnFamily
model of KV stores)
8. CAP theorem critique
• “one example of a fundamental trade-off between safety and
liveness in fault-prone systems” [3]
• Too simplistic [4]
• Choice of CA impractical mostly (Single node database), critique
therefore applies to CP or AP.
• CAP-Availability and CAP-Consistency is a spectrum and not binary
• e.g. AP-Reads, AP-Writes, Strong Consistency vs. Eventual
Consistency
• Define application tradeoffs, validate impact on NFRs - Latency,
Throughput
• Good starting point for considering Polyglot persistence
X
11. Polyglot Persistence - pluggable storage,
secondary indices
• Healthcare Graph data
(Conditions, Symptoms) on
Apache Titan
• Mostly Read-only queries - Point
lookups, one-hop traversals
• AP-Read data (Storage engine :
Cassandra)
• Also query by properties of
Vertex/Edge(Secondary indices
in ES)
Source: CureFit Symptoms & Conditions datastore
12. Others
• Performance benchmarks - Latency, Throughput,
Concurrency - e.g. Graph DBs benchmark [6]
• Operations & Maintenance - e.g. MySQL as
backend data store for Facebook TAO [7], LinkedIn
Espresso [8]
• Support - Paid (single vendor vs. multiple),
Community (size, composition)
• Hosted service - on public clouds as a managed
service
14. Database Type
• Relational
• All field values of a row stored together
• Common storage formats: BTree
• Better suited for OLTP
• Columnar
• All values of a column stored together
• More efficient data compression
• OLAP queries perform better
Source:
https://gerardnico.com/wiki/relation/structure/column_store
15. Database Type
• Document
• Sub-class of a KV store
• Often hierarchical (DB -> Collection ->
Document)
• Often have challenges in optimising
storage - due to lack of Data
Dictionary (schema-free)
• KV
• Often RAM based
• Durability through replication(sync)
and persistence to disk
• Preference for LSM over in-place
updates when designed for SSD
Source:
https://blog.mlab.com/2014/01/how-big-is-your-mongodb/
Source:
http://www.aerospike.com/technologies/
16. Data Organisation
• B-Tree
• Better suited for in-place updates
• Log Structured Merged (LSM)
• Better suited for high insert volume
• Better suited for SSD (for reducing write
amplification)
• Achieve high data locality of reference
through good row-key design [9]
Source: http://www.programering.com/a/MTMwAzMwATM.html
Source: http://www.cyanny.com/2014/03/13/hbase-architecture-
analysis-part1-logical-architecture/
17. Replication, Consensus
• Replication
• Sync vs. Async
• No. of Replicas, Min. Replicas, Journalling, Guaranteed writes
with hinted handoff
• Single master read-write(CP) vs. Replica reads(AP)
• Consensus
• Used in
• Leader election
• Committing transactions/Log replication
• Strength of protocol - Paxos, Raft, Zab etc.
• Jepsen Tests (https://jepsen.io/) - Tests ‘Safety’ of distributed
databases
• e.g. CockroachDB, MongoDB, VoltDB, Solr, Elastic Search etc
Source:https://martin.kleppmann.com
Source: https://raft.github.io/raft.pdf
18. Operations
• Data export & restore (RPO, RTO) - Disaster Recovery(DR)
• Tools for full export vs incremental snapshots
• Tools for restoring from exports, logs
• Piggy-back on XDC replication support to create continuous/ongoing
backup&restore
• Large scale data migration [10]
• Mean Time to Recovery (MTTR) - Node failure/Minor outages
• e.g. promoting hot-standby to master
• Tools to detect failure, validate data, promote new master/leader
19. Cost
• Disk-Memory ratio
• Database architecture to support disk storage, size of on-disk data w.r.t RAM
• Compute required
• No. of compute nodes required to keep data on-line
• Power Consumption
• SSD based databases generally more energy efficient than HDD
• Density of storage
• Relevant when storing large data over extended periods of time
• e.g. Aadhaar enrolment raw data, Facebook photos [11]
20. DB-specific Optimisations to leverage RAM, reduce Disk I/O
• Data block-cache/buffer-pool
• Reduces disk I/O
• Provides lower latency on repeat reads
• Provides potentially lower latency for
reads on high data locality of reference
• Bloom Filters
• Reduces disk I/O and row scanning in
random key lookups
Source: https://sematext.com
Source: Cloudera
21. References
• [1] - Google Spanner becoming a SQL System
• [2] - CRDTs in Riak
• [3] - Perspectives on the CAP Theorem
• [4] - Martin Kleppmann CP or AP
• [5] - Flipkart Catalog System, Datastore
• [6] - Do We Need Specialised Graph Databases?
• [7] - Facebook TAO social graph data store
• [8] - LinkedIn Espresso
• [9] - Facebook style notifications using HBase
• [10] - Flipkart DC migration
• [11] - Facebook cold storage system