SlideShare una empresa de Scribd logo
1 de 87
Descargar para leer sin conexión
APACHE CASSANDRA
Scalability, Performance and Fault Tolerance
in Distributed databases
Jihyun.An (jihyun.an@kt.com)
18, June 2013
TABLE OF CONTENTS
 Preface
 Basic Concepts
 P2P Architecture
 Primitive Data Model & Architecture
 Basic Operations
 Fault Management
 Consistency
 Performance
 Problem handling
TABLE OF CONTENTS (NEXT TIME)
 Maintaining
 Cluster Management
 Node Management
 Problem Handling
 Tuning
 Playing (for Development, Client stance)
 Designing
 Client
 Thrift
 Native
 CQL
 3rd party
 Hector
 OCM
 Extension
 Baas.io
 Hadoop
PREFACE
OUR WORLD
 Traditional DBMS is very valuable
 Storage(+Memory) and Computational Resources cost is cheap (than before)
 But we meet new section
 Big data
 (near) Real time
 Complex and various requirement
 Recommendation
 Find FOAF
 …
 Event Driven Trigging
 User Session
 …
OUR WORLD (CONT)
 Complex applications combine difference types of problems
 Different language -> more productive
 ex: Functional language, Multiprocessing optimized language
 Polyglot persistent layer
 Performance vs Durability?
 Reliability?
 …
TRADITIONAL DBMS
 Relational Model
 Well-defined Schema
 Access with Selection/Projection
 Derived from Joining/Grouping/Aggregating(Counting..)
 Small data (from refined)
 …
 But
 Painful data model changes
 Hard to scale out
 Ineffective in handling large volumes of data
 Not considered with hardware
 …
TRADITIONAL DBMS (CONT)
 Has many constraints for ACID
 PK/FK & checking
 Domain Type checking
 .. checking checking
 Lots of IO / Processing
 OODBMS, ORDBMS
 Good but .. more more checking / processing
 Not well with Disk IO
NOSQL
 Key-value store
 Column : Cassandra, Hbase, Bigtable …
 Others : Redis, Dynamo, Voldemort, Hazelcast …
 Document oriented
 MongoDB, CouchDB …
 Graph store
 Neo4j, Orient DB, BigOWL, FlockDB ..
NOSQL (CONT)
Benefits
 Higher performance
 Higher scalability
 Flexible Datamodel
 More effective for some case
 Less administrative overhead
Drawbacks
 Limited Transactions
 Relaxed Consistency
 Unconstrained data
 Limited ad-hoc query capabilities
 Limited administrative aid tools
CAP
Brewer’s theorem
We can pick two of
Consistency
Availability
Partition tolerance
A
C P
Amazon Dynamo derivatives
Cassandra, Voldemort, CouchDB
, Riak
Neo4j, Bigtable
Bigtable derivatives : MongoDB, Hbase
Hypertable, Redis
Relational:
MySQL, MSSQL,
Postgres
Dynamo
(Architecture)
BigTable
(Data model)
Cassandra
(Apache) Cassandra is a free, open-source, high scalable,
distributed database system for managing large amounts of data
Written in JAVA
Running on JVM
References :
BigTable (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf)
Dynamo (http://web.archive.org/web/20120129154946/http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf)
DESIGN GOALS
 Simple Key/Value(Column) store
 limited on storage
 No support anything (aggregating, grouping …) but basic operation
(CRUD, Range access)
 But extendable
 Hadoop (MR, HDFS, Pig, Hive ..)
 ESP
 Distributed Processing Interface (ex: BSP, MR)
 Baas.io
 …
DESIGN GOALS (CONT)
 High Availability
 Decentralized
 Everyone can accessor
 Replication & Their access
 Multi DC support
 Eventual consistency
 Less write complexity
 Audit and repair when read
 Possible tuning -> Trade offs between consistency, durability and latency
DESIGN GOALS (CONT)
 Incremental scalability
 Equal Member
 Linear Scalability
 Unlimited space
 Write / Read throughput increase linearly by add node(member)
 Low total cost
 Minimize administrative work
 Automatic partitioning
 Flush / compaction
 Data balancing / moving
 Virtual nodes (since v1.2)
 Middle powered nodes make good performance
 Collaborating work will make powerful performance and huge space
FOUNDER & HISTORY
 Founder
 Avinash Lakshman (one of the authors of Amazon's Dynamo)
 Prashant Malik ( Facebook Engineer )
 Developer
 About 50
 History
 Open sourced by Facebook in July 2008
 Became an Apache Incubator project in March 2009
 Graduated to a top-level project in Feb 2010
 0.6 released (added support for integrated caching, and Apache Hadoop MapReduce) in Apr 2010
 0.7 released (added secondary indexes and online schema change) in Jan 2011
 0.8 released (added the Cassandra Query Language (CQL), self-tuning memtables, and support for zero-downtime upgrades) in Jun 2011
 1.0 released (added integrated compression, leveled compaction, and improved read performance) in Oct 2011
 1.1 released (added self-tuning caches, row-level isolation, and support for mixed ssd/spinning disk deployments) in Apr 2012
 1.2 released (added clustering across virtual nodes, inter-node communication, atomic batches, and request tracing) in Jan 2013
PROMINENT USERS
User Cluster size Node count Usage Now
Facebook >200 ? Inbox search Abandoned,
Moved to HBase
Cisco WebEx ? ? User feed, activity OK
Netflix ? ? Backend OK
Formspring ? (26 million
account with 10 m
responsed per day)
? Social-graph data OK
Urban airship,
Rackspace, Open X,
Twitter (preparing
move to)
BASIC CONCEPTS
P2P ARCHITECTURE
 All nodes are same (has equality)
 No single point of failure / Decentralized
 Compare with
 mongoDB
 broker structure (cubrid …)
 Master / slave
 …
P2P ARCHITECTURE
 Driven linear scalability
References :
http://dev.kthcorp.com/2011/12/07/cassandra-on-aws-100-million-writ/
PRIMITIVE DATA MODEL & ARCHITECTURE
COLUMN
 Basic and primitive type (the smallest increment of data)
 A tuple containing a name, a value and a timestamp
 Timestamp is important
 Provided by client
 Determine the most recent one
 If meet the collision, DBMS chose the latest one
Name
Value
Timestamp
COLUMN (CONT)
 Types
 Standard: A column has a name (UUID or UTF8 …)
 Composite: A column has composite name (UUID+UTF8 …)
 Expiring: TTL marked
 Counter: Only has name and value, timestamp managed by server
 Super: Used to manage wide rows, inferior to using composite
columns (DO NOT USE, All sub-columns serialized)
Counter Name
Value
Name
Name
Value
Timestamp
Name
Value
Timestamp
COLUMN (CONT)
 Types (CQL3 based)
 Standard: Has one primary key.
 Composite: Has more than one primary key,
recommended for managing wide rows.
 Expiring: Gets deleted during compaction.
 Counter: Counts occurrences of an event.
 Super: Used to manage wide rows, inferior to using
composite columns (DO NOT USE, All sub-columns
serialized)
DDL : CREATE TABLE test (
user_id varchar,
article_id uuid,
content varchar,
PRIMARY KEY (user_id, article_id)
);
user_id article_id content
Smith <uuid1> Blah1..
Smith <uuid2> Blah2..
{uuid1,content}
Blah1…
Timestamp
{uuid2,content}
Blah2…
Timestamp
Smith
<Logical>
<Physical>
SELECT user_id,article_id from test order
by article_id DESC LIMIT 1;
ROWS
 A row containing a represent key and a set of columns
 A row key must be unique (usually UUID)
 Supports up to 2 billion columns per (physical) row.
 Columns are sorted by their name (Column’s Name indexed)
 Primitive
 Secondary Index
 Direct Column Access
Name
Value
Timestamp
Name
Value
Timestamp
Name
Value
Timestamp
Row
Key
COLUMN FAMILY
 Container for columns and rows
 No fixed schema
 Each row is uniquely identified by its row key
 Each row can have a different set of columns
 Rows are sorted by row key
 Comparator / Validator
 Static/Dynamic CF
 If columns type is super column, CF called “Super Column Familty”
 Like “Table” in Relational world
Name
Value
Timestamp
Name
Value
Timestamp
Name
Value
Timestamp
Row
Key
Name
Value
Timestamp
Row
Key
DISTRIBUTION
Row
Row
Row
Row
Row
Row
Server
1
Server
3
Server
2
Server
4
How to
map?
TOKEN RING
 Node is a instance (typically same as a server)
 Used to map between each row and node
 Range from 0 to 2127-1
 Associated with a row key
 Node
 Assigned a unique token (ex: token 5 to Node 5)
 Range is from previous node token to their token
 token 4 < Node 5’range <= token 5
Node 1
Node 2
Node 3
Node 4Node 5
Node 6
Node 7
Node 8
Token 5
Token 4
PARTITIONING
Row
Key
Random
Partitioners
(MD5,
Murmur3)
Order
Preserving
Partitioner /
Byte
Ordered
Partitioner
Default
Row
Key
Row
Key
Row
Key
REPLICATION
 Any node has read/write role is called
coordinator node (by client)
 Locator determine where located the replica
 Replica is used at
 Consistency check
 Repair
 Ensure W + R > N for consistency
 Local Cache (Row cache)
Node 1
Node 2
Node 3
Node 4Node 5
Node 6
Node 7
Node 8
Replica Factor is 4 (N-1 will be replicated)
Simple Locator treat strategy order as proximity
Locator
(Simple)
Coordinator node
Locating first one
1
2
Here is original
REPLICATION (CONT)
 Multi DC support
 Allow to Specify how many replcas in each DC
 Within DC replicas are placed on different racks
 Relies on snitch to place replicas
 Strategy (provided from Snitch)
 Simple (Single DC)
 RackInferringSnitch
 PropertyFileSnitch
 EC2Snitch
 EC2MultiRegionSnitch
DC1
DC2
Entire
ADD / REMOVE NODE
 Data transfer between nodes called “Streaming”
 If add node 5,
node 3 and node 4, 1 (suppose RF is 2) involved in streaming
 If remove node 2
node 3(got higher token and their replica container) serve instead
Node 1
Node 2
Node 3
Node 4
Node 1
Node 2
Node 3
Node 4
Node 5
Node 1
Node 3
Node 4
VIRTUAL NODES
 Support since v1.2
 Real time migration support?
 Shuffle utility
 One node has many tokens
 => one node has many ranges Node 1 Node 2
Number of token is 4
Cluster
Node 2
Node 1
VIRTUAL NODES (CONT)
 Less administrative works
 Save cost
 When Add/Remove node
 many node co-works
 No need to determine the token
 Shuffle to re-balance
 Less changing time
 Smart balancing
 No need to balance
(Sufficiently number of token should be higher)
Number of token is 4
Node 2
Node 1
Cluster
Node 2
Node 1
Node 3
Add node 3
KEYSPACE
 A namespace for column families
 Authorization
 CF? yeah
 Replication
 Key oriented schema (see right)
{
"row_key1":
{
"Users":{
"emailAddress":{"name":"emailAddress","value":"foo@bar.co
m"
},
"webSite":{"name":"webSite", "value":http://bar.com}
},
"Stats":{ "visits":{"name":"visits", "value":"243"} }
},
"row_key2":
{
"Users":{
"emailAddress":{"name":"emailAddress",
"value":"user2@bar.com"},
"twitter":{"name":"twitter", "value":"user2"}
}
}
}
Row Key
Column Family
Column
CLUSTER
 Total amount of data managed by the cluster is represented as a
ring
 Cluster of nodes
 Has multiple(or single) Keyspace
 Partitioning Strategy defined
 Authentication
GOSSIP
 Gossip protocol is used for cluster membership.
 Failure detection on service level (Alive or Not)
 Responsible
 Every node in the system knows every other node’s status
 Implemented as
 Sync -> Ack -> Ack2
 Information : status, load, bootstraping
 Basic status is Alive/Dead/Join
 Runs every second
 Status disseminated in O(logN) (N is the number of nodes)
 Seed
 PHI is used for auditing dead or alive in time window
(5 -> detecting in 15~16 s)
 Data structure
 HeartBeat<Application Status<Endpoint Status<Endpoint StatusMap
N1
N2
N3
N4
N6
N5
BASIC OPERATIONS
WRITE / UPDATE
 CommitLog
 Abstracted Mmaped Type
 File & Memory Sync -> On system failure? This is angel for U ^^.
 Java NIO
 C-Heap used (=Native Heap)
 Log Data (Write->Delete? But exists)
 Segment Rolling structure
 Memtable
 In memory buffer and workspace
 Sorted order by row key
 If reach threshold or period point, written to disk to a persistent table
structure(SSTable)
WRITE / UPDATE (LOCAL LEVEL)
Write
CommitLog
Write : “1”:{“name”:”fullname”,”value”:”smith”}
Write : “2”:{“name”:”fullname”,”value”:”mike”}
Delete : “1”
Write : “3”:{“name”:”fullname”,”value”:”osang”}
… Key Name Value
1 fullname smith
2 fullname mike
3 fullname Osang
… … …
Memtable
SSTable SSTable SSTable
1 Write to commitLog
2
Write/Update to Memtable
3Write to Disk (flush)
SSTABLE
 SSTable is Sorted String Table
 Best for log structured DB
 Store large numbers of key-value pairs
 Immutable
 Create with “Flush”
 Merges by (major/minor) compaction
 Has one or more column has different version (timestamp)
 Choose recent one
READ (LOCAL LEVEL)
Key Name Value
2 fullname mike
3 fullname Osang
… … …
SSTable
BF
IDX
SSTable
BF
IDX
Read
Memtable
READ (CLUSTER LEVEL, +READ REPAIR)
Replica
(Original, Right)
Replica
(Right)
Replica
(Wrong)
Digest Comparing
Choose the right one if digests differ
(the most recent)
Recover
Read
Operation
Coordinator
Locator
1 Transferred from original/replica node (with consistency level)
2
3
DELETE
 Add tomstone (this is some type of column)
 Garbage collected when compacting
 GC grace seconds : 864000 (default 10 days)
 Issue
 If the fault node recover after GCGraceSeconds, the deleted data can
be resurrected
FAULT MANAGEMENT
DETECTION
 Dynamic threshold for marking nodes
 Accrual Detection Mechanism calculates a per-node threshold
 Automatic take into account Network condition, workload and
other conditions might affect perceived heartbeat rate.
 From 3rd party client
 Hector
 Failover
HINTED-HANDOFF
 The coordinator will store a hint for if the node down or failed to
acknowledge the write
 Hint consists of the target replica and the mutation(column
object) to be replayed
 Use java heap (might next to be off-heap)
 Only saved within limited time (default, 1 hour) after a replica fails
 When failed node is alive again, it will begin streaming the miss
writes
REPAIR
 Support triangle method
 CommitLog Replaying (by administrator)
 Read Repair (realtime)
 Anti-entropy Repair (by administrator)
READ REPAIR
 Background work
 Configured per CF
 Choose most recently written value if they are inconsistent, and
replace it.
ANTI-ENTROPY REPAIR
 Ensure all data on a replica is made consistent
 Merkle tree used
 Tree of data block’s hashes
 Verify inconsistent
 Repair node request merkle hash (piece of CF)
to replicas and comparing, streaming from a
replica if inconsistent, do Read-repair
Block
1
Block
2
Block
3
…
CF
hash hash hash hash
hash hash
hash
CONSISTENCY
BASIC
 Full ACID compliance in distributed system is a bad idea.
(network, … )
 Single row updates are atomic (include internal indexes),
everything else is not
 Relaxing consistency does not equal data corruption
 Tunable Consistency
 Speed vs precision
 Any read and write operation decides how consistent the requested
data should be (from client)
CONDITION
 Consistency ensure if
 (W + R) > N
 W is nodes written (succeed)
 R is nodes read
 N is replica factor
CONDITION (CONT)
N is 3
Operations
1. Write 3
2. Write 5
3. Write 1
3 5 1
Worst case
W is 1
1 5 1W is 2 3 1 1or
W is 2 1 1 1
R is 1
Possible case
3 5 1or or
R is 21
1 R is 3
Written Read
(W+R)>N ensure that at lease one latest value can be selected
This is eventual consistency
READ CONSISTENCY LEVELS
 One
 Two
 Three
 Quorum
 Local Quorum
 Each Quorum
 All
Specify how many replicas must response
before a result is return to the client
Quorum : (Replication Factor / 2) + 1
Local Quorum / Each Quorum is used at Multi-
DC
Round down to a whole number processing
(If satisfied, return right away)
WRITE CONSISTENCY LEVELS
 ANY
 One
 Two
 Three
 Quorum
 Local Quorum
 Each Quorum
 All
Specify how many replicas must succeed
before returning acknowledge to client
Quorum : (Replication Factor / 2) + 1
Local Quorum / Each Quorum is used at Multi-
DC
ANY level contain hinted-handoff condition
Round down to a whole number processing
(If satisfied, return right away)
PERFORMANCE
CACHE
 Key/Row Cache can save their data to files
 Key Cache
 Accessed Frequently
 Hold the location of keys (indicating to columns)
 In memory, on JVM heap
 Row Cache
 Optional
 Hold entire columns of the row
 In memory, on Off-heap (since v1.1) or JVM heap
 If you have huge column, this will make OOME (Out Of Memory Event)
CACHE
 Mmaped Disk Access
 On 64bit JVM, used for data and index summary (default)
 Provide virtual mmaped space in Memory for SSTable
 On C-Heap(native heap)
 GC make this as cache
 Data accessed frequently live long period, otherwise GC will purge that
 If the data exists in memory, return it (=cache)
 (Problem) GC C-Heap when its full only
 (Problem) handle open SSTable, this mean Cassandra can allocate the entire size
of open SSTables, otherwise native OOME
 If you wanna have efficient Key/Row/Mmaped Access cache, add
sufficient nodes to cluster
BLOOM FILTERS
 Each SSTable has this
 Used to check if a requested row key exists in the SSTable before
doing any seeks (disk)
 Per row key, generate several hashes and mark the buckets for
the key
 Check each bucket for the key’s hashes, if any is empty the key
does not exists
 False positive are possible, but false negative are not
Key 1 Key 2 Key 2
Hash A Hash B Hash C
1 1 1
Same hashes
Only has
INDEX
 Primary Index
 Per CF
 The index of CF’s row key
 Efficient access with Index summary (1 row key out of every 128 is
sampled)
 In memory, on JVM heap (next move to Off-heap)
Read BF
KeyCache
SSTable
Index
Summary
Primary
Index
Offset
Calculator
INDEX (CONT)
 Secondary Index
 For Column’s value(s)
 Support composite type
 Hidden CF
 Implemented by CF’name index
 Value is the CF’name
 Write/Update/Delete operation is atomic
 Share value for many rows is good for
 On the contrary unique value for indexing is poor (-> use Dynamic CF for
indexing)
COMPACTION
 Combines data from SSTables
 Merge row fragments
 Rebuild primary and secondary indexes
 Remove expired columns marked with tomestone
 Delete old SSTable if complete
 “Minor” only compactions merge SSTables of similar size, “Major” compactions
merge all SSTables in a given CF
 Size-tiered compaction
 Leveled compaction
 Since v1.0
 Based on LevelDB
 Temporary use maximum twice space and spike in disk IO.
ARCHITECTURE
 Write : no race conditions, not handled by disk IO
 Read : Slow than write, but fast (DHT, cache …)
 Load balancing
 Virtual Nodes
 Replication
 Multi-DC
BENCHMARK
References :
http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18
0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-
eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/
Workload A—update heavy: (a) read
operations, (b) update operations.
Throughput in this (and
all figures) represents total operations
per second, including reads and
writes.
Workload B—read heavy: (a) read
operations, (b) update operations
By YCSB (Yahoo Cloud Serving Benchmark)
BENCHMARK (CONT)
References :
http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18
0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-
eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/
Workload E—short scans.
By YCSB (Yahoo Cloud Serving Benchmark)
Read performance as cluster size increases.
BENCHMARK (CONT)
Elastic speedup:
Time series showing
impact of adding
servers online.
By YCSB (Yahoo Cloud Serving Benchmark)
References :
http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18
0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-
eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/
BENCHMARK (CONT) By NoSQLBenchmarking.com
References :
http://www.nosqlbenchmarking.com/2011/02/new-results-for-cassandra-0-7-2//
BENCHMARK (CONT) By Cubrid
References :
http://www.cubrid.org/blog/dev-platform/nosql-benchmarking/
BENCHMARK (CONT) By VLDB
References :
http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/
Read latency Write latencyThroughput (95% read, 5% write)
BENCHMARK (LAST) By VLDB
References :
http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/
Throughput (50% read, 50% write) Throughput (100% write)
PROBLEM HANDLING
RESOURCE
 Memory
 Off-heap & Heap
 OOME Problem
 CPU
 GC
 Hashing
 Compression / Compaction
 Network Handling
 Context Switching
 Lazy Problem
 IO
 Bottleneck for everything
MEMORY
 Heap (GC management)
 Permanent (-XX:PermSize, -XX:MaxPermSize)
 JVM Heap (-Xmx, -Xms, -Xmn)
 C-Heap (=Native Heap)
 OS Shared
 Thread Stack (-Xss)
 Objects that access with JNI
 Off-Heap
 OS Shared
 GC managed by Cassandra
MEMORY (CONT)
 Heap
 Permanent
 JVM Heap
 Memtable
 KeyCache
 IndexSummary(move to Off-heap on next
release)
 Buffer
 Transport
 Socket
 Disk
 C-Heap
 Thread Stack
 File Memory Map (Virtual space)
 Data / Index buffer (default)
 CommitLog
v1.2
 Off-Heap (OS shared)
 RowCache
 BloomFilter
 Index->CompressionMetaData-
>ChuckOffset
MEMORY (CONT)
 Memtable
 Managed
 total size (default 1/3 JVM heap, flush largest memtable for CF if reached)
 Emergency, heap usage above the fraction of the max after full GC(CMS) -> flush
largest memtable (each time) -> prevent full GC / OOME
 KeyCache
 Managed
 total size (100M or 5% of the max)
 Emergency, heap usage above the fraction of the max after full GC(CMS) ->
reduce max cache size -> prevent full GC / OOME
 RowCache/CommitLog
 Managed
 total size (default disabled) -> prevent OOME
MEMORY (CONT)
 Thread Stack
 Not managed
 But XSS set as 180k (default)
 Check thrift (transport level, RPC server)’s server serving type (sync,
hsha, async(has bugs))
 Set min/max threads for connection (default unlimited)
v1.2
MEMORY (CONT)
 Transport buffer
 Thrift
 Support many languages and crossing
 Provide server/client interface, serializing
 Apache project, created by Facebook
 Framed buffer (default max 16M, variable size)
 4k, 16k, 32k, … 16M
 Determine by client
 Per connection
 Adjust max frame buffer size (client, server)
 Set min/max threads for connection (default unlimited)
v1.2
Data Service
Client
Data Service
Thrift
MEMORY (LAST)
 C-Heap/Off-Heap
 OS Shared -> Other application possible to make some problem
 File Memory Map (Virtual space)
 GC when Full GC
 0 <= total size <= the size of opened SSTables
 If cannot allocate? -> Native OOME
 But
 Generally access limited space of SSTable
 GC make space
 Worst case? (If OOME occur)
 yaml->disk_access_mode : standard (restart required)
 Add sufficient nodes
 Yaml->disk_access_mode : auto After joining
v1.2
CPU
 GC
 CMS
 Marking phase : low thread priority -> but high usage rate (it’s not a problem)
 CMSInitiatingOccupancyFraction is 75 (default)
 UseCMSInitiatingOccupancyOnly
 Full GC
 Frequency is important -> may has a problem (eg: thrift transport buffer)
 Add nodes or analyze memory usage to adjust configuration for
 Minor GC
 It’s OK
 Compaction
 If do slow, okay
 So priority down with “-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1”
 High CPU Load -> sustaining? -> When U need to add nodes
SWAPPING
 Swapping make big problem for real-time application
 IO block -> Thread block -> Gossip/Compaction/Flush … delaying ->
make other problem
 Disable or Set minimum Swapping
 Disable Swap partition
 Or Enable JNA + Kernel Configuration
 JNA : Mlockall (keep heap memory in physical memory)
 Kernel
 vm.swappiness=0 (but distress -> possible to swapping)
 vm.overcommit_memory=1
 Or vm.overcommit_memory=2 (overcommit managed)
 vm.overcommit_ratio=? (eg 0.75)
 Max memory = swap partition size + ratio*physical memory size
 Eg: 8G = 2G + 0.75*8G
MORNITERING
 System Monitoring
 CPU / Memory / Disk
 Nagios, Ganglia, Cacti, Zabbix
 Network Monitoring
 Per Client
 NfSen (network flow monitoring, see:
http://nfsen.sourceforge.net/#mozTocId376385)
 Cluster Monitoring / Maintaining
 OpsCenter
CHECK THREAD
 “top” command
 “H” key command to spread per thread
 “P” key command to sort by CPU usage rate
 Choose heavy rate thread’s PID
 PID convert to in Hex (http://www.binaryhexconverter.com/decimal-to-hex-converter)
 “jstack <Parent PID> > filename.log” command to save java stack to file
 Search PID in Hex
313C
CHECK HEAP
 Use dump file that from “jmap” or OOME
 Use “jhat” or another tool to analyze
 Check [B
 and their reference object
For development, maintaining
Sorry..
I have just two days to write this presentation.
Next time I will write and speak to U.
See U next time
Question or Talk about anything with Cassandra
Thank you
If you have any problem or question for me, please contact my email.
jihyun.an@kt.com

Más contenido relacionado

La actualidad más candente

Introduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraIntroduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraChetan Baheti
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraDataStax
 
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceDataStax Academy
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Anyscale
 
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...DataStax
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache CassandraJacky Chu
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyBenjamin Black
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
SQLPASS AD501-M XQuery MRys
SQLPASS AD501-M XQuery MRysSQLPASS AD501-M XQuery MRys
SQLPASS AD501-M XQuery MRysMichael Rys
 
SQL and NoSQL in SQL Server
SQL and NoSQL in SQL ServerSQL and NoSQL in SQL Server
SQL and NoSQL in SQL ServerMichael Rys
 
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...DataStax
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internalsSigmoid
 

La actualidad más candente (20)

Introduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraIntroduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache Cassandra
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis Price
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
SQLPASS AD501-M XQuery MRys
SQLPASS AD501-M XQuery MRysSQLPASS AD501-M XQuery MRys
SQLPASS AD501-M XQuery MRys
 
SQL and NoSQL in SQL Server
SQL and NoSQL in SQL ServerSQL and NoSQL in SQL Server
SQL and NoSQL in SQL Server
 
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 

Destacado

HTTP demystified for web developers
HTTP demystified for web developersHTTP demystified for web developers
HTTP demystified for web developersPeter Hilton
 
From legacy to DDD - 5 starting steps
From legacy to DDD - 5 starting stepsFrom legacy to DDD - 5 starting steps
From legacy to DDD - 5 starting stepsAndrzej Krzywda
 
Tom and jef’s awesome modellathon
Tom and jef’s awesome modellathonTom and jef’s awesome modellathon
Tom and jef’s awesome modellathonTom Janssens
 
Documentation avoidance for developers
Documentation avoidance for developersDocumentation avoidance for developers
Documentation avoidance for developersPeter Hilton
 
How to write maintainable code
How to write maintainable codeHow to write maintainable code
How to write maintainable codePeter Hilton
 
Death to project documentation with eXtreme Programming
Death to project documentation with eXtreme ProgrammingDeath to project documentation with eXtreme Programming
Death to project documentation with eXtreme ProgrammingAlex Fernandez
 
Domain-Driven Design in legacy application
Domain-Driven Design in legacy applicationDomain-Driven Design in legacy application
Domain-Driven Design in legacy applicationCyrille Martraire
 
From legacy to DDD (slides for the screencast)
From legacy to DDD (slides for the screencast)From legacy to DDD (slides for the screencast)
From legacy to DDD (slides for the screencast)Andrzej Krzywda
 
Evolving legacy to microservices and ddd
Evolving legacy to microservices and dddEvolving legacy to microservices and ddd
Evolving legacy to microservices and dddMarcos Vinícius
 
Simplifying your design with higher-order functions
Simplifying your design with higher-order functionsSimplifying your design with higher-order functions
Simplifying your design with higher-order functionsSamir Talwar
 
How to write good comments
How to write good commentsHow to write good comments
How to write good commentsPeter Hilton
 
I T.A.K.E. talk: "When DDD meets FP, good things happen"
I T.A.K.E. talk: "When DDD meets FP, good things happen"I T.A.K.E. talk: "When DDD meets FP, good things happen"
I T.A.K.E. talk: "When DDD meets FP, good things happen"Cyrille Martraire
 
DDD session BrownBagLunch (FR)
DDD session BrownBagLunch (FR)DDD session BrownBagLunch (FR)
DDD session BrownBagLunch (FR)Cyrille Martraire
 
Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...
Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...
Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...Cyrille Martraire
 
How to name things: the hardest problem in programming
How to name things: the hardest problem in programmingHow to name things: the hardest problem in programming
How to name things: the hardest problem in programmingPeter Hilton
 
Legacy Code: Evolve or Rewrite?
Legacy Code: Evolve or Rewrite?Legacy Code: Evolve or Rewrite?
Legacy Code: Evolve or Rewrite?Cyrille Martraire
 
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION Elvis Muyanja
 
DDD patterns that were not in the book
DDD patterns that were not in the bookDDD patterns that were not in the book
DDD patterns that were not in the bookCyrille Martraire
 

Destacado (20)

Writing the docs
Writing the docsWriting the docs
Writing the docs
 
HTTP demystified for web developers
HTTP demystified for web developersHTTP demystified for web developers
HTTP demystified for web developers
 
From legacy to DDD - 5 starting steps
From legacy to DDD - 5 starting stepsFrom legacy to DDD - 5 starting steps
From legacy to DDD - 5 starting steps
 
Tom and jef’s awesome modellathon
Tom and jef’s awesome modellathonTom and jef’s awesome modellathon
Tom and jef’s awesome modellathon
 
Documentation avoidance for developers
Documentation avoidance for developersDocumentation avoidance for developers
Documentation avoidance for developers
 
Selling ddd
Selling dddSelling ddd
Selling ddd
 
How to write maintainable code
How to write maintainable codeHow to write maintainable code
How to write maintainable code
 
Death to project documentation with eXtreme Programming
Death to project documentation with eXtreme ProgrammingDeath to project documentation with eXtreme Programming
Death to project documentation with eXtreme Programming
 
Domain-Driven Design in legacy application
Domain-Driven Design in legacy applicationDomain-Driven Design in legacy application
Domain-Driven Design in legacy application
 
From legacy to DDD (slides for the screencast)
From legacy to DDD (slides for the screencast)From legacy to DDD (slides for the screencast)
From legacy to DDD (slides for the screencast)
 
Evolving legacy to microservices and ddd
Evolving legacy to microservices and dddEvolving legacy to microservices and ddd
Evolving legacy to microservices and ddd
 
Simplifying your design with higher-order functions
Simplifying your design with higher-order functionsSimplifying your design with higher-order functions
Simplifying your design with higher-order functions
 
How to write good comments
How to write good commentsHow to write good comments
How to write good comments
 
I T.A.K.E. talk: "When DDD meets FP, good things happen"
I T.A.K.E. talk: "When DDD meets FP, good things happen"I T.A.K.E. talk: "When DDD meets FP, good things happen"
I T.A.K.E. talk: "When DDD meets FP, good things happen"
 
DDD session BrownBagLunch (FR)
DDD session BrownBagLunch (FR)DDD session BrownBagLunch (FR)
DDD session BrownBagLunch (FR)
 
Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...
Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...
Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...
 
How to name things: the hardest problem in programming
How to name things: the hardest problem in programmingHow to name things: the hardest problem in programming
How to name things: the hardest problem in programming
 
Legacy Code: Evolve or Rewrite?
Legacy Code: Evolve or Rewrite?Legacy Code: Evolve or Rewrite?
Legacy Code: Evolve or Rewrite?
 
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
 
DDD patterns that were not in the book
DDD patterns that were not in the bookDDD patterns that were not in the book
DDD patterns that were not in the book
 

Similar a Apache Cassandra Scalability, Performance and Fault Tolerance in Distributed Databases

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraFolio3 Software
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
PASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DivePASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DiveTravis Wright
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBob Ward
 
Google Megastore
Google MegastoreGoogle Megastore
Google Megastorebergwolf
 
Introduction of MariaDB AX / TX
Introduction of MariaDB AX / TXIntroduction of MariaDB AX / TX
Introduction of MariaDB AX / TXGOTO Satoru
 
Learning Cassandra NoSQL
Learning Cassandra NoSQLLearning Cassandra NoSQL
Learning Cassandra NoSQLPankaj Khattar
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational DatabasesUdi Bauman
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandraPL dream
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" BioinformaticsBrian Repko
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 

Similar a Apache Cassandra Scalability, Performance and Fault Tolerance in Distributed Databases (20)

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Oracle's history
Oracle's historyOracle's history
Oracle's history
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache Cassandra
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
PASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DivePASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep Dive
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and docker
 
Google Megastore
Google MegastoreGoogle Megastore
Google Megastore
 
Introduction of MariaDB AX / TX
Introduction of MariaDB AX / TXIntroduction of MariaDB AX / TX
Introduction of MariaDB AX / TX
 
Learning Cassandra NoSQL
Learning Cassandra NoSQLLearning Cassandra NoSQL
Learning Cassandra NoSQL
 
Cassandra
CassandraCassandra
Cassandra
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" Bioinformatics
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Apache Cassandra Scalability, Performance and Fault Tolerance in Distributed Databases

  • 1. APACHE CASSANDRA Scalability, Performance and Fault Tolerance in Distributed databases Jihyun.An (jihyun.an@kt.com) 18, June 2013
  • 2. TABLE OF CONTENTS  Preface  Basic Concepts  P2P Architecture  Primitive Data Model & Architecture  Basic Operations  Fault Management  Consistency  Performance  Problem handling
  • 3. TABLE OF CONTENTS (NEXT TIME)  Maintaining  Cluster Management  Node Management  Problem Handling  Tuning  Playing (for Development, Client stance)  Designing  Client  Thrift  Native  CQL  3rd party  Hector  OCM  Extension  Baas.io  Hadoop
  • 5. OUR WORLD  Traditional DBMS is very valuable  Storage(+Memory) and Computational Resources cost is cheap (than before)  But we meet new section  Big data  (near) Real time  Complex and various requirement  Recommendation  Find FOAF  …  Event Driven Trigging  User Session  …
  • 6. OUR WORLD (CONT)  Complex applications combine difference types of problems  Different language -> more productive  ex: Functional language, Multiprocessing optimized language  Polyglot persistent layer  Performance vs Durability?  Reliability?  …
  • 7. TRADITIONAL DBMS  Relational Model  Well-defined Schema  Access with Selection/Projection  Derived from Joining/Grouping/Aggregating(Counting..)  Small data (from refined)  …  But  Painful data model changes  Hard to scale out  Ineffective in handling large volumes of data  Not considered with hardware  …
  • 8. TRADITIONAL DBMS (CONT)  Has many constraints for ACID  PK/FK & checking  Domain Type checking  .. checking checking  Lots of IO / Processing  OODBMS, ORDBMS  Good but .. more more checking / processing  Not well with Disk IO
  • 9. NOSQL  Key-value store  Column : Cassandra, Hbase, Bigtable …  Others : Redis, Dynamo, Voldemort, Hazelcast …  Document oriented  MongoDB, CouchDB …  Graph store  Neo4j, Orient DB, BigOWL, FlockDB ..
  • 10. NOSQL (CONT) Benefits  Higher performance  Higher scalability  Flexible Datamodel  More effective for some case  Less administrative overhead Drawbacks  Limited Transactions  Relaxed Consistency  Unconstrained data  Limited ad-hoc query capabilities  Limited administrative aid tools
  • 11. CAP Brewer’s theorem We can pick two of Consistency Availability Partition tolerance A C P Amazon Dynamo derivatives Cassandra, Voldemort, CouchDB , Riak Neo4j, Bigtable Bigtable derivatives : MongoDB, Hbase Hypertable, Redis Relational: MySQL, MSSQL, Postgres
  • 12. Dynamo (Architecture) BigTable (Data model) Cassandra (Apache) Cassandra is a free, open-source, high scalable, distributed database system for managing large amounts of data Written in JAVA Running on JVM References : BigTable (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) Dynamo (http://web.archive.org/web/20120129154946/http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf)
  • 13. DESIGN GOALS  Simple Key/Value(Column) store  limited on storage  No support anything (aggregating, grouping …) but basic operation (CRUD, Range access)  But extendable  Hadoop (MR, HDFS, Pig, Hive ..)  ESP  Distributed Processing Interface (ex: BSP, MR)  Baas.io  …
  • 14. DESIGN GOALS (CONT)  High Availability  Decentralized  Everyone can accessor  Replication & Their access  Multi DC support  Eventual consistency  Less write complexity  Audit and repair when read  Possible tuning -> Trade offs between consistency, durability and latency
  • 15. DESIGN GOALS (CONT)  Incremental scalability  Equal Member  Linear Scalability  Unlimited space  Write / Read throughput increase linearly by add node(member)  Low total cost  Minimize administrative work  Automatic partitioning  Flush / compaction  Data balancing / moving  Virtual nodes (since v1.2)  Middle powered nodes make good performance  Collaborating work will make powerful performance and huge space
  • 16. FOUNDER & HISTORY  Founder  Avinash Lakshman (one of the authors of Amazon's Dynamo)  Prashant Malik ( Facebook Engineer )  Developer  About 50  History  Open sourced by Facebook in July 2008  Became an Apache Incubator project in March 2009  Graduated to a top-level project in Feb 2010  0.6 released (added support for integrated caching, and Apache Hadoop MapReduce) in Apr 2010  0.7 released (added secondary indexes and online schema change) in Jan 2011  0.8 released (added the Cassandra Query Language (CQL), self-tuning memtables, and support for zero-downtime upgrades) in Jun 2011  1.0 released (added integrated compression, leveled compaction, and improved read performance) in Oct 2011  1.1 released (added self-tuning caches, row-level isolation, and support for mixed ssd/spinning disk deployments) in Apr 2012  1.2 released (added clustering across virtual nodes, inter-node communication, atomic batches, and request tracing) in Jan 2013
  • 17. PROMINENT USERS User Cluster size Node count Usage Now Facebook >200 ? Inbox search Abandoned, Moved to HBase Cisco WebEx ? ? User feed, activity OK Netflix ? ? Backend OK Formspring ? (26 million account with 10 m responsed per day) ? Social-graph data OK Urban airship, Rackspace, Open X, Twitter (preparing move to)
  • 19. P2P ARCHITECTURE  All nodes are same (has equality)  No single point of failure / Decentralized  Compare with  mongoDB  broker structure (cubrid …)  Master / slave  …
  • 20. P2P ARCHITECTURE  Driven linear scalability References : http://dev.kthcorp.com/2011/12/07/cassandra-on-aws-100-million-writ/
  • 21. PRIMITIVE DATA MODEL & ARCHITECTURE
  • 22. COLUMN  Basic and primitive type (the smallest increment of data)  A tuple containing a name, a value and a timestamp  Timestamp is important  Provided by client  Determine the most recent one  If meet the collision, DBMS chose the latest one Name Value Timestamp
  • 23. COLUMN (CONT)  Types  Standard: A column has a name (UUID or UTF8 …)  Composite: A column has composite name (UUID+UTF8 …)  Expiring: TTL marked  Counter: Only has name and value, timestamp managed by server  Super: Used to manage wide rows, inferior to using composite columns (DO NOT USE, All sub-columns serialized) Counter Name Value Name Name Value Timestamp Name Value Timestamp
  • 24. COLUMN (CONT)  Types (CQL3 based)  Standard: Has one primary key.  Composite: Has more than one primary key, recommended for managing wide rows.  Expiring: Gets deleted during compaction.  Counter: Counts occurrences of an event.  Super: Used to manage wide rows, inferior to using composite columns (DO NOT USE, All sub-columns serialized) DDL : CREATE TABLE test ( user_id varchar, article_id uuid, content varchar, PRIMARY KEY (user_id, article_id) ); user_id article_id content Smith <uuid1> Blah1.. Smith <uuid2> Blah2.. {uuid1,content} Blah1… Timestamp {uuid2,content} Blah2… Timestamp Smith <Logical> <Physical> SELECT user_id,article_id from test order by article_id DESC LIMIT 1;
  • 25. ROWS  A row containing a represent key and a set of columns  A row key must be unique (usually UUID)  Supports up to 2 billion columns per (physical) row.  Columns are sorted by their name (Column’s Name indexed)  Primitive  Secondary Index  Direct Column Access Name Value Timestamp Name Value Timestamp Name Value Timestamp Row Key
  • 26. COLUMN FAMILY  Container for columns and rows  No fixed schema  Each row is uniquely identified by its row key  Each row can have a different set of columns  Rows are sorted by row key  Comparator / Validator  Static/Dynamic CF  If columns type is super column, CF called “Super Column Familty”  Like “Table” in Relational world Name Value Timestamp Name Value Timestamp Name Value Timestamp Row Key Name Value Timestamp Row Key
  • 28. TOKEN RING  Node is a instance (typically same as a server)  Used to map between each row and node  Range from 0 to 2127-1  Associated with a row key  Node  Assigned a unique token (ex: token 5 to Node 5)  Range is from previous node token to their token  token 4 < Node 5’range <= token 5 Node 1 Node 2 Node 3 Node 4Node 5 Node 6 Node 7 Node 8 Token 5 Token 4
  • 30. REPLICATION  Any node has read/write role is called coordinator node (by client)  Locator determine where located the replica  Replica is used at  Consistency check  Repair  Ensure W + R > N for consistency  Local Cache (Row cache) Node 1 Node 2 Node 3 Node 4Node 5 Node 6 Node 7 Node 8 Replica Factor is 4 (N-1 will be replicated) Simple Locator treat strategy order as proximity Locator (Simple) Coordinator node Locating first one 1 2 Here is original
  • 31. REPLICATION (CONT)  Multi DC support  Allow to Specify how many replcas in each DC  Within DC replicas are placed on different racks  Relies on snitch to place replicas  Strategy (provided from Snitch)  Simple (Single DC)  RackInferringSnitch  PropertyFileSnitch  EC2Snitch  EC2MultiRegionSnitch DC1 DC2 Entire
  • 32. ADD / REMOVE NODE  Data transfer between nodes called “Streaming”  If add node 5, node 3 and node 4, 1 (suppose RF is 2) involved in streaming  If remove node 2 node 3(got higher token and their replica container) serve instead Node 1 Node 2 Node 3 Node 4 Node 1 Node 2 Node 3 Node 4 Node 5 Node 1 Node 3 Node 4
  • 33. VIRTUAL NODES  Support since v1.2  Real time migration support?  Shuffle utility  One node has many tokens  => one node has many ranges Node 1 Node 2 Number of token is 4 Cluster Node 2 Node 1
  • 34. VIRTUAL NODES (CONT)  Less administrative works  Save cost  When Add/Remove node  many node co-works  No need to determine the token  Shuffle to re-balance  Less changing time  Smart balancing  No need to balance (Sufficiently number of token should be higher) Number of token is 4 Node 2 Node 1 Cluster Node 2 Node 1 Node 3 Add node 3
  • 35. KEYSPACE  A namespace for column families  Authorization  CF? yeah  Replication  Key oriented schema (see right) { "row_key1": { "Users":{ "emailAddress":{"name":"emailAddress","value":"foo@bar.co m" }, "webSite":{"name":"webSite", "value":http://bar.com} }, "Stats":{ "visits":{"name":"visits", "value":"243"} } }, "row_key2": { "Users":{ "emailAddress":{"name":"emailAddress", "value":"user2@bar.com"}, "twitter":{"name":"twitter", "value":"user2"} } } } Row Key Column Family Column
  • 36. CLUSTER  Total amount of data managed by the cluster is represented as a ring  Cluster of nodes  Has multiple(or single) Keyspace  Partitioning Strategy defined  Authentication
  • 37. GOSSIP  Gossip protocol is used for cluster membership.  Failure detection on service level (Alive or Not)  Responsible  Every node in the system knows every other node’s status  Implemented as  Sync -> Ack -> Ack2  Information : status, load, bootstraping  Basic status is Alive/Dead/Join  Runs every second  Status disseminated in O(logN) (N is the number of nodes)  Seed  PHI is used for auditing dead or alive in time window (5 -> detecting in 15~16 s)  Data structure  HeartBeat<Application Status<Endpoint Status<Endpoint StatusMap N1 N2 N3 N4 N6 N5
  • 39. WRITE / UPDATE  CommitLog  Abstracted Mmaped Type  File & Memory Sync -> On system failure? This is angel for U ^^.  Java NIO  C-Heap used (=Native Heap)  Log Data (Write->Delete? But exists)  Segment Rolling structure  Memtable  In memory buffer and workspace  Sorted order by row key  If reach threshold or period point, written to disk to a persistent table structure(SSTable)
  • 40. WRITE / UPDATE (LOCAL LEVEL) Write CommitLog Write : “1”:{“name”:”fullname”,”value”:”smith”} Write : “2”:{“name”:”fullname”,”value”:”mike”} Delete : “1” Write : “3”:{“name”:”fullname”,”value”:”osang”} … Key Name Value 1 fullname smith 2 fullname mike 3 fullname Osang … … … Memtable SSTable SSTable SSTable 1 Write to commitLog 2 Write/Update to Memtable 3Write to Disk (flush)
  • 41. SSTABLE  SSTable is Sorted String Table  Best for log structured DB  Store large numbers of key-value pairs  Immutable  Create with “Flush”  Merges by (major/minor) compaction  Has one or more column has different version (timestamp)  Choose recent one
  • 42. READ (LOCAL LEVEL) Key Name Value 2 fullname mike 3 fullname Osang … … … SSTable BF IDX SSTable BF IDX Read Memtable
  • 43. READ (CLUSTER LEVEL, +READ REPAIR) Replica (Original, Right) Replica (Right) Replica (Wrong) Digest Comparing Choose the right one if digests differ (the most recent) Recover Read Operation Coordinator Locator 1 Transferred from original/replica node (with consistency level) 2 3
  • 44. DELETE  Add tomstone (this is some type of column)  Garbage collected when compacting  GC grace seconds : 864000 (default 10 days)  Issue  If the fault node recover after GCGraceSeconds, the deleted data can be resurrected
  • 46. DETECTION  Dynamic threshold for marking nodes  Accrual Detection Mechanism calculates a per-node threshold  Automatic take into account Network condition, workload and other conditions might affect perceived heartbeat rate.  From 3rd party client  Hector  Failover
  • 47. HINTED-HANDOFF  The coordinator will store a hint for if the node down or failed to acknowledge the write  Hint consists of the target replica and the mutation(column object) to be replayed  Use java heap (might next to be off-heap)  Only saved within limited time (default, 1 hour) after a replica fails  When failed node is alive again, it will begin streaming the miss writes
  • 48. REPAIR  Support triangle method  CommitLog Replaying (by administrator)  Read Repair (realtime)  Anti-entropy Repair (by administrator)
  • 49. READ REPAIR  Background work  Configured per CF  Choose most recently written value if they are inconsistent, and replace it.
  • 50. ANTI-ENTROPY REPAIR  Ensure all data on a replica is made consistent  Merkle tree used  Tree of data block’s hashes  Verify inconsistent  Repair node request merkle hash (piece of CF) to replicas and comparing, streaming from a replica if inconsistent, do Read-repair Block 1 Block 2 Block 3 … CF hash hash hash hash hash hash hash
  • 52. BASIC  Full ACID compliance in distributed system is a bad idea. (network, … )  Single row updates are atomic (include internal indexes), everything else is not  Relaxing consistency does not equal data corruption  Tunable Consistency  Speed vs precision  Any read and write operation decides how consistent the requested data should be (from client)
  • 53. CONDITION  Consistency ensure if  (W + R) > N  W is nodes written (succeed)  R is nodes read  N is replica factor
  • 54. CONDITION (CONT) N is 3 Operations 1. Write 3 2. Write 5 3. Write 1 3 5 1 Worst case W is 1 1 5 1W is 2 3 1 1or W is 2 1 1 1 R is 1 Possible case 3 5 1or or R is 21 1 R is 3 Written Read (W+R)>N ensure that at lease one latest value can be selected This is eventual consistency
  • 55. READ CONSISTENCY LEVELS  One  Two  Three  Quorum  Local Quorum  Each Quorum  All Specify how many replicas must response before a result is return to the client Quorum : (Replication Factor / 2) + 1 Local Quorum / Each Quorum is used at Multi- DC Round down to a whole number processing (If satisfied, return right away)
  • 56. WRITE CONSISTENCY LEVELS  ANY  One  Two  Three  Quorum  Local Quorum  Each Quorum  All Specify how many replicas must succeed before returning acknowledge to client Quorum : (Replication Factor / 2) + 1 Local Quorum / Each Quorum is used at Multi- DC ANY level contain hinted-handoff condition Round down to a whole number processing (If satisfied, return right away)
  • 58. CACHE  Key/Row Cache can save their data to files  Key Cache  Accessed Frequently  Hold the location of keys (indicating to columns)  In memory, on JVM heap  Row Cache  Optional  Hold entire columns of the row  In memory, on Off-heap (since v1.1) or JVM heap  If you have huge column, this will make OOME (Out Of Memory Event)
  • 59. CACHE  Mmaped Disk Access  On 64bit JVM, used for data and index summary (default)  Provide virtual mmaped space in Memory for SSTable  On C-Heap(native heap)  GC make this as cache  Data accessed frequently live long period, otherwise GC will purge that  If the data exists in memory, return it (=cache)  (Problem) GC C-Heap when its full only  (Problem) handle open SSTable, this mean Cassandra can allocate the entire size of open SSTables, otherwise native OOME  If you wanna have efficient Key/Row/Mmaped Access cache, add sufficient nodes to cluster
  • 60. BLOOM FILTERS  Each SSTable has this  Used to check if a requested row key exists in the SSTable before doing any seeks (disk)  Per row key, generate several hashes and mark the buckets for the key  Check each bucket for the key’s hashes, if any is empty the key does not exists  False positive are possible, but false negative are not Key 1 Key 2 Key 2 Hash A Hash B Hash C 1 1 1 Same hashes Only has
  • 61. INDEX  Primary Index  Per CF  The index of CF’s row key  Efficient access with Index summary (1 row key out of every 128 is sampled)  In memory, on JVM heap (next move to Off-heap) Read BF KeyCache SSTable Index Summary Primary Index Offset Calculator
  • 62. INDEX (CONT)  Secondary Index  For Column’s value(s)  Support composite type  Hidden CF  Implemented by CF’name index  Value is the CF’name  Write/Update/Delete operation is atomic  Share value for many rows is good for  On the contrary unique value for indexing is poor (-> use Dynamic CF for indexing)
  • 63. COMPACTION  Combines data from SSTables  Merge row fragments  Rebuild primary and secondary indexes  Remove expired columns marked with tomestone  Delete old SSTable if complete  “Minor” only compactions merge SSTables of similar size, “Major” compactions merge all SSTables in a given CF  Size-tiered compaction  Leveled compaction  Since v1.0  Based on LevelDB  Temporary use maximum twice space and spike in disk IO.
  • 64. ARCHITECTURE  Write : no race conditions, not handled by disk IO  Read : Slow than write, but fast (DHT, cache …)  Load balancing  Virtual Nodes  Replication  Multi-DC
  • 65. BENCHMARK References : http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18 0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ- eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/ Workload A—update heavy: (a) read operations, (b) update operations. Throughput in this (and all figures) represents total operations per second, including reads and writes. Workload B—read heavy: (a) read operations, (b) update operations By YCSB (Yahoo Cloud Serving Benchmark)
  • 67. BENCHMARK (CONT) Elastic speedup: Time series showing impact of adding servers online. By YCSB (Yahoo Cloud Serving Benchmark) References : http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18 0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ- eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/
  • 68. BENCHMARK (CONT) By NoSQLBenchmarking.com References : http://www.nosqlbenchmarking.com/2011/02/new-results-for-cassandra-0-7-2//
  • 69. BENCHMARK (CONT) By Cubrid References : http://www.cubrid.org/blog/dev-platform/nosql-benchmarking/
  • 70. BENCHMARK (CONT) By VLDB References : http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/ Read latency Write latencyThroughput (95% read, 5% write)
  • 71. BENCHMARK (LAST) By VLDB References : http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/ Throughput (50% read, 50% write) Throughput (100% write)
  • 73. RESOURCE  Memory  Off-heap & Heap  OOME Problem  CPU  GC  Hashing  Compression / Compaction  Network Handling  Context Switching  Lazy Problem  IO  Bottleneck for everything
  • 74. MEMORY  Heap (GC management)  Permanent (-XX:PermSize, -XX:MaxPermSize)  JVM Heap (-Xmx, -Xms, -Xmn)  C-Heap (=Native Heap)  OS Shared  Thread Stack (-Xss)  Objects that access with JNI  Off-Heap  OS Shared  GC managed by Cassandra
  • 75. MEMORY (CONT)  Heap  Permanent  JVM Heap  Memtable  KeyCache  IndexSummary(move to Off-heap on next release)  Buffer  Transport  Socket  Disk  C-Heap  Thread Stack  File Memory Map (Virtual space)  Data / Index buffer (default)  CommitLog v1.2  Off-Heap (OS shared)  RowCache  BloomFilter  Index->CompressionMetaData- >ChuckOffset
  • 76. MEMORY (CONT)  Memtable  Managed  total size (default 1/3 JVM heap, flush largest memtable for CF if reached)  Emergency, heap usage above the fraction of the max after full GC(CMS) -> flush largest memtable (each time) -> prevent full GC / OOME  KeyCache  Managed  total size (100M or 5% of the max)  Emergency, heap usage above the fraction of the max after full GC(CMS) -> reduce max cache size -> prevent full GC / OOME  RowCache/CommitLog  Managed  total size (default disabled) -> prevent OOME
  • 77. MEMORY (CONT)  Thread Stack  Not managed  But XSS set as 180k (default)  Check thrift (transport level, RPC server)’s server serving type (sync, hsha, async(has bugs))  Set min/max threads for connection (default unlimited) v1.2
  • 78. MEMORY (CONT)  Transport buffer  Thrift  Support many languages and crossing  Provide server/client interface, serializing  Apache project, created by Facebook  Framed buffer (default max 16M, variable size)  4k, 16k, 32k, … 16M  Determine by client  Per connection  Adjust max frame buffer size (client, server)  Set min/max threads for connection (default unlimited) v1.2 Data Service Client Data Service Thrift
  • 79. MEMORY (LAST)  C-Heap/Off-Heap  OS Shared -> Other application possible to make some problem  File Memory Map (Virtual space)  GC when Full GC  0 <= total size <= the size of opened SSTables  If cannot allocate? -> Native OOME  But  Generally access limited space of SSTable  GC make space  Worst case? (If OOME occur)  yaml->disk_access_mode : standard (restart required)  Add sufficient nodes  Yaml->disk_access_mode : auto After joining v1.2
  • 80. CPU  GC  CMS  Marking phase : low thread priority -> but high usage rate (it’s not a problem)  CMSInitiatingOccupancyFraction is 75 (default)  UseCMSInitiatingOccupancyOnly  Full GC  Frequency is important -> may has a problem (eg: thrift transport buffer)  Add nodes or analyze memory usage to adjust configuration for  Minor GC  It’s OK  Compaction  If do slow, okay  So priority down with “-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1”  High CPU Load -> sustaining? -> When U need to add nodes
  • 81. SWAPPING  Swapping make big problem for real-time application  IO block -> Thread block -> Gossip/Compaction/Flush … delaying -> make other problem  Disable or Set minimum Swapping  Disable Swap partition  Or Enable JNA + Kernel Configuration  JNA : Mlockall (keep heap memory in physical memory)  Kernel  vm.swappiness=0 (but distress -> possible to swapping)  vm.overcommit_memory=1  Or vm.overcommit_memory=2 (overcommit managed)  vm.overcommit_ratio=? (eg 0.75)  Max memory = swap partition size + ratio*physical memory size  Eg: 8G = 2G + 0.75*8G
  • 82. MORNITERING  System Monitoring  CPU / Memory / Disk  Nagios, Ganglia, Cacti, Zabbix  Network Monitoring  Per Client  NfSen (network flow monitoring, see: http://nfsen.sourceforge.net/#mozTocId376385)  Cluster Monitoring / Maintaining  OpsCenter
  • 83. CHECK THREAD  “top” command  “H” key command to spread per thread  “P” key command to sort by CPU usage rate  Choose heavy rate thread’s PID  PID convert to in Hex (http://www.binaryhexconverter.com/decimal-to-hex-converter)  “jstack <Parent PID> > filename.log” command to save java stack to file  Search PID in Hex 313C
  • 84. CHECK HEAP  Use dump file that from “jmap” or OOME  Use “jhat” or another tool to analyze  Check [B  and their reference object
  • 85. For development, maintaining Sorry.. I have just two days to write this presentation. Next time I will write and speak to U. See U next time
  • 86. Question or Talk about anything with Cassandra
  • 87. Thank you If you have any problem or question for me, please contact my email. jihyun.an@kt.com