SlideShare una empresa de Scribd logo
1 de 26
美商優科無線 資深工程師

       Boris Yen
專家講座 B:
淺談 Apache Cassandra
Outline
•   Cassandra vs SQL Server
•   Overview
•   Data in Cassandra
•   Data Partitioning
•   Data Replication
•   Data Consistency
•   Client Libraries
Cassandra vs SQL Server
•   Cassandra
    o More servers = More capacity.
    o The concerns of scaling is transparent to application.
    o No single point of failure.
    o Horizontal scale.

•   SQL Server
    o More power machine = More capacity.
    o Adding capacity requires manual labor from ops people
      and substantial downtime.
    o There would be limit on how big you could go.
    o Vertical scale, Moore’s law scaling
Overview
•   Features are coming from Dynamo and BigTable
•   Distributed
    o   Data partitioned among all nodes
•   Extremely Scalable
    o Add new node = Add more capacity
    o Easy to add new node
•   Fault tolerant
    o All nodes are the same
    o Read/Write anywhere
    o Automatic Data replication
•   High Performance
Overview




http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-
performance
                                                                 http://www.cubrid.org/blog/dev-platform/nosql-
                                                                 benchmarking/




                                                               http://techblog.netflix.com/2011/11/benchmarking-
                                                               cassandra-scalability-on.html
Data in Cassandra
•   Keyspace ~ Database in RDBMS
•   Column Family ~ Table in RDBMS

    Keyspace


     ColumnFamily
                                                {
                                                    column: Phone,
                    ID     Addr       Phone         value: 09...,
      Key: Boris
                                                    timestamp: 1000
                    1    ... Taiwan   09.....   }

                                                timestamp is used
                                                to resolve conflict.
Data in Cassandra
•    Keyspace
      o   Where the replication strategy and replication factor
          is defined.
           CREATE KEYSPACE keyspace_name WITH
           strategy_class = 'SimpleStrategy'
           AND strategy_options:replication_factor=2;



•    ColumnFamily
    CREATE COLUMNFAMILY user (
     id uuid PRIMARY KEY, address text, userName text ) WITH
     comment='' AND comparator=text AND read_repair_chance=0.100000 AND
     gc_grace_seconds=864000 AND default_validation=text AND
     min_compaction_threshold=4 AND max_compaction_threshold=32 AND
     replicate_on_write=True AND compaction_strategy_class='SizeTieredCompactionStrategy' AND
    compression_parameters:sstable_compression='org.apache.cassandra.io.compress.SnappyCompres
    sor';
Data in Cassandra
•   Commit log
    o   Used to capture write activities. Data durability is
        assured.
•   Memtable
    o   Used to store most recent write activities.
•   SSTable
    o   When a memtable got flushed to disk, it becomes a
        sstable.
Data Read/Write
•   Write

            Data          Commitlog        Memtable


                                                     Flushed

                                           SSTable


•   Read
    o Search Row cache, if the result is not empty, then return the
      result. No further actions are needed.
    o If no hit in the Row cache. Try to get data from Memtable(s)
      and SSTable(s) that might contain requested key. Collate the
      results and return.
Data Compaction
                                    t2 > t1
           Boris:{
             name: boris (t1)
sstable1     phone: 092xxx (t1)
             addr: tainan (t1)
           }
                                                         Boris:{
                                                           addr: tainan (t1)
                                                           email: y@gmail (t2)
                                              sstableX     name: boris.yen (t2)
           Boris:{                                         phone: 092xxx (t1)
             name: boris.yen (t2)                          sex: male (t2)
sstable2     sex: male (t2)
             email: y@gmail (t2)                         }
           }


  .
  .
  .
  .
Data Partitioning
•   The total data managed by the cluster is
    represented as a circular space or ring.
•   Before a node can join the ring, it must be assigned
    a token.
•   The token determines the node’s position on the
    ring and the range of data it is responsible for.
•   Partitioning strategy
     o Random Partitioning
         Default and Recommended
     o Order Partitioning
         Sequential writes can cause hot spots
         More administrative overhead to load balance the
          cluster
Data Partitioning
           Random
           Partitioning
                          t1
            hash(k2)            hash(k1)




Data: k1     t5                       t2              Data: k3


                                            hash(k4)


                                           hash(k3)



                    t4          t3
Data Replication
•   To ensure fault tolerance and no single point
    of failure.
•   Replication is controlled by the parameters
    replication factor and replication strategy
    of a keyspace.
•   Replication factor controls how many copies
    of a row should be stored in the cluster
•   Replication strategy controls how the data
    being replicated.
Data Replication
             Random Partitioning
                                   t1
             RF=3                       hash(k1)




Data: k1            t5                        t2

           coordinator




                          t4            t3
Data Consistency
•   Cassandra supports tunable data
    consistency.
•   Choose from strong and eventual
    consistency depending on the need.
•   Can be done on a per-operation basis, and
    for both reads and writes.
•   Handles multi-data center operations
Consistency Level


   Write           Read
    Any
    One             One
  Quorum          Quorum
Local_Quorum    Local_Quorum
Each_Quorum     Each_Quorum
    All             All
Built-in Consistency Repair
                  Features

•   Read Repair
•   Hinted Handoff
•   Anti-Entropy Node Repair




http://www.datastax.com/docs/0.8/dml/data_consistency#builtin-consistency
Client Library for Java
•   Hector
    o https://github.com/hector-client/hector.git
    o https://github.com/hector-client/hector/wiki/User-
      Guide
•   Astyanax
    o https://github.com/Netflix/astyanax.git
•   CQL + JDBC
    o   http://code.google.com/a/apache-
        extras.org/p/cassandra-jdbc/
Hector
•   High level, simple object oriented
    interface to cassandra
•   Failover behavior on the client side
•   Connection pooling for improved
    performance and scalability
•   Automatic retry of downed hosts
.
.
.
Hector
// slice query
SliceQuery<String, String> q = HFactory.createSliceQuery(ko, se, se, se);
q.setColumnFamily(cf).setKey("jsmith").setColumnNames("first", "last",
"middle");
Result<ColumnSlice<String, String>> r = q.execute();

// multi-get
MultigetSliceQuery<String, String, String> multigetSliceQuery =
   HFactory.createMultigetSliceQuery(keyspace, stringSerializer, stringSerializer,
stringSerializer);
multigetSliceQuery.setColumnFamily("Standard1");
multigetSliceQuery.setKeys("fake_key_0", "fake_key_1",
   "fake_key_2", "fake_key_3", "fake_key_4");
multigetSliceQuery.setRange("", "", false, 3);
Result<Rows<String, String, String>> result = multigetSliceQuery.execute();

// batch operation
Mutator<String> mutator = HFactory.createMutator(keyspace, stringSerializer);
mutator.addInsertion("jsmith", "Standard1",
HFactory.createStringColumn("first", "John")).addInsertion("jsmith",
"Standard1", HFactory.createStringColumn("last",
"Smith")).addInsertion("jsmith", "Standard1",
HFactory.createStringColumn("middle", "Q"));
mutator.execute();
https://github.com/hector-client/hector/wiki/User-Guide
CQL+JDBC
Class.forName("org.apache.cassandra.cql.jdbc.CassandraDriver");
    String URL = String.format("jdbc:cassandra://%s:%d/%s",HOST,PORT,"system");
    System.out.println("Connection URL = '"+URL +"'");

    con = DriverManager.getConnection(URL);
    Statement stmt = con.createStatement();


// Create KeySpace
String createKS = String.format("CREATE KEYSPACE %s WITH strategy_class =
SimpleStrategy AND strategy_options:replication_factor = 1;",KEYSPACE);

stmt.execute(createKS);

// Create the target Column family
      String createCF = "CREATE COLUMNFAMILY RegressionTest (keyname text PRIMARY
KEY,” + "bValue boolean, “+ "iValue int “+ ") WITH comparator = ascii AND default_validation =
bigint;";

 stmt.execute(createCF);

https://code.google.com/a/apache-extras.org/p/cassandra-
jdbc/source/browse/src/test/java/org/apache/cassandra/cql/jdbc/JdbcRegressionTest.java
CQL+JDBC
Statement statement = con.createStatement();

String truncate = "TRUNCATE RegressionTest;";
statement.execute(truncate);

String insert1 = "INSERT INTO RegressionTest (keyname,bValue,iValue) VALUES ('key0',true,
2000);";
statement.executeUpdate(insert1);

String insert2 = "INSERT INTO RegressionTest (keyname,bValue) VALUES( 'key1',false);";
statement.executeUpdate(insert2);

String select = "SELECT * from RegressionTest;";
ResultSet result = statement.executeQuery(select);
ResultSetMetaData metadata = result.getMetaData();
.
.
.


https://code.google.com/a/apache-extras.org/p/cassandra-
jdbc/source/browse/src/test/java/org/apache/cassandra/cql/jdbc/JdbcRegressionTest.java
Useful Tools
•   cassandra-cli
    o <cassandra-dir>/bin
    o http://www.datastax.com/docs/1.0/dml/using_cli

•   cqlsh
    o <cassandra-dir>/bin
    o http://www.datastax.com/docs/1.0/references/cql/index

•   nodetool
    o <cassandra-dir>/bin
    o http://www.datastax.com/docs/1.0/references/nodetool

•   stress
    o <cassandra-dir>/tools/bin
    o http://www.datastax.com/docs/1.0/references/stress_java
Useful Tools
•   OpsCenter
    o    http://www.datastax.com/products/opscenter
•   sstableloader
    o    <cassandra-dir>/bin
    o    http://www.datastax.com/dev/blog/bulk-loading
•   More tools
        http://en.wikipedia.org/wiki/Apache_Cassandra#Tools
          _for_Cassandra
Questions?

Más contenido relacionado

La actualidad más candente

Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Sparknickmbailey
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsJulien Anguenot
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremGrisha Weintraub
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraDataStax
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016DataStax
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsPanagiotis Papadopoulos
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value StoreSantal Li
 
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016DataStax
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache CassandraRobert Stupp
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage systemArunit Gupta
 
Boot Strapping in Cassandra
Boot Strapping  in CassandraBoot Strapping  in Cassandra
Boot Strapping in CassandraArunit Gupta
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseCloudera, Inc.
 
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...DataStax
 
Scaling with MongoDB
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDBMongoDB
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in CassandraShogo Hoshii
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Michael Renner
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadeaviadea
 

La actualidad más candente (20)

Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and Limitations
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
 
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
Boot Strapping in Cassandra
Boot Strapping  in CassandraBoot Strapping  in Cassandra
Boot Strapping in Cassandra
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBase
 
Hbase Nosql
Hbase NosqlHbase Nosql
Hbase Nosql
 
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
 
Scaling with MongoDB
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDB
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in Cassandra
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
 

Destacado

SAUTERNES AND BARSAC GRANDS CRUS CLASSES 2011
SAUTERNES AND BARSAC GRANDS CRUS CLASSES 2011SAUTERNES AND BARSAC GRANDS CRUS CLASSES 2011
SAUTERNES AND BARSAC GRANDS CRUS CLASSES 2011LettresDeChateaux
 
การนำเสนอข้อมูล(จด)
การนำเสนอข้อมูล(จด)การนำเสนอข้อมูล(จด)
การนำเสนอข้อมูล(จด)pumyam
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Boris Yen
 
Press release - 2011 Vintage
Press release - 2011 VintagePress release - 2011 Vintage
Press release - 2011 VintageLettresDeChateaux
 
การนำเสนอข้อมูล(จด)
การนำเสนอข้อมูล(จด)การนำเสนอข้อมูล(จด)
การนำเสนอข้อมูล(จด)pumyam
 
การนำเสนอข้อมูล
การนำเสนอข้อมูลการนำเสนอข้อมูล
การนำเสนอข้อมูลpumyam
 

Destacado (6)

SAUTERNES AND BARSAC GRANDS CRUS CLASSES 2011
SAUTERNES AND BARSAC GRANDS CRUS CLASSES 2011SAUTERNES AND BARSAC GRANDS CRUS CLASSES 2011
SAUTERNES AND BARSAC GRANDS CRUS CLASSES 2011
 
การนำเสนอข้อมูล(จด)
การนำเสนอข้อมูล(จด)การนำเสนอข้อมูล(จด)
การนำเสนอข้อมูล(จด)
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
 
Press release - 2011 Vintage
Press release - 2011 VintagePress release - 2011 Vintage
Press release - 2011 Vintage
 
การนำเสนอข้อมูล(จด)
การนำเสนอข้อมูล(จด)การนำเสนอข้อมูล(จด)
การนำเสนอข้อมูล(จด)
 
การนำเสนอข้อมูล
การนำเสนอข้อมูลการนำเสนอข้อมูล
การนำเสนอข้อมูล
 

Similar a Introduce Apache Cassandra - JavaTwo Taiwan, 2012

Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentationMurat Çakal
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinChristian Johannsen
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandraAaron Ploetz
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparisonshsedghi
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Boundary Front end tech talk: how it works
Boundary Front end tech talk: how it worksBoundary Front end tech talk: how it works
Boundary Front end tech talk: how it worksBoundary
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...Scality
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into CassandraBrent Theisen
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Jon Haddad
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache CassandraStu Hood
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectMorningstar Tech Talks
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
 

Similar a Introduce Apache Cassandra - JavaTwo Taiwan, 2012 (20)

Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Boundary Front end tech talk: how it works
Boundary Front end tech talk: how it worksBoundary Front end tech talk: how it works
Boundary Front end tech talk: how it works
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Devops kc
Devops kcDevops kc
Devops kc
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Cassandra Overview
Cassandra OverviewCassandra Overview
Cassandra Overview
 
DataStax TechDay - Munich 2014
DataStax TechDay - Munich 2014DataStax TechDay - Munich 2014
DataStax TechDay - Munich 2014
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Introduce Apache Cassandra - JavaTwo Taiwan, 2012

  • 3. Outline • Cassandra vs SQL Server • Overview • Data in Cassandra • Data Partitioning • Data Replication • Data Consistency • Client Libraries
  • 4. Cassandra vs SQL Server • Cassandra o More servers = More capacity. o The concerns of scaling is transparent to application. o No single point of failure. o Horizontal scale. • SQL Server o More power machine = More capacity. o Adding capacity requires manual labor from ops people and substantial downtime. o There would be limit on how big you could go. o Vertical scale, Moore’s law scaling
  • 5. Overview • Features are coming from Dynamo and BigTable • Distributed o Data partitioned among all nodes • Extremely Scalable o Add new node = Add more capacity o Easy to add new node • Fault tolerant o All nodes are the same o Read/Write anywhere o Automatic Data replication • High Performance
  • 6. Overview http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0- performance http://www.cubrid.org/blog/dev-platform/nosql- benchmarking/ http://techblog.netflix.com/2011/11/benchmarking- cassandra-scalability-on.html
  • 7. Data in Cassandra • Keyspace ~ Database in RDBMS • Column Family ~ Table in RDBMS Keyspace ColumnFamily { column: Phone, ID Addr Phone value: 09..., Key: Boris timestamp: 1000 1 ... Taiwan 09..... } timestamp is used to resolve conflict.
  • 8. Data in Cassandra • Keyspace o Where the replication strategy and replication factor is defined. CREATE KEYSPACE keyspace_name WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor=2; • ColumnFamily CREATE COLUMNFAMILY user ( id uuid PRIMARY KEY, address text, userName text ) WITH comment='' AND comparator=text AND read_repair_chance=0.100000 AND gc_grace_seconds=864000 AND default_validation=text AND min_compaction_threshold=4 AND max_compaction_threshold=32 AND replicate_on_write=True AND compaction_strategy_class='SizeTieredCompactionStrategy' AND compression_parameters:sstable_compression='org.apache.cassandra.io.compress.SnappyCompres sor';
  • 9. Data in Cassandra • Commit log o Used to capture write activities. Data durability is assured. • Memtable o Used to store most recent write activities. • SSTable o When a memtable got flushed to disk, it becomes a sstable.
  • 10. Data Read/Write • Write Data Commitlog Memtable Flushed SSTable • Read o Search Row cache, if the result is not empty, then return the result. No further actions are needed. o If no hit in the Row cache. Try to get data from Memtable(s) and SSTable(s) that might contain requested key. Collate the results and return.
  • 11. Data Compaction t2 > t1 Boris:{ name: boris (t1) sstable1 phone: 092xxx (t1) addr: tainan (t1) } Boris:{ addr: tainan (t1) email: y@gmail (t2) sstableX name: boris.yen (t2) Boris:{ phone: 092xxx (t1) name: boris.yen (t2) sex: male (t2) sstable2 sex: male (t2) email: y@gmail (t2) } } . . . .
  • 12. Data Partitioning • The total data managed by the cluster is represented as a circular space or ring. • Before a node can join the ring, it must be assigned a token. • The token determines the node’s position on the ring and the range of data it is responsible for. • Partitioning strategy o Random Partitioning  Default and Recommended o Order Partitioning  Sequential writes can cause hot spots  More administrative overhead to load balance the cluster
  • 13. Data Partitioning Random Partitioning t1 hash(k2) hash(k1) Data: k1 t5 t2 Data: k3 hash(k4) hash(k3) t4 t3
  • 14. Data Replication • To ensure fault tolerance and no single point of failure. • Replication is controlled by the parameters replication factor and replication strategy of a keyspace. • Replication factor controls how many copies of a row should be stored in the cluster • Replication strategy controls how the data being replicated.
  • 15. Data Replication Random Partitioning t1 RF=3 hash(k1) Data: k1 t5 t2 coordinator t4 t3
  • 16. Data Consistency • Cassandra supports tunable data consistency. • Choose from strong and eventual consistency depending on the need. • Can be done on a per-operation basis, and for both reads and writes. • Handles multi-data center operations
  • 17. Consistency Level Write Read Any One One Quorum Quorum Local_Quorum Local_Quorum Each_Quorum Each_Quorum All All
  • 18. Built-in Consistency Repair Features • Read Repair • Hinted Handoff • Anti-Entropy Node Repair http://www.datastax.com/docs/0.8/dml/data_consistency#builtin-consistency
  • 19. Client Library for Java • Hector o https://github.com/hector-client/hector.git o https://github.com/hector-client/hector/wiki/User- Guide • Astyanax o https://github.com/Netflix/astyanax.git • CQL + JDBC o http://code.google.com/a/apache- extras.org/p/cassandra-jdbc/
  • 20. Hector • High level, simple object oriented interface to cassandra • Failover behavior on the client side • Connection pooling for improved performance and scalability • Automatic retry of downed hosts . . .
  • 21. Hector // slice query SliceQuery<String, String> q = HFactory.createSliceQuery(ko, se, se, se); q.setColumnFamily(cf).setKey("jsmith").setColumnNames("first", "last", "middle"); Result<ColumnSlice<String, String>> r = q.execute(); // multi-get MultigetSliceQuery<String, String, String> multigetSliceQuery = HFactory.createMultigetSliceQuery(keyspace, stringSerializer, stringSerializer, stringSerializer); multigetSliceQuery.setColumnFamily("Standard1"); multigetSliceQuery.setKeys("fake_key_0", "fake_key_1", "fake_key_2", "fake_key_3", "fake_key_4"); multigetSliceQuery.setRange("", "", false, 3); Result<Rows<String, String, String>> result = multigetSliceQuery.execute(); // batch operation Mutator<String> mutator = HFactory.createMutator(keyspace, stringSerializer); mutator.addInsertion("jsmith", "Standard1", HFactory.createStringColumn("first", "John")).addInsertion("jsmith", "Standard1", HFactory.createStringColumn("last", "Smith")).addInsertion("jsmith", "Standard1", HFactory.createStringColumn("middle", "Q")); mutator.execute(); https://github.com/hector-client/hector/wiki/User-Guide
  • 22. CQL+JDBC Class.forName("org.apache.cassandra.cql.jdbc.CassandraDriver"); String URL = String.format("jdbc:cassandra://%s:%d/%s",HOST,PORT,"system"); System.out.println("Connection URL = '"+URL +"'"); con = DriverManager.getConnection(URL); Statement stmt = con.createStatement(); // Create KeySpace String createKS = String.format("CREATE KEYSPACE %s WITH strategy_class = SimpleStrategy AND strategy_options:replication_factor = 1;",KEYSPACE); stmt.execute(createKS); // Create the target Column family String createCF = "CREATE COLUMNFAMILY RegressionTest (keyname text PRIMARY KEY,” + "bValue boolean, “+ "iValue int “+ ") WITH comparator = ascii AND default_validation = bigint;"; stmt.execute(createCF); https://code.google.com/a/apache-extras.org/p/cassandra- jdbc/source/browse/src/test/java/org/apache/cassandra/cql/jdbc/JdbcRegressionTest.java
  • 23. CQL+JDBC Statement statement = con.createStatement(); String truncate = "TRUNCATE RegressionTest;"; statement.execute(truncate); String insert1 = "INSERT INTO RegressionTest (keyname,bValue,iValue) VALUES ('key0',true, 2000);"; statement.executeUpdate(insert1); String insert2 = "INSERT INTO RegressionTest (keyname,bValue) VALUES( 'key1',false);"; statement.executeUpdate(insert2); String select = "SELECT * from RegressionTest;"; ResultSet result = statement.executeQuery(select); ResultSetMetaData metadata = result.getMetaData(); . . . https://code.google.com/a/apache-extras.org/p/cassandra- jdbc/source/browse/src/test/java/org/apache/cassandra/cql/jdbc/JdbcRegressionTest.java
  • 24. Useful Tools • cassandra-cli o <cassandra-dir>/bin o http://www.datastax.com/docs/1.0/dml/using_cli • cqlsh o <cassandra-dir>/bin o http://www.datastax.com/docs/1.0/references/cql/index • nodetool o <cassandra-dir>/bin o http://www.datastax.com/docs/1.0/references/nodetool • stress o <cassandra-dir>/tools/bin o http://www.datastax.com/docs/1.0/references/stress_java
  • 25. Useful Tools • OpsCenter o http://www.datastax.com/products/opscenter • sstableloader o <cassandra-dir>/bin o http://www.datastax.com/dev/blog/bulk-loading • More tools http://en.wikipedia.org/wiki/Apache_Cassandra#Tools _for_Cassandra