SlideShare a Scribd company logo
1 of 33
MySQL for Beginners Gary Dusbabek Rackspace April Fools!!!11
Apache Gary Dusbabek Rackspace
What is Cassandra? Key-value store (with some structure) Highly scalable Eventually consistent Distributed Tunable Partitioning Replication
Where did it come from? Created at Facebook Dynamo: distribution architecture BigTable: data model Open-sourced in 2008 Apache incubator in early 2009 Graduation in March 2010
Who uses it? Rackspace Facebook (of course) Twitter Digg Reddit IBM Others…
What problems does it solve? Reliability at scale No single point of failure (all nodes are identical) Simple scaling linear High write throughput Large data sets
What problems can’t it solve? No flexible indices No querying on non PK values Not good for big binary data (>64mb) unless you chunk Row contents must fit in available memory
Concepts: CAP CAP Theorem Consistency Availability Partition tolerance ,[object Object]
Cassandra chooses A and P but allows them to be tunable to have more C.,[object Object]
Concepts: Replication & Consistency You specify replication factor You specify consistency level for read/write operations ZERO, ONE, QUORUM, ALL, ANY
Ring Topology Storage ring Every node gets a token Defines its place in the storage ring And which keys it is responsible for (its ranges) RF=3 a j d g
Ring Topology Storage ring Every node gets a token Defines its place in the storage ring And which keys it is responsible for (its ranges) RF=2 a j d g
Ring: New Node New node Ranges are adjusted RF=3 a m j d g
Ring: New Node New node Ranges are adjusted RF=2 a m j d g
Ring Partition Node dies or becomes isolated from the ring Hints Handoff RF=3 a m j d g
Data Model Keyspace-contains column families ColumnFamily Standard or Super Two levels of indexes (key and column name)
Data Model Column and subcolumn sorting Specify your own comparator: TimeUUID LexicalUUID UTF8 Long Bytes CreateYourOwn
Data Model Standard Column Family
Data Model Super Column Family
Inserting: Overview Simple: put(key, col, value) Complex: put(key, [col:value, …, col:value]) Batch: multi key.
Inserting: Writes Commit log for durability Memtable – no disk access (no reads or seeks) Sstables are final (become read only) Index Bloom filter Raw data Atomic within a ColumnFamily Bottom line: FAST!!!
Querying: Overview You need a key or keys: Single: key=‘a’ Range: key=‘a’ through ’f’ And columns to retrieve: Slice:  cols={bar through kite} By name: key=‘b’ cols={bar, cat, llama} Nothing like SQL “WHERE col=‘faz’” But secondary indices are being worked on (see CASSANDRA-749)
Querying: Reads Not as fast as writes Read repair when out of sync New in 0.6: Row cache (avoid sstable lookup) Key cache (avoid index scan)
Client API (Low level) Fat Client Maybe too low level, not well-tested Thrift (currently best-supported) Many language bindings Not much of a community No streaming Fast transport Avro Just getting started Shows promise
Client API (High Level) Rapidly changing, getting feature-rich Connection pools Load balancing/Failover Reduces the verbosity of working with thrift For Java, see Hector http://github.com/rantav/hector Also Ruby, Python, C++, C#, Perl, PHP http://wiki.apache.org/cassandra/ClientExamples
Java Bits: JMX Relatively easy to expose objects and services as MBeans Simplifies aspects of cluster and node management Easy monitoring You choose the JMX-enabled system management tool (jconsole is alright)
Java Bits: available libraries Excellent: Google collections Multimap, BiMap, Iterators java.util.concurrency nio files (including mmap) Meh: nio sockets
Java Bits: Heap & GC Cassandra tweaks the default GC settings quite a bit: XX:+UseParNewGC XX:+UseConcMarkSweepGC XX:+CMSParallelRemarkEnabled XX:TargetSurvivorRatio=90 XX:SurvivorRatio=128 XX:MaxTenuringThreshold=0 XX:+HeapDumpOnOutOfMemoryError XX:+AggressiveOpts
Java Bits: code management Library versioning No standard way Mostly declarative Not readily queryable Must ship every dependency Or use ant/mvn. Now you have two (or more!) problems.
Java Bits: daemonization Java doesn’t make it easy re: stdout, stderr After setting up, System.out and System.err are close()d Windows: don’t ask
Future Direction Range delete (delete these cols from those keys) Vector clocks (including server-side conflict resolution) Altering keyspace/column family definitions on a live cluster Byte[] keys Compression Multi-tenant support Less memory restrictions
Linky wiki.apache.org/cassandra cassandra.apache.org Google BigTable labs.google.com/papers/bigtable.html Amazon Dynamo s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf Facebook Cassandra  www.facebook.com/note.php?note_id=24413138919 Java tuning: java.sun.com/performance/reference/whitepapers/tuning.html java.sun.com/javase/technologies/hotspot/gc/index.jsp Me gdusbabek@gmail.com gdusbabek on twitter and just about everything else.

More Related Content

What's hot

Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
Ryan King
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011
mubarakss
 

What's hot (20)

Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Scalable PHP Applications With Cassandra
Scalable PHP Applications With CassandraScalable PHP Applications With Cassandra
Scalable PHP Applications With Cassandra
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Apache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide DeckApache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide Deck
 
Cassandra: An Alien Technology That's not so Alien
Cassandra: An Alien Technology That's not so AlienCassandra: An Alien Technology That's not so Alien
Cassandra: An Alien Technology That's not so Alien
 
Avro
AvroAvro
Avro
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011
 
NoSQL & HBase overview
NoSQL & HBase overviewNoSQL & HBase overview
NoSQL & HBase overview
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 

Similar to Cassandra Presentation for San Antonio JUG

Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Cal Henderson
 
Data SLA in the public cloud
Data SLA in the public cloudData SLA in the public cloud
Data SLA in the public cloud
Liran Zelkha
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
BigBlueHat
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
supertom
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
guest18a0f1
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 

Similar to Cassandra Presentation for San Antonio JUG (20)

Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
No sql
No sqlNo sql
No sql
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
Triangle of Cassandra & Solr & Kafka
Triangle of Cassandra & Solr & KafkaTriangle of Cassandra & Solr & Kafka
Triangle of Cassandra & Solr & Kafka
 
Data SLA in the public cloud
Data SLA in the public cloudData SLA in the public cloud
Data SLA in the public cloud
 
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
 
No sql (1)
No sql (1)No sql (1)
No sql (1)
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
 
Not only SQL
Not only SQL Not only SQL
Not only SQL
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 

More from gdusbabek

Rackspace Cloud Monitoring - Strata NYC
Rackspace Cloud Monitoring - Strata NYCRackspace Cloud Monitoring - Strata NYC
Rackspace Cloud Monitoring - Strata NYC
gdusbabek
 
Cassandra Codebase 2011
Cassandra Codebase 2011Cassandra Codebase 2011
Cassandra Codebase 2011
gdusbabek
 

More from gdusbabek (14)

My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015
My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015
My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015
 
How To (Not) Open Source - Javazone, Oslo 2014
How To (Not) Open Source - Javazone, Oslo 2014How To (Not) Open Source - Javazone, Oslo 2014
How To (Not) Open Source - Javazone, Oslo 2014
 
Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014
Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014
Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014
 
Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013Blueflood: Open Source Metrics Processing at CassandraEU 2013
Blueflood: Open Source Metrics Processing at CassandraEU 2013
 
Introduction to Blueflood at Berlin Buzzwords 2013
Introduction to Blueflood at Berlin Buzzwords 2013Introduction to Blueflood at Berlin Buzzwords 2013
Introduction to Blueflood at Berlin Buzzwords 2013
 
Rackspace Cloud Monitoring - Strata NYC
Rackspace Cloud Monitoring - Strata NYCRackspace Cloud Monitoring - Strata NYC
Rackspace Cloud Monitoring - Strata NYC
 
Austin cassandra meetup
Austin cassandra meetupAustin cassandra meetup
Austin cassandra meetup
 
How Rackspace Cloud Monitoring uses Cassandra
How Rackspace Cloud Monitoring uses CassandraHow Rackspace Cloud Monitoring uses Cassandra
How Rackspace Cloud Monitoring uses Cassandra
 
Breaking the Relational Headlock: A Survey of NoSQL Datastores
Breaking the Relational Headlock: A Survey of NoSQL DatastoresBreaking the Relational Headlock: A Survey of NoSQL Datastores
Breaking the Relational Headlock: A Survey of NoSQL Datastores
 
Building Rackspace Cloud Monitoring
Building Rackspace Cloud MonitoringBuilding Rackspace Cloud Monitoring
Building Rackspace Cloud Monitoring
 
Cassandra Codebase 2011
Cassandra Codebase 2011Cassandra Codebase 2011
Cassandra Codebase 2011
 
Data Modeling with Cassandra Column Families
Data Modeling with Cassandra Column FamiliesData Modeling with Cassandra Column Families
Data Modeling with Cassandra Column Families
 
Getting to Know the Cassandra Codebase
Getting to Know the Cassandra CodebaseGetting to Know the Cassandra Codebase
Getting to Know the Cassandra Codebase
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Cassandra Presentation for San Antonio JUG

  • 1.
  • 2. MySQL for Beginners Gary Dusbabek Rackspace April Fools!!!11
  • 4. What is Cassandra? Key-value store (with some structure) Highly scalable Eventually consistent Distributed Tunable Partitioning Replication
  • 5. Where did it come from? Created at Facebook Dynamo: distribution architecture BigTable: data model Open-sourced in 2008 Apache incubator in early 2009 Graduation in March 2010
  • 6. Who uses it? Rackspace Facebook (of course) Twitter Digg Reddit IBM Others…
  • 7. What problems does it solve? Reliability at scale No single point of failure (all nodes are identical) Simple scaling linear High write throughput Large data sets
  • 8. What problems can’t it solve? No flexible indices No querying on non PK values Not good for big binary data (>64mb) unless you chunk Row contents must fit in available memory
  • 9.
  • 10.
  • 11. Concepts: Replication & Consistency You specify replication factor You specify consistency level for read/write operations ZERO, ONE, QUORUM, ALL, ANY
  • 12. Ring Topology Storage ring Every node gets a token Defines its place in the storage ring And which keys it is responsible for (its ranges) RF=3 a j d g
  • 13. Ring Topology Storage ring Every node gets a token Defines its place in the storage ring And which keys it is responsible for (its ranges) RF=2 a j d g
  • 14. Ring: New Node New node Ranges are adjusted RF=3 a m j d g
  • 15. Ring: New Node New node Ranges are adjusted RF=2 a m j d g
  • 16. Ring Partition Node dies or becomes isolated from the ring Hints Handoff RF=3 a m j d g
  • 17. Data Model Keyspace-contains column families ColumnFamily Standard or Super Two levels of indexes (key and column name)
  • 18. Data Model Column and subcolumn sorting Specify your own comparator: TimeUUID LexicalUUID UTF8 Long Bytes CreateYourOwn
  • 19. Data Model Standard Column Family
  • 20. Data Model Super Column Family
  • 21. Inserting: Overview Simple: put(key, col, value) Complex: put(key, [col:value, …, col:value]) Batch: multi key.
  • 22. Inserting: Writes Commit log for durability Memtable – no disk access (no reads or seeks) Sstables are final (become read only) Index Bloom filter Raw data Atomic within a ColumnFamily Bottom line: FAST!!!
  • 23. Querying: Overview You need a key or keys: Single: key=‘a’ Range: key=‘a’ through ’f’ And columns to retrieve: Slice: cols={bar through kite} By name: key=‘b’ cols={bar, cat, llama} Nothing like SQL “WHERE col=‘faz’” But secondary indices are being worked on (see CASSANDRA-749)
  • 24. Querying: Reads Not as fast as writes Read repair when out of sync New in 0.6: Row cache (avoid sstable lookup) Key cache (avoid index scan)
  • 25. Client API (Low level) Fat Client Maybe too low level, not well-tested Thrift (currently best-supported) Many language bindings Not much of a community No streaming Fast transport Avro Just getting started Shows promise
  • 26. Client API (High Level) Rapidly changing, getting feature-rich Connection pools Load balancing/Failover Reduces the verbosity of working with thrift For Java, see Hector http://github.com/rantav/hector Also Ruby, Python, C++, C#, Perl, PHP http://wiki.apache.org/cassandra/ClientExamples
  • 27. Java Bits: JMX Relatively easy to expose objects and services as MBeans Simplifies aspects of cluster and node management Easy monitoring You choose the JMX-enabled system management tool (jconsole is alright)
  • 28. Java Bits: available libraries Excellent: Google collections Multimap, BiMap, Iterators java.util.concurrency nio files (including mmap) Meh: nio sockets
  • 29. Java Bits: Heap & GC Cassandra tweaks the default GC settings quite a bit: XX:+UseParNewGC XX:+UseConcMarkSweepGC XX:+CMSParallelRemarkEnabled XX:TargetSurvivorRatio=90 XX:SurvivorRatio=128 XX:MaxTenuringThreshold=0 XX:+HeapDumpOnOutOfMemoryError XX:+AggressiveOpts
  • 30. Java Bits: code management Library versioning No standard way Mostly declarative Not readily queryable Must ship every dependency Or use ant/mvn. Now you have two (or more!) problems.
  • 31. Java Bits: daemonization Java doesn’t make it easy re: stdout, stderr After setting up, System.out and System.err are close()d Windows: don’t ask
  • 32. Future Direction Range delete (delete these cols from those keys) Vector clocks (including server-side conflict resolution) Altering keyspace/column family definitions on a live cluster Byte[] keys Compression Multi-tenant support Less memory restrictions
  • 33. Linky wiki.apache.org/cassandra cassandra.apache.org Google BigTable labs.google.com/papers/bigtable.html Amazon Dynamo s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf Facebook Cassandra www.facebook.com/note.php?note_id=24413138919 Java tuning: java.sun.com/performance/reference/whitepapers/tuning.html java.sun.com/javase/technologies/hotspot/gc/index.jsp Me gdusbabek@gmail.com gdusbabek on twitter and just about everything else.

Editor's Notes

  1. Hello World
  2. RandomPartitioner – takes key, uses MD5 as the real key, then stores on the appropriate node.OrderPreservingPartitioner– get cheap range scans. Takes more work.
  3. Eric Brewer
  4. Need to describe hinted handoff better.
  5. Keyspace == like namespaceCF == like a tableKeyspace + Table used interchangeably in the code.
  6. Key cache : keys whose location are kept in memory to avoid index scan.Row cache: entire rows kept in memory.
  7. Avro: Doug Cutting
  8. Mmap – index and data files (read only)
  9. java.sun.com/performance/reference/whitepapers/tuning.htmlhttp://java.sun.com/javase/technologies/hotspot/gc/index.jspGoal is low pause times and high throughput:-XX:TargetSurvivorRatio=90Allows 90% of the survivor spaces to be occupied instead of the default 50%, allowing better utilization of the survivor space memory. -XX:SurvivorRatio=128Sets survivor space ratio to 1:128, resulting in small survivor. Smaller survivor spaces allow short lived less time in the young generation (they die faster). -XX:+AggressiveOptsturns on point optimizations that are expected to be on in later releases. Experimental and sometimes reveals JDK bugs.-XX:+UseParNewGC -UseConcMarkSweepGCparallel young generation collector. Similar to +UsePareallelGC except can be used with the concurrent collector. See benefits here on multiway systems. Two pauses instead of one long pause (mark, then sweep). Mark: directly reachable (young). 2nd: objects missed due to concurrent execution of threads (the remark).-XX:+CMSParallelRemarkEnabledworks with UseParNewGC to decrease the remark pauses.