SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Cassandra in Production
      2012.0π
The Presenter
●   Kjetil Valstadsve
    ●   Developer at openadex (Open AdExchange)
    ●   Various experience
The Task:
●   Handle 500 requests/sec
    ●   For now
●   Handle ~100 updates per request
    ●   For now
The Agenda
●   Cassandra Essentials and Data Model
●   What We Do and How
●   Scaling and Operations
●   Advice and Admonitions
Cassandra Essentials and Data Model
Cassandra Essentials
●   Inspired by BigTable (Google) and Dynamo
    (Amazon)
    ●   Eventually consistent
    ●   Multi-level map-like
    ●   Column store
●   Released by Facebook, adopted by Apache
●   Supported by DataStax
    ●   EC2 AMI
    ●   Commercial product on top: Brisk
Data Model in Brief
●   Atomic unit of storage: The Column
       –   Possibly stored in a Super Column
●   Collections of columns: The Row
       –   Or Super Columns
●   Collections of rows: The Column Family
       –   Or the Super Column Family
●   Collections of column families: The Keyspace
The Column
●   Key, value and timestamp:


                       Age

                        29

                   1330945017654
The Row
●   Many (many, many) columns:
    ●   Columns are sorted on key, good for range queries
    ●   Scales wildly – just keep on adding columns
    ●   In practice, a persistent hash map
●   Rows can be stored sorted, or hashed

                                                 Age


    Kjetil                                       29

                                             1330945017654
The Column Family
●   Consists of many (many) rows:
                     YOUNG_AND_PROMISING




                                               Age

                 Kjetil                        29
                                           1330945017654
The Keyspace
●   Consists of (many)
    column families:                     JUST_YOUNG

    ●   Usually a statically known
        set



                                     YOUNG_AND_PROMISING
WTF a Super Column is
●   Columns holding (a few) other columns:
●   Serialized as single value. Do NOT scale wildly.

                       Kjetil




                    1330945017654
Can You Relate?
●   Concepts mapped to RDB data model levels
    ●   Keyspace        => Schema
    ●   Column family   => Table
    ●   Row    => Row, but without known columns
    ●   Column => Column name and value found in a row
●   RDB: Rows, column values are dynamic/data,
    column names are static/structure
●   NoSQL: Column keys are dynamic/data, too.
The Column Revisited
●   Columns are dynamic
    ●   Columns are data, not structure
●   Column keys don't have to be strings
    ●   Columns can be any supported, sortable primitive
        type, e.g. timestamps (Long)
    ●   Don't say column name, say column key
●   Columns are sorted
●   Some RDB unlearning required
What's in a KeyspaceSchema?
●   Keyspace settings
    ●   Partitioning: Decides which node(s) will store rows
    ●   Replication factor
    ●   Custom strategies for partitioning, placement etc.
●   The set of Column Families
    ●   For each Column Family, the type of its keys
●   Optional meta-data:
    ●   Pre-defined columns
Data Model Notes/(Anti-)Patterns
●   Super columns are losing favor
    ●   Prefer “synthetic” columns (e.g. columns grouped by prefix)
    ●   Columns in super columns are schema, NOT data!
    ●   Cassandra devs hate them
●   Partitioning inside of rows is common
    ●   E.g. for x partitions, compute hash value from column
        name and mod by x, obtaining i. E.g. if “Age” hashes to
        module 2, write to row name Kjetil[2]
    ●   Helps to distribute r/w traffic among nodes, for column
        families with busy/crowded rows
What We Do and How
What We Do
●   Count displays of, and clicks on, ads
●   Use Cassandra to track # of hits, in time intervals:
    ●   Ads
    ●   Groups of ads
    ●   Advertiser campaigns
    ●   Display boxes
    ●   Publisher channels
    ●   Publisher sites
    ●   Other
    ●   ... and combinations thereof
One Hit, Two Boxes
Example List of Updates
●   Count +1 for:
    ●   6 ads, 6 ad groups, 6 campaigns. (No overlap.)
    ●   2 display boxes, 1 channel (in this case, same
        channel), 1 site
    ●   2 channel/ad combinations
    ●   Various secret sauce, e.g. another 4
●   28 updates
●   If click: 11 updates, count +1 for:
    ●   1 ad, ad group, campaign, box, channel, etc.
But wait, there's more!
●   Spec says “ in time intervals” => +1 for each of:
    ●   The current hour
    ●   Today
    ●   This week
    ●   This month
    ●   This year
    ●   Total
●   Total: 6x28 = 168 updates
●   For average of 500 requests/sec, ~100 updates/req:
    ●   ~50,000 writes/second
Cassandra 1.0 Applied
●   New feature/godsend: Counter columns!
    ●   Like Long values, but
    ●   Accept updates that are increments to current value
●   Combined with batched updates
    ●   Phew!
●   Scale out for write traffic and workable read
    speed
    ●   Done!
Real data: Row and columns



●   D[0]
    ●   D: Daily interval, partition 0 (hashed from key)
●   20120121
    ●   The day: January 21 this year
●   channel_ad/Channel:b29-Ad:e13083
    ●   1 click, 7 hits for ad 13083 in channel 29 on that day
Stupid Pet Tricks for Sorting
●   Funny-looking values in the column key?
    ●   a1
    ●   b29
    ●   c432
    ●   d2345
    ●   e34345
●   Sortable, more compact and scalable than:
    ●   00000000029
    ●   00000000432
    ●   ...
Given hit in channel 29 ...
●   Read from an application-configured set of rows
●   Example config: last 4 hours, 3 days, 2 weeks.
    ●   9 logical rows to read from
    ●   Assume 3 partitions for each logical row.
    ●   Read from 27 physical rows, all (or a minimum count of)
        columns beginning with:
        –   channel_ad/Channel:b29-Ad:
●   Obtain synthetic clicks/hits ratio for each ad
●   And channel_ad is just one of the ratios to use
Caching of Synthetic Ratios
●   Use ehcache
    ●   In-memory, fast
    ●   In-memory, clutters heap, provokes stop-the-world GC
●   Cache in Cassandra
    ●   Store synthetic reads back in Cassandra (on-demand “denormalization”)
    ●   Still sensitive to high Cassandra loads
●   Instance-local Redis instance each box
    ●   Stand-alone: Isolated from high Cassandra loads
    ●   Off-heap: Reduce stop-the-world GC
    ●   Fast: Configured for in-memory caching behavior
    ●   Typical time to retrieve a Java object from 200µs to 2ms
    ●   Good trade-off
Client Libraries
●   Out-of-the-box: Thrift
    ●   Usable, but should not be mixed up with business
        logic
●   Java recommendation: Hector
    ●   https://github.com/rantav/hector
    ●   Connection pooling
    ●   Just-above-Thrift-level
    ●   Type-safe(r) r/w
Scaling and Operations
Operations: Quickstart on EC2
●   DataStax AMI:
    ●   http://datastax.com/docs/1.0/install/install_ami
    ●   Readymade cluster of N nodes
    ●   Free OpsCenter
Operations: Scaling
●   Scaling Strategy:
    ●   Doubling/halving capacity is very convenient
    ●   => New nodes automatically redistribute load
        naturally
Operations: Backup
●   System-wide backups
    ●   Nodes can be asked to dump Snapshots
    ●   Recovery: New nodes started from Snapshots
●   Selective backups
    ●   Selected data can be dumped to/read from JSON
    ●   sstable2json/json2sstable
●   Incremental backups
Advice and Admonitions
Introducing Cassandra
●   Look for data that
    ●   Grows fast
    ●   Holds useful information, given time to analyze it
    ●   Can be reproduced from source data (e.g. log files)
●   Avoid business-critical data
    ●   Let RDBMS handle all that
Living with Cassandra
●   Columns are data that live in a context:
    ●   Sorted in pre-defined ways, determining query
        efficiency
    ●   Queried for by application in other ways
●   Columns are data coupled to your logic
    ●   Typical: Encoding and parsing column names
    ●   Queries will change in development/maintenance
        –   Persisted formats should change
        –   Code must change
Cost of Change
●   Your NoSQL data are, relative to your RDB data:
    ●   Bigger
    ●   More loosely-defined
    ●   More closely-coupled to application code
    ●   Harder to query (and easier queries => bigger data)
    ●   Less supported by mature tools
●   Affects cost of change
●   Rebuild-from-source-data is a better option than
    migrate-existing-data - if it's practical

Más contenido relacionado

La actualidad más candente

What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010jbellis
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryJustin Swanhart
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPTony Rogerson
 
Key-Value-Stores -- The Key to Scaling?
Key-Value-Stores -- The Key to Scaling?Key-Value-Stores -- The Key to Scaling?
Key-Value-Stores -- The Key to Scaling?Tim Lossen
 
HBase Schema Design - HBase-Con 2012
HBase Schema Design - HBase-Con 2012HBase Schema Design - HBase-Con 2012
HBase Schema Design - HBase-Con 2012Ian Varley
 
Divide and conquer in the cloud
Divide and conquer in the cloudDivide and conquer in the cloud
Divide and conquer in the cloudJustin Swanhart
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandraTarun Garg
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory Fabio Fumarola
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
Shard-Query, an MPP database for the cloud using the LAMP stack
Shard-Query, an MPP database for the cloud using the LAMP stackShard-Query, an MPP database for the cloud using the LAMP stack
Shard-Query, an MPP database for the cloud using the LAMP stackJustin Swanhart
 
Bigtable and Boxwood
Bigtable and BoxwoodBigtable and Boxwood
Bigtable and BoxwoodEvan Weaver
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...Cloudera, Inc.
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
 

La actualidad más candente (20)

What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard query
 
Voldemort Nosql
Voldemort NosqlVoldemort Nosql
Voldemort Nosql
 
Project Voldemort
Project VoldemortProject Voldemort
Project Voldemort
 
Fudcon talk.ppt
Fudcon talk.pptFudcon talk.ppt
Fudcon talk.ppt
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTP
 
No SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability MeetupNo SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability Meetup
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Key-Value-Stores -- The Key to Scaling?
Key-Value-Stores -- The Key to Scaling?Key-Value-Stores -- The Key to Scaling?
Key-Value-Stores -- The Key to Scaling?
 
HBase Schema Design - HBase-Con 2012
HBase Schema Design - HBase-Con 2012HBase Schema Design - HBase-Con 2012
HBase Schema Design - HBase-Con 2012
 
Divide and conquer in the cloud
Divide and conquer in the cloudDivide and conquer in the cloud
Divide and conquer in the cloud
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Shard-Query, an MPP database for the cloud using the LAMP stack
Shard-Query, an MPP database for the cloud using the LAMP stackShard-Query, an MPP database for the cloud using the LAMP stack
Shard-Query, an MPP database for the cloud using the LAMP stack
 
Bigtable and Boxwood
Bigtable and BoxwoodBigtable and Boxwood
Bigtable and Boxwood
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 

Similar a Cassandra in production

Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUGStu Hood
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Productiontrihug
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Eric Evans
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017HBaseCon
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandrashimi_k
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Heterogenous Persistence
Heterogenous PersistenceHeterogenous Persistence
Heterogenous PersistenceJervin Real
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache CassandraStu Hood
 
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019Matt Schallert
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLScyllaDB
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overviewSean Murphy
 
HPEC 2021 sparse binary format
HPEC 2021 sparse binary formatHPEC 2021 sparse binary format
HPEC 2021 sparse binary formatErikWelch2
 

Similar a Cassandra in production (20)

Cassandra
CassandraCassandra
Cassandra
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Heterogenous Persistence
Heterogenous PersistenceHeterogenous Persistence
Heterogenous Persistence
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019
Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Google mesa
Google mesaGoogle mesa
Google mesa
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
 
HPEC 2021 sparse binary format
HPEC 2021 sparse binary formatHPEC 2021 sparse binary format
HPEC 2021 sparse binary format
 

Último

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Cassandra in production

  • 2. The Presenter ● Kjetil Valstadsve ● Developer at openadex (Open AdExchange) ● Various experience
  • 3. The Task: ● Handle 500 requests/sec ● For now ● Handle ~100 updates per request ● For now
  • 4. The Agenda ● Cassandra Essentials and Data Model ● What We Do and How ● Scaling and Operations ● Advice and Admonitions
  • 6. Cassandra Essentials ● Inspired by BigTable (Google) and Dynamo (Amazon) ● Eventually consistent ● Multi-level map-like ● Column store ● Released by Facebook, adopted by Apache ● Supported by DataStax ● EC2 AMI ● Commercial product on top: Brisk
  • 7. Data Model in Brief ● Atomic unit of storage: The Column – Possibly stored in a Super Column ● Collections of columns: The Row – Or Super Columns ● Collections of rows: The Column Family – Or the Super Column Family ● Collections of column families: The Keyspace
  • 8. The Column ● Key, value and timestamp: Age 29 1330945017654
  • 9. The Row ● Many (many, many) columns: ● Columns are sorted on key, good for range queries ● Scales wildly – just keep on adding columns ● In practice, a persistent hash map ● Rows can be stored sorted, or hashed Age Kjetil 29 1330945017654
  • 10. The Column Family ● Consists of many (many) rows: YOUNG_AND_PROMISING Age Kjetil 29 1330945017654
  • 11. The Keyspace ● Consists of (many) column families: JUST_YOUNG ● Usually a statically known set YOUNG_AND_PROMISING
  • 12. WTF a Super Column is ● Columns holding (a few) other columns: ● Serialized as single value. Do NOT scale wildly. Kjetil 1330945017654
  • 13. Can You Relate? ● Concepts mapped to RDB data model levels ● Keyspace => Schema ● Column family => Table ● Row => Row, but without known columns ● Column => Column name and value found in a row ● RDB: Rows, column values are dynamic/data, column names are static/structure ● NoSQL: Column keys are dynamic/data, too.
  • 14. The Column Revisited ● Columns are dynamic ● Columns are data, not structure ● Column keys don't have to be strings ● Columns can be any supported, sortable primitive type, e.g. timestamps (Long) ● Don't say column name, say column key ● Columns are sorted ● Some RDB unlearning required
  • 15. What's in a KeyspaceSchema? ● Keyspace settings ● Partitioning: Decides which node(s) will store rows ● Replication factor ● Custom strategies for partitioning, placement etc. ● The set of Column Families ● For each Column Family, the type of its keys ● Optional meta-data: ● Pre-defined columns
  • 16. Data Model Notes/(Anti-)Patterns ● Super columns are losing favor ● Prefer “synthetic” columns (e.g. columns grouped by prefix) ● Columns in super columns are schema, NOT data! ● Cassandra devs hate them ● Partitioning inside of rows is common ● E.g. for x partitions, compute hash value from column name and mod by x, obtaining i. E.g. if “Age” hashes to module 2, write to row name Kjetil[2] ● Helps to distribute r/w traffic among nodes, for column families with busy/crowded rows
  • 17. What We Do and How
  • 18. What We Do ● Count displays of, and clicks on, ads ● Use Cassandra to track # of hits, in time intervals: ● Ads ● Groups of ads ● Advertiser campaigns ● Display boxes ● Publisher channels ● Publisher sites ● Other ● ... and combinations thereof
  • 19. One Hit, Two Boxes
  • 20. Example List of Updates ● Count +1 for: ● 6 ads, 6 ad groups, 6 campaigns. (No overlap.) ● 2 display boxes, 1 channel (in this case, same channel), 1 site ● 2 channel/ad combinations ● Various secret sauce, e.g. another 4 ● 28 updates ● If click: 11 updates, count +1 for: ● 1 ad, ad group, campaign, box, channel, etc.
  • 21. But wait, there's more! ● Spec says “ in time intervals” => +1 for each of: ● The current hour ● Today ● This week ● This month ● This year ● Total ● Total: 6x28 = 168 updates ● For average of 500 requests/sec, ~100 updates/req: ● ~50,000 writes/second
  • 22. Cassandra 1.0 Applied ● New feature/godsend: Counter columns! ● Like Long values, but ● Accept updates that are increments to current value ● Combined with batched updates ● Phew! ● Scale out for write traffic and workable read speed ● Done!
  • 23. Real data: Row and columns ● D[0] ● D: Daily interval, partition 0 (hashed from key) ● 20120121 ● The day: January 21 this year ● channel_ad/Channel:b29-Ad:e13083 ● 1 click, 7 hits for ad 13083 in channel 29 on that day
  • 24. Stupid Pet Tricks for Sorting ● Funny-looking values in the column key? ● a1 ● b29 ● c432 ● d2345 ● e34345 ● Sortable, more compact and scalable than: ● 00000000029 ● 00000000432 ● ...
  • 25. Given hit in channel 29 ... ● Read from an application-configured set of rows ● Example config: last 4 hours, 3 days, 2 weeks. ● 9 logical rows to read from ● Assume 3 partitions for each logical row. ● Read from 27 physical rows, all (or a minimum count of) columns beginning with: – channel_ad/Channel:b29-Ad: ● Obtain synthetic clicks/hits ratio for each ad ● And channel_ad is just one of the ratios to use
  • 26. Caching of Synthetic Ratios ● Use ehcache ● In-memory, fast ● In-memory, clutters heap, provokes stop-the-world GC ● Cache in Cassandra ● Store synthetic reads back in Cassandra (on-demand “denormalization”) ● Still sensitive to high Cassandra loads ● Instance-local Redis instance each box ● Stand-alone: Isolated from high Cassandra loads ● Off-heap: Reduce stop-the-world GC ● Fast: Configured for in-memory caching behavior ● Typical time to retrieve a Java object from 200µs to 2ms ● Good trade-off
  • 27. Client Libraries ● Out-of-the-box: Thrift ● Usable, but should not be mixed up with business logic ● Java recommendation: Hector ● https://github.com/rantav/hector ● Connection pooling ● Just-above-Thrift-level ● Type-safe(r) r/w
  • 29. Operations: Quickstart on EC2 ● DataStax AMI: ● http://datastax.com/docs/1.0/install/install_ami ● Readymade cluster of N nodes ● Free OpsCenter
  • 30. Operations: Scaling ● Scaling Strategy: ● Doubling/halving capacity is very convenient ● => New nodes automatically redistribute load naturally
  • 31. Operations: Backup ● System-wide backups ● Nodes can be asked to dump Snapshots ● Recovery: New nodes started from Snapshots ● Selective backups ● Selected data can be dumped to/read from JSON ● sstable2json/json2sstable ● Incremental backups
  • 33. Introducing Cassandra ● Look for data that ● Grows fast ● Holds useful information, given time to analyze it ● Can be reproduced from source data (e.g. log files) ● Avoid business-critical data ● Let RDBMS handle all that
  • 34. Living with Cassandra ● Columns are data that live in a context: ● Sorted in pre-defined ways, determining query efficiency ● Queried for by application in other ways ● Columns are data coupled to your logic ● Typical: Encoding and parsing column names ● Queries will change in development/maintenance – Persisted formats should change – Code must change
  • 35. Cost of Change ● Your NoSQL data are, relative to your RDB data: ● Bigger ● More loosely-defined ● More closely-coupled to application code ● Harder to query (and easier queries => bigger data) ● Less supported by mature tools ● Affects cost of change ● Rebuild-from-source-data is a better option than migrate-existing-data - if it's practical