2. Who am I?
• Ben Coverston
• DSE Architect
• DataStax since 2010
• Previous Experience in the Travel Industry
• Low Cost Airlines / Web Reservations
• Past: HP / Accenture
• Lived in Santa Catarina for a few years.
Monday, September 2, 13
5. What is NoSQL?
NoSQL is a term coined by Carlo Strozzi and
repurposed by Eric Evans to refer to “some”
storage systems. The NoSQL term should be used
as in the Not-Only-SQL and not as No to SQL or
Never SQL.
-- Alex Popescu
Monday, September 2, 13
6. What is NoSQL (Cont.)
• It’s not
• No to SQL
• About performance
• About scaling
• ACID
• Eventual consistency
• Volume
• It is:
• About choice
Monday, September 2, 13
7. Diversity in Data
• Big Data has the 3 (or 4 or 5) V’s
• Volume
• Variety
• Velocity
• Variability (sometimes)
• Value (other times)
Monday, September 2, 13
8. Diversity in Data
• The V’s don’t cover everything
• Availabilty is important
• Your use case is important too
Monday, September 2, 13
9. Is NoSQL Big Data?
• You can store Big Data with an RDBMS
• Is it easy?
• Is it cost effective?
• What kid of compromises do you have to make?
Monday, September 2, 13
10. The Problems
• In general there are two classes of data problems
• OLTP (Real-Time)
• Analytics (Batch)
• Usually you want both
• No solution is perfect for everyone
• Popularity is no indication of fitness
Monday, September 2, 13
11. Use Cases
• OLTP
• Low Latency
• High Throughput
• LOB Applications
• Batch
• Predictive Models
• Complex Queries
• Tomorrow (or precalculated, but now we need OLTP)
Monday, September 2, 13
12. Where to put your ‘Stuff’
Sharded RDBMS MPP -- Greenplum, Teradata Hadoop Key/Value
Columnar
Other
Monday, September 2, 13
13. Why not just one?
• Analytics
• Optimize serial IO
• Limitations in Storage
• OLTP
• Working Set
• Distribution
• Availability
• Storage Medium
Monday, September 2, 13
14. Why Do We Need Something Else?
• ACID semantics are often
overkill
• ACID also makes the database
layer brittle
• This means you get less
Availability (CAP Theorem)
Monday, September 2, 13
18. But Sharding
• Is Painful
• Requires ‘something else’
• Most no-sql solutions auto-shard
• Sharding requires tradeoffs.
• Which means your application will need to change
Monday, September 2, 13
20. Which should I choose?
• Analytics
• Hadoop (probably) if your data is big
• Spark, other (sometimes faster) solutions available now
• NoSql
• Let’s talk!
Monday, September 2, 13
21. Decisions are about tradeoffs, never a zero-sum game
Fast, Cheap, Good -- Choose Two
Monday, September 2, 13
22. CAP Theorem
• More of two, less of one
• Consistency
• Availability
• Partitioning
• You have to accept P
• That leaves C and A
Monday, September 2, 13
23. How To Scale Anything
• Partition By Function
• Split Horizontally
• Avoid Distributed Transactions
• Avoid Synchronous Coupling
• Virtualize Everywhere
• Cache Everything
Monday, September 2, 13
24. Partition By Function
• Don’t put everything in the same database
• Physical
• Pools of Machines
• Geographical Distribution
• Automatic sharding (look for this)
• Make sure it works!
• Virtual
• Logical Tables, Schema
• Not 100% necessary, but schema is nice
Monday, September 2, 13
25. Partitioning (cont.)
• Pros
• Isolate failure
• To a region
• To a service
• Simplify Failover
• Cons
• Your DB has to handle multi-region replication
• If you chose CP (CAP) you’re going to have a bad time
• AP systems do OK here (Cassandra, actually excels)
• “Relational” part of databases becomes complex
• Everything gets denormalized
Monday, September 2, 13
26. Split Horizontally
• Scaling Vertically is easy
• To a point, then it gets expensive.. fast..
• Easy if your system has no state to maintain
• Or if the states are known, and small
• Sharding over dependent fields complicates design
• Some things distribute themselves easily
• key/value stores
• Others not so much
• BTree indexes, foreign keys
• P2P architecture is helpful when splitting
• In other words, avoid masters
Monday, September 2, 13
27. Split Horizontally (cont)
• Pros
• Can be as fast or faster than traditional design
• Can scale up as long as you can afford more machines
• Scaling is easy if you avoid having masters
• Replication and failover don’t have to be special cases
• Cons
• Even logical pieces of your app are distributed over many machines
• example: your catalog is not all in one place
• Real time analytics is difficult, or slower
Monday, September 2, 13
28. Avoid Distributed Transactions
• Have you tried this?
• Hard to do right
• Paxos gives us some hope
• CAS in Cassandra 2.0 looks promising
• Even then, it’s not good for everything!
• MVCC works for many use cases
• Compensating Mechanisms
• Customer Service (Amazon, inventory)
Monday, September 2, 13
29. Avoid Distributed Transactions
• Pros
• Consistency in a distributed environment
• Cons
• Slow
• Overkill
• Did I say slow?
• We chose CP so we get less A
• What happens when they don’t succeed?
• Do we shut the whole thing down?
Monday, September 2, 13
30. Avoid Synchronous Coupling
• What?
• A or B can be down
• A can be down, B continues to work
• B can suffer, while A continues to work
• If your recommendation engine fails, your customers can still buy stuff!
• Master/Slave failover is a good example of synchronous coupling
• Master is down, slave needs to take over, but in the meantime.. what happens?
Monday, September 2, 13
31. Avoid Syncronous Coupling
• Pros
• Fewer shared dependencies means less failure
• Less failure means more total uptime
• For the whole
• Less coupling means that your application topology is more modular
• Introducing new, decoupled services is less risky
• Cons
• More duplication of your infrastructure
• e.g. now you have an application stack for each of your services.
Monday, September 2, 13
32. Avoid Synchronous Processing Flows
• AKA
• Blocking Sockets
• Serialized Processes
• Locking in General
• Do what is important FIRST
• Take their money
• Modify Inventory
• Other less important stuff can be queued
• Triggers
• Joins
• Stored Procedures
• Consistency Checks
Monday, September 2, 13
33. Avoid Synchronous Processing Flows
• Pros
• Critical operations will not block for nice to haves
• Easy monitoring of queues and assign priority to tasks
• Problem areas are easier to identify
• Cons
• Race conditions
• More up front development cost
Monday, September 2, 13
34. • DON’T
• Pick your database because it has a sexy API
• Pick your database because it worked for somebody else
• DO
• Pick a database that will fit with your use case
• Virtualize your data model
• Encourage manipulation of your logical models
• DO NOT force interaction with your database
• Good virtualization means that you can change your data store later...
• And most of your code will still work.
Virtualize Everything
Monday, September 2, 13
35. Virtualize Everything
• Virtualization isn’t just for the programmer
• Things fall apart
• Requests have to be re-routed
• Parts Replaced
• APIs change
• Good virtualization means you can make changes w/o impacting
availability
Monday, September 2, 13
36. Cache Appropriately
• You can’t cache everything
• But you can cache stuff that doesn’t change
• Or is expensive to retrieve
Monday, September 2, 13
37. Cache Appropriately
• Pros
• Cache is fast (compared to traditoinal RDBMS access)
• Can give you a performance buffer
• Cons
• Cache Coherence
• Cache Dependency
• Is it a SPOF?
• What if it all doesn’t fit?
Monday, September 2, 13
38. What about NoSQL?
• All of this applies
• Evaluate Products on their Strengths
• If easy things are easy
• The hard might be impossible
• Pick a something that makes the hard things possible
Monday, September 2, 13
39. What are the ‘easy’ things?
• Serialization Formats
• JSON/BSON
• Data Models
• HTTP/REST/JSON APIs
• NodeJS Drivers!
• etc.
Monday, September 2, 13
40. The ‘hard’ things
• Automatic Sharding
• Where does the data go?
• How do I find it?
• How do I add another?
• Multi DC
• Replication
• No SPOF
• Anti-Entropy
• Continuous Availability
• Upgrades
• Failure
• Etc.
Monday, September 2, 13
41. What should you use?
• Your decision
• Every database is not a fit for every problem.
Monday, September 2, 13
42. DataStax Enterprise
• DSE
• Cassandra (OLTP)
• Analytics
• Search
• The hard things are possible
• We’re making the easy things easier
Monday, September 2, 13