Until recently, agile business had to choose between the benefits of JSON-based NoSQL databases and the benefits of SQL-based querying. NoSQL provides schema flexibility, high performance, and elastic scaling, while SQL provides expressive, independent data access. Recent convergence allows developers and organizations to have the best of both worlds.
Developers need to deliver apps that readily evolve, perform, and scale, all to match changing business needs. Organizations need rapid access to their operational data, using standard analytical tools, for insight into their business. In this session, you will learn the ways that SQL can be applied to NoSQL databases (N1QL, SQL++, ODBC, JDBC, and others), and what additional features are needed to deal with JSON documents. SQL for JSON, JSON data modeling, indexing, and tool integration will be covered.
Please make sure to visit the sponsors and say thank you. They make this great event possible.
Couchbase is here too, if you have questions or just want to talk databases, please come on by!
Spend just a little time on why people are using NoSQL
Talk about how data is modeled differently in JSON
Let’s talk about why SQL is good and why SQL for JSON is needed
Let’s talk about the exciting stuff happening in the database ecosystem
Including but not limited to the stuff Couchbase is doing
If we have time, we’ll look at how a .NET developer (or Java developer, etc) would interact with SQL for JSON
What’s also interesting is that we’re seeing the use of NoSQL expand inside many of these companies. Orbitz, the online travel company, is a great example – they started using Couchbase to store their hotel rate data, and now they use Couchbase in many other ways.
Same with ebay, they recently presented at the Couchbase conference with a chart tracking how many instances of various nosql databases are in use, and we see growth in Cassandra, mongo, and couchbase has actually surpassed them within ebay
Marriott is using Couchbase, so you may be interacting with it right now
SQL (relational) databases are great. They give you LOT OF functionality.
Great set of abstractions (tables, columns, data types, constraints, triggers, SQL, ACID TRANSACTIONS, stored procedures and more) at a highly reasonable cost.
Change is inevitable
One thing RDBMS does not handle well is CHANGE.
Change of schema (both logical and physical), change of hardware, change of capacity.
NoSQL databases ESPECIALLY ONES DESIGNED TO BE DISTRIBUTED tend to help solve problems with: agility, scalability, performance, and availability
Let’s talk about what NoSQL is, first.
NoSQL generally refers to databases which lack SQL or don’t use a relational model
Once the SQL language, transaction became optional, flurry of databases were created using distinct approaches for common use-cases.
KEY-Value simply provided quick access to data for a given KEY.
Wide Column databases can store large number of arbitrary columns in each row
Graph databases store data and relationships as first class concepts
Document databases aggregate data into a hierarchical structure.
With JSON is a means to the end. Document databases provide flexible schema,built-in data types, rich structure, implicit relationships using JSON.
When we look at document databases, they originally came with a
Minimal set of APIs and features
But as they continue to mature, we’re seeing more features being added
And generally I’m seeing a convergent trend between SQL and NoSQL
But anyway, this set of minimal features, lacking a SQL language and tables gives us the buzzword “nosql”
Elastic scaling
Size your cluster for today
Scale out on demand
Cost effective scaling
Commodity hardware
On premise or on cloud
Scale OUT instead of Scale UP
[example: changing the channel to a soccer game or Game of Thrones, everyone makes the same API request in the same 5 minutes]
[example: TV show lets watchers vote during some period of the week, so you can scale up during that period of time]
[example: black Friday]
Elastic scaling
Size your cluster for today
Scale out on demand
Cost effective scaling
Commodity hardware
On premise or on cloud
Scale OUT instead of Scale UP
[example: changing the channel to a soccer game or Game of Thrones, everyone makes the same API request in the same 5 minutes]
[example: TV show lets watchers vote during some period of the week, so you can scale up during that period of time]
[example: black Friday]
Schema flexibility
Easier management of change in the business requirements
Easier management of change in the structure of the data
Sometimes you're pulling together data, integrating from different sources (e.g. ELT) and that flexibility helps
Document database means that you have no rigid schema. You can do whatever the heck you want.
That being said, you SHOULDN’T. You should still have discipline about your data.
NoSQL systems are optimized for specific access patterns
Low response time for web & mobile user experience
Millisecond latency
Consistently high throughput to handle growth
[perf measures can be subjective – talk about architecture, integrated cache, maybe mention MDS too]
If one tap for pepsi goes out, there’s others available. If one db node goes down, the others will compensate.
This is related to scaling
Built-in replication and fail-over
No application downtime when hardware fails
Online maintenance & upgrade
No application downtime
Let’s talk about data modeling a bit, because storing data in JSON
Is different that storing in tables.
So I want to compare the approaches over 4 key areas.
I’m going to fill in this table, traditional SQL on the left and JSON on the right
Let’s look at modeling Customer data. This is an example of what a customer might look like
There is a rich structure: attributes, potentially sub-attributes (first name and last name)
Relationships: to other data (other customers, to products perhaps)
Value evolution: Maybe we’d start with one billing option, change to multiple (data is updated)
Structure evolution: Maybe we start without connections and add those later, or we evolve name field to be more than first and last name (data is reshaped)
Rich Structure
In relational database, this customers data would be stored in five normalized tables.
Each time you want to construct a customer object, you JOIN the data in these tables;
Each time you persist, you find the appropriate rows in relevant tables and insert/update.
Relationship
Enforcement is via referential constraints. Objects are constructed by JOINS, EACH time.
Value Evolution
Additional values of the SAME TYPE (e.g. additional phone, additional address) is managed by additional ROWS in one of the tables.
Customer:contacts will have 1:n relationship.
Structure Evolution:
Imagine we didn't start with a billing table.
This is the most difficult part. Changing the structure is difficult, within a table, across tables.
While you can do these via ALTER TABLE, requires downtime, migration and application versioning.
This is one of the problem document databases try to handle by representing data in JSON.
Let’s see how to represent customer data in JSON.
The primary (CustomerID) becomes the DocumentKey
Column name-Column value becomes KEY-VALUE pair.
We aren’t normal form anymore
Rich Structure & Relationships
Billing information is stored as a sub-document
There could be more than a single credit card. So, use an array.
Value evolution
Simply add additional array element or update a value.
Structure evolution
Simply add new key-value pairs
No downtime to add new KV pairs
Applications can validate data
Structure evolution over time.
Relations via Reference
So, finally, you have a JSON document that represents a CUSTOMER.
In a single JSON document, relationship between the data is implicit by use of sub-structures and arrays and arrays of sub-structures.
So, finally, you have a JSON document that represents a CUSTOMER.
In a single JSON document, relationship between the data is implicit by use of sub-structures and arrays and arrays of sub-structures.
Reference slide
I think there's a piece that's missing in many NoSQL databases: robust query capabilities
NoSQL systems provide specialized APIs
Key-Value get and set
Script based query APIs
Limited declarative query
Map/reduce can be nice, but it does make certain operations harder.
Plus it may not feel as natural as SQL if you’ve been using that for decades. (joins, unions, etc)
This is how you might have to do it WITHOUT a query language like SQL
Complex code and logic
Inefficient processing on client side
I can do this with k/v calls, map/reduce calls sure, but it’s more complex, and maybe inefficient
Less complex, more natural to do this in SQL, more readable
SQL is meant for normalized data
so what do we do when we have denormalized data, like JSON documents?
So let’s look at how SQL for JSON should work
I’m going to use a different model as an example:
Travel data. Specifically airlines and routes.
I want users to be able to find flights between given destinations
On a specific day, etc
The data I’m looking at is actually a sample data set that ships with Couchbase
Called the “travel sample” bucket
It has airlines, routes, airports, and landmarks, but we’re only going to look at airlines and routes today.
Here is a sample route document
Notice it has a source and destination
It also has a “type”, which I won’t need but I may in the future
Notice it has an airlineid field, which is kinda like a foreign key
It has a schedule field, which is an array. In relational, this would typically
Be 3 rows in another table, but it’s denormalized into a single document
Here’s an airline sample document
It’s flat data, but it contains the name of the airline
It also has a “type” field that I also don’t need today but might in the future, or I might need for other use cases
Go fast over these, since it’s just SQL
Go fast over these, since it’s just SQL
ARRAY_AGG – array of the non MISSING values in the group, including nulls
ARRAY_AGG can optionally use DISTINCT
Go fast over these, since it’s just SQL
USE is like WHERE for document keys
Direct primary key lookup bypassing index scans
Ideal for hash-distributed datastore
Available in SELECT, UPDATE, DELETE
UNNEST is an “intra-document” join
UNNEST, flattens it out to the same level
Flattening JOIN that surfaces nested objects as top-level documents
Ideal for decomposing JSON hierarchies
UNNEST is an “intra-document” join
UNNEST, flattens it out to the same level
Flattening JOIN that surfaces nested objects as top-level documents
Ideal for decomposing JSON hierarchies
UNNEST is an “intra-document” join
UNNEST, flattens it out to the same level
Flattening JOIN that surfaces nested objects as top-level documents
Ideal for decomposing JSON hierarchies
UNNEST is an “intra-document” join
UNNEST, flattens it out to the same level
Flattening JOIN that surfaces nested objects as top-level documents
Ideal for decomposing JSON hierarchies
NEST, create a json array in the query result, it’s like the opposite of UNNEST
Special JOIN that embeds external child documents under their parent
Ideal for JSON encapsulation
NEST, create a json array in the query result, it’s like the opposite of UNNEST
Special JOIN that embeds external child documents under their parent
Ideal for JSON encapsulation
NEST, create a json array in the query result, it’s like the opposite of UNNEST
Special JOIN that embeds external child documents under their parent
Ideal for JSON encapsulation
I didn’t cover NEST, but it’s another type of JOIN
Dot notation for json documents, hierarchy
JOIN only works on document keys right now
UNNEST to flatten array, kinda like a join within a document itself… flattening
Go fast over these, since it’s just SQL
SQL language operates on set of relations (table) and then projects another relation (table)
Resultset, needs to be mapped to something in application (could be complex, or maybe not), and then you have to map that to complex objects, and then to json
Not as natural
N1QL is a superset of SQL designed for JSON.
Input is set of related sets of JSON documents.
It operates on these documents and produces another set of JSON documents.
DocumentDb is Microsoft’s document database azure offering
It has a SQL syntax available, and it’s pretty good
(side note: It has JOIN in its syntax but it’s an intra-document join
so it’s more like UNNEST)
N1QL is a superset of SQL
Additionally, N1QL extends the language to support:
-- Fully address, access and modify any part of the JSON document.
-- Handle the full flexible schema where schema is self-described by each document.
-- key-value pairs could be missing, data types could change and structure could change!
These are the additions to SQL to deal with JSON
Mention MISSING, it’s semantically different from a field with a null value. A field isn’t there (which is possible with a flexible schema)
Ranging over collections, deep traversal might not be in CosmosDb yet
Arrays and Objects are not in traditional SQL
--
-There are date function for string and numeric encodings - There is no “date” type in JSON, so we store as UNIX timestamp and N1QL has functions for that
- Well-defined semantics for ORDER BY and comparison operators
- Defined JSON expression semantics for all input data types, so goodbye type mismatch errors
- You can SELECT a string as an integer (conversion functions)
I’ve already mentioned the concept of “missing”
Couchbase can be used to store binary data, and there is some limited N1QL functionality associated with that
But that’s a pretty non-typical use case
Note: Couchbase provides per-document atomicity.
So the UPDATE is not all-or-nothing, failure/success is per document
We don’t have transactions
CosmosDb doesn't offer these
JSON in, JSON out
Indexes are important, just like in SQL
https://dzone.com/articles/index-first-and-query-faster
Functional indexes: on any data expression
Partial indexes
Some other profiling/monitoring tools coming in 5.0, you can check out a developer build of 5.0 now
Indexing in CosmosDb doesn’t use SQL, but it is available with, say, the .NET SDK
Just like SQL
(primary index is on the document key)
Secondary Index can be created on any combination of attribute names.
Useful in speeding up the queries.
Need to have matching indices with right key-ordering
Dotted notation here, CosmosDb uses a different syntax for indexing, which is more like xpath for json
Top:
Complex code and logic
Inefficient processing on client side
Bottom:
Proven and expressive query language
Leverage SQL skills and ecosystem
Extended for JSON - More readable
Reference/summary slide
This chart is not exhaustive, but I think it identifies key points of difference
Sprocs on CosmosDb are in Javascript, not SQL
UC San Diego: SQL++
Couchbase: N1QL
SQL++ is a paper that UC San Diego produced
N1QL is pretty much the first implementation of SQL++, validating their paper in a real database
Couchbase Analytics is currently a standalone developer preview of a feature that is coming to Couchbase Server.
This uses a language that looks like N1QL, is going to converge with N1QL, but has some additional features.
It’s intended for analytics, reports, trend analysis, so Couchbase Analytics is not meant to be operational: no direct reads and writes.
You can join any field to any other field
It’s a public PREVIEW right now, so it is available for you to get, but lots of stuff will change and we don’t recommend it for production.
JSON support in SQL
IBM Informix
Oracle
PostgreSQL
MySQL
All the sql vendors are introducing JSON features
SQL Server 2016 introduced JSON_VALUE and some other JSON functions
It’s not quite a SQL for JSON implementation, but it’s a step in that direction
Here’s a table with two rows, one is an nvarchar with a JSON string
JSON_VALUE can be applied to it to parse the JSON and get specific values
All of these tools are using JDBC/ODBC driver provided by Simba to connect to Couchbase
ODBC opens the door to a ton of integrations, these are just a few
Maybe mention Crystal Reports too
Couchbase can integrate directly with big data tools using DCP/XDCR instead of a query
These don't rely on ODBC but are built specifically to connect to Couchbase
For mongo to couchbase, we have ottoman, which has a very close API to mongoose
No ODBC driver afaik (yet)
But there are some interesting APIs being provided by CosmosDb and now by CosmosDb
DocumentDb provides a mongo API, so you can switch seamlessly
With CosmosDb, it's multi-model, so it can work as a graph database using the Gremlin language
It can work as a Table database using the Azure Table Storage API
So there may be some integrations out there that use those
It's important to remember that compared to Couchbase and many other document databases, DocumentDb is very new, so features will come in time. Similarly, compared to Oracle and SQL Server, document databases in general are relatively new, so there are new features
And innovations happening every year
Documentdb.com link still works as of 7/18/2017
Even though the name has been changed
Hosted by couchbase, but we have guests from all different companies to talk about nosql
We recently had Kirill Gavrylyuk from the DocumentDb team on an episode, so check that out
We've had guests on to talk about Raven, Marten, Mongo, Cassandra, Realm, etc
Start the animation
Open source apache license for community edition, enterprise edition on a faster release schedule, some advanced features, and support license.
Couchbase is software you can run in the cloud on a VM or on your own data center. CosmosDb is a manage cloud service, but there is a emulator you can run locally.
Couchbase is the company, Couchbase Server is the software
So imagine if Mongo and Redis companies formed a new company named RedMongo, and a new database combining Redis and Mongo together called RedMongo Server
Memory first: integrated cache, you don't need to put redis on top of couchbase
Master-master: easier scaling, better scaling
Auto-sharding: we call vBuckets, you don't have to come up with a sharding scheme, it's done by crc32
N1QL: SQL, mongo has a more limited query language and it's not SQL-like
Full Text Search: Using the bleve search engine, language aware FTS capabilities built in
Mobile & sync: Mongo has nothing like the offline-first and sync capabilities couchbase offers
Mongo DOES have a DbaaS cloud provider
Everything I've shown you today is available in Community edition
The only N1QL feature I can think of not in Community is INFER
The Enterprise features you probably don't need unless you are Enterprise developer.
Not yet. We've been talking about it at least as long as I've been with Couchbase.
It's partly a technical problem, may need additional features for multi-tenant.
It's partly (mostly) a business problem. Would this be worth it?
Couchbase IS in the Azure and AWS marketplaces, and there are some wizards to make config easy, but it runs on your VMs.