Intergen CTO Chris Auld (Microsoft MVP, Microsoft Regional Director) goes deep into Microsoft Azure DocumentDB, the new fully managed, highly-scalable, NoSQL document database service. You will learn the basics - including a single slide that will give you the most important things you should know.
2. Agenda
1) DocumentDB Refresher
2) CUs, RUs and Indexing
3) Polyglot Persistence and Data Modelling
4) Data Tier Programmability
5) Trading Off Consistency
3.
4. 4
A fully-managed, highly-scalable, NoSQL
document database service.
{ }
Schema free
storage, indexing
and query of JSON
documents
Transaction aware
service side
programmability
with JavaScript
Write optimized,
SSD backed and
tuneable via
indexing and
consistency
Built to be delivered
as a service. Pay as
you go. Achieve
faster time to value.
5. DocumentDB in One Slide
5
• Simple HTTP RESTful model.
• Access can be via any client that supports
HTTP. Libraries for; Node, .NET, Python, JS
• All resources are uniquely addressable by
a URI.
• Partitioned for scale out and replicated
for HA. Tunable indexing & consistency
• Granular access control through item
level permissions
• Attachments stored in Azure Blobs and
support document lifecycle.
• T-SQL like programmability.
• Customers buy storage and throughput
capacity basis at account level
Item
resource POST
TenFaenedt
POST http://myaccount.documents.azure.net/dbs
{ "name":"My Company Db"}
...
[201 Created]
{
/dbs/{id} /colls/{id} /docs/{id} /attachments/{id}
/sprocs/{id}
/triggers/{id}
/functions/{id}
“id": "My Company Db",
“_rid": "UoEi5w==",
/users/{id}
“_self": "dbs/UoEi5w==/",
“_colls": "colls/",
“_users": "users/"
URI
PUT
Item
resource
Item
URI
DELETE
Item
URI
GET TeFneaendt Or
Item URI
Create a new resource
/Execute a sprocs/trigger/query
Replace an existing resource
Delete an existing resource
Read/Query an existing
resource
}
6. Capacity Units
• Customers provision one or more Database Accounts
• A database account can be configured with one to five
Capacity Units (CUs). Call for more.
• A CU is a reserved unit of storage (in GB) and throughput (in
Request Units RU)
• Reserved storage is allocated automatically but subject to a
minimum allocation per collection of 3.3GB (1/3 of a CU) and a
maximum amount stored per collection of 10GB (1 whole CU)
• Reserved throughput is automatically made available, in equal
amounts, to all collections within the account subject to
min/max of 667 RUs (1/3 of a CU) and 2000 RUs (1 whole CU)
• Throughput consumption levels above provisioned units are
throttled
Throughput
RUs
Storage
GB
Provisioned
capacity units
* All limits noted above are the Preview Limitations. Subject to change
7. Request Units
• A CU includes ability to execute up to 2000 Request Units per Second
• I.e. With 1 CU peak throughput needs to be below 2000 RUs/sec
• When reserved throughput is exceeded, any subsequent request will be pre-emptively ended
• Server will respond with HTTP status code 429
• Server response includes x-ms-retry-after-ms header to indicate the amount of time the client must wait
before retrying
• .NET client SDK implicitly catches this response, respects the retry-after header and retries the request (3x)
• You can setup alert rules in the Azure portal to be notified when requests are throttled
8. Request Units
DATABASE OPERATIONS NUMBER OF RUs NUMBER OP/s/CU
Reading single document by _self 1 2000
Inserting/Replacing/Deleting a single document 4 500
Query a collection with a simple predicate and returning a
single document
2 1000
Stored Procedure with 50 document inserts 100 20
Rough estimates: Document size is 1KB consisting of 10 unique property values with the default
consistency level is set to “Session” and all of the documents automatically indexed by
DocumentDB.
As long as the Database stays the same the RUs consumed should stay the same
9. Cool Tool:
DocumentDB Studio
Useful tool with source for sending
queries to DocumentDB.
9
http://tiny.cc/docdbstudio
12. Indexing in DocumentDB
• By default everything is indexed
• Indexes are schema free
• Indexing is not a B-Tree and works really well under
write pressure and at scale.
• Out of the Box. It Just Works.
• But…
… it cannot read your mind all of the time…
12
13. Tuning Indexes
13
• We can change the way that DocumentDB indexes
• We’re trading off
• Write Performance
How long does it take? How many RUs does it use?
• Read Performance
How long does it take? How many RUs does it use?
Which queries will need a scan?
• Storage
How much space does the document + index require?
• Complexity and Flexibility
Moving away from the pure schema-free model
14. Index Policy and Mode
• Index Policy
• Defines index rules for that collection
• Index mode
• Consistent
• Lazy
• Automatic
• True: Documents automatically
added (based on policy)
• False: Documents must be manually
added via IndexingDirective on
document PUT.
• Anything not indexed can only be
retrieved via _self link (GET)
14
var collection = new DocumentCollection
{
Id = “myCollection”
};
collection.IndexingPolicy.IndexingMode = IndexingMode.Lazy;
collection.IndexingPolicy.Automatic = false;
collection = await client.CreateDocumentCollectionAsync
(databaseLink, collection);
15. Index Paths & Index Types
• Include/Exclude Paths
• Include a specific path
• Exclude sub paths
• Exclude a specific path
• Specify Index Type
• Hash (default)
• Range (default for _ts)
not on strings
• Specify Precision
• Byte precision (1-7)
• Affects storage overhead
collection.IndexingPolicy.IncludedPaths.Add(new IndexingPath
15
{
IndexType = IndexType.Hash,
Path = "/",
});
collection.IndexingPolicy.IncludedPaths.Add(new
IndexingPath
{
IndexType = IndexType.Range,
Path = @"/"“modifiedTimeStamp""/?",
NumericPrecision = 7
});
collection.IndexingPolicy.ExcludedPaths.Add("/“longHTML"/*");
18. Worth Reading:
NoSQL Distilled
By Martin Fowler
of ‘Design Patterns’ fame and fortune
Provides a good background on
characteristics of NoSQL style data
stores and strategies for combining
multiple stores.
http://tiny.cc/fowler-pp
18
19. schema-free data model
19
DocumentDB
transactional processing
rich query
managed as a service
elastic scale
internet accessible http/rest
arbitrary data formats
20. Attachments
• Store large blobs/media outside core storage
•DocumentDB managed
• Submit raw content in POST
• DocumentDB stores into Azure Blob storage (2GB today)
• DocumentDB manages lifecycle
• Self managed
• Store content in service of your choice
• Create Attachment providing URL to content
20
21. Storage Strategies
• Things to think about
• How much storage do I use; where? $$$?
• How is my data being indexed?
• Entropy & Precision
• Will it ever be queried? Should I exclude it?
• How many network calls to; save & retrieve
• Complexity of implementation & management
• Consistency. The Polyglot isn’t consistent
21
23. Embed (De-Normalize) or Reference?
• Embed
• Well suited to containment
• Typically bounded 1:Few
• Slowly changing data
• M:N Requires
management of duplicates
• One call to read all data
• Write call must write whole
document
23
• Reference
• Think of this as 3NF
• Provides M:N without
duplicates
• Allows unbounded 1:N
• Multiple calls to read all
data (hold that thought…)
• Write call may write single
referenced document
24. How Do We Relate?
• ID or _self
• A matter of taste.
• _self will be more efficient (half as many RUs or better)
• Consider using IndexigDirective.Exclude
• Direction
• Manufacturer > Product. 1:N
• We have to update manufacturer every time we add a new product
• Products are unbounded
• Product > Manufacturer N:1
• We have to update product if manufacturer changes
• Manufacturers per product are bounded (1)
• Sometimes both makes sense.
24
25. The
Canonical
Polyglot
Online
Store
Azure
Web Site
Azure SQL Database
storage blob
storage table
Document DB
Azure Search
26. A Product Catalog
• Product
• Name (String 100)
• SKU (String 100 YYYYCCCNNNNN e.g. ‘2013MTB13435’)
• Description (HTML up to 8kb)
• Manufacturer (String 100)
• Price (Amount + Currency)
• Images (0-N Images Up to 100kb)
• ProductSizes (0-N including a sort order)
• Reviews (0-N reviews, Reviewer + Up to 10kb text)
• Attributes (0-N strongly typed complex details)
• Probably want to index in Azure Search
• Do we ‘save space’ and push to an
attachment?
• A sub document within DocumentDB will
• Do we often retrieve Product without
• We probably do want to exclude it from
26
• Probably want to
search
• Hash index is fine
• May duplicate into
Azure Search
• Probably a core lookup field. Needs a
hash index.
allow multiple base currencies.
description?
• How to we manage precision?
• We could store reversed?
• We could store a duplicate reversed and
• How deep does the rabbit hole go?
• Probably doesn’t change much so de-normalize
the currency identifier
the index
• We probably want price in Search….but…
• If we include/are providing exclude.
localized prices then
• We might want to pull Year out into
have consistency issues; huge churn
when another we change field and exchange range index.
rates
Attachments
• Do we embed these?
• Do we reference? On product? On reviewer/user? Both?
• Do we reference and embed? Say embed last 10?
• Which direction does the reference go?
• Almost certainly push to search.
27. The Promise of Schema Free
• Fully indexed complex type structures
• Ability to define schema independent of data store
• Reflect for editing and complex search filters
• Create templates to produce HTML from JSON for
editing and rendering. E.g. Angular, Knockout
http://www.mchem.co.nz/msds/Tutti%20Frutti%20Disinfectant.pdf
http://www.toxinz.com/Demo
27
28.
29. Programmability in DocumentDB
• Familiar constructs
• Stored procs, UDFs, triggers
• Transactional
• Each call to the service is in
ACID txn
• Uncaught exception to rollback
• Sandboxed
• No imports
• No network calls
• No Eval()
• Resource governed
& time bound
29
var helloWorldStoredProc = {
id: "helloRealWorld",
body: function () {
var context = getContext();
var response = context.getResponse();
response.setBody("Hello, Welcome To The Real World");
response.setBody("Here Be Dragons...");
response.setBody("Oh... and network latency");
}
}
30. Where To Use Programmability
• Reduce Network Calls
• Send multiple
documents & shred in a
SPROC
• Multi-Document
Transactions
• Each call in ACID txn
• No multi-statement txns
One REST call = One txn
30
• Transform & Join
• Pull content from
multiple docs. Perform
calculations
• JOIN operator intradoc
only
• Drive lazy processes
• Write journal entries
and process later
31.
32. Worth Reading:
Replicated Data
Consistency
Explained Through
Baseball
By Doug Terry
MS Research
http://tiny.cc/cons-baseball
32
33. Tuning Consistency
• Database Accounts are configured with a default consistency
level. Consistency level can be weakened per read/query
request
• Four consistency levels
• STRONG – all writes are visible to all readers. Writes committed by
a majority quorum of replicas and reads are acknowledged by the
majority read quorum
• BOUNDED STALENESS – guaranteed ordering of writes, reads
adhere to minimum freshness. Writes are propagated
asynchronously, reads are acknowledged by majority quorum
lagging writes by at most N seconds or operations (configurable)
• SESSION (Default) – read your own writes. Writes are propagated
asynchronously while reads for a session are issued against the
single replica that can serve the requested version.
• EVENTUAL – reads eventually converge with writes. Writes are
propagated asynchronously while reads can be acknowledged by
any replica. Readers may view older data then previously
observed.
33
Writes Reads
Strong sync quorum
writes
quorum
reads
Bounded async
replication
quorum
reads
Session* async
replication
session
bound
replica
Eventual async
replication
any replica
34.
35. •DocumentDB is a preview service… expect
and enjoy change over time
•Think outside the relational model…
… if what you really want is an RDBMS
then use one of those…
35