In this session, we will take a look at Amazon DynamoDB and how you can get started building application with it. We will look at table design, common access patterns and compare it to a relational database.
To fully appreciate the need for NoSQL… Let’s start by looking into how much data volume has grown in the last few years. 90% of data was generated in the last few years.
1 TB vs 40 PB .. To put that into perspective… We are starting to see Businesses with multi-TB have exploded to multiPB databases.
As data volume increased, we started innovating data processing systems that would scale to process the large volume of data
https://techjury.net/stats-about/big-data-statistics/
Note to presenters : Intent of this slide idea is to convey two different ideas :
Variety of databases with need for different data types
2. AWS playing active role in this rapidly changing industry. Use the animation in the order included along with the notes below.
===============================================================
Different forms of data. Beyond relational data that can be handled by the traditional relational databases.
We have data that can be represented as Key-Value pairs, connected data that is best represented as a Graph, Documents, columnar data that is best suited for aggregations, so on and so forth.
If you observe AWS, our database services have evolved according to the different types of data needs in the industry. Dynamo was proposed as a key-value store back in Fall 2007. Later that year, we released SimpleDB, which is a managed service for NoSQL databases. A few years down the line, by Jan 2012, AWS released DynamoDB, which is again a NoSQL database service offering, that offers seamless scaling and single digit milli second latency.
We recently at the end of 2017 released in preview mode Neptune, which is managed service for graph databases. As you can see we are constantly improving our database managed services offerings to meet industry demand and lead data type trends.
Today we will focus on DynamoDB, which is our enterprise level managed NoSQL offering.
Let’s talk scaling database.
Relational - Data is normalized. To enable joins, You are tied to a single partition and a single system. performance on the hardware specs of the primary server. To improve performance, Optimize -- Move to a bigger box. You may still run out of headroom. Create Read Replicas. You will still run out. Scale UP.
NoSQL -- NoSQL databases were designed specifically to overcome scalability issues. Scale “out” data using distributed clusters, low-cost hardware, throughput + low latency
Therefore, Using NoSQL, businesses can scale virtually without limit.
Explain partitions here
Use high-cardinality attributes. These are attributes that have distinct values for each item, like email, employeeid, customerid, sessionid, orderid, and so on.
Use composite attributes. Try to combine more than one attribute to form a unique key, if that meets your access pattern. For example, consider an orders table with customerid+productid+countrycode as the partition key and order_date as the sort key.
Items are stored based on partition key. Avoid hot partitions keys so your requests are distributed across partitions/shards. While adaptive capacity will enable you to continue writing to a partition without being throttled (as long as you are under the table limit), throttling can occur if a single partition receives more than 3,000 RCUs or 1,000 WCUs.
Generic product catalog. Table relationships in normalized.
A product could be a book – say the Harry Potter Series. There’s a 1:1 relationship. Or it could be a movie..
You can imagine the types of queries that you’d have to execute. 1. Show me all the movies starring. 2. the entire product catalog. This is Resource intensive – perform complex join
** NoSQL you have to ask – how will the application access the data?
optimize for the costlier asset. No joins. Just a select. Hierarchical structures. Designed by keeping in mind Access patterns.
Via duplication of data (storage), optimized for compute, it is fast.
Businesses are starting to see scalability problems with relational databases. I once had a customer say they top out with relational at around 3,000 requests per second and had to scale up to move to bigger hardware.
With NoSQL, we have a technology that can easily sale to 100s of nodes, or even 1000s, and the scalability bottleneck goes away.
Excellent for OLTP applications that scale, real time data access, fast, low latency, user cannot wait.
==
They store data in a de-normalized hierarchical view, that makes it faster and easier to access the data.
Businesses are starting to see scalability problems with relational databases. I once had a customer say they top out with relational at around 3,000 requests per second and had to scale up to move to bigger hardware.
With NoSQL, we have a technology that can easily sale to 100s of nodes, or even 1000s, and the scalability bottleneck goes away.
Excellent for applications that scale, real time data access, fast, low latency, user cannot wait.
==
They store data in a denormalized hierarchical view, that makes it faster and easier to access the data.
SQL is good for OLAP (and maybe HTAP), and NoSQL (at least for DynamoDB) is best for OLTP at any scale. DynamoDB's completely serverless natures and decoupling of compute makes it a great choice for tiny workloads and massive workloads, with incredibly elastic scaling between the extremes.
We are seeing DynamoDB being used in several types of applications and workloads. Here are some possible sample usecases. And we will look at a few of our customer workloads in the next few slides.
If you look at these usecases, you will see that they have two things in common : they all need low latency and they need to be able to scale seamlessly.
Take for example : Chat messages – A messenger application needs to provide real time experience to the end users and need to be able to accommodate huge volumes.
Very similar to IoT Sensor data : accommodate huge volumes of data from the sensor and make the data available for real-time needs. Same with social media feeds.
Let’s look into basics of DynamoDB next and then we will see how some of our customers are taking advantage of DynamoDB capabilities to build some of these workloads.
With DynamoDB, what are you getting is :
-- Fully managed service that you can start using with just a few clicks on AWS console
-- Ability to create a table that is highly scalable and gives you consistent performance at any scale.
-- Highly available and durable
-- Secure : Access control using fine grained policies
-- Integrates with other AWS services like lambda, redshift enabling you to architect applications that automatically react to data changes.
-- As with other AWS services, only pay for what you use.
Key take away: Using DynamoDB customers get consistent, single-digit millisecond latency at any scale.
Now lets take a look at these capabilities in detail.
============== Only notes for speaker, don’t say the below at this time ================
DynamoDB supports both document and key-value store models, and offers a range of features including global secondary indexes, fine-grained access control via AWS Identity and Access Management, support for event-driven programming, and more.
==
Fast, Consistent Performance
Amazon DynamoDB is designed to deliver consistent, fast performance at any scale for all applications. Average service-side latencies are typically single-digit milliseconds. As your data volumes grow and application performance demands increase, Amazon DynamoDB uses automatic partitioning and SSD technologies to meet your throughput requirements and deliver low latencies at any scale.
Fully Managed
Amazon DynamoDB is a fully managed cloud NoSQL database service – you simply create a database table, set your throughput, and let the service handle the rest. You no longer need to worry about database management tasks such as hardware or software provisioning, setup and configuration, software patching, operating a reliable, distributed database cluster, or partitioning data over multiple instances as you scale.
Flexible
Amazon DynamoDB supports both document and key-value data structures, giving you the flexibility to design the best architecture that is optimal for your application.
Highly Scalable
When creating a table, simply specify how much request capacity you require. If your throughput requirements change, simply update your table's request capacity using the AWS Management Console or the Amazon DynamoDB APIs. Amazon DynamoDB manages all the scaling behind the scenes, and you are still able to achieve your prior throughput levels while scaling is underway.
Event Driven Programming
Amazon DynamoDB integrates with AWS Lambda to provide Triggers which enables you to architect applications that automatically react to data changes.
Fine-grained Access Control
Amazon DynamoDB integrates with AWS Identity and Access Management (IAM) for fine-grained access control for users within your organization. You can assign unique security credentials to each user and control each user's access to services and resources.
Those if you who are involved in spinning up and managing your own servers surely realize how resource intensive it is to manage your own infrastructure. It can be possible to underestimate the cost and complexity of maintaining…. You have to think about power, cooling, OS maintenance and patching. Now imagine managing a 1000 node cluster, this can become very resource intensive
Amazon EC2 is an AWS service for is the compute capacity in cloud, it is resizable. Database instance hosted in an EC2 instance takes away some of the overhead. But, you still need to think about scalability and availability.
Those if you who are involved in spinning up and managing your own servers surely realize how resource intensive it is to manage your own infrastructure. It can be possible to underestimate the cost and complexity of maintaining…. You have to think about power, cooling, OS maintenance and patching. Now imagine managing a 1000 node cluster, this can become very resource intensive
Amazon EC2 is an AWS service for is the compute capacity in cloud, it is resizable. Database instance hosted in an EC2 instance takes away some of the overhead. But, you still need to think about scalability and availability.
Those if you who are involved in spinning up and managing your own servers surely realize how resource intensive it is to manage your own infrastructure. It can be possible to underestimate the cost and complexity of maintaining…. You have to think about power, cooling, OS maintenance and patching. Now imagine managing a 1000 node cluster, this can become very resource intensive
Amazon EC2 is an AWS service for is the compute capacity in cloud, it is resizable. Database instance hosted in an EC2 instance takes away some of the overhead. But, you still need to think about scalability and availability.
This is the value that is built into DynamoDB. With DynamoDB, you get an easy-to-use database. You don’t have to spin up any servers. You can easily design serverless scalable applications with DynamoDB. You get scalability and multi-AZ replication without designing a distributed system. You get ongoing security upgrades, software improvements, cost reduction efforts, monitoring…without any effort at all.
DDB is fully managed service, you have all of that benefit built into it. We built Dynamo to just work so you can focus on your app.
In any business, as your business scales up, you need a way to easy scale to meet the traffic, and be able to get consistent predictable latency at any scale.
You need a way to scale down as your business needs changes. DynamoDB was designed to offer consistent and predictable single-digit millisecond latency, at any scale. And you only pay for what you use. NO limit on throughput. No limit on Size – PB of data any number of items.
The latency characteristics of DynamoDB are under 10 milliseconds and highly consistent.
Most importantly, the data is durable in DynamoDB, constantly replicated across multiple data centers and persisted to SSD storage.
Predictable Performance
This is obviously something that’s important and valuable in any industry, whether it’s powering the New York Times recommendation engine, storing and retrieving game data for the game Fruit Ninja, or powering queries and fast data retrieval for Major League Baseball Advanced Media. Predictable performance at scale is a must-have for many web apps, and DynamoDB was designed specifically to deliver on this.
DynamoDB is built for high availability and 99.99% durability.
All ”Writes” are persisted on to a SSD disk and replicated to 3 availability zones.
Reads can be configured to be “strong” or ”eventual” consistent. There is no latency tradeoff with either configuration, however the read capacity is used is different. We will talk about read and write capacity units in a few slides.
Now Let’s see how the 3 – way replication looks.
Here we are inserting an item into”CustomerOrders” table.
CustomerId value, actually the hash of the customerId value is sent to three different availability zones.
DynamoDB can back up your data with per-second granularity and restore to any single second from the time PITR was enabled up to the prior 35 days
EMPHASIZE: COMPLETELY AUTOMATED
The following diagram illustrates how adaptive capacity works. The example table is provisioned with 400 write-capacity units (WCUs) evenly shared across four partitions, allowing each partition to sustain up to 100 WCUs per second. Partitions 1, 2, and 3 each receive write traffic of 50 WCU/sec. Partition 4 receives 150 WCU/sec. This hot partition can accept write traffic while it still has unused burst capacity, but eventually it will throttle traffic that exceeds 100 WCU/sec.
DynamoDB adaptive capacity responds by increasing partition 4's capacity so that it can sustain the higher workload of 150 WCU/sec without being throttled.
This feature used to have a delay before kicking in, but it is now also instantaneous since the May 23rd, 2019 update.
https://aws.amazon.com/about-aws/whats-new/2019/05/amazon-dynamodb-adaptive-capacity-is-now-instant/
99.9% availability
Talk about how there are no background ops included in burst cap calculation
In almost all cases, only exceeding partition MAX IOPS will lead to throttle (or extended overuse of burst capacity)
Node, .NET, Python, Java SDKs with Go in the works
At table creation time you can specifiy :
Keys : primary/sort
WCU : write capacity unit (1 KB/sec)
RCU : read capacity unit (4 KB/sec)
Size of the table automatically increases as you add more items. There is a limit of item size at 400KB. This is a hard limit
As more items are added as size increases, the table is partitioned automatically for you. Size and provisioning capacity of the table are equally distributed for all partitions. And new partitions are added when either the capacity or size exceeds the formula above.
Here’s a screen shot that shows how you can configure Dynamo DB auto scaling from the console. The same functionality is available through cli and SDK as well.
You can choose to auto-scale either one or both of them. Similar to min/max instances in an auto scaling group, you can specify minimum and maximum read/write capacity.
Additionally, you can chose to apply the same settings to global secondary indexes.
And of course, you have complete control to grant DynamoDB permission to scale the provisioned capacity on your behalf. You do this through IAM role. You can chose an existing role or create a new role that allows Dynamo scaling operations.
If you are familiar with auto scaling of EC2s, this is very similar to that. You take advantage of auto scaling to provision capacity when your application needs demand it while avoiding unnecessary over provisioning.
While you control the number of instances in a fleet of EC2 instances (an autoscaling group) in the first case, here for DynamoDB, the two knobs you are controlling are “Read capacity” and “Write capacity”. Completely independent of each other and based on target utilizations set.
As a managed and automatic feature, this takes out the guesswork out of provisioning capacity. Lets see how it looks when done from the console.
On-demand instances can scale much faster than autoscaling DynamoDB. Autoscaling rules monitor utilization of the table and scales similar to EC2 instances. This can take minutes. For spikey and unpredictable workloads, we suggest using on-demand scaling for DynamoDB.
Dynamo is very cost effective.
Every month you get 25GB, 200 million transactions (i.e., 25 writes and 25 reads per second), 2.5 million read requests from DynamoDB Streams and Global tables upto 2 regions for free. EVERY MONTH, not just the first 12 months of signing up.
You get on demand capacity and storage independently.
Additionally, you can get benefit of auto scaling for your provisioned throughput, so you are not paying for unused provisioning capacity.
TTL is a concept that will effectively help you manage your table size without having to pay item deletion charges.
And lastly, we have cost allocation tagging, that allows you to keep track of dynamo related expenses including costs for tables, indexes, global tabels etc. So you can keep tabs on your expenses for a given project or for a given team etc.
I have a few additional slides about auto scaling, TTL and cost tagging that we will cover in a little bit.
Data is stored in Tables. Think of Table as the “Database”
Within table we have “Items”
We have our first item. Item has attributes : 5 in this case.
Now lets see what is the next item is. Here you can see there are only two attributes.
As we add more items to the table, we can see that Attributes can vary between the items, Each item can have a different set of attributes than the other items. (as with any NoSQL database).
You can also see the primary key or the partition key. It uniquely identifies each item. Also determines HOW DATA IS Partitioned STORED. Partition key is mandatory.
Optional Sort key – you have a composite key; Sort keys help to create 1:many relationships, and useful in range queries.
Lets consider an orders table. Partition key could be order_id, and sort_key could be ‘customer_id’
Some times you need to query data using the primary key; But sometimes you might have a need to query by an attribute that is not your primary/seconday key.
Lets say we want to find out all fullfiled orders. We would have to query for all orders and then look for fullfilled ones in the results. Not very efficient with large tables. But we have “LSIs” to help us out. We can create an “LSI” with the same primary key (order_id) and a different secondary key (fullfilled). Now you query can be based on the key of the LSI. Fast and efficient.
LSI is collocated on the same partition as the item in the table, so this gives us consistency. When an item is updated, LSI is updated, and then ack’d.
LSI is partitioned by the same primary key as the parent table. Different Sort key.
In the Index, you can choose to have just the keys, or other attributes projected or included all attributes – Depending on what attributes you want returned with the query.
There is a limit of 10GB on LSI storage. Note that LSIs are local to partition key.
And LSIs are using the RCU/WCU of the original table.
Taking the concept of indexes a little further.
Some applications might need to perform many kinds of queries, using a variety of different attributes as query criteria. Doesn’t fit into either using existing primary key/sort key.
In this case you define a GSI.
Global Secondary Indexes – Parallel tables or secondary tables.
GSI can have a partition key that is different from the Table. They can also have an alternate sort key.
Customers, Orders, Date Range. Partition by Order Id and query for a date range.
Note: When you create a GSI, you must specify read and write capacity units for the expected workload on that index.
Similar to an LSI, you can choose to have just the keys, or other attributes projected or included all attributes – Depending on what attributes you want returned with the query.
Think of this as a parallel table asynchronously populated by DynamoDB. Eventually consistent. GSI updates typically happen within a second.
Throughput for GSI is important.. That is important on how soon the GSI will be updated.
Note: When you create a GSI, you must specify read and write capacity units for the expected workload on that index.
1 Table update = 0, 1 or 2 GSI updates
Customers often ask if LSI should be used or GSI. When should you use GSI vs LSI??
More flexibility with GSI.
With a local secondary index, there is a limit on item collection sizes: For every distinct partition key value, the total sizes of all table and index items cannot exceed 10 GB
You can have only 5 LSI and 20 GSI, however, with GSI, you have the flexibility to create them after the table is created. LSI must be created when the table is defined.
LSI can be modeled as a GSI
If data size in an item collection > 10 GB (Example, many orders for a customerID) use GSI that’s the only choice. Because LSI limit the data size in a particular partition.
If eventual consistency is okay for your scenario, use GSI – it works for 99% of the scenarios out there.
Amazon’s path from Relational Databases to NoSQL reflects the journey many customers are now taking.
Amazon.com, the online retail business, runs on one of the world’s largest web infrastructures. Back in 2004, Amazon.com was using Relational Oracle Databases and they were unable to scale their relational database. Maintenance and adminstration. In order to keep Amazon.com highly scalable to support all the incoming traffic, Internal project to investigate options… “If availability, durability, and scalability are the priority, what would the database look like?”. This resulted in a whitepaper that described what the database should look like. This paper made the way for many NoSQL technologies out there today. This was also the beginning of DynamoDB.
Database as a Swiss Army Knife - Hundreds of applications built on RDBMS, Poor Scalability (Q4 was a pain), Poor availability, Exorbitantly high costs for h/w, software, admin
Dynamo = replicated DHT with consistency management
Specialist tool with limited query and simpler consistency
Problem: required significant effort to maintain
DynamoDB was designed to deliver consistently high performance at any scale:
Predictable Performance
Massively Scalable
Fully Managed
Low Cost
Now consider PrimeDay 2017 to see how far we came.
Amazon DynamoDB requests from Alexa, the Amazon.com sites, and the Amazon fulfillment centers totaled 3.34 trillion, peaking at 12.9 million per second. According to the team, the extreme scale, consistent performance, and high availability of DynamoDB met needs of Prime Day without breaking a sweat.
Amazon DynamoDB supports multiple high-traffic sites and systems including Alexa, the Amazon.com sites, and all 442 Amazon fulfillment centers. Across the 48 hours of Prime Day, these sources made 7.11 trillion calls to the DynamoDB API, peaking at 45.4 million requests per second.
If you are familiar with tagging AWS resources for keeping track of expenses, this screen will look familiar.
Here’s a screenshot of a daily report using the cost allocation tags.
You can clearly identity on each day which services cost how much. This feature will allow you to see what you spending for your tables, indexes etc.
For eg., you can tag DynamoDB tables for different environments using {env=DEV; env=TEST; env=PROD etc. Here name of the tag is ‘env’ and possible values are DEV/TEST/PROD etc.
Then you can clearly see your spend
for each environment.
You can download and run DynamoDB on your local machine for Dev/Test.
-- No API charges
-- No Data transfer charges.
Except for the endpoint, applications that run with the downloadable version of DynamoDB will work with the DynamoDB web service.
With minor changes to point your applications to production DynamoDB endpoint
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.html
https://amazon-dynamodb-labs.com/