In this session, learn how Nike Digital migrated their large clusters of Cassandra and Couchbase to a fully managed Amazon DynamoDB. We share how Cassandra and Couchbase proved to be operationally challenging for engineering teams and failed to meet the needs to scale up for high-traffic product launches. We discuss how DynamoDB’s flexible data model allows Nike to focus on innovating for our consumer experiences without managing database clusters. We also share the best practices we learned on how to effectively use DynamoDB’s TTL, Auto Scaling, on-demand backups, point-in-time recovery, and adaptive capacity for applications that require scale, performance, and reliability to meet Nike’s business requirements.
* Set context
* First talk about our group at Nike, what we do, where we started on our cloud journey.
* Then talk about SNKRs: premium experience for sneaker heads. Specific engineering challenges on DBs
* Adam will talk about Achievements, born with DynamoDB.
* He'll then shift and talk about one of our DynamoDB migration stories for timezone.
Nike's mission statement focuses on innovation and inspiration, because that's how we help athletes reach their full potential. Nike believes, "If you have a body, you are an athlete." Nike Digital Engineering powers the future of sport - serving athletes in our digital experiences.
* At this point, Nike digital is built around a microservices architecture.
* Built this up over many years
* Spans many different domains and experiences: activity, commerce, order management, logistics.
* Served from AWS: large footprint
* Chose AWS because of their vast array of offerings, including DynamoDB
* But, we didn't start this way
* To give perspective: started with monoliths in a traditional colo
* As the digital business grew => challenges scaling
* 6 months worth of changes going out in big batches
* 2013: started our cloud journey as an experiment w/ small number of teams
* Not mature yet: technology choices or automation
* Fast forward to 2015, we mature to a full fledged, cloud-first organization
* Focusing on automation to to scale deploys and teams
Migration into the cloud
migrating to the cloud , core principles, grounded in a singular mission – Serve Athletes. We apply these principals to all of our work
Responsive Resilient Elastic, Observable
Moving features instead of tackling everything
features still running on the monolith, sent new to the old platform. We quickly learned the scalability, throttle the writes back to the old service to prevent database pressure
* Along with some of our core principals => starting direction
* Explicitly no lift and shift: rearchitect around cloud native
* Start with C* and Couchbase, taking cues from companies like Netflix
* C* was chosen to support multi-regional clusters in the future
* Had a lot of automation around cluster creation and some maintenance
* new model: dedicated databases, shared-nothing architecture
* DevOps: Each team operated their clusters
* As we grew, the choice of using a database cluster on EC2 provided some challenges
* More clusters given more teams building services
* Early engineering teams had deep knowledge of C*, lost some of that with growth
* Upgrades, backups and restore non-trivial
* We have a culture of sharing best practices between teams
* => But we still found C* to be operationally intensive
* C* and couchbase were a large part of EC2 spend
* Many overprovisioned, underutilized
* As you'll see throughout the presentation, we've moved many workloads over from DynamoDB
* Let's talk about a specific example: SNKRs launch
* Start from the consumer end
* SNKRs is a premium experience where we launch our most coveted product
* Launches happen 52 weeks a year, around the globe
* From browsing, to checkout, to payment, the services supporting launches are all in AWS
* If you know anything about sneakerheads, you know how passionate they are
* The analog analogy would be a massive line at a store
* Today, we can serve these athletes digitally
* We see our largest surges in traffic from launches
* Which presents specific engineering challenges around scale
* Here you see the traffic pattern we face
* Before the big spike, you're seeing Nike.com traffic
* NOTE: This is not zero!
* Largest spike at launch start time: 7am
* Aftershocks, but shows how massive launches are
* Again, 52 weeks a year, across geographies
* To make this work, overprovision Cassandra and Couchbase clusters
* Ensure that consumers have a good experience
* To handle the spike, the cluster has excess capacity before and after the spike
* This translates into $$$ we could be spending elsewhere
* Provisioning and configuration is challenging in order to meet the spike
* Also highlight need to forecast
* All teams in the launch path migrated
* We migrated to DynamoDB to solve some specific challenges for launch
* Relieved of ongoing cluster maintenance
* DynamoDB's ability to horizontally scale
* With the migration, we did need to rethink some of our data models
* Some key DynamoDB features that enabled our data needs include...
* GSIs: before we'd need to actually coordinate secondary indexing via code, but moved this to the DB layer
* Conditional and partial updates with UpdateItem are a huge win
* Adam has a great story here with achievements
* Launch: optimistic locking for our fairness process
* Flexible schema without ALTER TABLE
* One particular case we had to solve for is compound hash keys
* In C* we get multi-attribute primary keys
* In DynamoDB we have to do this with string concat in code
* One example of a minor difference in the data modeling changes we needed to make
* DynamoDB is slightly less forgiving of hot keys => pay attention to hash key
* DynamoDB's ability to scale up and down horizontally is a key unlock for launch
* Previously, we'd need to scale up and down EC2s to vary capacity
* We'd need to do so in each AZ, in the right order
* No more idle capacity: we pay for what we use
* Horizontal scalability is not free: still need capacity planning
* In general, we also need enough headroom
* The RCU and WCUs vary per use case: read heavy services may be product data, write heavy checkout
* Observability from Cloudwatch metrics for DDB
* Run experiments to determine correct capacity for various expected loads
* DDB Autoscaling looks at the last 5min of activity to determine future scale
* Although we use autoscaling, still need to prescale
* In the case of launch, the orders of magnitude increase dictate that we increase this ourselves
* We happen to know when the spikes will happen, since we're the ones setting the date
* Here's the specific API call we use: use the same autoscaling service to setup scheduled scaling actions
* Set the minimum value to the expected peak + headroom
* Capacity Units: evenly divided by partition
* High traffic in a short period of time on a single product
* Hot keys in product-centric services (product info or pricing)
* previous databases: in-node caching capabilities
* In DynamoDB: hot keys = partiton-level throttling
* We have a few strategies to solve this problem w/ low latency
* Our first tool we reach to is the CDN
* Used for low changing resources, such as product data
* Problem: Points of Presence burst at launch start
* A traditional approach to solve this is via a side cache
* Puts caching logic in the service (read through/write through/write behind/ etc)
* Some teams use this still
* In some cases, we can use DynamoDB streams to update the cache
* This effectively handles bursts
* Ultimately the easiest option to solve hot key issues on read is DAX
* DAX is a provisioned cache cluster with the right logic for caching results
* Reduction of consistency issues and code complexity
* Problems with Query Cache vs. Item Cache, usually use short TTL to mitigate
Achievements
--Az Zach mentioned earlier the old platform is a monolith software stack
-- old achievements backed by a series of Oracle Rac and many Coherence servers.
--implemented as code, difficult to scale, deployments occurred every six months.
--taken down for database maintenance during the night. low traffic periods started to dry up. harder to find.
--Coherence , in-memory cache, state locked for athlete. Locks contentious as background processes, bottleneck, throughput.
--Processes were hard to debug not observable. calls from athletes with no way of explaining what happened. update the database.
Release of redesigned app experiences
Started journey mid 2016 Nike, existing apps, released NRC and NTC, cloud enabled
New service great, only the start of our team’s journey to the cloud.
Achievements was not in the cloud in first release
apps – NTC and NRC –called “achievements”. keep athletes motivated, first mile, fifth marathon. cheering them through achievements.
re-design with two key services.
The rules engine, world to define custom localized, various filters over activity data.
The ingest service processing incoming activity messages applying rules to determine had and writing it to the database.
The rules engine, and the ingest service distributed tracing to ensure they were observable,
Don’t have time to dig deeper, one of our engineers wrote a fantastic article hosted on medium under nike engineering
Database behind the Achievements system was DynamoDB
Power, team decided to use Dynamo for a number of reasons.
Scalability. thousands of requests
Durability. Never patch or update late at night during low traffic times. core principle responsive.
Elasticity, being able to adjust our capacity dynamicly, another core principle.
Cost. With the ability to auto-scale, scale up and down to only pay for what we use
Dynamo’s Update api conditional updates for optimistic locking.
Conditional updates and optimistic locking
When processing multiple updates to a single achievement at the same time, concurrency can become a problem when trying to maintain a state of a single record.
First what is optimistic locking.
method of maintaining state, records, with a version number.
record is read special attention is applied version number.
attempting to write a record. If version of record doesn’t match ead then state has changed while being processed.
record reprocessed given that state has been changed.
locking methodologies additional infrastructure, leverage the database.
Without conditional updates
Imagine you’ve been training for months.
Today is race day.
Four hours and 45 minutes, 26.2 miles later you just completed your first marathon. Nike wants to celebrate your achievement so let’s look what would happen if conditional updates was not used.
… Through steps
This is a last write wins model which doesn’t ensure consistency of the data which can will result in a terrible experience for our runners.
With conditional updates
Let’s look at this scenario with conditional updates.
Each update now reads the version 1 of the achievement record.
The quicker update successfully writes its update as the records version is still 1.
The second write, now a conditional update, fails because the record version has been updated to 2.
Achievements sends the message to be reprocessed.
The update comes back to achievements and sees that the activity has a distance of 26.2 miles.
Nike is there to celebrate your first marathon achievement.
DynamoDB is enabling achievements to be resilient checking off the last core principle.
Challenges
Row size, To keep costs manageable that meant keeping row size low. Since the read capacity is also tied to the data size of a record the team GZIP'd a large column to drop consumed read capacity with a quick code change.
DAX.
We attempted to use DAX. Launch successfully used DAX but the data and access patterns are very different. Launch is very read heavy with small records. Achievements is on average 97% read and 3% write with large records. With decreased visibility into the metrics of DAX cluster the team instead decided to rework the table schema by adding a second table to enable development of new features for the achievements service. (Need AWS word smithing here to best explain why DAX didn’t fit our use case)
Monolith system
Achievements has been in production for over a year and has been a success.
With the team’s hard work the monolith system was able to be shutdown earlier this year.
But this does not complete our journey in the cloud just finishes a chapter
Achievements is one example where we started with Dynamo, but what about services that didn’t start with Dynamo from the design.
Remember when Zach mentioned that teams quick to the cloud used Cassandra as their data store? We were one of those teams.
One of the earliest services we moved to the cloud was Timezone
Timezone
Time is always an engineers worst nightmare, right? For our athletes, sport never stops, and we need to support that need with our apps. So we developed Timezone.
Timezone is a web service responsible for keeping a timeline of an athlete’s timezone journey as they move through the world.
Our running and training apps send activity data to our new cloud services the apps would tell the platform what timezone the athlete was in. This data allows the platform to give the athlete timezone adjusted data when needed while handling all data with the UTC timezone for easier storage and processing.
Cassandra is a great datastore. It could scale to handling very large traffic loads with ease. But we’ve learned a lot since Timezone was launched in 2016.
Working with Cassandra for two years
We’re a Devops group and Cassandra was the most demanding and time consuming task team members had to deal with.
Costs were rising. EC2 costs were going up as the team grew the cluster to accomodate the increasing data size.
Upgrading a Cassandra version meant a rolling push that could easily occupy a week of a team member’s time to push the code to production.
Retirement Notifications.
The retirement notification
There you are at your desk, working hard on some code. You're on-call, and your rotation has been awesome.
Ding! An email notification comes in and says….
EC2 retirement..
For all our other cloud services it’s easy. The team can let our auto-scale groups do the work for us, but Cassandra was a different story.
You quickly scramble to the AWS console
Head over to EC2 and to the Events section just hoping you don’t see a Cassandra node.
And your rotation was going so well!
Time to spend a couple hours going through the documented node replacement process.
The team began to ponder if we could move from Cassandra to DynamoDB. (Looking for a way out)
We setup the service to do dual writes to Cassandra and DynamoDB, and read from Cassandra
Using a combo of different technologies. AWS EMR, Spark, and the DynamoDB Spark Connector from AWS.
Show link to connector on slide
It took approximately three days to move all the data over.
Validated the migrated data
Updated the service to still perform dual writes, but instead read from DynamoDB.
Ensured the service was operating normally with the read path changed.
Updated the service to remove Cassandra
Terminated the Cassandra cluster
Cost savings were substantial.
Because of DynamoDB’s we dropped our overall database cost by 98%. Yes we’re spending only 2% of what we did with Cassandra.
Able to achieve this without giving up anything after migrating
Our devops around managing the timezone database dropped from being measured in hours or sometimes days to just minutes. (Engineering cost, loose cost)
DynamoDB is the new default
DynamoDB was a big win for Timezone, the team and for Nike.
Whether starting from scratch or starting with an existing datastore.
----------
* As you've seen, we can achieve the necessary levels of scaling
* For product launch, we have to optimize for peak throughput scaling
* For something like Timezone, we have scalability on storage
* Overall, DynamoDB has helped us grow the organization's footprint:
* less time spent scaling clusters means more services and experiences
-------
* DynamoDB continues to add features over time.
* DAX, Backups, Adaptive Capacity, TTL
* As a result, Nike can continue to innovate for our consumer
* For Nike, our experiences are the differentiator