Developer Data Modeling Mistakes: From Postgres to NoSQL
DevOpsDays Rockies - Living in a Hybrid World
1. NORDSTROM TECHNOLOGY
Living In A Hybrid World
DevOpsDays Rockies
April 21st, 2016
COURTNEY KISSLER
Vice President of E-commerce
and Store Technologies
2. CUSTOMER CENTRIC STRATEGY
.com
.com
BRICK &
MORTAR
ONLINE
FULL
PRICE
OFF
PRICE
Technology as The Key
Enabler
• Strategic Flexibility
• Digital Experience
• In-Store Convenience
• Speed
• Reliability
4. MODERNIZATION JOURNEY
1
CRAWL WALK RUN
Defined path forward Begin unlocking
productivity and speed
to market
Optimized, scalable site
that can innovate at the
speed of business
Not All Teams Transform At The Same Pace…..
• Invest in engineering
thought leadership
• Focus on shipping
product
• Microservices & Cloud
strategy definition
• Invest in Lean mindset
and practices
• DevOps adoption
• Microservices & Cloud
implementation
• On-demand releases
• Establish baseline metrics
• Spread talent across
organization
• Manage to metrics
• Optimize and extend
6. CASE STUDY: WEB PRODUCT PAGE MODERNIZATION
2
Initial Condition Target Condition
• Lack of Trust from Customers
• Slow to respond to Business
Needs
• Tightly Coupled Monolithic
Code Base
• 5 Week Release Process
• Site Failures During Peak
Traffic Events
• Trusted & Reliable Partner
• Speed & Flexibility
• Microservices
• CI/CD Pipelines
• Isolated On-Demand
Releases
• Reliable & Scalable
Environments
Unlock Business Value Thru Microservices, Isolation and Faster Deployments
Modernize and Isolate at
Each Layer of the Stack
Presentation Tier
UX Component model to
standardize user interface
components on the site.
Microservices and
Data
Service build-out to remove
dependencies on shared service
teams and isolate functionality.
Hosting and
Deployment
Cloud-based Blue Green
Deployments for elastic scalability
and de-risking releases.
Test and Quality
Fully Automated Test Suite to
improve speed of testing and
quality of testing.
7. CASE STUDY: WEB PRODUCT PAGE MODERNIZATION
2
Code & Infrastructure
From To Benefit
• Centrally Managed Build &
Source Control
• Centrally Managed
Infrastructure in Physical Data
Centers
• 5 Week Release Process
• Distributed team owned CI/CD
GIT Model with On-Demand
Releases to the Cloud
• Infrastructure as Code
• Ability to release Multiple times
per day
• Rollback capability
• React to both Major and Minor
traffic fluctuations at any layer of
the stack within 6 minutes
• Technology is not the constraint
Team Structure
From To Benefit
• Multiple Dev Teams
• Centralized Support Team
• Centralized Infrastructure Team
• Distributed Full Stack DevOps
Team
• Weekly On-Call Support Rotation
• Ownership
• Accountability
• Operational Mindset
• Accelerated Feedback Loops
Our DevOps Journey
8. CASE STUDY: WEB PRODUCT PAGE MODERNIZATION
2
Incident Management
From To Benefit
• Centralized Support
• Slow Escalation and Resolution
• Proactive Alerts
• Modern Incident Management
Platform
• Modern Collaboration Tool
• On-Call Rotation among Team
Members
• Improved MTTD & MTTR
• Fast Response to Critical
Incidents (often within 2-3 Hrs.)
• Fewer Critical Incidents due to
proactive monitoring
Logging & Metrics
From To Benefit
• Basic IIS Logs
• DOM Ready Page Performance
• Extensible Team Defined
Common Logging Scheme
• User Ready Page Performance
Dashboards that support the Team
• App Specific
• Instance Specific
• Query Volume Trending
• Business Telemetry
Our DevOps Journey
9. CASE STUDY: WEB PRODUCT PAGE MODERNIZATION
2
What we learned from the DevOps Journey
Complete agreement with Jez Humble:
“If it hurts, do it more frequently, and bring the pain forward.”
Teams Pushed hard to Increase Release Frequency
• Developers are passionate about getting better each time
• Deploying smaller feature increments
• Increased frequency forced Continuous Improvement and New Processes
Eg: Experimentation Platform
• Can spin up new test branch in cloud in 5 – 10 mins
• Accessible by business, UX, PM, and Dev
• Ideate and course correct early in the process
• Eliminate need for big bug bashes
10. CASE STUDY: WEB PRODUCT PAGE MODERNIZATION
2
What we learned from the DevOps Journey
No More Big Sign Offs
• Accountability shifted to team (Business, UX, PM, and Dev) to be
actively involved in the lifecycle of the development process
Don’t Stop Investing in the Platform
• Modernization requires Continuous Improvement to ensure team can
always deliver with speed and flexibility
• Built Feature/Infrastructure Ratios into each release
Team Focused on Business Outcomes
• Accountability goes far beyond owning the code, it includes coming up
with solutions
• When given Business Goals, teams self-organized to figure out how to
improve experience and performance
11. CASE STUDY: WEB PRODUCT PAGE MODERNIZATION
2
What Challenges Remain in the Hybrid Environment?
• Cross team orchestration
• High level of effort to make site wide changes
• Not all engineers are fungible
• Only as fast as your slowest link
• Performance optimization challenges
• Cross team alerts
• Operational awareness
• Support for long tail of legacy code
• Monolithic & aging code base impedes developer productivity and limits innovation
12. LIVING IN A HYBRID ENVIRONMENT
YOU CAN’T MODERNIZE EVERYTHING AT ONCE.
When Living in a Hybrid Environment,
don’t forget to show your legacy systems a little love too!
14. CASE STUDY: WEB HARDENING VALUE STREAM MAPPING
3
Current Condition (2014) Target Condition (2015)
• Hardening Sprint takes 2
weeks
• Hardening exceptions occur
too frequently
• Testing is a time consuming
manual process with
inconsistent results
• Production Releases take
several days and require
heroics and long hours
• Hardening Sprint takes 1
week
• Reduce Hardening Exceptions
by 75%
• Testing is automated and
Feature Teams are held
accountable for testing during
the development phase
• Production Releases can be
finished in a single 8-hr day
Improve Cycle Time by optimizing the Hardening phase of the release cycle
2016 Goal = Additional 20% Improvement
15. CASE STUDY: WEB HARDENING VSM TIMELINE (2015)
3
Incremental
Schedule
Reductions
Final 2015 Status
• Removed 5 days
from Hardening
• Removed 3300 Hrs
of waste
Break out Sub-Tracks
• Integration
• Exceptions
• Deployment
• Performance Testing
VSM Workshop
• 40 People
• 10 Teams
First Experiments
• Removed 20 Steps
• Saved 2,294 Hours
• Improved
Collaboration
Quality
Gates to
reduce
Exceptions
Problem A3s
by Sub-Track
R0 R1 R2 R3 R4 R5 R6 R7 R8 R9
Feb Mar May Jun Aug Sept Oct Nov Dec Jan
2015 Release Schedule
16. CASE STUDY: WEB HARDENING VALUE STREAM MAPPING
3
Outcome
Removed 1 week from process
Release process on track to complete in
a single 8 Hr day
• Removed 3300 Hrs of waste per
release
• 93% Exception time reduction
• 70% Testing time reduction
• 42% Deployment time reduction
Weekly On-Call Support Rotation
On-call person does not participate in Sprint but instead works on a Tech Investment Project to Continuously Improve infrastructure and automation. Projects are designed to be small enough chunks that a dev can get it done in a week. Sometimes the devs came up with their own improvement ideas and other times Matt would guide them. The concept built empowerment for the team members to have some say in what they worked on for CI.
Ownership – teams own end to end
Accountability - Everyone on team is accountable
Operational Mindset – Teams own Infrastructure as code
Infrastructure as Code
Teams accountable for optimizing infrastructure using strategies such as AMI Factories and right sizing of instances which reduce auto scale times.
React to both major and minor traffic fluctuations at any layer of the stack within 6 minutes.
When the team first moved to the cloud, they could auto-scale in 18 minutes but through continuous improvement efforts, the brought the time down to 6 minutes!
User Ready Page Performance - Measures critical elements of page
As teams took accountability, they started logging all kinds of telemetry data to AWS Cloudwatch and Splunk from which they built alerts and dashboards that worked for them so they could see problems early and avoid getting a call in the middle of the night. Today, an alert sets off PagerDuty, The on-call person sets up a Slack channel, the team works on the issue until it is fixed and ready to deploy. A critical bug doesn’t last for more than a week or less, usually on the order of hours. Formed a lot of metrics to support all this…not some bogus metrics handed down from somewhere else like DOM Ready. Measured how key parts of the page are performing such as how long it takes for a critical section or even an individual component to render. These metrics supported the team so they could work better. Top down mandates on measurement wouldn’t have panned out.
Deployed Smaller Feature Increments (Trending toward Single Piece Flow)
If its painful, do it more often until you get better at it
Don’t shy away from the pain… If a failed deployment brought the site down, find a way to fix that and keep deploying quickly. Don’t build in a big risk aversion process. Drive to release more frequently even though as frequency increases, more issues arise. Team had to improve dev process every step along the way because we found more pain points. Devs had to engineer their way out of issues but they were compassionate about getting better each time. Forced a lot of behaviors.
Some Features had a hard time hitting target date on deployment due to other features in release so the team shifted to deploying as things were ready
When deploying on a 5 week cadence, you can spend 2 weeks bug bashing. When deploying regularly, there is no time for bug bashes. Therefore, you have to improve your engineering processes. When we reached the point where we could release every day, we weren’t actually mature enough to pull that off. We couldn’t let the team spend half the day testing.
Upon reaching a point where they could deliver quickly, teams had to figure out how to do things differently. For example, they needed to build out an experimentation platform to test a bunch of hypothesis. This platform, called Spinable environments provided the capability to build unique environment in 5-10mins. It was hosted in the cloud and could be shareable with the team to demonstrate a small design feature for comment. These environments eliminated the need for big bug bashes. With these spinable environments, the teams could Ideate and play with features early in the process. They didn’t have to wait until the feature was fully complete before they could iterate and course correct.
No More Big Sign Offs
Accountability shifted to Team (Bus, UX, PM, Dev) to be actively involved in the lifecycle of the dev process. Building together using spinable environments so the entire team can see it as it evolves and adapt and adjust immediately without waiting for big chunks of functionality to release at once. Ship the thing when its ready.
Mod code base is not a single snapshot that is just done. It is a continuous process that needs improvement to ensure that the team can always deliver with agility. When planning a sprint, the team intentionally planned ratios for infrastructure and feature work. Helped avoid pressure to focus on features only and ensured continuous investment in our codebase and our infrastructure.
Performance is a huge priority. When figuring out how to go from current to future, they present info/data to the team. The Team then self organized and figured out how to solve the issues. e.g. lazy load proposal from team.
Cross team Orchestration: Still challenging to orchestrate major cross company events / Initiatives due to hybrid release structure. As teams run at their own pace, there is a need for some orchestration. Different set up, branching & source control, monitoring etc. can enable a team to run causes interdependency issues between teams.
High level of effort to make site wide changes- If we wanted to globally change the UX its difficult to do in a hybrid world
Engineers not quite Plug n Play: lack of DevOps standardization across teams makes it difficult to move people between teams, resources are still not truly interchangeable.
Agility: Overall agility and throughput is still limited. Only as fast as the slowest link.
Performance Optimization Challenges: Latency issues with Ingress and Egress of data moving between cloud and on-prem systems.
Cross Team Alerts: Complex issue to manage alerts across multiple teams
Operational Awareness: DevOps teams have more monitoring capabilities but each team is using something different and there is often no shared truth. Need to work toward standardization of DevOps practices.
Site Reliability Engineering:
Although DevOps teams are accountable for their own systems, they don’t always know or understand the associated or dependent systems. They don’t know what they don’t know.
All AWS releases initially resulted in launch issues.
We formed a Site Reliability Engineering team to understand complete architectural design and formalize accountability so we could proactively manage impact on dependent systems and maintain site availability.
Support for long tail of legacy code: While much of the site has been modernized and speed of releasing new features has improved, there is a long tail of legacy code which also needs to be cleaned up and migrated to the cloud. We have yet to shut down any legacy systems until long tail legacy code is migrated.,
Monolithic & Aging code base impedes developer productivity and limits innovation
Current Condition
The web currently releases every 5 weeks. Each release is comprised of 3 weeks of development and 2 weeks of ‘Hardening’. Hardening includes 4 main activities: integration testing, performance testing, operational readiness testing and implementation of any completed features/fixes into production.
Heroics and extra hours are consistently required in order to finish the Hardening Sprint, usually just in the nick of time – this creates high stress that burns out the team and lowers morale.
“Hardening Exceptions” occur frequently during Hardening and are hugely wasteful, requiring re-deployment and re-testing of applications, sometimes multiple times.
Testing is manual, resource intensive and time consuming, requiring several weeks to complete.
Production Implementations take several days, are resource intensive, highly manual, error prone, time consuming and painful for all involved.
Target Condition
The Hardening Sprint is reduced by 1 week and is able to finish without heroics or any negative impact to team work/life balance.
“Hardening Exceptions” are reduced by 75%.
Testing can be fully completed in available timeframes and is highly trusted.
Production Releases can be finished comfortably in a single 8-hour work day, during regular business hours with minimal human involvement and are highly predicable, highly trusted and do not impact our customers.
VSM Workshop
In late January 2015 a cross-functional team conducted a week-long Value Stream Mapping workshop on the Nordstrom web site release process. We decided to focus on the Hardening sprint, which was the last major phase in the release process and therefore closest to the customer. In the workshop, we mapped out the current state, identified improvements and created a new target state.
First Set Of Experiments
In our first set of experiments we saved 20 steps and over 2,294 hrs (out of 3700) per release. We also improved collaboration between teams and removed much of the emotion from the discussion by relying on facts and a visual map of where time was being spent and wasted.
Break out Sub-tracks
After the first 2 releases we began to run out of low hanging fruit and participation was starting to drop off.
Starting with May 2015, release(15020), we pivoted by creating 4 separate “tracks” that focused on the four major areas of effort in Hardening, that made up over 90% of the total effort. These tracks were:
Integration Testing
Hardening exceptions
Production Implementation
Performance Lab (deployment + testing)
Each track was assigned to a key leader that was fully empowered to bring in who they needed and make the necessary improvements to reach our target condition.
Implemented Hardening Exception Process
Track Lead (Dev Manager) implemented a set of Quality Gates
Socialized a Go/No Go checklist with dev teams
Started saying “No” in Go/No Go meetings
Drove a culture shift toward avoiding exceptions
Incremental Schedule Reductions
After the June 2015 (15030) release and about ½ way through the year, it was clear that while our current approach was reducing effort, we weren’t reducing duration. To combat this, we put a schedule in place to incrementally reduce the number of hardening days in each release as follows:
15050 -1 day (1 total)
15060 -2 days (3 total)
15070 -1 days (4 total)
15080 -1 days (5 total)
Problem A3s
While progress was being made, it was difficult to get visibility to this progress by track. It was also hard to determine if the work being undertaken would actually result in us reaching our desired target condition. As a result, each track owner started creating simplified Problem A3 form that was posted on a visibility wall. This went into effect late in August release (15050) and has greatly improved visibility to progress and ensured the work being undertaken really moved us towards the desired target condition.
Where we are now
With the completion of January (15080) release, we have removed 4days from the hardening sprint and over 3300 Hrs per release, down from an 3700 Hrs initially .
We are on track to remove an additional 1 more days in January and reach a total projected EOY savings of over 28,000 hrs. for 2015.
What did we learn?
Strong and continued executive sponsorship is critical.
The VSM exercise is just the start – continued dedication and focus is needed to really reap the benefits!
Start small(er)
Removed 3300 Hrs of waste per release from an original total of 3700. As of final FY2015 release, hours per release was 391.
Behavior change – team empowered to continue to find opportunities for improvement
Leadership fully engaged - team felt supported and were able to be successful
Greater trust across teams – low trust initially across teams
Team Morale – process focused on making it easier for employees to get work done
Personal development – demonstrated growth from individual contributors
Honor Reality
Become a student - Go & See (not Go & Tell)
Become a teacher – Learning Culture
Problem Solving
Improvement Kata
Lead by Example (actions matching words)
Ask Why and Articulate Why