SlideShare una empresa de Scribd logo
1 de 40
Service  Architectures  at  Scale      
Lessons  from  Google  and  eBay	
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/service-arch-scale-google-ebay
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Architecture  
Evolution	
• eBay
•  5th generation today
•  Monolithic Perl à Monolithic C++ à Java à microservices
• Twitter
•  3rd generation today
•  Monolithic Rails à JS / Rails / Scala à microservices
• Amazon
•  Nth generation today
•  Monolithic C++ à Java / Scala à microservices
Service  Architectures    
at  Scale	
•  Ecosystem of Services
•  Building a Service
•  Operating a Service
•  Service Anti-Patterns
Service  Architectures    
at  Scale	
•  Ecosystem of Services
•  Building a Service
•  Operating a Service
•  Service Anti-Patterns
Ecosystem    
of  Services	
•  Hundreds to thousands of
independent services
•  Many layers of dependencies,
no strict tiers
•  Graph of relationships, not a
hierarchy
C	
B	
A	
 E	
F	
G	
D
Evolution,    
not  Intelligent  Design	
•  No centralized, top-down design of the system
•  Variation and Natural selection
o  Create / extract new services when needed to solve a problem
o  Deprecate services when no longer used
o  Services justify their existence through usage
•  Appearance of clean layering is an emergent
property
Google    
Service  Layering	
•  Cloud Datastore: NoSQL service
o  Highly scalable and resilient
o  Strong transactional consistency
o  SQL-like rich query capabilities
•  Megastore: geo-scale structured
database
o  Multi-row transactions
o  Synchronous cross-datacenter replication
•  Bigtable: cluster-level structured storage
o  (row, column, timestamp) -> cell contents
•  Colossus: next-generation clustered file
system
o  Block distribution and replication
•  Borg: cluster management infrastructure
o  Task scheduling, machine assignment
Cloud  
Datastore	
Megastore	
Bigtable	
Colossus	
Borg
Architecture  without  an  
Architect?	
•  No “Architect” title / role
•  (+) No central approval for technology decisions
o  Most technology decisions made locally instead of globally
o  Better decisions in the field
•  (-) eBay Architecture Review Board
o  Central approval body for large-scale projects
o  Usually far too late in the process to be valuable
o  Experienced engineers saying “no” after the fact vs. encoding knowledge
in a reusable library, tool, or service
Standardization	
•  Standardized communication
o  Network protocols
o  Data formats
o  Interface schema / specification
•  Standardized infrastructure
o  Source control
o  Configuration management
o  Cluster management
o  Monitoring, alerting, diagnosing, etc.
Standards become standards by
being better than the alternatives!
“Enforcing”  
Standardization	
•  Encouraged via
o  Libraries
o  Support in underlying services
o  Code reviews
o  Searchable code
The easiest way to encourage best
practices is with *code*!
Make it really easy to do the right
thing, and harder to do the wrong
thing!
Service  
Independence	
•  No standardization of service internals
o  Programming languages
o  Frameworks
o  Persistence mechanisms
In a mature ecosystem of services,
we standardize the arcs of the
graph, not the nodes!
Creating    
New  Services	
•  Spinning out a new service
o  Almost always built for particular use-case first
o  If successful and appropriate, form a team and generalize for multiple
use-cases
•  Pragmatism wins
•  Examples
o  Google File System
o  Bigtable
o  Megastore
o  Google App Engine
o  Gmail
Deprecating    
Old  Services	
•  What if a service is a failure?
o  Repurpose technologies for other uses
o  Redeploy people to other teams
•  Examples
o  Google Wave -> Google Apps
o  Multiple generations of core services
“Every service at Google is either
deprecated or not ready yet.”
-- Google engineering proverb
Service  Architectures    
at  Scale	
•  Ecosystem of Services
•  Building a Service
•  Operating a Service
•  Service Anti-Patterns
Characteristics  of  an  
Effective  Service	
•  Single-purpose
•  Simple, well-defined interface
•  Modular and independent
•  Isolated persistence (!)
A	
C	
 D	
 E	
B
Goals  of  a    
Service  Owner	
•  Meet the needs of my clients …
•  Functionality
•  Quality
•  Performance
•  Stability and reliability
•  Constant improvement over time
•  … at minimum cost and effort
•  Leverage common tools and infrastructure
•  Leverage other services
•  Automate building, deploying, and operating my service
•  Optimize for efficient use of resources
Responsibilities  of  a  
Service  Owner	
•  End-to-end Ownership
o  Team owns service from design to deployment to retirement
o  No separate maintenance or sustaining engineering team
o  DevOps philosophy of “You build it, you run it”
•  Autonomy and Accountability
o  Freedom to choose technology, methodology, working environment
o  Responsibility for the results of those choices
Service  as    
Bounded  Context	
•  Primary focus on my service
o  Clients which depend on my service
o  Services which my service depends on
o  Cognitive load is very bounded
•  Very little worry about
o  The complete ecosystem
o  The underlying infrastructure
•  è Small, nimble service teams
Service	
Client  
A	
Client  
B	
Client  
C
Service-­‐‑Service    
Relationships	
•  Vendor – Customer Relationship
o  Friendly and cooperative, but structured
o  Clear ownership and division of responsibility
o  Customer can choose to use service or not (!)
•  Service-Level Agreement (SLA)
o  Promise of service levels by the provider
o  Customer needs to be able to rely on the service, like a utility
Service-­‐‑Service    
Relationships	
•  Charging and Cost Allocation
o  Charge customers for *usage* of the service
o  Aligns economic incentives of customer and provider
o  Motivates both sides to optimize for efficiency
o  (+) Pre- / post-allocation at Google
Maintaining  
Service  Quality	
•  Small incremental changes
o  Easy to reason about and understand
o  Risk of code change is nonlinear in the size of the change
o  (-) Initial memcache service submission
•  Solid Development Practices
o  Code reviews before submission
o  Automated tests for everything
•  Google build and test system
o  Uses production cluster manager
o  Runs millions of tests per day in parallel
o  All acceptance tests run before code is accepted into source control
Maintaining    
Interface  Stability	
•  Backward / forward compatibility of interfaces
o  Can *never* break your clients’ code
o  Often multiple interface versions
o  Sometimes multiple deployments
o  Majority of changes don’t impact the interface in any way
•  Explicit deprecation policy
o  Strong incentive to wean customers off old versions (!)
Service  Architectures    
at  Scale	
•  Ecosystem of Services
•  Building a Service
•  Operating a Service
•  Service Anti-Patterns
Predictable  
Performance	
•  Services at scale highly exposed to performance
variability
•  Imagine an operation …
o  1ms median latency, but 1 second latency at 99.99%ile (1 in 10,000)
o  Service using one machine à 0.01% slow
o  Service using 5,000 machines à 50% slow
•  Predictability trumps average performance
o  Low latency + inconsistent performance != low latency
o  Far easier to program to consistent performance
o  Tail latencies are *much* more important than average latencies
Google  App  Engine    
Memcache  Service	
•  Periodic “hiccups” in latency at 99.99%ile and
beyond
•  Very difficult to detect and diagnose
•  è Slab memory allocation
Service  
Reliability	
•  Systems at scale highly exposed to failure
o  Software, hardware, service failures
o  Sharks and backhoes
o  Operator “oops”
•  Resilience in depth
o  Redundancy for machine / cluster / data center failures
o  Load-balancing and flow control for service invocations
o  Rapid rollback for “oops”
Service  Reliability:  
Deployment	
•  Incremental Deployment
o  Canary systems
o  Staged rollouts
o  Rapid rollback
•  eBay “Feature Flags”
o  Decouple code deployment from feature deployment
o  Rapidly turn on / off features without redeploying code
o  Typically deploy with feature turned off, then turn on as a separate step
Service  Reliability:  
Monitoring	
•  Instrumentation
o  Common monitoring service
o  Machine / instance statistics: CPU, memory, I/O
o  Request statistics: request rate, error rate, latency distribution
o  Application / service statistics
o  Downstream service invocations
•  Diagnosability
o  In-process web server with current statistics
o  Distributed tracing of requests through multiple service invocations
You can have too much alerting,
but you can never have too much
monitoring!
Service  Architectures    
at  Scale	
•  Ecosystem of Services
•  Building a Service
•  Operating a Service
•  Service Anti-Patterns
Service  
Anti-­‐‑PaQerns	
•  The “Mega-Service”
o  Overbroad area of responsibility is difficult to reason about, change
o  Leads to more upstream / downstream dependencies
•  Shared persistence
o  Breaks encapsulation, encourages “backdoor” interface violations
o  Unhealthy and near-invisible coupling of services
o  (-) Initial eBay SOA efforts
Thank  You!	
•  @randyshoup
•  linkedin.com/in/randyshoup
•  Slides will be at slideshare.net/randyshoup
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/service-
arch-scale-google-ebay

Más contenido relacionado

Más de C4Media

Más de C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Service Architectures at Scale: Lessons from Google and eBay

  • 1. Service  Architectures  at  Scale       Lessons  from  Google  and  eBay Randy Shoup @randyshoup linkedin.com/in/randyshoup
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /service-arch-scale-google-ebay
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Architecture   Evolution • eBay •  5th generation today •  Monolithic Perl à Monolithic C++ à Java à microservices • Twitter •  3rd generation today •  Monolithic Rails à JS / Rails / Scala à microservices • Amazon •  Nth generation today •  Monolithic C++ à Java / Scala à microservices
  • 5. Service  Architectures     at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  • 6. Service  Architectures     at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  • 7. Ecosystem     of  Services •  Hundreds to thousands of independent services •  Many layers of dependencies, no strict tiers •  Graph of relationships, not a hierarchy C B A E F G D
  • 8. Evolution,     not  Intelligent  Design •  No centralized, top-down design of the system •  Variation and Natural selection o  Create / extract new services when needed to solve a problem o  Deprecate services when no longer used o  Services justify their existence through usage •  Appearance of clean layering is an emergent property
  • 9. Google     Service  Layering •  Cloud Datastore: NoSQL service o  Highly scalable and resilient o  Strong transactional consistency o  SQL-like rich query capabilities •  Megastore: geo-scale structured database o  Multi-row transactions o  Synchronous cross-datacenter replication •  Bigtable: cluster-level structured storage o  (row, column, timestamp) -> cell contents •  Colossus: next-generation clustered file system o  Block distribution and replication •  Borg: cluster management infrastructure o  Task scheduling, machine assignment Cloud   Datastore Megastore Bigtable Colossus Borg
  • 10. Architecture  without  an   Architect? •  No “Architect” title / role •  (+) No central approval for technology decisions o  Most technology decisions made locally instead of globally o  Better decisions in the field •  (-) eBay Architecture Review Board o  Central approval body for large-scale projects o  Usually far too late in the process to be valuable o  Experienced engineers saying “no” after the fact vs. encoding knowledge in a reusable library, tool, or service
  • 11. Standardization •  Standardized communication o  Network protocols o  Data formats o  Interface schema / specification •  Standardized infrastructure o  Source control o  Configuration management o  Cluster management o  Monitoring, alerting, diagnosing, etc.
  • 12. Standards become standards by being better than the alternatives!
  • 13. “Enforcing”   Standardization •  Encouraged via o  Libraries o  Support in underlying services o  Code reviews o  Searchable code
  • 14. The easiest way to encourage best practices is with *code*!
  • 15. Make it really easy to do the right thing, and harder to do the wrong thing!
  • 16. Service   Independence •  No standardization of service internals o  Programming languages o  Frameworks o  Persistence mechanisms
  • 17. In a mature ecosystem of services, we standardize the arcs of the graph, not the nodes!
  • 18. Creating     New  Services •  Spinning out a new service o  Almost always built for particular use-case first o  If successful and appropriate, form a team and generalize for multiple use-cases •  Pragmatism wins •  Examples o  Google File System o  Bigtable o  Megastore o  Google App Engine o  Gmail
  • 19. Deprecating     Old  Services •  What if a service is a failure? o  Repurpose technologies for other uses o  Redeploy people to other teams •  Examples o  Google Wave -> Google Apps o  Multiple generations of core services
  • 20. “Every service at Google is either deprecated or not ready yet.” -- Google engineering proverb
  • 21. Service  Architectures     at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  • 22. Characteristics  of  an   Effective  Service •  Single-purpose •  Simple, well-defined interface •  Modular and independent •  Isolated persistence (!) A C D E B
  • 23. Goals  of  a     Service  Owner •  Meet the needs of my clients … •  Functionality •  Quality •  Performance •  Stability and reliability •  Constant improvement over time •  … at minimum cost and effort •  Leverage common tools and infrastructure •  Leverage other services •  Automate building, deploying, and operating my service •  Optimize for efficient use of resources
  • 24. Responsibilities  of  a   Service  Owner •  End-to-end Ownership o  Team owns service from design to deployment to retirement o  No separate maintenance or sustaining engineering team o  DevOps philosophy of “You build it, you run it” •  Autonomy and Accountability o  Freedom to choose technology, methodology, working environment o  Responsibility for the results of those choices
  • 25. Service  as     Bounded  Context •  Primary focus on my service o  Clients which depend on my service o  Services which my service depends on o  Cognitive load is very bounded •  Very little worry about o  The complete ecosystem o  The underlying infrastructure •  è Small, nimble service teams Service Client   A Client   B Client   C
  • 26. Service-­‐‑Service     Relationships •  Vendor – Customer Relationship o  Friendly and cooperative, but structured o  Clear ownership and division of responsibility o  Customer can choose to use service or not (!) •  Service-Level Agreement (SLA) o  Promise of service levels by the provider o  Customer needs to be able to rely on the service, like a utility
  • 27. Service-­‐‑Service     Relationships •  Charging and Cost Allocation o  Charge customers for *usage* of the service o  Aligns economic incentives of customer and provider o  Motivates both sides to optimize for efficiency o  (+) Pre- / post-allocation at Google
  • 28. Maintaining   Service  Quality •  Small incremental changes o  Easy to reason about and understand o  Risk of code change is nonlinear in the size of the change o  (-) Initial memcache service submission •  Solid Development Practices o  Code reviews before submission o  Automated tests for everything •  Google build and test system o  Uses production cluster manager o  Runs millions of tests per day in parallel o  All acceptance tests run before code is accepted into source control
  • 29. Maintaining     Interface  Stability •  Backward / forward compatibility of interfaces o  Can *never* break your clients’ code o  Often multiple interface versions o  Sometimes multiple deployments o  Majority of changes don’t impact the interface in any way •  Explicit deprecation policy o  Strong incentive to wean customers off old versions (!)
  • 30. Service  Architectures     at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  • 31. Predictable   Performance •  Services at scale highly exposed to performance variability •  Imagine an operation … o  1ms median latency, but 1 second latency at 99.99%ile (1 in 10,000) o  Service using one machine à 0.01% slow o  Service using 5,000 machines à 50% slow •  Predictability trumps average performance o  Low latency + inconsistent performance != low latency o  Far easier to program to consistent performance o  Tail latencies are *much* more important than average latencies
  • 32. Google  App  Engine     Memcache  Service •  Periodic “hiccups” in latency at 99.99%ile and beyond •  Very difficult to detect and diagnose •  è Slab memory allocation
  • 33. Service   Reliability •  Systems at scale highly exposed to failure o  Software, hardware, service failures o  Sharks and backhoes o  Operator “oops” •  Resilience in depth o  Redundancy for machine / cluster / data center failures o  Load-balancing and flow control for service invocations o  Rapid rollback for “oops”
  • 34. Service  Reliability:   Deployment •  Incremental Deployment o  Canary systems o  Staged rollouts o  Rapid rollback •  eBay “Feature Flags” o  Decouple code deployment from feature deployment o  Rapidly turn on / off features without redeploying code o  Typically deploy with feature turned off, then turn on as a separate step
  • 35. Service  Reliability:   Monitoring •  Instrumentation o  Common monitoring service o  Machine / instance statistics: CPU, memory, I/O o  Request statistics: request rate, error rate, latency distribution o  Application / service statistics o  Downstream service invocations •  Diagnosability o  In-process web server with current statistics o  Distributed tracing of requests through multiple service invocations
  • 36. You can have too much alerting, but you can never have too much monitoring!
  • 37. Service  Architectures     at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  • 38. Service   Anti-­‐‑PaQerns •  The “Mega-Service” o  Overbroad area of responsibility is difficult to reason about, change o  Leads to more upstream / downstream dependencies •  Shared persistence o  Breaks encapsulation, encourages “backdoor” interface violations o  Unhealthy and near-invisible coupling of services o  (-) Initial eBay SOA efforts
  • 39. Thank  You! •  @randyshoup •  linkedin.com/in/randyshoup •  Slides will be at slideshare.net/randyshoup
  • 40. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/service- arch-scale-google-ebay