Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Josh	Evans	– Engineering	Leader
November	8,	2016
Mastering	Chaos
A	Netflix Guide	to	Microservices
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese an...
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practi...
Illness	in	the	Family
Myelin Sheathe
Autoimmune	disorder
Externally	trigger
Treatable
Guillain-Barré	Syndrome
Breathing	is	a	miraculous	act	of	
bravery
ELB
and	so	is	taking	traffic
Introductions
Microservice	Basics
Challenges	&	Solutions
Organization	&	Architecture
Our	Talk	Today
Introductions
Microservice	Basics
Challenges	&	Solutions
Organization	&	Architecture
Our	Talk	Today
1999	– 2009
Engineer	&	Engineering	Manager
Ecommerce	(DVD	à Streaming)	
2009	– 2013
Director	of	Engineering	- Playback	Ser...
Taking	time	off
Spending	time	with	family
Thinking	about	what’s	next
Today
Leader	in	subscription	internet	tv service
Hollywood,	indy,	local
Growing	slate	of	original	content
86	million	members
~19...
Introductions
Microservice	Basics
Challenges	&	Solutions
Organization	&	Architecture
Our	Talk	Today
Netflix	DVD	Data	Center	- 2000
Linux	Host
What	microservices are	not
Apache
Tomcat
Javaweb
STORE
Load	Balancer
BILLING
HTT...
What	is	a	microservice?
…the	microservice	architectural	style	is	an	
approach	to	developing	a	single	application	as	a	
suite	of	small	services,	ea...
Separation	of	concerns
Modularity,	encapsulation
Scalability
Horizontally	scaling
Workload	partitioning
Virtualization	&	e...
Organ	Systems
Each	organ	has	a	purpose
Organs	form	systems
Systems	form	an	organism
Edge
ELB
Zuul
NCCP
API
Middle	Tier	&	Platform
Product
• Bucket testing
• Subscriber
• Recommendations
Platform
• Routing
•...
Client Application
Client Library
EVCache Client Service Client
S S S S. . .
DB DB DB DB. . .
. . .
Microservices	are	an	a...
Introductions
Microservice	Basics
Challenges	&	Solutions
Organization	&	Architecture
Our	Talk	Today
Dependency
Scale
Variance
Change
Challenges	&	Solutions
Dependency
Scale
Variance
Change
Challenges	&	Solutions
Intra-service	requests
Client	libraries
Data	Persistence
Infrastructure
Use	Cases
Intra-service	Requests
Crossing	the	Chasm
Linux	Host
Linux	Host
Linux	Host
Linux	Host
Crossing	the	Chasm
Linux	Host
Apache Tomcat
Linux	Host
Apache Tomcat
Network	l...
Cascading	Failure
How	do	you	know	if	it	works?
Inoculation
Device Service	B	
Service	C
Internet EdgeZuul
Service	A	
ELB
FITSynthetic	transactions
Override	by	device	or	account
%	of	...
Device Service	B	
Service	C
Internet EdgeZuul
Service	A	
ELB
FIT
Fault	Injection	Testing	(FIT)
Enforced	throughout	the	cal...
ELB
API
How	do	we	constrain	testing	scope?
API	Gateway
App	1
App	2
App	4
App	5
App	6
App	3
App	7
App	8
99.99
99.99
99.99
99.99
99.99
99.99
99.99
99.99
Proxy
99.99 99...
Critical	Microservices
Client	Libraries
• Many	clients
• Common	business	logic
• Common	access	patterns
Return	of	the	Monolith
Heap	consumption
Logical	defects
Transitive	dependencies
Parasitic	Infestation
Client Application
Client Library
EVCache Client Service Client
S S S S. . .
DB DB DB DB. . .
. . .
Simple	Logic,	Common	P...
Persistence
In	the	presence	of	a	network	partition,	you	must	choose	
between	consistency	and	availability
CAP	Theorem
DB
DB
DB
Network...
Zone	A
Zone	B
Zone	C
Zone	B
Zone	C
Client
Zone	A
Local Quorum
(Typical)
100ms
Eventual	Consistency
Infrastructure
December	24th,	2012
US-East-1
Canada
No	place	to	go
US
Latin	America
US-East-1US-West-2 EU-West-1
#NetflixEverywhere	Global	Architecture
QCon London,	2016
https://www.infoq.com/presentations/netflix-failure-multiple-regi...
Dependency
Scale
Variance
Change
Challenges	&	Solutions
Stateless	services	
Stateful services
Hybrid	services
Use	Cases
Stateless	Services
Not	a	cache	or	a	database
Frequently	accessed	metadata
No	instance	affinity
Loss	a	node	is	a	non-event
What	is	a	stateless...
Minimum size
Desired capacity
Maximum size
Scale out as needed
S3AMI retrieved on demand
Compute efficiency
Node failure
T...
Cluster A Cluster D
Edge Cluster
Cluster B
Cluster C
Surviving Instance Failure
Stateful Services
Databases	&	caches
Custom	apps	which	hold	large	amounts	of	data
Loss	of	a	node	is	a	notable	event
What	is	a	stateful servi...
Dedicated	Shards	– An	Antipattern
Squid 1 Squid 2 Squid 3
Client Application
Subscriber Client Library
Cache Client Servic...
Redundancy	is	fundamental
Zone	A Zone	B Zone	C
.	.	..	.	..	.	.
EVCache Writes
Client	Application
Client	Library
EVCache	Client
Client	Application
Cl...
Zone	A
Client	Application
Client	Library
EVCache	Client
Zone	B
Client	Application
Client	Library
EVCache	Client
Zone	C
Cli...
Hybrid	Services
Client Application
Client Library
EVCache Client Service Client
S S S S. . .
DB DB DB DB. . .
. . .
Hybrid	Microservice
. ...
It’s	easy	to	take	EVCache	for	granted
30	million	requests/sec
2	trillion	requests	per	day	globally
Hundreds	of	billions	of...
Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Called	by	many	services
On...
Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Excessive	Load
X X
Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Workload	partitioning
Requ...
Dependency
Scale
Variance
Change
Challenges	&	Solutions
Operational	drift
Polyglot	&	containers
Use	Cases
Operational	Drift
(Unintentional	Variance)
Over	time
Alert	thresholds
Timeouts,	retries,	fallbacks
Throughput	(RPS)
Across	microservices
Reliability	best	practices
O...
Autonomic	Nervous
System
You don’t have to think about
digestion or breathing
Incident
Resolution
Review
Remediation
Analysis
Best	Practice
Automation
Adoption
Continuous Learning & Automation
Alerts
Apache	&	Tomcat
Automated	canary	analysis
Autoscaling
Chaos
Consistent	naming
ELB	config
Healthcheck
Immutable	mach...
Polyglot	&	Containers
(Intentional	Variance)
The	Paved	Road
Stash
Nebula/Gradle
BaseAMI/Ubuntu
Jenkins
Spinnaker
Runtime	Platform
In	the	Critical	Path
In	the	Critical	Path
Productivity	tooling
Insight	&	triage	capabilities
Base	image	fragmentation
Node	management	
Library/platform	duplication
...
Raise	awareness	of	costs
Constrain	centralized	support
Prioritize	by	impact
Seek	reusable	solutions
Strategic	Stance
Dependency
Scale
Variance
Change
Challenges	&	Solutions
How	do	we	achieve	velocity	with	confidence?
Global	Cloud	Management	&	Delivery
Integrated, Automated Practices
Conformity	checks
Red/black	pipelines
Automated	canaries
Staged	deployments
Squeeze	tests
Alerts
Apache	&	Tomcat
Automated	canary	analysis
Autoscaling
Chaos
Consistent	naming
ELB	config
Healthcheck
Immutable	mach...
https://www.youtube.com/watch?v=IkPb15FfuQU
Introductions
Microservice	Basics
Challenges	&	Solutions
Organization	&	Architecture
Our	Talk	Today
Customer	Device Netflix	Data	Center	- 2009
NCCP
Electronic	Delivery	- NRDP	1.x
Load	Balancer
Netflix	App
Security
Activati...
Netflix	API	- let	a	1000	flowers	bloom!
Netflix	Data	Center	- 2009
API
Netflix	API	– from	public	to	private
Load	Balancer
General REST API
JSON schema
HTTP respon...
Customer	Device
Netflix	Data	Center	– 2010
API
Hybrid	Architecture
LB
Netflix	App
Security
Activation
Playback
Platform	(N...
Josh:	what	is	the	right	long	term	architecture?
Peter:	do	you	care	about	the	organizational		
implications?
Conway’s	Law
Organizations	which	design	systems	are	constrained	to	
produce	designs	which	are	copies	of	the	
communication...
Conway’s	Law
If	you	have	four	teams	working	on	a	compiler	you	will	
end	up	with	a	four	pass	compiler
NCCP
API
Blade	Runner
Outcomes
Productivity	&	new	capabilities
Refactored	organization
Lessons
Solutions	first,	team	second
Reconfigure	teams	to...
Introductions
Microservice	Basics
Challenges	&	Solutions
Organization	&	Architecture
Recap
Our	Talk	Today
Microservice	architectures	are	
complex	and	organic
Health	depends	on	discipline	and	
chaos
Dependency
Circuit	breakers,	fallbacks,	chaos
Simple	clients
Eventual	consistency
Multi-region	failover
Scale
Auto-scaling...
netflix.github.io
techblog.netflix.com
Questions?
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-chaos-microservices
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
Próxima SlideShare
Cargando en…5
×

Mastering Chaos - A Netflix Guide to Microservices

915 visualizaciones

Publicado el

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2hkPH7v.

Josh Evans talks about the chaotic and vibrant world of microservices at Netflix. He starts with the basics, - the anatomy of a microservice, the challenges around distributed systems, and the benefits. Then he builds on that foundation exploring the cultural, architectural, and operational methods that lead to microservice mastery. Filmed at qconsf.com.

Josh Evans is Director of Operations Engineering at Netflix, with experience in e-commerce, tools, testing, and operations. For the past three years he has led an organization that creates, integrates, and evangelizes proven technical solutions and practices like continuous delivery, real-time ​operational insight, and chaos engineering to achieve operational excellence at scale.

Publicado en: Tecnología
  • Sé el primero en comentar

Mastering Chaos - A Netflix Guide to Microservices

  1. 1. Josh Evans – Engineering Leader November 8, 2016 Mastering Chaos A Netflix Guide to Microservices
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-chaos-microservices
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. Illness in the Family
  5. 5. Myelin Sheathe Autoimmune disorder Externally trigger Treatable Guillain-Barré Syndrome
  6. 6. Breathing is a miraculous act of bravery
  7. 7. ELB and so is taking traffic
  8. 8. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Our Talk Today
  9. 9. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Our Talk Today
  10. 10. 1999 – 2009 Engineer & Engineering Manager Ecommerce (DVD à Streaming) 2009 – 2013 Director of Engineering - Playback Services 2013 – 2016 Director of Operations Engineering Josh Evans
  11. 11. Taking time off Spending time with family Thinking about what’s next Today
  12. 12. Leader in subscription internet tv service Hollywood, indy, local Growing slate of original content 86 million members ~190 countries, 10s of languages 1000s of device types Microservices on AWS
  13. 13. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Our Talk Today
  14. 14. Netflix DVD Data Center - 2000 Linux Host What microservices are not Apache Tomcat Javaweb STORE Load Balancer BILLING HTTP JDBC DB Link HTTP/S Monolithic code base Monolithic database Tightly coupled architecture
  15. 15. What is a microservice?
  16. 16. …the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. - Martin Fowler
  17. 17. Separation of concerns Modularity, encapsulation Scalability Horizontally scaling Workload partitioning Virtualization & elasticity Automated operations On demand provisioning An Evolutionary Response
  18. 18. Organ Systems Each organ has a purpose Organs form systems Systems form an organism
  19. 19. Edge ELB Zuul NCCP API Middle Tier & Platform Product • Bucket testing • Subscriber • Recommendations Platform • Routing • Configuration • Crypto Persistence • Cache • Database
  20. 20. Client Application Client Library EVCache Client Service Client S S S S. . . DB DB DB DB. . . . . . Microservices are an abstraction . . . Microservice
  21. 21. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Our Talk Today
  22. 22. Dependency Scale Variance Change Challenges & Solutions
  23. 23. Dependency Scale Variance Change Challenges & Solutions
  24. 24. Intra-service requests Client libraries Data Persistence Infrastructure Use Cases
  25. 25. Intra-service Requests
  26. 26. Crossing the Chasm
  27. 27. Linux Host Linux Host Linux Host Linux Host Crossing the Chasm Linux Host Apache Tomcat Linux Host Apache Tomcat Network latency, congestion, failure Logical or scaling failure Service A Service B
  28. 28. Cascading Failure
  29. 29. How do you know if it works?
  30. 30. Inoculation
  31. 31. Device Service B Service C Internet EdgeZuul Service A ELB FITSynthetic transactions Override by device or account % of live traffic up to 100% Fault Injection Testing (FIT)
  32. 32. Device Service B Service C Internet EdgeZuul Service A ELB FIT Fault Injection Testing (FIT) Enforced throughout the call path
  33. 33. ELB API
  34. 34. How do we constrain testing scope?
  35. 35. API Gateway App 1 App 2 App 4 App 5 App 6 App 3 App 7 App 8 99.99 99.99 99.99 99.99 99.99 99.99 99.99 99.99 Proxy 99.99 99.99 Combinatorial Math 99.9910 = 99.9
  36. 36. Critical Microservices
  37. 37. Client Libraries
  38. 38. • Many clients • Common business logic • Common access patterns
  39. 39. Return of the Monolith
  40. 40. Heap consumption Logical defects Transitive dependencies Parasitic Infestation
  41. 41. Client Application Client Library EVCache Client Service Client S S S S. . . DB DB DB DB. . . . . . Simple Logic, Common Patterns . . .
  42. 42. Persistence
  43. 43. In the presence of a network partition, you must choose between consistency and availability CAP Theorem DB DB DB Network B Network C Network D Service Network A X
  44. 44. Zone A Zone B Zone C Zone B Zone C Client Zone A Local Quorum (Typical) 100ms Eventual Consistency
  45. 45. Infrastructure
  46. 46. December 24th, 2012
  47. 47. US-East-1 Canada No place to go US Latin America
  48. 48. US-East-1US-West-2 EU-West-1
  49. 49. #NetflixEverywhere Global Architecture QCon London, 2016 https://www.infoq.com/presentations/netflix-failure-multiple-regions
  50. 50. Dependency Scale Variance Change Challenges & Solutions
  51. 51. Stateless services Stateful services Hybrid services Use Cases
  52. 52. Stateless Services
  53. 53. Not a cache or a database Frequently accessed metadata No instance affinity Loss a node is a non-event What is a stateless service?
  54. 54. Minimum size Desired capacity Maximum size Scale out as needed S3AMI retrieved on demand Compute efficiency Node failure Traffic spikes Performance bugs Auto Scaling Groups
  55. 55. Cluster A Cluster D Edge Cluster Cluster B Cluster C Surviving Instance Failure
  56. 56. Stateful Services
  57. 57. Databases & caches Custom apps which hold large amounts of data Loss of a node is a notable event What is a stateful service?
  58. 58. Dedicated Shards – An Antipattern Squid 1 Squid 2 Squid 3 Client Application Subscriber Client Library Cache Client Service Client S S S S. . . DB DB DB DB. . . Squid n HA Proxy Set 1 Set 2 Set 3 Set n X
  59. 59. Redundancy is fundamental
  60. 60. Zone A Zone B Zone C . . .. . .. . . EVCache Writes Client Application Client Library EVCache Client Client Application Client Library EVCache Client Client Application Client Library EVCache Client
  61. 61. Zone A Client Application Client Library EVCache Client Zone B Client Application Client Library EVCache Client Zone C Client Application Client Library EVCache Client . . .. . .. . . EVCache Reads
  62. 62. Hybrid Services
  63. 63. Client Application Client Library EVCache Client Service Client S S S S. . . DB DB DB DB. . . . . . Hybrid Microservice . . .
  64. 64. It’s easy to take EVCache for granted 30 million requests/sec 2 trillion requests per day globally Hundreds of billions of objects Tens of thousands of memcached instances Milliseconds of latency per request
  65. 65. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Called by many services Online & offline clients Called many times / request 800k – 1M RPS Fallback to service/db Excessive Load
  66. 66. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Excessive Load X X
  67. 67. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Workload partitioning Request-level caching Secure token fallback Chaos under load Solutions Online Offline
  68. 68. Dependency Scale Variance Change Challenges & Solutions
  69. 69. Operational drift Polyglot & containers Use Cases
  70. 70. Operational Drift (Unintentional Variance)
  71. 71. Over time Alert thresholds Timeouts, retries, fallbacks Throughput (RPS) Across microservices Reliability best practices Operational Drift
  72. 72. Autonomic Nervous System You don’t have to think about digestion or breathing
  73. 73. Incident Resolution Review Remediation Analysis Best Practice Automation Adoption Continuous Learning & Automation
  74. 74. Alerts Apache & Tomcat Automated canary analysis Autoscaling Chaos Consistent naming ELB config Healthcheck Immutable machine images Squeeze testing Staged, red/black deployments Timeouts, retries, fallbacks Production Ready
  75. 75. Polyglot & Containers (Intentional Variance)
  76. 76. The Paved Road Stash Nebula/Gradle BaseAMI/Ubuntu Jenkins Spinnaker Runtime Platform
  77. 77. In the Critical Path
  78. 78. In the Critical Path
  79. 79. Productivity tooling Insight & triage capabilities Base image fragmentation Node management Library/platform duplication Learning curve - production expertise Cost of Variance
  80. 80. Raise awareness of costs Constrain centralized support Prioritize by impact Seek reusable solutions Strategic Stance
  81. 81. Dependency Scale Variance Change Challenges & Solutions
  82. 82. How do we achieve velocity with confidence?
  83. 83. Global Cloud Management & Delivery
  84. 84. Integrated, Automated Practices Conformity checks Red/black pipelines Automated canaries Staged deployments Squeeze tests
  85. 85. Alerts Apache & Tomcat Automated canary analysis Autoscaling Chaos Consistent naming ELB config Healthcheck Immutable machine images Squeeze testing Staged, red/black deployments Timeouts, retries, fallbacks Production Ready
  86. 86. https://www.youtube.com/watch?v=IkPb15FfuQU
  87. 87. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Our Talk Today
  88. 88. Customer Device Netflix Data Center - 2009 NCCP Electronic Delivery - NRDP 1.x Load Balancer Netflix App Security Activation Playback Platform (NRDP) UI Collaborative design XML payloads Custom responses Versioned firmware releases Long cycles Simple UI – “Queue Reader” ED
  89. 89. Netflix API - let a 1000 flowers bloom!
  90. 90. Netflix Data Center - 2009 API Netflix API – from public to private Load Balancer General REST API JSON schema HTTP response codes Oauth security model Content Metadata Content Metadata Application
  91. 91. Customer Device Netflix Data Center – 2010 API Hybrid Architecture LB Netflix App Security Activation Playback Platform (NRDP) UI Content Metadata NCCP ED LB Distinct • Services • Protocols • Schemas • Security
  92. 92. Josh: what is the right long term architecture? Peter: do you care about the organizational implications?
  93. 93. Conway’s Law Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. Any piece of software reflects the organizational structure that produced it.
  94. 94. Conway’s Law If you have four teams working on a compiler you will end up with a four pass compiler
  95. 95. NCCP API
  96. 96. Blade Runner
  97. 97. Outcomes Productivity & new capabilities Refactored organization Lessons Solutions first, team second Reconfigure teams to best support your architecture Outcomes & Lessons
  98. 98. Introductions Microservice Basics Challenges & Solutions Organization & Architecture Recap Our Talk Today
  99. 99. Microservice architectures are complex and organic Health depends on discipline and chaos
  100. 100. Dependency Circuit breakers, fallbacks, chaos Simple clients Eventual consistency Multi-region failover Scale Auto-scaling Redundancy – avoid SPoF Partitioned workloads Failure-driven design Chaos under load Variance Engineered operations Understood cost of variance Prioritized support by impact Change Automated delivery Integrated practices Organization & Architecture Solutions first, team second
  101. 101. netflix.github.io
  102. 102. techblog.netflix.com
  103. 103. Questions?
  104. 104. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-chaos-microservices

×