Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1b6WvAi.
Sid Anand uses examples from LinkedIn, Netflix, and eBay to discuss some common causes of outages and scaling issues. He also discusses modern practices in availability and scaling in web sites today. Filmed at qconnewyork.com.
Siddharth "Sid" Anand has deep experience designing and scaling high-traffic web sites. Currently, he is a senior member of LinkedIn's Data Infrastructure team focusing on analytics infrastructure. Prior to joining LinkedIn, he served as Netflix's Cloud Database Architect, Etsy's VP of Engineering, a search engineer and researcher at eBay, and a performance engineer at Siebel Systems.
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/website-outages
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. About Me
2
*
Current Life…
Ø LinkedIn
Ø Search, Network, and Analytics (SNA)
Ø Search Infrastructure
Ø Me
In a Previous Life…
Ø LinkedIn, Data Infrastructure, Architect
Ø Netflix, Cloud Database Architect
Ø eBay, Web Development, Research Lab, & Search Engine
And Many Years Prior…
Ø Studying Distributed Systems at Cornell University
@r39132 2
5. Our mission
Connect the world’s professionals to make
them more productive and successful
3@r39132 3
6. Over 200M members and counting
2 4 8
17
32
55
90
145
2004 2005 2006 2007 2008 2009 2010 2011 2012
LinkedIn Members (Millions)
200+
The world’s largest professional network
Growing at more than 2 members/sec
Source :
http://press.linkedin.com/about
7. 5
*
>88%Fortune 100 Companies
use LinkedIn Talent Soln to hire
Company Pages
>2.9M
Professional searches in 2012
>5.7B
Languages
19
@r39132 5
>30MFastest growing demographic:
Students and NCGs
The world’s largest professional network
Over 64% of members are now international
Source :
http://press.linkedin.com/about
8. Other Company Facts
6
*
• Headquartered in Mountain View, Calif., with offices around the world!
• As of June 1, 2013, LinkedIn has ~3,700 full-time employees located around the
world
@r39132 6
Source :
http://press.linkedin.com/about
9. Agenda
ü Company Overview
§ Serving Architecture
§ How Does LinkedIn Scale
– Web Services
– Databases
– Messaging
– Other
§ Q & A
7@r39132
11. Overview
• Our site runs primarily on Java, with some use of Scala for specific
infrastructure
• The presentation tier is an exception – runs on everything!
• What runs on Scala?
• Network Graph Engine
• Kafka
• Some front ends (Play)
• Most of our services run on Jetty
LinkedIn : Serving Architecture
@r39132 9
12. LinkedIn : Serving Architecture
@r39132 10
Frontier
Presentation Tier
Play
Spring
MVC
NodeJS JRuby Grails Django
USSR (Chrome V8 JS Engine)
Our presentation tier is composed of ATS with 2 plugins:
• Fizzy
• A content aggregator that unifies content across a diverse set of front-ends
• Open-source JS templating framework
• USSR (a.k.a. Unified Server-Side Rendering)
• Packages Google Chrome’s V8 JS engine as an ATS plugin
13. A A B B
Master
C C
D D E E F F
Presentation Tier
Business Service Tier
Data Service Tier
Data Infrastructure
Slave Master Master
Memcached
à A web page requests
information A and B
à A thin layer focused on
building the UI. It assembles
the page by making parallel
requests to BST services
à Encapsulates business
logic. Can call other BST
clusters and its own DST
cluster.
à Encapsulates DAL logic and
concerned with one Oracle
Schema.
à Concerned with the
persistent storage of and
easy access to data
LinkedIn : Serving Architecture
Hadoop
@r39132 11
Other
14. 12
Serving Architecture : Other?
Oracle or
Espresso Data Change Events
Search
Index
Graph
Index
Read
Replicas
Updates
Standard
ization
As I will discuss later, data that is committed to Databases needs to also be made
available to a host of other online serving systems :
• Search
• Standardization Services
• These provider canonical names for your titles, companies, schools, skills,
fields of study, etc..
• Graph engine
• Recommender Systems
This data change feed needs to be scalable, reliable, and fast. [ Databus ]
@r39132
15. 13
Serving Architecture : Hadoop
@r39132
How do we Hadoop to Serve?
• Hadoop is central to our analytic infrastructure
• We ship data streams into Hadoop from our
primary Databases via Databus & from
applications via Kafka
• Hadoop jobs take daily or hourly dumps of this
data and compute data files that Voldemort can
load!
• Voldemort loads these files and serves them on
the site
16. Voldemort : RO Store Usage at LinkedIn
People You May Know
LinkedIn Skills
Related Searches
Viewers of this profile also viewed
Events you may be interested in
Jobs you may be
interested in
@r39132 14
19. LinkedIn : Scaling Web Services
@r39132 17
Problem
• How do 150+ web services communicate with each other to fulfill user requests in
the most efficient and fault-tolerant manner?
• How do they handle slow downstream dependencies?
• For illustration sake, consider the following scenario:
• Service B has 2 hosts
• Service C has 2 hosts
• A machine in Service B sends a web request to a machine in Service C
A A B B C C
20. LinkedIn : Scaling Web Services
@r39132 18
What sorts of failure modes are we concerned about?
• A machine in service C
• has a long GC pause
• calls a service that has a long GC pause
• calls a service that calls a service that has a long GC pause
• … see where I am going?
• A machine in service C or in its downstream dependencies may be slow for any
reason, not just GC (e.g. bottlenecks on CPU, IO, and memory, lock-contention)
Goal : Given all of this, how can we ensure high uptime?
Hint : Pick the right architecture and implement best-practices on top of it!
21. LinkedIn : Scaling Web Services
@r39132 19
In the early days, LinkedIn made a big bet on Spring and Spring RPC.
Issues
1. Spring RPC is difficult to debug
• You cannot call the service using simple command-line tools like Curl
• Since the RPC call is implemented as a binary payload over HTTP, http access
logs are not very useful
B B
C C
LB
2. A Spring RPC-based architecture leads to high MTTR
• Spring RPC is not flexible and pluggable -- we cannot use
• custom client-side load balancing strategies
• custom fault-tolerance features
• Instead, all we can do is to put all of our service nodes
behind a hardware load-balancer & pray!
• If a Service C node experiences a slowness issue, a
NOC engineer needs to be alerted and then manually
remove it from the LB (MTTR > 30 minutes)
22. LinkedIn : Scaling Web Services
@r39132 20
Solution
A better solution is one that we see often in both cloud-based architectures and
NoSQL systems : Dynamic Discovery + Client-side load-balancing
Step 1 :
Service C nodes announce their
availability to serve traffic to a ZK
registry
Step 2 :
Service B nodes get updates
from ZK
B B
C C
ZK
ZK
ZK
B B
C C
ZK
ZK
ZK
Step 3 :
Service B nodes route
traffic to service C nodes
B B
C C
ZK
ZK
ZK
23. LinkedIn : Scaling Web Services
@r39132 21
With this new paradigm for discovering services and routing requests to
them, we can incorporate additional fault-tolerant services
24. LinkedIn : Scaling Web Services
@r39132 22
Best Practices
• Fault-tolerance Support
1. No client should wait indefinitely for a response from a service
• Issues
• Waiting causes a traffic jam : all upstream clients end up also getting
blocked
• Each service has a fixed number of Jetty or Tomcat threads. Once
those are all tied up waiting, no new requests can be handled
• Solution
• After a configurable timeout, return
• Store different SLAs in ZK for each REST end-points
• In other words, all calls are not the same and should not have
the same read time out
25. LinkedIn : Scaling Web Services
@r39132 23
Best Practices
• Fault-tolerance Support
2. Isolate calls to back-ends from one another
• Issues
• You depend on a responses from independent services A and B. If A
slows down, will you still be able to serve B?
• Details
• This is a common use-case for federated services and for shard-
aggregators :
• E.g. Search at LinkedIn is federated and will call people-
search, job-search, group-search, etc... In parallel
• E.g. People-search is itself sharded, so an additional shard-
aggregation step needs to happen across 100s of shards
• Solution
• Use Async requests or independent ExecutorServices for sync
requests (one per each shard or vertical)
26. LinkedIn : Scaling Web Services
@r39132 24
Best Practices
• Fault-tolerance Support
3. Cancel Unnecessary Work
• Issues
• Work issued down the call-graphs is unnecessary if the clients at the
top of the call graph have already timed out
• Imagine that as a call reaches half-way down your call-tree, the
caller at the root times out.
• You will still issue work down the remaining half-depth of your
tree unless you cancel it!
• Solution
• A possible approach
• Root of the call-tree adds (<tree-UUID>, inProgress status) to
Memcached
• All services pass the tree-UUID down the call-tree (e.g. as a
HTTP custom request header)
• Servlet filters at each hop check whether inProgress == false. If
so, immediately respond with an empty response
27. LinkedIn : Scaling Web Services
@r39132 25
Best Practices
• Fault-tolerance Support
4. Avoid Sending Requests to Hosts that are GCing
• Issues
• If a client sends a web request to a host in Service C and if that host
is experiencing a GC pause, the client will wait 50-200ms, depending
on the read time out for the request
• During that GC pause other requests will also be sent to that node
before they all eventually time out
• Solution
• Send a “GC scout” request before every “real” web request
28. LinkedIn : Scaling Web Services
@r39132 26
Why is this a good idea?
• Scout requests are cheap and provide negligible overhead for requests
Step 1 :
A Service B node sends a cheap
1 msec TCP request to a
dedicated “scout” Netty port
Step 2 :
If the scout request comes
back within 1 msec, send the
real request to the Tomcat or
Jetty port
Step 3 :
Else repeat with a different
host in Service C
B B
Netty Tomcat
ZK
ZK
ZK
C
B B
Netty Tomcat
ZK
ZK
ZK
C
B B ZK
ZK
ZK
Netty Tomcat
C
Netty Tomcat
C
29. LinkedIn : Scaling Web Services
@r39132 27
Best Practice
• Fault-tolerance Support
5. Services should protect themselves from traffic bursts
• Issues
• Service nodes should protect themselves from being over-whelmed
by requests
• This will also protect their downstream servers from being
overwhelmed
• Simply setting the tomcat or jetty thread pool size is not always an
option. Often times, these are not configurable per application.
• Solution
• Use a sliding window counter. If the counter exceeds a configured
threshold, return immediately with a 503 (‘service unavailable’)
• Set threshold below Tomcat or Jetty thread pool size
31. Espresso : Overview
@r39132 29
Problem
• What do we do when we run out of QPS capacity on an Oracle database server?
• You can only buy yourself out of this problem so far (i.e. buy a bigger box)
• Read-replicas and memcached will help scale reads, but not writes!
Solution à Espresso
You need a horizontally-scalable database!
Espresso is LinkedIn’s newest NoSQL store. It offers the following features:
• Horizontal Scalability
• Works on commodity hardware
• Document-centric
• Avro documents supporting rich-nested data models
• Schema-evolution is drama free
• Extensions for Lucene indexing
• Supports Transactions (within a partition, e.g. memberId)
• Supports conditional reads & writes using standard HTTP headers (e.g. if-modified-since)
33. 31
• Components
• Request Routing Tier
• Consults Cluster Manager to
discover node to route to
• Forwards request to
appropriate storage node
• Storage Tier
• Data Store (MySQL)
• Local Secondary Index
(Lucene)
• Cluster Manager
• Responsible for data set
partitioning
• Manages storage nodes
• Relay Tier
• Replicates data to consumers
Espresso: Architecture
@r39132
35. 33
DataBus : Overview
Problem
Our databases (Oracle & Espresso) are used for R/W web-site traffic. However,
various services (Search, Graph DB, Standardization, etc…) need the ability to
• Read the data as it is changed in these OLTP stores
• Occasionally, scan the contents in order rebuild their entire state
Solution è Databus
Databus provides a consistent, in-time-order stream of database changes that
• Scales horizontally
• Protects the source database from high-read-load
@r39132
37. 35
DataBus : Usage @ LinkedIn
Oracle or
Espresso Data Change Events
Search
Index
Graph
Index
Read
Replicas
Updates
Standard
ization
A user updates the company, title, & school on his profile. He also accepts a
connection
• The write is made to an Oracle or Espresso Master and DataBus replicates:
• the profile change is applied to the Standardization service
Ø E.g. the many forms of IBM were canonicalized for search-friendliness and
recommendation-friendliness
• the profile change is applied to the Search Index service
Ø Recruiters can find you immediately by new keywords
• the connection change is applied to the Graph Index service
Ø The user can now start receiving feed updates from his new connections immediately
@r39132
38. Relay
Event Win
36
DB
Bootstrap
Capture
Changes
On-line
Changes
DB
DataBus consists of 2 services
• Relay Service
• Sharded
• Maintain an in-memory buffer per
shard
• Each shard polls Oracle and then
deserializes transactions into Avro
• Bootstrap Service
• Picks up online changes as they
appear in the Relay
• Supports 2 types of operations
from clients
Ø If a client falls behind and
needs records older than what
the relay has, Bootstrap can
send consolidated deltas!
Ø If a new client comes on line
or if an existing client fell too
far behind, Bootstrap can
send a consistent snapshot
DataBus : Architecture
@r39132
39. Relay
Event Win
37
DB
Bootstrap
Capture
Changes
On-line
Changes
On-line
Changes
DB
Consolidated
Delta Since T
Consistent
Snapshot at U
Consumer 1
Consumer n
Databus
ClientLib
Client
Consumer 1
Consumer n
Databus
ClientLib
Client
Guarantees
§ Transactions
§ In-commit-order Delivery à commits are replicated in order
§ Durability à you can replay the change stream at any time in the future
§ Reliability à 0% data loss
§ Low latency à If your consumers can keep up with the relay à sub-second
response time
DataBus : Architecture
@r39132
40. 38
DataBus : Architecture
Cool Features
§ Server-side (i.e. relay-side & bootstrap-side) filters
§ Problem
§ Say that your consuming service is sharded 100 ways
§ e.g. Member Search Indexes sharded by member_id % 100
§ index_0, index_1, …, index_99
§ However, you have a single member Databus stream
§ How do you avoid having every shard read data it is not interested in?
§ Solution
§ Easy, Databus already understands the notion of server-side filters
§ It will only send updates to your consumer instance for the shard it is
interested in
@r39132
42. 40
Kafka : Overview
Problem
We have Databus to stream changes that were committed to a database. How do we
capture and stream high-volume data if we relax the requirement that the data needs
long-term durability?
• In other words, the data can have limited retention
Challenges
• Needs to handle a large volume of events
• Needs to be highly-available, scalable, and low-latency
• Needs to provide limited durability guarantees (e.g. data retained for a week)
Solution è Kafka
Kafka is a messaging system that supports topics. Consumers can subscribe to topics
and read all data within the retention window. Consumers are then notified of new
messages as they appear!
@r39132
43. 41
Kafka is used at LinkedIn for a variety of business-critical needs:
Examples:
• End-user Activity Tracking (a.k.a. Web Tracking)
• Emails opened
• Logins
• Pages Seen
• Executed Searches
• Social Gestures : Likes, Sharing, Comments
• Data Center Operational Metrics
• Network & System metrics such as
• TCP metrics (connection resets, message resends, etc…)
• System metrics (iops, CPU, load average, etc…)
Kafka : Usage @ LinkedIn
@r39132
44. 42
WebTier
Topic 1
Broker Tier
Push
Events
Topic 2
Topic N
Zookeeper Message Id
Management
Topic, Partition
Ownership
Sequential write sendfile
Kafka
ClientLib
Consumers
Pull
Events Iterator 1
Iterator n
Topic à Message Id
100 MB/sec 200 MB/sec
§ Pub/Sub
§ Batch Send/Receive
§ E2E Compression
§ System Decoupling
Features Guarantees
§ At least once delivery
§ Very high throughput
§ Low latency (0.8)
§ Durability (for a time period)
§ Horizontally Scalable
Kafka : Architecture
@r39132
• Average Unique Message @Peak
• writes/sec = 460k
• reads/sec: 2.3m
• # topics: 693
28 billion unique messages written per day
Scale at LinkedIn
45. 43
Improvements in 0.8
• Low Latency Features
• Kafka has always been designed for high-throughput, but E2E latency could
have been as high as 30 seconds
• Feature 1 : Long-polling
• For high throughput requests, a consumer’s request for data will always be
fulfilled
• For low throughput requests, a consumer’s request will likely return 0 bytes,
causing the consumer to back-off and wait. What happens if data arrives on
the broker in the meantime?
• As of 0.8, a consumer can “park” a request on the broker for as much
as “m milliseconds have passed”
• If data arrives during this period, it is instantly returned to the
consumer
Kafka : Overview
@r39132
46. 44
Improvements in 0.8
• Low Latency Features
• In the past, data was not visible to a consumer until it was flushed to disk on the
broker.
• Feature 2 : New Commit Protocol
• In 0.8, replicas and a new commit protocol has been introduced. As long as
data has been replicated to the memory of all replicas, even if it has not
been flushed to disk on any one of them, it is considered “committed” and
becomes visible to consumers
Kafka : Overview
@r39132