SlideShare una empresa de Scribd logo
1 de 88
©2013 LinkedIn Corporation. All Rights Reserved.
Reflecting a Year After Migrating to Apache Traffic Server
©2013 LinkedIn Corporation. All Rights Reserved.
Have You Looked At Your Access Logs Lately?
©2013 LinkedIn Corporation. All Rights Reserved.
Surviving by Proxy
©2013 LinkedIn Corporation. All Rights Reserved.
Even Your Registrar Breaks Sometimes
©2013 LinkedIn Corporation. All Rights Reserved.
How Apache Traffic Server Changed LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved.
Hello!
©2013 LinkedIn Corporation. All Rights Reserved.
ATS: Apache Traffic Server
 Fast, scalable and extensible HTTP/1.1 compliant caching proxy server
 Single-process, multi-threaded
 Asynchronous I/O
 Plugin architecture
 Written by Inktomi >10 years ago, Yahoo acquired Inktomi, found the code
on a system collecting dust in a cardboard box, and open-sourced in 2010
©2013 LinkedIn Corporation. All Rights Reserved.
ATS: Who’s using it?
©2013 LinkedIn Corporation. All Rights Reserved.
When we started…
 4,000 QPS to www.linkedin.com
 120M members
 Citrix NetScaler used for all external load balancing (XLB)
– Load balances requests based on path to frontends
– SSL termination
– Monitors health per frontend
 Features were built as Tomcat filters
– Tomcat required, no solution for alternates
– >70 frontend services deployed across hundreds of hosts
©2013 LinkedIn Corporation. All Rights Reserved.
Outgrowing the existing solution
 Need to support multiple frontend frameworks
– DoS protection
– Authentication
– Optimizations
 Complete control over features
– Cookie manipulation
– Advanced routing
 Deployment delays, security-related fixes took days if not weeks
 Even small changes required touching network gear
©2013 LinkedIn Corporation. All Rights Reserved.
How about an intelligent HTTP proxy layer?
 Less (re)implementing features into multiple frameworks
 Make decisions higher in the stack
– Faster response time
– Reduce work on the application stack
 Rapid iteration
©2013 LinkedIn Corporation. All Rights Reserved.
Where to start?
 Evaluated options
 Requirements:
– Mature
– Scalable
– Language we like
– Plugin support with hooks and documentation, shared libraries a big plus
– Shared runtime information between plugins
– In-house knowledge is a plus
 Apache Traffic Server matched our needs
©2013 LinkedIn Corporation. All Rights Reserved.
Preparation
 4 patches out of the gate
 Audit traffic, build configs
 Build metrics, dashboards and alerts
– Huge blocker, new territory for non-Java @ LinkedIn
 Migrate traffic, one service at a time
©2013 LinkedIn Corporation. All Rights Reserved.
Let’s migrate!
Started migration in October 2011
©2013 LinkedIn Corporation. All Rights Reserved.
Let’s migrate!
Started migration in October 2011
“We’ll be done by Christmas!”
- everyone
©2013 LinkedIn Corporation. All Rights Reserved.
Original Plan
XLB
L1 Proxy
(ATS)
VIP
Frontend
©2013 LinkedIn Corporation. All Rights Reserved.
Month 1: Public Profile
©2013 LinkedIn Corporation. All Rights Reserved.
Request Rules Remap
Cookie-based routing
e.g. logged-in vs. logged-out
Los Angeles
XLB
L1 Proxy
(ATS)
VIP
Frontend
Chicago
XLB
L1 Proxy
(ATS)
VIP
Frontend
www.linkedin.com
©2013 LinkedIn Corporation. All Rights Reserved.
Request Rules Remap
if (request_cookie[”foo"] starts_with ”bar”)
return "host:chicago.linkedin.com:8888";
else
return "host:losangeles.linkedin.com:8888";
©2013 LinkedIn Corporation. All Rights Reserved.
Month 3: Sentinel (DoS protection)
Prevent abusive requests from reaching frontend
XLB
L1 Proxy
(ATS)
VIP
Frontend
©2013 LinkedIn Corporation. All Rights Reserved.
Month 4: Picking up momentum
Largest frontends of the site done
– Homepage
– Profile
– Registration
New ATS tier, Fizzy!
©2013 LinkedIn Corporation. All Rights Reserved.
New ATS tier, Fizzy!
 Edge Side Includes on steroids
 UI content aggregator
 Progressive Rendering
– Browser deferred rendering
– Browser deferred fetch
– Server
 Supports Server Side Rendering
of JavaScript templates via V8
1
3
4
2
©2013 LinkedIn Corporation. All Rights Reserved.
Now with Fizzy!
XLB
L1 Proxy
(ATS)
VIP VIP
Frontend
(non-fizzy)
Fizzy
(ATS)
VIP
Frontend
©2013 LinkedIn Corporation. All Rights Reserved.
Month 6: Most frontends migrated
 Config generators written
 Caught the attention of other teams
– New plugins developed
 Another new tier, QD Proxy!
©2013 LinkedIn Corporation. All Rights Reserved.
Another new tier, QD Proxy!
Quick Deploy Proxy
– Define profiles for dev instances to route to
– Allows multiple users to use the same profile
– Develop without running the entire stack
©2013 LinkedIn Corporation. All Rights Reserved.
Quick Deploy Proxy: Frontend
XLB
L1 Proxy
(ATS)
Frontend
Fizzy
(ATS)
Backend
QD Proxy
(ATS)
My
Frontend
©2013 LinkedIn Corporation. All Rights Reserved.
Quick Deploy Proxy: Backend
XLB
L1 Proxy
(ATS)
Frontend
Fizzy
(ATS)
Backend
QD Proxy
(ATS)
My
Backend
©2013 LinkedIn Corporation. All Rights Reserved.
Month 9: Ramping Fizzy to 100%
©2013 LinkedIn Corporation. All Rights Reserved.
Month 9: Ramping Fizzy to 100%
 Broke the site
©2013 LinkedIn Corporation. All Rights Reserved.
Month 9: Ramping Fizzy to 100%
 Broke the site
 HA Proxy saves the day
– “The Reliable, High Performance TCP/HTTP Load Balancer”
– leverage the metadata in Range to generate configs
– reduce network hops by avoiding hardware load balancer
– deploy changes in minutes
©2013 LinkedIn Corporation. All Rights Reserved.
… and HA Proxy!
XLB
Frontend
L
1
P
R
O
X
Y HAPROXY
ATS
F
I
Z
Z
Y
HAPROXY
ATS
L
1
P
R
O
X
Y HAPROXY
ATS
F
I
Z
Z
Y
HAPROXY
ATS
Frontend
(non-fizzy)
©2013 LinkedIn Corporation. All Rights Reserved.
After all that…
 October 2011: 4,000 QPS, 120M members
 August 2012: 15,000 QPS, 175M members
 Now: 67,000 QPS, 225M members
 Citrix NetScaler still in use
– Load balancing L1 proxy
– SSL termination
 Features built as ATS plugins
– Supports anything behind ATS tiers (L1 Proxy, Fizzy)
– Quick to deploy
©2013 LinkedIn Corporation. All Rights Reserved.
Implementation
 October 2011 - August 2012 (10 months)
©2013 LinkedIn Corporation. All Rights Reserved.
Implementation
 October 2011 - August 2012
 Unexpected surprises aka outages
 Scope creep
– New tiers and architecture: Fizzy, HA Proxy
– Lots of new plugins
 It takes time to build…
– monitoring
– tooling
– configuration automation
©2013 LinkedIn Corporation. All Rights Reserved.
Outages
 Hand edited configs with typos
 Misbehaving node in rotation
 Bad upgrade from 2.x to 3.x due to incompatible hostdb
 Missing slash for a config, sent requests to wrong frontend
 Bonus slash to a healthcheck taking all hosts down
 SysOps re-imaged experimental hosts, broke 10% of Profile
 Saturated load balancer due to additional ATS layer
 Sticky cookie conflict between frontends
 HA Proxy wasn’t started
 Random ATS crashes
 Coal in our stocking for Christmas
 Multiple issues with multiple plugins
 Log4cpp hard-coded to DEBUG at root level for one plugin, overwrote for all plugins
 FD per-user limit unexpectedly changed
 Keep-alive unexpectedly turned on with high timeouts
©2013 LinkedIn Corporation. All Rights Reserved.
Outages (>0.1% requests affected)
0% 20% 40% 60% 80% 100%
2011
2012
2013
Plugin ATS Human
©2013 LinkedIn Corporation. All Rights Reserved.
How did we improve?
©2013 LinkedIn Corporation. All Rights Reserved.
How did we improve? Monitoring!
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: traffic_logstats
• per-origin breakdown:
– status
– method
– QPS
– bytes
– etc.
• Want JSON output? use -j
• results are COUNTER, and GAUGE if the key ends in _pct
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: traffic_logstats
HTTP return codes Count Percent Bytes Percent
------------------------------------------------------------------------------
100 Continue 0 0.00% 0.00KB 0.00%
200 OK 1,383,361 93.57% 4.71GB 97.48%
201 Created 5,429 0.37% 3.28MB 0.07%
202 Accepted 0 0.00% 0.00KB 0.00%
203 Non-Authoritative Info 0 0.00% 0.00KB 0.00%
204 No content 12 0.00% 5.63KB 0.00%
205 Reset Content 0 0.00% 0.00KB 0.00%
206 Partial content 0 0.00% 0.00KB 0.00%
2xx Total 1,388,802 93.94% 4.71GB 97.54%
300 Multiple Choices 0 0.00% 0.00KB 0.00%
301 Moved permanently 3,360 0.23% 3.47MB 0.07%
302 Found 38,475 2.60% 35.09MB 0.71%
303 See Other 11 0.00% 3.87KB 0.00%
304 Not modified 29,262 1.98% 12.20MB 0.25%
305 Use Proxy 0 0.00% 0.00KB 0.00%
307 Temporary Redirect 0 0.00% 0.00KB 0.00%
3xx Total 71,108 4.81% 50.76MB 1.03%
...
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: traffic_line
• Swiss army knife for Traffic Server
• executable to read variables
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: {stat}
• prefer HTTP over shell?
records.config:
CONFIG proxy.config.http_ui_enabled INT 2
remap.config:
map /_stat/ http://{stat} @action=allow @src_ip=127.0.0.1
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: {stat}
proxy.node.restarts.manager.start_time
proxy.node.restarts.proxy.start_time
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: {stat}
proxy.node.restarts.manager.start_time
proxy.node.restarts.proxy.start_time
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: {stat}
proxy.node.current_client_connections
proxy.node.current_server_connections
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: {stat}
proxy.config.net.connections_throttle
 limit before ATS starts to drop connections
 based on the sum of client and server connections
proxy.process.net.connections_currently_open
 client + server connections
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: {stat}
proxy.config.net.connections_throttle
 limit before ATS starts to drop connections
 based on the sum of client and server connections
proxy.process.net.connections_currently_open
 client + server connections
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: {stat}
Plugin specific
 reviewed prior plugins go to production
Examples
 enforced vs. un-enforced DoS requests
 track cookie usage for a migration
 thread usage of a plugin
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: outside the app
Core dump rate
– generate crash reports with full stack trace
– monitoring file system for core dumps newer than -24 hours
– alert if > N
TCP
– capture states from netstat
– listen queue overflowing (net.core.somaxconn)
Proc
– review /proc/pid/status
– fetch VmSize and VmSwap
– count # of files in /proc/pid/fd for FD usage
©2013 LinkedIn Corporation. All Rights Reserved.
Monitoring: logs
 I HATE dislike the stock logs
 squid.log
– mimics squid access log
– more useful if you’re caching
 common.log, extended.log, extended2.log
– Netscape formats
– not enough detail
 custom logging!
©2013 LinkedIn Corporation. All Rights Reserved.
Custom Logging
records.config
CONFIG proxy.config.log.custom_logs_enabled INT 1
logs_xml.config
<LogFormat>
<Name = ”custom_access"/>
<Format = "%<chi> %<{X-Real-Client-IP}cqh> - %<caun> [%<cqtn>] "%<cqhm> %<cquuc>
%<cqhv>" %<pssc> %<pscl> "%<{Referer}cqh>" "%<{User-Agent}cqh>" %<ttms>ms
%<cquc> %<{X-LI-UUID}psh>"/>
</LogFormat>
<LogObject>
<Format = ” custom_access"/>
<Filename = ”access"/>
</LogObject>
©2013 LinkedIn Corporation. All Rights Reserved.
Custom logging
%<chi> 172.16.200.10
%<{X-Real-Client-IP}cqh> 65.16.225.8
%<caun> - (http auth'd username)
[%<cqtn>] [01/Nov/2011:23:59:59 +0000]
"%<cqhm> %<cquuc> %<cqhv>" "GET /nhome/ HTTP/1.1"
%<pssc> 200
%<pscl> 34697
%<{Referer}cqh> “http://www.linkedin.com/"
%<{User-Agent}cqh> "Mozilla/4.0 (compatible; ...)"
%<ttms> 327ms
%<cqu> http://origin:port/nhome/
©2013 LinkedIn Corporation. All Rights Reserved.
Dashboard: overview
Internal ATS:
 client connections
 server connections
 traffic_cop uptime
 traffic_server uptime
 connection failed
 invalid request
Logs:
 2xx status
 3xx status
 4xx status
 5xx status
 HTTP methods
OS:
 cpu usage
 interface
 tcp state distribution
 # of core dumps
 ATS memory usage
 ATS swap usage
 ATS file descriptor usage
©2013 LinkedIn Corporation. All Rights Reserved.
Dashboard: in-depth
 plugin-specific
 per-path histogram of request durations
 per-origin HTTP status breakdown
 HA Proxy
– current sessions
– denied requests
– error requests
– server status
©2013 LinkedIn Corporation. All Rights Reserved.
How did we improve? Automation!
Configs are generated, not hand maintained
– Details about a service are stored in metadata store
– YAML configs supplement missing data
Deployment done by Salt
– All deployment actions and verifications are
– integrated with Informed
©2013 LinkedIn Corporation. All Rights Reserved.
Informed
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins!
header-rewrite
request-rules-remap
sentinel
lix-remap
host_override
postbuffer
mobileredirect
correctcookiedomain
qdproxy
boom
pagespeed
contentsecurityheader
authfilter
oauth-rewrite
stickyrouting
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: header-rewrite
Manipulate headers at any point in the request lifecycle
– read request
– send request
– read response
– send response
 Can use as a remap plugin
– change path, destination, port
 Patched to include variables
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: header-rewrite
cond %{READ_REQUEST_HDR_HOOK} [AND]
cond %{ACCESS:/var/healthcheck} [NOT]
rm-header Connection
add-header Connection "close”
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: header-rewrite
cond %{SEND_RESPONSE_HDR_HOOK} [AND]
cond %{PATH} "/foo.js”
add-header Content-Type "text/javascript”
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: lix-remap
Uses LinkedIn Experiments infrastructure (A/B testing) to make routing
decisions
 Enable NOC to easily send traffic to another data center
 Route specific users, LinkedIn employees or % of users to experimental
tiers
 Used for red-line performance testing of frontends
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: Boom
We don’t want to show users this…
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: Boom
… but based on status code, we can replace it with this:
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: Host Override
Direct your request to a specific host through any ATS tier
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: PageSpeed
Support on-the-fly operations before sending the response
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: PageSpeed
HTML minification
– How many empty new lines are on Profile?
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: PageSpeed
HTML minification
– How many empty new lines are on Profile?
2703
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: PageSpeed
HTML minification
– How many empty new lines are on Profile?
2703
– How many empty new lines are on Homepage?
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: PageSpeed
HTML minification
– How many empty new lines are on Profile?
2703
– How many empty new lines are on Homepage?
9205
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: PageSpeed
HTML minification
Homepage: 78%
Profile: 72%
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: PageSpeed
HTML minification
Homepage: 10%
Profile: 17%
0
10000
20000
30000
40000
Homepage Profile
Compressed bytes
©2013 LinkedIn Corporation. All Rights Reserved.
Plugins: PageSpeed
Lazy loading of images below the fold
©2013 LinkedIn Corporation. All Rights Reserved.
The awesome patches
©2013 LinkedIn Corporation. All Rights Reserved.
The awesome patches
 traffic_server gets restarted if FD > 32
©2013 LinkedIn Corporation. All Rights Reserved.
The awesome patches
 traffic_server gets restarted if FD > 32
 infinite emergency throttle
©2013 LinkedIn Corporation. All Rights Reserved.
The awesome patches
 traffic_server gets restarted if FD > 32
 infinite emergency throttle
 buffer overflow in the stats system
©2013 LinkedIn Corporation. All Rights Reserved.
Contributions back
28 fixes committed back to open-source
19 more pending
LinkedIn ATS committer, Brian Geffon
©2013 LinkedIn Corporation. All Rights Reserved.
ATS C++ API
Simplifies the process of writing ATS plugins
https://github.com/linkedin/atscppapi
I wrote a transformation plugin that would probably
have taken me weeks, struggling with virtual I/O
buffers, in just a few hours. Now that I’ve done it
once, it would be even faster.
Doug Young
Sr. Staff Software Engineer
©2013 LinkedIn Corporation. All Rights Reserved.
Almost forgot… Media Cache!
Serves profile pictures, cached external content
Pre-ATS
– NetApp filer CPU >50%
– Expected an outage during NetApp failover
©2013 LinkedIn Corporation. All Rights Reserved.
Almost forgot… Media Cache!
Serves profile pictures, cached external content
Pre-ATS
– NetApp filer CPU >50%
– Expected an outage during NetApp failover
Post-ATS
– 98% cache hit rate
– $30,000 in gear, saved $400,000
– Bought us time to re-architect the service
©2013 LinkedIn Corporation. All Rights Reserved.
So what are the takeaways?
 ATS is a bad ass HTTP proxy
 Small details matter, fight for the users
 HA Proxy is a silver bullet
 Slow down, learn for your mistakes.
 Don’t just use open-source, contribute
©2013 LinkedIn Corporation. All Rights Reserved.
Meet the team
Manjesh Nilange Brian Geffon Thomas JacksonNick Berry
Office hours @ 1:15 PM
Exhibit Hall (Table 2)
©2013 LinkedIn Corporation. All Rights Reserved.
Links
 This talk:
 Apache Traffic Server:
– http://trafficserver.apache.org
 ATS C++ API:
– https://github.com/linkedin/atscppapi
 New plugins:
– https://github.com/linkedin/ -- coming soon!
©2013 LinkedIn Corporation. All Rights Reserved.
Goodbye!

Más contenido relacionado

Último

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Destacado

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Destacado (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Reflecting a year after migrating to apache traffic server

  • 1. ©2013 LinkedIn Corporation. All Rights Reserved. Reflecting a Year After Migrating to Apache Traffic Server
  • 2. ©2013 LinkedIn Corporation. All Rights Reserved. Have You Looked At Your Access Logs Lately?
  • 3. ©2013 LinkedIn Corporation. All Rights Reserved. Surviving by Proxy
  • 4. ©2013 LinkedIn Corporation. All Rights Reserved. Even Your Registrar Breaks Sometimes
  • 5. ©2013 LinkedIn Corporation. All Rights Reserved. How Apache Traffic Server Changed LinkedIn
  • 6. ©2013 LinkedIn Corporation. All Rights Reserved. Hello!
  • 7. ©2013 LinkedIn Corporation. All Rights Reserved. ATS: Apache Traffic Server  Fast, scalable and extensible HTTP/1.1 compliant caching proxy server  Single-process, multi-threaded  Asynchronous I/O  Plugin architecture  Written by Inktomi >10 years ago, Yahoo acquired Inktomi, found the code on a system collecting dust in a cardboard box, and open-sourced in 2010
  • 8. ©2013 LinkedIn Corporation. All Rights Reserved. ATS: Who’s using it?
  • 9. ©2013 LinkedIn Corporation. All Rights Reserved. When we started…  4,000 QPS to www.linkedin.com  120M members  Citrix NetScaler used for all external load balancing (XLB) – Load balances requests based on path to frontends – SSL termination – Monitors health per frontend  Features were built as Tomcat filters – Tomcat required, no solution for alternates – >70 frontend services deployed across hundreds of hosts
  • 10. ©2013 LinkedIn Corporation. All Rights Reserved. Outgrowing the existing solution  Need to support multiple frontend frameworks – DoS protection – Authentication – Optimizations  Complete control over features – Cookie manipulation – Advanced routing  Deployment delays, security-related fixes took days if not weeks  Even small changes required touching network gear
  • 11. ©2013 LinkedIn Corporation. All Rights Reserved. How about an intelligent HTTP proxy layer?  Less (re)implementing features into multiple frameworks  Make decisions higher in the stack – Faster response time – Reduce work on the application stack  Rapid iteration
  • 12. ©2013 LinkedIn Corporation. All Rights Reserved. Where to start?  Evaluated options  Requirements: – Mature – Scalable – Language we like – Plugin support with hooks and documentation, shared libraries a big plus – Shared runtime information between plugins – In-house knowledge is a plus  Apache Traffic Server matched our needs
  • 13. ©2013 LinkedIn Corporation. All Rights Reserved. Preparation  4 patches out of the gate  Audit traffic, build configs  Build metrics, dashboards and alerts – Huge blocker, new territory for non-Java @ LinkedIn  Migrate traffic, one service at a time
  • 14. ©2013 LinkedIn Corporation. All Rights Reserved. Let’s migrate! Started migration in October 2011
  • 15. ©2013 LinkedIn Corporation. All Rights Reserved. Let’s migrate! Started migration in October 2011 “We’ll be done by Christmas!” - everyone
  • 16. ©2013 LinkedIn Corporation. All Rights Reserved. Original Plan XLB L1 Proxy (ATS) VIP Frontend
  • 17. ©2013 LinkedIn Corporation. All Rights Reserved. Month 1: Public Profile
  • 18. ©2013 LinkedIn Corporation. All Rights Reserved. Request Rules Remap Cookie-based routing e.g. logged-in vs. logged-out Los Angeles XLB L1 Proxy (ATS) VIP Frontend Chicago XLB L1 Proxy (ATS) VIP Frontend www.linkedin.com
  • 19. ©2013 LinkedIn Corporation. All Rights Reserved. Request Rules Remap if (request_cookie[”foo"] starts_with ”bar”) return "host:chicago.linkedin.com:8888"; else return "host:losangeles.linkedin.com:8888";
  • 20. ©2013 LinkedIn Corporation. All Rights Reserved. Month 3: Sentinel (DoS protection) Prevent abusive requests from reaching frontend XLB L1 Proxy (ATS) VIP Frontend
  • 21. ©2013 LinkedIn Corporation. All Rights Reserved. Month 4: Picking up momentum Largest frontends of the site done – Homepage – Profile – Registration New ATS tier, Fizzy!
  • 22. ©2013 LinkedIn Corporation. All Rights Reserved. New ATS tier, Fizzy!  Edge Side Includes on steroids  UI content aggregator  Progressive Rendering – Browser deferred rendering – Browser deferred fetch – Server  Supports Server Side Rendering of JavaScript templates via V8
  • 24. ©2013 LinkedIn Corporation. All Rights Reserved. Now with Fizzy! XLB L1 Proxy (ATS) VIP VIP Frontend (non-fizzy) Fizzy (ATS) VIP Frontend
  • 25. ©2013 LinkedIn Corporation. All Rights Reserved. Month 6: Most frontends migrated  Config generators written  Caught the attention of other teams – New plugins developed  Another new tier, QD Proxy!
  • 26. ©2013 LinkedIn Corporation. All Rights Reserved. Another new tier, QD Proxy! Quick Deploy Proxy – Define profiles for dev instances to route to – Allows multiple users to use the same profile – Develop without running the entire stack
  • 27. ©2013 LinkedIn Corporation. All Rights Reserved. Quick Deploy Proxy: Frontend XLB L1 Proxy (ATS) Frontend Fizzy (ATS) Backend QD Proxy (ATS) My Frontend
  • 28. ©2013 LinkedIn Corporation. All Rights Reserved. Quick Deploy Proxy: Backend XLB L1 Proxy (ATS) Frontend Fizzy (ATS) Backend QD Proxy (ATS) My Backend
  • 29. ©2013 LinkedIn Corporation. All Rights Reserved. Month 9: Ramping Fizzy to 100%
  • 30. ©2013 LinkedIn Corporation. All Rights Reserved. Month 9: Ramping Fizzy to 100%  Broke the site
  • 31. ©2013 LinkedIn Corporation. All Rights Reserved. Month 9: Ramping Fizzy to 100%  Broke the site  HA Proxy saves the day – “The Reliable, High Performance TCP/HTTP Load Balancer” – leverage the metadata in Range to generate configs – reduce network hops by avoiding hardware load balancer – deploy changes in minutes
  • 32. ©2013 LinkedIn Corporation. All Rights Reserved. … and HA Proxy! XLB Frontend L 1 P R O X Y HAPROXY ATS F I Z Z Y HAPROXY ATS L 1 P R O X Y HAPROXY ATS F I Z Z Y HAPROXY ATS Frontend (non-fizzy)
  • 33.
  • 34. ©2013 LinkedIn Corporation. All Rights Reserved. After all that…  October 2011: 4,000 QPS, 120M members  August 2012: 15,000 QPS, 175M members  Now: 67,000 QPS, 225M members  Citrix NetScaler still in use – Load balancing L1 proxy – SSL termination  Features built as ATS plugins – Supports anything behind ATS tiers (L1 Proxy, Fizzy) – Quick to deploy
  • 35. ©2013 LinkedIn Corporation. All Rights Reserved. Implementation  October 2011 - August 2012 (10 months)
  • 36.
  • 37. ©2013 LinkedIn Corporation. All Rights Reserved. Implementation  October 2011 - August 2012  Unexpected surprises aka outages  Scope creep – New tiers and architecture: Fizzy, HA Proxy – Lots of new plugins  It takes time to build… – monitoring – tooling – configuration automation
  • 38. ©2013 LinkedIn Corporation. All Rights Reserved. Outages  Hand edited configs with typos  Misbehaving node in rotation  Bad upgrade from 2.x to 3.x due to incompatible hostdb  Missing slash for a config, sent requests to wrong frontend  Bonus slash to a healthcheck taking all hosts down  SysOps re-imaged experimental hosts, broke 10% of Profile  Saturated load balancer due to additional ATS layer  Sticky cookie conflict between frontends  HA Proxy wasn’t started  Random ATS crashes  Coal in our stocking for Christmas  Multiple issues with multiple plugins  Log4cpp hard-coded to DEBUG at root level for one plugin, overwrote for all plugins  FD per-user limit unexpectedly changed  Keep-alive unexpectedly turned on with high timeouts
  • 39. ©2013 LinkedIn Corporation. All Rights Reserved. Outages (>0.1% requests affected) 0% 20% 40% 60% 80% 100% 2011 2012 2013 Plugin ATS Human
  • 40. ©2013 LinkedIn Corporation. All Rights Reserved. How did we improve?
  • 41. ©2013 LinkedIn Corporation. All Rights Reserved. How did we improve? Monitoring!
  • 42.
  • 43. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: traffic_logstats • per-origin breakdown: – status – method – QPS – bytes – etc. • Want JSON output? use -j • results are COUNTER, and GAUGE if the key ends in _pct
  • 44. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: traffic_logstats HTTP return codes Count Percent Bytes Percent ------------------------------------------------------------------------------ 100 Continue 0 0.00% 0.00KB 0.00% 200 OK 1,383,361 93.57% 4.71GB 97.48% 201 Created 5,429 0.37% 3.28MB 0.07% 202 Accepted 0 0.00% 0.00KB 0.00% 203 Non-Authoritative Info 0 0.00% 0.00KB 0.00% 204 No content 12 0.00% 5.63KB 0.00% 205 Reset Content 0 0.00% 0.00KB 0.00% 206 Partial content 0 0.00% 0.00KB 0.00% 2xx Total 1,388,802 93.94% 4.71GB 97.54% 300 Multiple Choices 0 0.00% 0.00KB 0.00% 301 Moved permanently 3,360 0.23% 3.47MB 0.07% 302 Found 38,475 2.60% 35.09MB 0.71% 303 See Other 11 0.00% 3.87KB 0.00% 304 Not modified 29,262 1.98% 12.20MB 0.25% 305 Use Proxy 0 0.00% 0.00KB 0.00% 307 Temporary Redirect 0 0.00% 0.00KB 0.00% 3xx Total 71,108 4.81% 50.76MB 1.03% ...
  • 45. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: traffic_line • Swiss army knife for Traffic Server • executable to read variables
  • 46. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: {stat} • prefer HTTP over shell? records.config: CONFIG proxy.config.http_ui_enabled INT 2 remap.config: map /_stat/ http://{stat} @action=allow @src_ip=127.0.0.1
  • 47. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: {stat} proxy.node.restarts.manager.start_time proxy.node.restarts.proxy.start_time
  • 48. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: {stat} proxy.node.restarts.manager.start_time proxy.node.restarts.proxy.start_time
  • 49. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: {stat} proxy.node.current_client_connections proxy.node.current_server_connections
  • 50. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: {stat} proxy.config.net.connections_throttle  limit before ATS starts to drop connections  based on the sum of client and server connections proxy.process.net.connections_currently_open  client + server connections
  • 51. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: {stat} proxy.config.net.connections_throttle  limit before ATS starts to drop connections  based on the sum of client and server connections proxy.process.net.connections_currently_open  client + server connections
  • 52. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: {stat} Plugin specific  reviewed prior plugins go to production Examples  enforced vs. un-enforced DoS requests  track cookie usage for a migration  thread usage of a plugin
  • 53. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: outside the app Core dump rate – generate crash reports with full stack trace – monitoring file system for core dumps newer than -24 hours – alert if > N TCP – capture states from netstat – listen queue overflowing (net.core.somaxconn) Proc – review /proc/pid/status – fetch VmSize and VmSwap – count # of files in /proc/pid/fd for FD usage
  • 54. ©2013 LinkedIn Corporation. All Rights Reserved. Monitoring: logs  I HATE dislike the stock logs  squid.log – mimics squid access log – more useful if you’re caching  common.log, extended.log, extended2.log – Netscape formats – not enough detail  custom logging!
  • 55. ©2013 LinkedIn Corporation. All Rights Reserved. Custom Logging records.config CONFIG proxy.config.log.custom_logs_enabled INT 1 logs_xml.config <LogFormat> <Name = ”custom_access"/> <Format = "%<chi> %<{X-Real-Client-IP}cqh> - %<caun> [%<cqtn>] "%<cqhm> %<cquuc> %<cqhv>" %<pssc> %<pscl> "%<{Referer}cqh>" "%<{User-Agent}cqh>" %<ttms>ms %<cquc> %<{X-LI-UUID}psh>"/> </LogFormat> <LogObject> <Format = ” custom_access"/> <Filename = ”access"/> </LogObject>
  • 56. ©2013 LinkedIn Corporation. All Rights Reserved. Custom logging %<chi> 172.16.200.10 %<{X-Real-Client-IP}cqh> 65.16.225.8 %<caun> - (http auth'd username) [%<cqtn>] [01/Nov/2011:23:59:59 +0000] "%<cqhm> %<cquuc> %<cqhv>" "GET /nhome/ HTTP/1.1" %<pssc> 200 %<pscl> 34697 %<{Referer}cqh> “http://www.linkedin.com/" %<{User-Agent}cqh> "Mozilla/4.0 (compatible; ...)" %<ttms> 327ms %<cqu> http://origin:port/nhome/
  • 57. ©2013 LinkedIn Corporation. All Rights Reserved. Dashboard: overview Internal ATS:  client connections  server connections  traffic_cop uptime  traffic_server uptime  connection failed  invalid request Logs:  2xx status  3xx status  4xx status  5xx status  HTTP methods OS:  cpu usage  interface  tcp state distribution  # of core dumps  ATS memory usage  ATS swap usage  ATS file descriptor usage
  • 58. ©2013 LinkedIn Corporation. All Rights Reserved. Dashboard: in-depth  plugin-specific  per-path histogram of request durations  per-origin HTTP status breakdown  HA Proxy – current sessions – denied requests – error requests – server status
  • 59. ©2013 LinkedIn Corporation. All Rights Reserved. How did we improve? Automation! Configs are generated, not hand maintained – Details about a service are stored in metadata store – YAML configs supplement missing data Deployment done by Salt – All deployment actions and verifications are – integrated with Informed
  • 60. ©2013 LinkedIn Corporation. All Rights Reserved. Informed
  • 61. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins! header-rewrite request-rules-remap sentinel lix-remap host_override postbuffer mobileredirect correctcookiedomain qdproxy boom pagespeed contentsecurityheader authfilter oauth-rewrite stickyrouting
  • 62. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: header-rewrite Manipulate headers at any point in the request lifecycle – read request – send request – read response – send response  Can use as a remap plugin – change path, destination, port  Patched to include variables
  • 63. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: header-rewrite cond %{READ_REQUEST_HDR_HOOK} [AND] cond %{ACCESS:/var/healthcheck} [NOT] rm-header Connection add-header Connection "close”
  • 64. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: header-rewrite cond %{SEND_RESPONSE_HDR_HOOK} [AND] cond %{PATH} "/foo.js” add-header Content-Type "text/javascript”
  • 65. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: lix-remap Uses LinkedIn Experiments infrastructure (A/B testing) to make routing decisions  Enable NOC to easily send traffic to another data center  Route specific users, LinkedIn employees or % of users to experimental tiers  Used for red-line performance testing of frontends
  • 66. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: Boom We don’t want to show users this…
  • 67. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: Boom … but based on status code, we can replace it with this:
  • 68. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: Host Override Direct your request to a specific host through any ATS tier
  • 69. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: PageSpeed Support on-the-fly operations before sending the response
  • 70. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: PageSpeed HTML minification – How many empty new lines are on Profile?
  • 71. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: PageSpeed HTML minification – How many empty new lines are on Profile? 2703
  • 72. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: PageSpeed HTML minification – How many empty new lines are on Profile? 2703 – How many empty new lines are on Homepage?
  • 73. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: PageSpeed HTML minification – How many empty new lines are on Profile? 2703 – How many empty new lines are on Homepage? 9205
  • 74. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: PageSpeed HTML minification Homepage: 78% Profile: 72%
  • 75. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: PageSpeed HTML minification Homepage: 10% Profile: 17% 0 10000 20000 30000 40000 Homepage Profile Compressed bytes
  • 76. ©2013 LinkedIn Corporation. All Rights Reserved. Plugins: PageSpeed Lazy loading of images below the fold
  • 77. ©2013 LinkedIn Corporation. All Rights Reserved. The awesome patches
  • 78. ©2013 LinkedIn Corporation. All Rights Reserved. The awesome patches  traffic_server gets restarted if FD > 32
  • 79. ©2013 LinkedIn Corporation. All Rights Reserved. The awesome patches  traffic_server gets restarted if FD > 32  infinite emergency throttle
  • 80. ©2013 LinkedIn Corporation. All Rights Reserved. The awesome patches  traffic_server gets restarted if FD > 32  infinite emergency throttle  buffer overflow in the stats system
  • 81. ©2013 LinkedIn Corporation. All Rights Reserved. Contributions back 28 fixes committed back to open-source 19 more pending LinkedIn ATS committer, Brian Geffon
  • 82. ©2013 LinkedIn Corporation. All Rights Reserved. ATS C++ API Simplifies the process of writing ATS plugins https://github.com/linkedin/atscppapi I wrote a transformation plugin that would probably have taken me weeks, struggling with virtual I/O buffers, in just a few hours. Now that I’ve done it once, it would be even faster. Doug Young Sr. Staff Software Engineer
  • 83. ©2013 LinkedIn Corporation. All Rights Reserved. Almost forgot… Media Cache! Serves profile pictures, cached external content Pre-ATS – NetApp filer CPU >50% – Expected an outage during NetApp failover
  • 84. ©2013 LinkedIn Corporation. All Rights Reserved. Almost forgot… Media Cache! Serves profile pictures, cached external content Pre-ATS – NetApp filer CPU >50% – Expected an outage during NetApp failover Post-ATS – 98% cache hit rate – $30,000 in gear, saved $400,000 – Bought us time to re-architect the service
  • 85. ©2013 LinkedIn Corporation. All Rights Reserved. So what are the takeaways?  ATS is a bad ass HTTP proxy  Small details matter, fight for the users  HA Proxy is a silver bullet  Slow down, learn for your mistakes.  Don’t just use open-source, contribute
  • 86. ©2013 LinkedIn Corporation. All Rights Reserved. Meet the team Manjesh Nilange Brian Geffon Thomas JacksonNick Berry Office hours @ 1:15 PM Exhibit Hall (Table 2)
  • 87. ©2013 LinkedIn Corporation. All Rights Reserved. Links  This talk:  Apache Traffic Server: – http://trafficserver.apache.org  ATS C++ API: – https://github.com/linkedin/atscppapi  New plugins: – https://github.com/linkedin/ -- coming soon!
  • 88. ©2013 LinkedIn Corporation. All Rights Reserved. Goodbye!

Notas del editor

  1. https://iwww.corp.linkedin.com/wiki/cf/display/~niberry/Velocity+2013+Proposalhttp://velocityconf.com/velocity2013/public/schedule/detail/28461
  2. I like this one a lot
  3. Really this talk is about how introducing ATS into the LinkedIn stack completelychanged how we tackled several complex problems
  4. I started working at LinkedIn August 2011, I manage the SRE responsible for Core, Security and Presentation Infrastructure. We support Identity infrastructure, Growth/Registration, Engagement and several other systems that I spare you the details on since you’re here to learn about ATS… so what is it?
  5. Bad ass HTTP proxyMulti-threadedNon-blocking I/OPluggableWell known for cachingInktomi wrote sometime in the late mid-90s and Yahoo open-sourced it in 2010.
  6. So a few companies are using it… and these are just the people that bothered to put their logo on the Customers page
  7. If you wanted a feature, it would be built as a Tomcat filter and deployed out to the majority of the site. For anything not running in Tomcat, there was not a solutionLots of frontends on lots of hosts
  8. LinkedIn started acquiring companies, really difficult to integrate their stack into our own.We’re in a heterogeneous environment, supporting features across multiple platforms is a requirement and abstracting that from the frontend itself into an infrastructure tier completely changes the game.give a story?
  9. Centralize the effort of these features. Acquisitions become first-ish class citizensNeed to make a change, push out the plugin to the ATS tier instead of coordinating with all the service owners for weeks to update/deploy their codeReduce the time to deploy
  10. slow down the delivery on this oneHA Proxy and Varnish were not consideredNginx, ATS: maturity, scalability. modular , in house knowledgehttps://iwww.corp.linkedin.com/wiki/cf/display/ENGS/Comparison+of+TS+and+nginxdynamically load plugins without having to recompile a new server binary gave us the flexibility we needed to enable/disable features quickly
  11. Before we even started, we had patches for ATS addressing an issue with keep-alive handling and adding support for remap_with_recv_port to allow requests from different incoming ports to different originsNo good source of truth how to route requests, so we had to audit configs and access logs to build the configMetrics for non-Java services at LinkedIn really didn’t exist. We built a Python-based framework to seamlessly fit into our monitoring modelAfter all this, we could start migrating!
  12. We were a little ambitious
  13. Ok, very ambitious
  14. L1 Proxy will be ATS with a few plugins
  15. First, we migrated our SEO optimized Profile pages to L1 Proxy. Allowed us to support routing unauthenticated users out of ECH3
  16. Certain requests are able to served out of different data centers. So for public profile requests from our signed-out users, we can route them to our Chicago data center for improved RTT.
  17. if there is a cookie named foo and it starts with bar, route it to the moon.
  18. Drop the request at L1 Proxy instead of wasting cycles on the frontend. Allowed for us to automatically deny requests based on limits, without the need for people to scan access logs and manually block IP addressesWe were prepared with our new Sentinel plugin before Christmas, but decided to delay enabling until after the holidays. Since scrapers do not seem to celebrate New Years, we were forced to enable on New Years Day 2012, and it worked!
  19. Hopefully you had a chance to attend Veena’s talk yesterday on The Curious Case of Dust JavaScript and Performance. In case you didn’t make it, I’ll explain what it does
  20. call out Veena’s talk, you’ve missed it but you can see more information here…add note about USSR + V8
  21. At a high level, this allows your app to do specifically what it’s supposed to. Before Fizzy, LinkedIn would ship code into multiple frontends so they could render the module within the various services.Now… People You May Know’s module can be fetched and embedded into the Profile page, while an Ad can be pulled in from another frontend.
  22. Up until this point, configs were manually edited and deployed. this sucked, big time. When we started, there was no great source of truth of the data we needed. By this point, LinkedIn SRE had a metadaa store in place with most of the data in place and we just needed to fill the gaps. Significantly improved managing configs and reduced the amount of human error.Teams across the company wanted to build ATS plugins. This was both good and bad. Good that we were able to solve some difficult problems, bad that our proxy tier was becoming more complicated.The Mobile team wrote a plugin to detect when to issue redirects to Mobile pages, instead of handling in the Tomcat frontendsSecurity started manipulating and enforcing cookies, addressing legacy issues that were traditionally difficult to track downThis also lead to the development of QD Proxy…
  23. The key design behind Quick Deploy is that if you develop a service locally, you can initiate a request to that service, either directly or indirectly, by using QD Proxy in LinkedIn’s staging environment. All other components of my request go to the Staging environment.
  24. If I have a minor tweak to make, and not ready to commit, I can set up a QD Proxy profile to route the request to my dev box for the Frontend request and my Frontend will talk back to QD Proxy for all the backend calls, which will be sent to the backends in Staging.Really freakin’ sweet.
  25. I can also have a frontend in Staging send the backend requests to my dev box based on my QD Proxy profile.Testing features before committing against a complete environment is now possible.
  26. not because of Fizzy, but from a compounding loop through a single pair of load balancers causing the LB pair CPU to spike and drop requests. this sucks! we were so close to finishing the migrations and now we’ll have buy new load balancers to handle the load… or will we?That&apos;s when HA Proxy came into our lives.The Reliable, High Performance TCP/HTTP Load BalancerSince we already had all the data in RangeWe could generate our HA Proxy configs in seconds and deploy them in minutesAllow us to automate load balancing changes without having to make changes on network gear
  27. not because of Fizzy, but from a compounding loop through a single pair of load balancers causing the LB pair CPU to spike and drop requests. this sucks! we were so close to finishing the migration and now we’ll to handle the load… or will we?
  28. That&apos;s when HA Proxy came into our lives.The Reliable, High Performance TCP/HTTP Load BalancerSince we already had all the data in RangeWe could generate our HA Proxy configs in seconds and deploy them in minutesAllow us to automate load balancing changes without having to make changes on network gear
  29. Moving to HA Proxy gave SRE:complete control over how we handled the load balancingreduced requests to NetOps, in turn reduced our turn around timereduced network hops between ATS and the Frontendremoved single points of failure between L1 Proxy and Fizzy by eliminating the load balancer
  30. Month 10: we did it! www.linkedin.com migrated behind L1 Proxy and Fizzy.
  31. Startedout at 4000 QPS and 120M members, today we’re nearing 70000 QPS at 225M members. that’s approximately a 4x increase in QPS, year over year. NetScaler is still in play, but now only providing the load balancing to L1 Proxy as well as SSL termination. At this point, we’re able to consider possibilities of removing the NetScaler all-together.Bug fixes for features in ATS can be rolled out in hours not days and our acquisitions get all the goodness the rest of the site does
  32. Stability:* When you’re introducing a critical tier and forcing everyone onto it, customer service is key. Spent many hours debugging invalid (and some valid) escalations to help build confidenceInvalid requests:POST requests with no content-length and no bodyConnection failed:Clients using CONNECT for no reason!
  33. Here are 15 out of 30 outages since we set up L1 Proxy and FizzyEach outage reminded us of the impact from even the smallest of changes. There will be mistakes, there will be unexpected surprises.If you’re going to fail, do it quick and recover fast. Learn from the mistakes, and avoid the repeaters.
  34. We’re doing more with ATS than ever before and the outage rate is not affected by it.issues with plugins are now caught earlier in the development process. they’re performance tested before going to staging, deployment schedule with strict guidelines to ensure testing/verification is done before promoting to production.and you can see a downwardtrend with the Human factor
  35. We strive to keep our graphs looking good, even when they’re bad… so much so that my team will draw on post-it notes to cover up nasty outages. So how did we do this… with a few different tools.
  36. I suggest summarizing some of the data, unless you’re prepared to consume all the metrics
  37. Great for reading variables (core + plugins) to monitor.
  38. don’t want to shell out to gather metrics, there’s an HTTP endpoint!awesome, right?
  39. we take start_time and use it to calculate uptime by subtracting start_time from time.time()
  40. tracking start_time helps highlight crashes, deployments, people doing things to the service that shouldn’t be
  41. monitor trends coming in and going out
  42. We track how close we’re getting to the throttle limit.
  43. that’s bad
  44. Core dump rate: monitoring file system for core dumps &lt; 24 hours, alert if &gt;NTCP States: captured from netstat, watching for spike in TIME_WAITProc: memory usage, swap usage, file descriptor usage
  45. They don’t give enough of a picture for the requests we’re processing. If you need to debug a problem, we need a combination of these familiar logging formats… fortunately there’s custom logging
  46. Log request headers, response headers, timing, originWe tail -f the log, aggregate and report timing for given paths (something we don’t get with traffic_logstats)
  47. If someone adds or removes a host from the deployment system’s topology, our config generatorswill pick it up. We even have some of these configs ready to be headless so changes will be automatically propagated.Salt is an open source remote execution framework written in Python. Since we can write Python modules to do whatever we want, we’re able to create the pre/post hooks necessary for rolling out changes:take host out of rotationbleed trafficconfirm it’s out of rotationupgrade packagesinstall configsrestart trafficserververify process is runningreview log filesgo back into rotationbefore these steps were all done by a human, and ultimately led to mistakes. we now automate these tasks and iterate on them every time we learn how to better the process.quick plug on inFormed
  48. inFormed is our in-house report of things happening in productionFed through multiple bridges:jira ticketing systemircdeploymentswhatever
  49. available in the experimental section of plugins
  50. This is an example
  51. Google DWR recently was updated in the last couple weeks and caused Chrome to bomb out on one of our javascript files. Within 20 minutes, we had a temporary fix deployed into production to issue the correct Content-Type header.
  52. Why would you need Boom?
  53. Enabled anyone to debug production issues against a single host instead of scanning for your request across 50+ servers. I can pin my requests to a specific L1 Proxy host, through a specific Fizzy host and then to a specific Profile host. Hell yes!
  54. This is even more awesome due to ATS’ non-blocking I/O, avoids burning up threads on the frontendsWho has looked at LinkedIn’s “View as Source”?
  55. Saving 10% per request at the expense of idle CPU is a huge win!
  56. Being tested in our staging environment as we speak. Potential CDN savings still to be calculated.
  57. traffic_manager was unable to communicate with traffic_server because of a hard-coded file descriptor limit of 32 for the internal healthcheck. so traffic_server would restart every ~2 minutes and you see an uptime graph like this…
  58. when ATS hits the connection_throttle limit, it would never get out of the throttle until restarting ATS.
  59. as we started adding our own stats, there was no checking in place to prevent a plugin from creating too many variables/metrics and the {stat} end point was not able to return the results within the given buffer.https://git-wip-us.apache.org/repos/asf?p=trafficserver.git&amp;a=search&amp;h=HEAD&amp;st=author&amp;s=brianghttps://git-wip-us.apache.org/repos/asf?p=trafficserver.git&amp;a=search&amp;h=HEAD&amp;st=author&amp;s=manjeshnilangehttps://issues.apache.org/jira/issues/?jql=project%20%3D%20TS%20AND%20reporter%20in%20(manjeshnilange%2C%20briang%2C%20manjesh)%20AND%20updated%20%3E%202011-06-01----- Meeting Notes (6/18/13 11:50) -----write out what you want to say
  60. https://git-wip-us.apache.org/repos/asf?p=trafficserver.git&amp;a=search&amp;h=HEAD&amp;st=author&amp;s=brianghttps://git-wip-us.apache.org/repos/asf?p=trafficserver.git&amp;a=search&amp;h=HEAD&amp;st=author&amp;s=manjeshnilangehttps://issues.apache.org/jira/issues/?jql=project%20%3D%20TS%20AND%20reporter%20in%20(manjeshnilange%2C%20briang%2C%20manjesh)%20AND%20updated%20%3E%202011-06-01
  61. We have rewritten a few of our plugins to use this new API, one of them is literally half the code it was before.This has enabled us to growfrom 2 engineers to working on ATS plugins to 6 engineers and the ramp up time for plugin development is dramatically reduced.Doug’s comment on atscppapi:I&apos;d say the main feedback is that it&apos;s *really* easy to use compared to the raw API. Hides all the grunge, and just lets you focus on your logic. I wrote a transform plugin that would probably have taken me weeks of struggling with virtual I/O buffers and so on in just a few hours, and that included learning the basics of the API. Now that I&apos;ve done it once, it would be even faster. So far I haven&apos;t hit any limitations of the abstraction. It does a excellent job of providing the functionality of the ATS in a way that matches the plugin developer&apos;s mental view of the tasks to be performed, rather than going from the mindset of internal ATS implementation. As long as you understand the basic concept of the ATS state machine, writing a new plugin is almost trivial.
  62. Earlier this year, our Media origin for the CDN was nearing capacity. The NetApp filer’s CPU were over 50% and if we needed to failover, we would not have been able to serve Media requests (profile pictures, cached external content). Since we had so much success with ATS as a reverse proxy, why not try using it for its bread and butter... caching.
  63. After a couple weeks of tweaking config and $30,000 in gear later, a caching layer was built on-top of our Media origin. We had 98% cache hit rate serving requests &lt; 2ms. This reduced our NetApp filer’s CPU to less than 1%. The team responsible for the Media origin thought the NetApp CPU graphs were broken (we kind of forgot to mention we finished migrating the traffic over to the new cache)and savedthe company $400,000by avoiding having to upgrading our filers.Recap…$30,000 of commodity gear + ATS saved LinkedIn $400,000
  64. Thank you for your time! Come meet the team behind ATS @ LinkedIn during our office hours at 1:15PM. We’re interested in answering any questions around our experiences of solving problems at LinkedIn with Apache Traffic Server.