SlideShare una empresa de Scribd logo
1 de 55
© Copyright Azul Systems 2015
© Copyright Azul Systems 2015
@azulsystems
Enabling Java
in
Latency Sensitive
Environments
Matt Schuetze
Azul Director of Product Management
Matt Schuetze, Product Manager, Azul Systems
Utah JUG, Murray UT, November 20, 2014
Dallas Java Developers Meetup
Dallas, Texas
5/3/20151
© Copyright Azul Systems 2015
High Level Agenda
 Intro, jitter vs. JITTER
 Java in a low latency application world
 The (historical) fundamental problems
 What people have done to try to get around
them
 What if the fundamental problems were
eliminated?
 What 2015 looks like for Low Latency Java
developers
 Real World Case Studies
Welcome to all Dallas Java Developers
5/3/20152
© Copyright Azul Systems 20155/3/20153
© Copyright Azul Systems 2015
Is “jitter” a proper word for this?
99%‘ile is
~60 usec
Max is ~30,000%
higher than
“typical”
Answer: no its not jitter at all. Its phase changes.
5/3/20154
© Copyright Azul Systems 2015
About Azul Systems
Vega
C4
 We make scalable Virtual
Machines
 Have built “whatever it
takes to get job done” since
2002
 3 generations of custom
SMP Multi-core HW (Vega)
 Now Pure software for
commodity x86 (Zing)
 Certified OpenJDK (Zulu)
 Known for Low Latency,
Consistent execution, and
Large data set excellence
Zing, Zulu, and everything about Java Virtual Machines
5/3/20155
© Copyright Azul Systems 2015
Java in the low latency world
5/3/20156
© Copyright Azul Systems 2015
Java in a low latency world
 Why do people use Java for low latency apps?
 Are they crazy?
 No. There are good, easy to articulate reasons
 Projected lifetime cost
 Developer productivity
 Time-to-product, Time-to-market, ...
 Leverage, ecosystem, ability to hire
Yep, low latency Java is goin’ down for real…
5/3/20157
© Copyright Azul Systems 2015
e.g. customer answer to:
“Why do you use Java in Algo Trading?”
 Strategies have a shelf life
 We have to keep developing and deploying new
ones
 Only one out of N is actually productive
 Profitability therefore depends on ability to
successfully deploy new strategies, and on the
cost of doing so
 Our developers seem to be able to produce 2x-3x
as much when using a Java environment as they
would with C++ ...
5/3/20158
© Copyright Azul Systems 2015
So what is the problem?
Is Java Slow?
 No
 A good programmer will get roughly the same
speed from both Java and C++
 A bad programmer won’t get you fast code on
either
 The 50%‘ile and 90%‘ile are typically excellent...
 It’s those pesky occasional stutters and stammers
and stalls that are the problem...
 Ever hear of Garbage Collection?
5/3/20159
© Copyright Azul Systems 2015
Java’s Achilles heel
5/3/201510
© Copyright Azul Systems 2015
Stop-The-World Garbage Collection:
How bad is it?
 Let’s ignore the bad multi-second pauses for now...
 Low latency applications regularly experience
“small”, “minor” GC events that range in the 10s of
msec
 Frequency directly related to allocation rate
 In turn, directly related to throughput
 So we have great 50%, 90%. Maybe even 99%
 But 99.9%, 99.99%, Max, all “suck”
 So bad that it affects risk, profitability, service
expectations, etc.
5/3/201511
© Copyright Azul Systems 2015
STW-GC effects in a low latency application
99%‘ile is
~60 usec
Max is ~30,000%
higher than
“typical”
5/3/201512
© Copyright Azul Systems 2015
One way to deal with Stop-The-World GC
I cannot see it, so it cannot see me.
5/3/201513
© Copyright Azul Systems 2015
More Stop-The-World GC avoidance
Time for a bigger rug.
5/3/201514
© Copyright Azul Systems 2015
What do actual low latency developers
do about it?
 They use “Java” instead of Java
 They write “in the Java syntax”
 They avoid allocation as much as possible
 E.g. They build their own object pools for
everything
 They write all the code they use (no 3rd party libs)
 They train developers for their local discipline
 In short: They revert to many of the practices that
hurt productivity. They lose out on much of Java.
5/3/201515
© Copyright Azul Systems 2015
Another way to cope: “Creative Language”
“Guarantee a worst case of 5 msec, 99% of the time”
Translation: “1% will be far worse than worst case”
“Mostly” Concurrent, “Mostly” Incremental
Translation: “Will at times exhibit long monolithic stop-the-
world pauses”
“Fairly Consistent”
Translation: “Will sometimes show results well outside this
range”
“Typical pauses in the tens of milliseconds”
Translation: “Some pauses are much longer than tens of
milliseconds”
Drawn from evil vendor marketing literature
5/3/201516
© Copyright Azul Systems 2015
What do low latency (Java) developers
get for all their effort?
 They still see pauses (usually ranging to tens of
msec)
 But they get fewer (as in less frequent) pauses
 And they see fewer people able to do the job
 And they have to write EVERYTHING themselves
 And they get to debug malloc/free patterns again
 And they can only use memory in certain ways
 ...
 Some call it “fun”... Others “duct tape
engineering”...
5/3/201517
© Copyright Azul Systems 2015
There is a fundamental problem:
Stop-The-World GC mechanisms
are contradictory to the fundamental
requirements of
low latency & low jitter apps
5/3/201518
© Copyright Azul Systems 2015
Unsustainable
Throughout
Sustainable Throughput
The throughput achieved while safely maintaining service levels
5/3/201519
© Copyright Azul Systems 2015
It’s an industry-wide
problem
5/3/201520
© Copyright Azul Systems 2015
It was an industry-wide
problem
It’s 2015... Now we have Zing®.
5/3/201521
© Copyright Azul Systems 2015
The common GC behavior across ALL
currently shipping (non-Zing) JVMs
 ALL use a Monolithic Stop-the-world NewGen
– “small” periodic pauses (small as in 10s of msec)
– pauses more frequent with higher throughput or allocation rates
 Development focus for ALL is on OldGen collectors
– Focus is on trying to address the many-second pause problem
– Usually by sweeping it farther and farther the rug
– “Mostly X” (e.g. “mostly concurrent”) hides the fact that they refer
only to the OldGen part of the collector
– E.g. CMS, G1, Balanced.... all are OldGen-only efforts
 ALL use a Fallback to Full Stop-the-world Collection
– Used to recover when other mechanisms (inevitably) fail
– Also hidden under the term “Mostly”...
5/3/201522
© Copyright Azul Systems 2015
At Azul, STW-GC was addressed head-on
 We decided to focus on the right core problems
– Scale & productivity being limited by responsiveness
– Even “short” GC pauses are considered a problem
 Responsiveness must be unlinked from key
metrics:
– Transaction Rate, Concurrent users, Data set size, etc.
– Heap size, Live Set size, Allocation rate, Mutation rate
– Responsiveness must be continually sustainable
– Can’t ignore “rare but periodic” events
 Eliminate ALL Stop-The-World Fallbacks
– Any STW fallback is a real-world failure
Trivia: Azul as a company founded predominantly around this one premise plaguing then Java servers
5/3/201523
© Copyright Azul Systems 2015
The Zing “C4” Collector
Continuously Concurrent Compacting Collector
 Concurrent, compacting old generation
 Concurrent, compacting new generation
 No stop-the-world fallback
– Always compacts, and always does so concurrently
5/3/201524
© Copyright Azul Systems 2015
Benefits
5/3/201525
© Copyright Azul Systems 20155/3/201526
Stay Responsive
Even when traffic patterns change without warning
7x Load
Increase
30 minute span shows
elevated load long after
event, yet no pauses.
© Copyright Azul Systems 20155/3/201527
Handle Real World traffic patterns
One second view of transactions. Not constant. Not random either. Bursty is normal.
Red line shows where
order pricing arrival rate
would be if constant
© Copyright Azul Systems 20155/3/201528
Achieve Measureable Benefits
 Zing helped LMAX tame GC-related
latency outlier pauses
– Highly-engineered system: 4ms every 30 seconds
down to 1ms every 2 hours
– Less well-tuned system: 50ms every 30 seconds down
to 3ms every 15 minutes
 No more unexpected/unwanted old-gen
pauses caused by external behavior
– CMS STW intra-day, generally ~500ms, gone
– Removed source of backpressure on latency critical
path.
– Pre-Azul these would occur less predictably, but
multiple times a week.
From joint LMAX/Azul talk at QCon London, March 2015
© Copyright Azul Systems 2015
This is not “just Theory”
jHiccup
A tool that measures and reports
(as your application is running)
if your JVM is running all the time
5/3/201529
© Copyright Azul Systems 2015
Discontinuities in Java execution - Easy To Measure
A telco
App with a
bit of a
“problem”
5/3/201530
We call these
“hiccups”
© Copyright Azul Systems 2015
Oracle HotSpot™ (pure
newgen)
Zing
Low latency trading application
5/3/201531
© Copyright Azul Systems 2015
Oracle HotSpot (pure newgen) Zing
Low latency trading application
5/3/201532
© Copyright Azul Systems 2015
Low latency - Drawn to scale
Oracle HotSpot (pure newgen) Zing
5/3/201533
© Copyright Azul Systems 2015
It’s not just for
Low Latency
Just as easy to demonstrate for human-
response-time apps
5/3/201534
© Copyright Azul Systems 2015
Portal Application, slow Ehcache “churn”
Oracle HotSpot CMS, 1GB in an 8GB heap Zing, 1GB in an 8GB heap
5/3/201535
© Copyright Azul Systems 2015
Portal Application, slow Ehcache “churn”
Oracle HotSpot CMS, 1GB in an 8GB heap Zing, 1GB in an 8GB heap
5/3/201536
© Copyright Azul Systems 2015
Portal Application - Drawn to scale
Oracle HotSpot CMS, 1GB in an 8GB heap Zing, 1GB in an 8GB heap
5/3/201537
© Copyright Azul Systems 2015
A Recent E-Commerce Case
Study
5/3/201538
© Copyright Azul Systems 2015
Cyber Monday comes earlier every year…
General trends of real world e-commerce traffic
5/3/201539
© Copyright Azul Systems 2015
Human-Time Real World Latency Case
 Web retail site faces spike loads every year over
Thanksgiving through Cyber Monday.
 Site latency suffers at peak viewing and buying
times, discouraging shoppers and leaving
abandoned carts.
 Hard to predict height of surge, just know its big,
far higher than regular traffic 362 other days of the
year.
 New features like gallery search (Solr/Lucene)
added higher memory footprint, longer GC times.
 Staff spent lots of effort tuning HotSpot.
Specific e-tail customer based in Salt Lake City, Utah.
5/3/201540
© Copyright Azul Systems 2015
Real World Latency Results
 Customer studied Azul, met at Strata, NYC
 Discussion led to Zing as viable alternative
 Customer ran pilot tests with positive results.
Needed one Linux adjustment, otherwise same
server gear.
 POC on customer live system showed better than
expected latency profiles.
 No more GC tuning!
 Experienced a stable and profitable Thanksgiving
2014 weekend.
Timeframe was fall 2014.
5/3/201541
© Copyright Azul Systems 2015
Remind me how GC tuning sucks
5/3/201542
© Copyright Azul Systems 2015
Java GC tuning is “hard”…
 Examples of actual command line GC tuning terms:
Java -Xmx12g -XX:MaxPermSize=64M -XX:PermSize=32M -XX:MaxNewSize=2g
-XX:NewSize=1g -XX:SurvivorRatio=128 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=0
-XX:CMSInitiatingOccupancyFraction=60 -XX:+CMSParallelRemarkEnabled
-XX:+UseCMSInitiatingOccupancyOnly -XX:ParallelGCThreads=12
-XX:LargePageSizeInBytes=256m …
Java –Xms8g –Xmx8g –Xmn2g -XX:PermSize=64M -XX:MaxPermSize=256M
-XX:-OmitStackTraceInFastThrow -XX:SurvivorRatio=2 -XX:-UseAdaptiveSizePolicy
-XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled
-XX:+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled
-XX:CMSMaxAbortablePrecleanTime=10000 -XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=63 -XX:+UseParNewGC –Xnoclassgc …
5/3/201543
© Copyright Azul Systems 2015
A few GC tuning flags
Source: Word Cloud created by Frank Pavageau in his Devoxx FR 2012 presentation titled “Death by Pauses”
5/3/201544
© Copyright Azul Systems 2015
Complete guide to Zing GC tuning
java -Xmx40g
5/3/201545
© Copyright Azul Systems 2015
Any other problems beyond GC?
5/3/201546
© Copyright Azul Systems 2015
JVMs make many tradeoffs
often trading speed vs. outliers
 Some speed techniques come at extreme outlier
costs
– E.g. (“regular”) biased locking
– E.g. counted loops optimizations
 Deoptimization
 Lock deflation
 Weak References, Soft References, Finalizers
 Time To Safe Point (TTSP)
5/3/201547
© Copyright Azul Systems 2015
Time To Safepoint: Your new #1 enemy
 Many things in a JVM (still) use a global safepoint
 All threads brought to a halt, at a “safe to analyze”
point in code, and then released after work is
done.
 E.g. GC phase shifts, Deoptimization, Class
unloading, Thread Dumps, Lock Deflation, etc.
etc.
 A single thread with a long time-to-safepoint path
can cause an effective pause for all other threads.
Consider this a variation on Amdahl’s law.
 Many code paths in the JVM are long...
Once GC itself was taken care of
5/3/201548
© Copyright Azul Systems 2015
Time To Safepoint (TTSP),
the most common examples
 Array copies and object clone()
 Counted loops
 Many other variants in the runtime...
 Measure, Measure, Measure...
 Zing has a built-in TTSP profiler
 At Azul, the CTO walks around with a 0.5msec
beat down stick...
5/3/201549
© Copyright Azul Systems 2015
OS related stuff
 OS related hiccups tend to dominate once GC
and TTSP are removed as issues.
 Take scheduling pressure seriously (Duh?)
 Hyper-threading (good? bad?)
 Swapping (Duh!)
 Power management
 Transparent Huge Pages (THP).
 ...
Once GC and TTSP are taken care of
5/3/201550
© Copyright Azul Systems 2015
Takeaway: In 2015, “Real” Java is finally
viable for low latency applications
 GC is no longer a dominant issue, even for
outliers
 2-3 msec worst case with “easy” tuning
 < 1 msec worst case is very doable
 No need to code in special ways any more
– You can finally use “real” Java for everything
– You can finally 3rd party libraries without worries
– You can finally use as much memory as you want
– You can finally use regular (good) programmers
5/3/201551
© Copyright Azul Systems 2015
One-liner Takeaway:
Zing: the cure for your
Java hiccups
5/3/201552
© Copyright Azul Systems 2015
Compulsory Marketing Pitch
5/3/201553
© Copyright Azul Systems 2015
Azul Hot Topics
5/3/201554
Zing 15.05 imminent
 1TB heap
 ReadyNow!
 JMX
 Oracle Linux
Zing for Cloud
 Amazon AMIs
 Rackspace
OnMetal compat
 Docker in R&D
Zing for Big Data
 Cloudera CDH5 cert
 Cassandra paper
 Spark is in Zing open
source program
Zulu®
 Azure Gallery
 JSE Embedded
 8u45 in the chute
© Copyright Azul Systems 2015
Q&A and In Closing…
Go get some Zing today!
azul.com/trial
At very least download JHiccup.
azul.com/jhiccup
Grab a Zing Free Trial card.
Let’s talk about best BBQ in Texas
azul.com
5/3/201555
@schuetzematt

Más contenido relacionado

Destacado

Genomic analysis of diffuse intrinsic pontine gliomas identifies three molecu...
Genomic analysis of diffuse intrinsic pontine gliomas identifies three molecu...Genomic analysis of diffuse intrinsic pontine gliomas identifies three molecu...
Genomic analysis of diffuse intrinsic pontine gliomas identifies three molecu...
Joshua Mangerel
 

Destacado (13)

Psl
Psl Psl
Psl
 
Silicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsSilicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM Mechanics
 
Portfolio
PortfolioPortfolio
Portfolio
 
Internet Governance by Its History (1966-2000)
Internet Governance by Its History (1966-2000)Internet Governance by Its History (1966-2000)
Internet Governance by Its History (1966-2000)
 
О себе
О себеО себе
О себе
 
Кружок "Математическая мозаика"
Кружок "Математическая мозаика"Кружок "Математическая мозаика"
Кружок "Математическая мозаика"
 
Genomic analysis of diffuse intrinsic pontine gliomas identifies three molecu...
Genomic analysis of diffuse intrinsic pontine gliomas identifies three molecu...Genomic analysis of diffuse intrinsic pontine gliomas identifies three molecu...
Genomic analysis of diffuse intrinsic pontine gliomas identifies three molecu...
 
In what way does your media product use , develop or challenge forms and conv...
In what way does your media product use , develop or challenge forms and conv...In what way does your media product use , develop or challenge forms and conv...
In what way does your media product use , develop or challenge forms and conv...
 
JVM Language Summit: Object layout workshop
JVM Language Summit: Object layout workshopJVM Language Summit: Object layout workshop
JVM Language Summit: Object layout workshop
 
Git 들여다보기(1)
Git 들여다보기(1)Git 들여다보기(1)
Git 들여다보기(1)
 
The Justification for an Analysis of Stakeholder Input in the National Inform...
The Justification for an Analysis of Stakeholder Input in the National Inform...The Justification for an Analysis of Stakeholder Input in the National Inform...
The Justification for an Analysis of Stakeholder Input in the National Inform...
 
Installation manual
Installation manualInstallation manual
Installation manual
 
Ch11 - Operating Systems
Ch11 - Operating SystemsCh11 - Operating Systems
Ch11 - Operating Systems
 

Similar a Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015

The Why and How of Continuous Delivery
The Why and How of Continuous DeliveryThe Why and How of Continuous Delivery
The Why and How of Continuous Delivery
Nigel McNie
 
Asynchronous Event Streams – when java.util.stream met org.osgi.util.promise!...
Asynchronous Event Streams – when java.util.stream met org.osgi.util.promise!...Asynchronous Event Streams – when java.util.stream met org.osgi.util.promise!...
Asynchronous Event Streams – when java.util.stream met org.osgi.util.promise!...
mfrancis
 
Azul yandexjune010
Azul yandexjune010Azul yandexjune010
Azul yandexjune010
yaevents
 

Similar a Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015 (20)

How NOT to Measure Latency
How NOT to Measure LatencyHow NOT to Measure Latency
How NOT to Measure Latency
 
Enabling Java in Latency-Sensitive Applications
Enabling Java in Latency-Sensitive ApplicationsEnabling Java in Latency-Sensitive Applications
Enabling Java in Latency-Sensitive Applications
 
DotCMS Bootcamp: Enabling Java in Latency Sensitivie Environments
DotCMS Bootcamp: Enabling Java in Latency Sensitivie EnvironmentsDotCMS Bootcamp: Enabling Java in Latency Sensitivie Environments
DotCMS Bootcamp: Enabling Java in Latency Sensitivie Environments
 
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVMQCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
 
Enabling Java in Latency Sensitive Applications by Gil Tene, CTO, Azul Systems
Enabling Java in Latency Sensitive Applications by Gil Tene, CTO, Azul SystemsEnabling Java in Latency Sensitive Applications by Gil Tene, CTO, Azul Systems
Enabling Java in Latency Sensitive Applications by Gil Tene, CTO, Azul Systems
 
JVMCON Java in the 21st Century: are you thinking far enough ahead?
JVMCON Java in the 21st Century: are you thinking far enough ahead?JVMCON Java in the 21st Century: are you thinking far enough ahead?
JVMCON Java in the 21st Century: are you thinking far enough ahead?
 
Vertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache CassandraVertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache Cassandra
 
The Cloud Native Journey
The Cloud Native JourneyThe Cloud Native Journey
The Cloud Native Journey
 
Shift Happens - Rapidly Rolling Forward During Production Failure
Shift Happens - Rapidly Rolling Forward During Production FailureShift Happens - Rapidly Rolling Forward During Production Failure
Shift Happens - Rapidly Rolling Forward During Production Failure
 
Software Development Methodologies By E2Logy
Software Development Methodologies By E2LogySoftware Development Methodologies By E2Logy
Software Development Methodologies By E2Logy
 
Silos Are For Farmers, Not IT
Silos Are For Farmers, Not ITSilos Are For Farmers, Not IT
Silos Are For Farmers, Not IT
 
Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10
 
Understanding Hardware Transactional Memory
Understanding Hardware Transactional MemoryUnderstanding Hardware Transactional Memory
Understanding Hardware Transactional Memory
 
Cloud Native Empowered Culture
Cloud Native Empowered Culture Cloud Native Empowered Culture
Cloud Native Empowered Culture
 
The Why and How of Continuous Delivery
The Why and How of Continuous DeliveryThe Why and How of Continuous Delivery
The Why and How of Continuous Delivery
 
Asynchronous Event Streams – when java.util.stream met org.osgi.util.promise!...
Asynchronous Event Streams – when java.util.stream met org.osgi.util.promise!...Asynchronous Event Streams – when java.util.stream met org.osgi.util.promise!...
Asynchronous Event Streams – when java.util.stream met org.osgi.util.promise!...
 
Azul yandexjune010
Azul yandexjune010Azul yandexjune010
Azul yandexjune010
 
5 process synchronization
5 process synchronization5 process synchronization
5 process synchronization
 
Dev talks Cluj 2018 : Java in the 21 Century: Are you thinking far enough ahead?
Dev talks Cluj 2018 : Java in the 21 Century: Are you thinking far enough ahead?Dev talks Cluj 2018 : Java in the 21 Century: Are you thinking far enough ahead?
Dev talks Cluj 2018 : Java in the 21 Century: Are you thinking far enough ahead?
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
 

Último

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Último (20)

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015

  • 1. © Copyright Azul Systems 2015 © Copyright Azul Systems 2015 @azulsystems Enabling Java in Latency Sensitive Environments Matt Schuetze Azul Director of Product Management Matt Schuetze, Product Manager, Azul Systems Utah JUG, Murray UT, November 20, 2014 Dallas Java Developers Meetup Dallas, Texas 5/3/20151
  • 2. © Copyright Azul Systems 2015 High Level Agenda  Intro, jitter vs. JITTER  Java in a low latency application world  The (historical) fundamental problems  What people have done to try to get around them  What if the fundamental problems were eliminated?  What 2015 looks like for Low Latency Java developers  Real World Case Studies Welcome to all Dallas Java Developers 5/3/20152
  • 3. © Copyright Azul Systems 20155/3/20153
  • 4. © Copyright Azul Systems 2015 Is “jitter” a proper word for this? 99%‘ile is ~60 usec Max is ~30,000% higher than “typical” Answer: no its not jitter at all. Its phase changes. 5/3/20154
  • 5. © Copyright Azul Systems 2015 About Azul Systems Vega C4  We make scalable Virtual Machines  Have built “whatever it takes to get job done” since 2002  3 generations of custom SMP Multi-core HW (Vega)  Now Pure software for commodity x86 (Zing)  Certified OpenJDK (Zulu)  Known for Low Latency, Consistent execution, and Large data set excellence Zing, Zulu, and everything about Java Virtual Machines 5/3/20155
  • 6. © Copyright Azul Systems 2015 Java in the low latency world 5/3/20156
  • 7. © Copyright Azul Systems 2015 Java in a low latency world  Why do people use Java for low latency apps?  Are they crazy?  No. There are good, easy to articulate reasons  Projected lifetime cost  Developer productivity  Time-to-product, Time-to-market, ...  Leverage, ecosystem, ability to hire Yep, low latency Java is goin’ down for real… 5/3/20157
  • 8. © Copyright Azul Systems 2015 e.g. customer answer to: “Why do you use Java in Algo Trading?”  Strategies have a shelf life  We have to keep developing and deploying new ones  Only one out of N is actually productive  Profitability therefore depends on ability to successfully deploy new strategies, and on the cost of doing so  Our developers seem to be able to produce 2x-3x as much when using a Java environment as they would with C++ ... 5/3/20158
  • 9. © Copyright Azul Systems 2015 So what is the problem? Is Java Slow?  No  A good programmer will get roughly the same speed from both Java and C++  A bad programmer won’t get you fast code on either  The 50%‘ile and 90%‘ile are typically excellent...  It’s those pesky occasional stutters and stammers and stalls that are the problem...  Ever hear of Garbage Collection? 5/3/20159
  • 10. © Copyright Azul Systems 2015 Java’s Achilles heel 5/3/201510
  • 11. © Copyright Azul Systems 2015 Stop-The-World Garbage Collection: How bad is it?  Let’s ignore the bad multi-second pauses for now...  Low latency applications regularly experience “small”, “minor” GC events that range in the 10s of msec  Frequency directly related to allocation rate  In turn, directly related to throughput  So we have great 50%, 90%. Maybe even 99%  But 99.9%, 99.99%, Max, all “suck”  So bad that it affects risk, profitability, service expectations, etc. 5/3/201511
  • 12. © Copyright Azul Systems 2015 STW-GC effects in a low latency application 99%‘ile is ~60 usec Max is ~30,000% higher than “typical” 5/3/201512
  • 13. © Copyright Azul Systems 2015 One way to deal with Stop-The-World GC I cannot see it, so it cannot see me. 5/3/201513
  • 14. © Copyright Azul Systems 2015 More Stop-The-World GC avoidance Time for a bigger rug. 5/3/201514
  • 15. © Copyright Azul Systems 2015 What do actual low latency developers do about it?  They use “Java” instead of Java  They write “in the Java syntax”  They avoid allocation as much as possible  E.g. They build their own object pools for everything  They write all the code they use (no 3rd party libs)  They train developers for their local discipline  In short: They revert to many of the practices that hurt productivity. They lose out on much of Java. 5/3/201515
  • 16. © Copyright Azul Systems 2015 Another way to cope: “Creative Language” “Guarantee a worst case of 5 msec, 99% of the time” Translation: “1% will be far worse than worst case” “Mostly” Concurrent, “Mostly” Incremental Translation: “Will at times exhibit long monolithic stop-the- world pauses” “Fairly Consistent” Translation: “Will sometimes show results well outside this range” “Typical pauses in the tens of milliseconds” Translation: “Some pauses are much longer than tens of milliseconds” Drawn from evil vendor marketing literature 5/3/201516
  • 17. © Copyright Azul Systems 2015 What do low latency (Java) developers get for all their effort?  They still see pauses (usually ranging to tens of msec)  But they get fewer (as in less frequent) pauses  And they see fewer people able to do the job  And they have to write EVERYTHING themselves  And they get to debug malloc/free patterns again  And they can only use memory in certain ways  ...  Some call it “fun”... Others “duct tape engineering”... 5/3/201517
  • 18. © Copyright Azul Systems 2015 There is a fundamental problem: Stop-The-World GC mechanisms are contradictory to the fundamental requirements of low latency & low jitter apps 5/3/201518
  • 19. © Copyright Azul Systems 2015 Unsustainable Throughout Sustainable Throughput The throughput achieved while safely maintaining service levels 5/3/201519
  • 20. © Copyright Azul Systems 2015 It’s an industry-wide problem 5/3/201520
  • 21. © Copyright Azul Systems 2015 It was an industry-wide problem It’s 2015... Now we have Zing®. 5/3/201521
  • 22. © Copyright Azul Systems 2015 The common GC behavior across ALL currently shipping (non-Zing) JVMs  ALL use a Monolithic Stop-the-world NewGen – “small” periodic pauses (small as in 10s of msec) – pauses more frequent with higher throughput or allocation rates  Development focus for ALL is on OldGen collectors – Focus is on trying to address the many-second pause problem – Usually by sweeping it farther and farther the rug – “Mostly X” (e.g. “mostly concurrent”) hides the fact that they refer only to the OldGen part of the collector – E.g. CMS, G1, Balanced.... all are OldGen-only efforts  ALL use a Fallback to Full Stop-the-world Collection – Used to recover when other mechanisms (inevitably) fail – Also hidden under the term “Mostly”... 5/3/201522
  • 23. © Copyright Azul Systems 2015 At Azul, STW-GC was addressed head-on  We decided to focus on the right core problems – Scale & productivity being limited by responsiveness – Even “short” GC pauses are considered a problem  Responsiveness must be unlinked from key metrics: – Transaction Rate, Concurrent users, Data set size, etc. – Heap size, Live Set size, Allocation rate, Mutation rate – Responsiveness must be continually sustainable – Can’t ignore “rare but periodic” events  Eliminate ALL Stop-The-World Fallbacks – Any STW fallback is a real-world failure Trivia: Azul as a company founded predominantly around this one premise plaguing then Java servers 5/3/201523
  • 24. © Copyright Azul Systems 2015 The Zing “C4” Collector Continuously Concurrent Compacting Collector  Concurrent, compacting old generation  Concurrent, compacting new generation  No stop-the-world fallback – Always compacts, and always does so concurrently 5/3/201524
  • 25. © Copyright Azul Systems 2015 Benefits 5/3/201525
  • 26. © Copyright Azul Systems 20155/3/201526 Stay Responsive Even when traffic patterns change without warning 7x Load Increase 30 minute span shows elevated load long after event, yet no pauses.
  • 27. © Copyright Azul Systems 20155/3/201527 Handle Real World traffic patterns One second view of transactions. Not constant. Not random either. Bursty is normal. Red line shows where order pricing arrival rate would be if constant
  • 28. © Copyright Azul Systems 20155/3/201528 Achieve Measureable Benefits  Zing helped LMAX tame GC-related latency outlier pauses – Highly-engineered system: 4ms every 30 seconds down to 1ms every 2 hours – Less well-tuned system: 50ms every 30 seconds down to 3ms every 15 minutes  No more unexpected/unwanted old-gen pauses caused by external behavior – CMS STW intra-day, generally ~500ms, gone – Removed source of backpressure on latency critical path. – Pre-Azul these would occur less predictably, but multiple times a week. From joint LMAX/Azul talk at QCon London, March 2015
  • 29. © Copyright Azul Systems 2015 This is not “just Theory” jHiccup A tool that measures and reports (as your application is running) if your JVM is running all the time 5/3/201529
  • 30. © Copyright Azul Systems 2015 Discontinuities in Java execution - Easy To Measure A telco App with a bit of a “problem” 5/3/201530 We call these “hiccups”
  • 31. © Copyright Azul Systems 2015 Oracle HotSpot™ (pure newgen) Zing Low latency trading application 5/3/201531
  • 32. © Copyright Azul Systems 2015 Oracle HotSpot (pure newgen) Zing Low latency trading application 5/3/201532
  • 33. © Copyright Azul Systems 2015 Low latency - Drawn to scale Oracle HotSpot (pure newgen) Zing 5/3/201533
  • 34. © Copyright Azul Systems 2015 It’s not just for Low Latency Just as easy to demonstrate for human- response-time apps 5/3/201534
  • 35. © Copyright Azul Systems 2015 Portal Application, slow Ehcache “churn” Oracle HotSpot CMS, 1GB in an 8GB heap Zing, 1GB in an 8GB heap 5/3/201535
  • 36. © Copyright Azul Systems 2015 Portal Application, slow Ehcache “churn” Oracle HotSpot CMS, 1GB in an 8GB heap Zing, 1GB in an 8GB heap 5/3/201536
  • 37. © Copyright Azul Systems 2015 Portal Application - Drawn to scale Oracle HotSpot CMS, 1GB in an 8GB heap Zing, 1GB in an 8GB heap 5/3/201537
  • 38. © Copyright Azul Systems 2015 A Recent E-Commerce Case Study 5/3/201538
  • 39. © Copyright Azul Systems 2015 Cyber Monday comes earlier every year… General trends of real world e-commerce traffic 5/3/201539
  • 40. © Copyright Azul Systems 2015 Human-Time Real World Latency Case  Web retail site faces spike loads every year over Thanksgiving through Cyber Monday.  Site latency suffers at peak viewing and buying times, discouraging shoppers and leaving abandoned carts.  Hard to predict height of surge, just know its big, far higher than regular traffic 362 other days of the year.  New features like gallery search (Solr/Lucene) added higher memory footprint, longer GC times.  Staff spent lots of effort tuning HotSpot. Specific e-tail customer based in Salt Lake City, Utah. 5/3/201540
  • 41. © Copyright Azul Systems 2015 Real World Latency Results  Customer studied Azul, met at Strata, NYC  Discussion led to Zing as viable alternative  Customer ran pilot tests with positive results. Needed one Linux adjustment, otherwise same server gear.  POC on customer live system showed better than expected latency profiles.  No more GC tuning!  Experienced a stable and profitable Thanksgiving 2014 weekend. Timeframe was fall 2014. 5/3/201541
  • 42. © Copyright Azul Systems 2015 Remind me how GC tuning sucks 5/3/201542
  • 43. © Copyright Azul Systems 2015 Java GC tuning is “hard”…  Examples of actual command line GC tuning terms: Java -Xmx12g -XX:MaxPermSize=64M -XX:PermSize=32M -XX:MaxNewSize=2g -XX:NewSize=1g -XX:SurvivorRatio=128 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=0 -XX:CMSInitiatingOccupancyFraction=60 -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:ParallelGCThreads=12 -XX:LargePageSizeInBytes=256m … Java –Xms8g –Xmx8g –Xmn2g -XX:PermSize=64M -XX:MaxPermSize=256M -XX:-OmitStackTraceInFastThrow -XX:SurvivorRatio=2 -XX:-UseAdaptiveSizePolicy -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled -XX:CMSMaxAbortablePrecleanTime=10000 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=63 -XX:+UseParNewGC –Xnoclassgc … 5/3/201543
  • 44. © Copyright Azul Systems 2015 A few GC tuning flags Source: Word Cloud created by Frank Pavageau in his Devoxx FR 2012 presentation titled “Death by Pauses” 5/3/201544
  • 45. © Copyright Azul Systems 2015 Complete guide to Zing GC tuning java -Xmx40g 5/3/201545
  • 46. © Copyright Azul Systems 2015 Any other problems beyond GC? 5/3/201546
  • 47. © Copyright Azul Systems 2015 JVMs make many tradeoffs often trading speed vs. outliers  Some speed techniques come at extreme outlier costs – E.g. (“regular”) biased locking – E.g. counted loops optimizations  Deoptimization  Lock deflation  Weak References, Soft References, Finalizers  Time To Safe Point (TTSP) 5/3/201547
  • 48. © Copyright Azul Systems 2015 Time To Safepoint: Your new #1 enemy  Many things in a JVM (still) use a global safepoint  All threads brought to a halt, at a “safe to analyze” point in code, and then released after work is done.  E.g. GC phase shifts, Deoptimization, Class unloading, Thread Dumps, Lock Deflation, etc. etc.  A single thread with a long time-to-safepoint path can cause an effective pause for all other threads. Consider this a variation on Amdahl’s law.  Many code paths in the JVM are long... Once GC itself was taken care of 5/3/201548
  • 49. © Copyright Azul Systems 2015 Time To Safepoint (TTSP), the most common examples  Array copies and object clone()  Counted loops  Many other variants in the runtime...  Measure, Measure, Measure...  Zing has a built-in TTSP profiler  At Azul, the CTO walks around with a 0.5msec beat down stick... 5/3/201549
  • 50. © Copyright Azul Systems 2015 OS related stuff  OS related hiccups tend to dominate once GC and TTSP are removed as issues.  Take scheduling pressure seriously (Duh?)  Hyper-threading (good? bad?)  Swapping (Duh!)  Power management  Transparent Huge Pages (THP).  ... Once GC and TTSP are taken care of 5/3/201550
  • 51. © Copyright Azul Systems 2015 Takeaway: In 2015, “Real” Java is finally viable for low latency applications  GC is no longer a dominant issue, even for outliers  2-3 msec worst case with “easy” tuning  < 1 msec worst case is very doable  No need to code in special ways any more – You can finally use “real” Java for everything – You can finally 3rd party libraries without worries – You can finally use as much memory as you want – You can finally use regular (good) programmers 5/3/201551
  • 52. © Copyright Azul Systems 2015 One-liner Takeaway: Zing: the cure for your Java hiccups 5/3/201552
  • 53. © Copyright Azul Systems 2015 Compulsory Marketing Pitch 5/3/201553
  • 54. © Copyright Azul Systems 2015 Azul Hot Topics 5/3/201554 Zing 15.05 imminent  1TB heap  ReadyNow!  JMX  Oracle Linux Zing for Cloud  Amazon AMIs  Rackspace OnMetal compat  Docker in R&D Zing for Big Data  Cloudera CDH5 cert  Cassandra paper  Spark is in Zing open source program Zulu®  Azure Gallery  JSE Embedded  8u45 in the chute
  • 55. © Copyright Azul Systems 2015 Q&A and In Closing… Go get some Zing today! azul.com/trial At very least download JHiccup. azul.com/jhiccup Grab a Zing Free Trial card. Let’s talk about best BBQ in Texas azul.com 5/3/201555 @schuetzematt

Notas del editor

  1. Main chart is half an hour on one day. Inset is exchange rate between swiss franc and euro. Point of slide is no real warning. Events happened, so traffic moves rapidly. Ie. nothing was happening then a 7x times increase, then higher floor throughout the day. No warning. Even though, latencies stayed stable.
  2. A typical day’s one second worth of data. X axis is which millisecond within every second. Arrival rate is not smooth. Every second. Every 250ms Every 100 seconds Bursty traffic the norm. Red line shows average if arrival rate were constant