2. 2
About the Speaker
I have been with VMware for the last 8 years, working on Java
and vSphere
20 years experience as a Software Engineer/Architect, with last 15
years focused on Java development
Open source contributions
Prior work with Cisco, Oracle, and Banking/Trading Systems
Authored the following books:
• Virtualizing and Tuning Large Scale Java Platforms
• Enterprise Java Applications Architecture on VMware
3. 3
Disclaimer
This session may contain product features that are
currently under development.
This session/overview of the new technology represents
no commitment from VMware to deliver these features in
any generally available product.
Features are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery.
Pricing and packaging for any new technologies or features
discussed or presented have not been determined.
4. 4
Agenda
Overview
Design and Sizing Java Platforms
Performance
Best Practices and Tuning
Customer Success Stories
Questions
6. 6
Conventional Java Platforms
Java Platforms are multitier and multi org
DB ServersJava Applications
Load Balancer Tier
Load Balancers Web Servers
IT Operations
Network Team
IT Operations
Server Team
IT Apps – Java
Dev Team
IT Ops & Apps
Dev Team
Organizational Key Stakeholder Departments
Web Server Tier Java App Tier DB Server Tier
7. 7
Middleware Platform Architecture on vSphere
SHARED,ALWAYS-ON
INFRASTRUCTURE
SHAREDINFRASTRUCTURESERVICES
Capacity On Demand High AvailabilityDynamic
APPLICATIONSERVICES
DBServersJavaApplicationsLoadbalancers WebServers
VMwarevSphere
HighUptime, Scalable, and DynamicEnterprise JavaApplicationsLoad Balancers as VMs
Web Servers
Java Application Servers
9. 9
Design and Sizing of Java Platforms on vSphere
Step 1 –
EstablishLoad profile
From production
logs/monitoring
reports measure:
Concurrent
Users
Requests Per
Second
Peak
ResponseTime
Average
ResponseTime
Establishyour
response time SLA
Step 2
EstablishBenchmark
Iterate through
Benchmark test until
youare satisfied with
the Load profile
metrics and your
intendedSLA
after each
benchmarkiteration
youmay have to
adjustthe Application
Configuration
Adjust the vSphere
environmentto scale
out/upin order to
achieveyour desired
number of VMs,
number of vCPU and
RAM configurations
Step 3 –
Size Production Env.
The size of the
production
environmentwould
havebeen
establishedin
Step2, hence either
you roll out the
environmentfrom
Step-2 or build a
new one based on
the numbers
established
10. 10
Step 2 – Establish Benchmark
DETERMINE HOW MANY VMs
Establish Horizontal Scalability
Scale Out Test
How many VMs do you need to
meet your Response Time SLAs
without reaching 70%-80%
saturation of CPU?
Establish your Horizontal scalability
Factor before bottleneck appear in
your application
Scale Out Test
Building Block VM Building Block VM
SLA
OK?
Test
complete
Investigate bottlnecked layer
Network, Storage,
Application Configuration, &
vSphere
If scale out
bottlenecked
layer is
removed, iterate
scale out test
If building block
app/VM config
problem, adjust
& iterate No
Building Block VM
ESTABLISH BUILDING BLOCK VM
Establish Vertical scalability
Scale Up Test
Establish how many JVMs on a VM?
Establish how large a VM would be
in terms of vCPU and memory
ScaleUpTest
Building Block VM
11. 11
Design and Sizing HotSpot JVMs on vSphere
JVM
Max
Heap
-Xmx
JVM
Memory
Perm Gen
Initial
Heap
Guest OS
Memory
VM
Memory
-Xms
Java Stack
-Xss per thread
-XX:MaxPermSize
Other mem
Direct native
Memory
“off-the-heap”
Non
Direct
Memory
“Heap”
12. 12
Design and Sizing of HotSpot JVMs on vSphere
Guest OS Memory approx 1G (depends on OS/other processes)
Perm Size is an area additional to the –Xmx (Max Heap) value and
is not GC-ed because it contains class-level information.
“other mem” is additional mem required for NIO buffers, JIT code
cache, classloaders, Socket Buffers (receive/send), JNI, GC
internal info
If you have multiple JVMs (N JVMs) on a VM then:
• VM Memory = Guest OS memory + N * JVM Memory
VM Memory = Guest OS Memory + JVM Memory
JVM Memory = JVM Max Heap (-Xmx value) + JVM Perm Size (-XX:MaxPermSize) +
NumberOfConcurrentThreads * (-Xss) + “other Mem”
13. 13
Sizing Example
JVM Max
Heap
-Xmx
(4096m)
JVM
Memory
(4588m) Perm Gen
Initial
Heap
Guest OS
Memory
VM
Memory
(5088m)
-Xms (4096m)
Java Stack -Xss per thread (256k*100)
-XX:MaxPermSize (256m)
Other mem (=217m)
500m used by OS
set mem Reservation to
5088m
14. 14
Perm Gen
Initial
Heap
Java Stack
Larger JVMs for In-Memory Data Management Systems
JVM Max
Heap
-Xmx
(30g)
Guest OS
Memory
-Xms (30g)
-Xss per thread (1M*500)
-XX:MaxPermSize (0.5g)
Other mem (=1g)
0.5-1g used by OS
Set memory reservation to
34g
JVM
Memory for
SQLFire
(32g)
VM
Memory for
SQLFire
(34g)
15. 15
NUMA Local Memory with Overhead Adjustment
Physical RAM
On vSphere host
Physical RAM
On vSphere host
Number of VMs
On vSphere host
1% RAM
overhead
vSphere RAM
overhead
Number of Sockets
On vSphere host
16. 16
Middleware ESXi Cluster
96GB RAM
2 sockets
8 pCPU per
socket
Middleware
components
47GB RAM VMs
with
8vCPU
Locator/heart beat
for middleware
DO NOT VMotion
Memory Available for all VMs =
96*0.98 -1GB => 94GB
Per NUMA memory => 94/2
47GB
17. 17
96 GB RAM
on Server
Each NUMA
Node has 94/2
47GB
8 vCPU VMs
less than
47GB RAM
on each VMESX
Scheduler
If VM is sized greater
than 47GB or 8 CPUs,
Then NUMA interleaving
Occurs and can cause
30% drop in memory
throughput performance
18. 18
1
128 GB RAM
on server
2vCPU VMs
less than
20GB RAM
on each VM
4vCPU VM
40GB RAM
split by ESXi into
2 NUMA Clients
available in ESX4.1
ESXi
Scheduler 2
3
4
5
19. 19
Java Platform Categories – Category 1
Smaller JVMs < 4GB heap,
4.5GB Java process, and 5GB
for VM
vSphere hosts with <96GB
RAM is more suitable, as by
the time you stack the many
JVM instances, you are likely
to reach CPU boundary before
you can consume all of the
RAM. For example if instead
you chose a vSphere host with
256GB RAM, then 256/4.5GB =>
57JVMs, this would clearly
reach CPU boundary
Multiple JVMs per VM
Use Resource pools to
manage different LOBs Category 1: 100s to 1000s of JVMs
Resource Pool 1
Gold LOB 1
Resource Pool 2
SilverLOB 2
Use 4 sockets servers
to get more cores
20. 20
Most Common Sizing and Configuration Question
JVM-1 JVM-2
JVM-1A
JVM-1 JVM-2 JVM-1 JVM-2
JVM-2A
JVM-3 JVM-4 Option-1 Scale out VM and JVM ( best)
Option-2 Scale Up JVM heap size (2nd best)
JVM-2JVM-1
Option-3 Scale up VM and JVM (3rd best)
2GB 2GB 2GB 2GB
2vCPU2vCPU 2vCPU 2vCPU
2vCPU2vCPU
4GB4GB
21. 21
What Else to Consider When Sizing?
Job
Web
JVM-1
Job
Web
JVM-2
Job
Web
Job
Web
JVM-3
Job
Web
JVM-4
Vertical
Horizontal
Mixed workloads Job Scheduler vs Web app require
different GC Tuning
Job Schedulers care about Throughput
Web apps care about minimize latency and response time
You can’t have both reduced response time and increased
throughput, without compromise
Separate the concerns for optimal tuning
22. 22
Java Platform Categories – Category 2
Fewer JVMs < 20
Very large JVMs, 32GB to 128GB
Always deploy 1 VM per NUMA node
and size to fit perfectly
1 JVM per VM
Choose 2 socket vSphere hosts, and
install ample memory128GB to 512GB
Example is in memory databases, like
SQLFire and GemFire
Apply latency sensitive BP disable
interrupt coalescing pNIC and vNIC
Dedicated vSphere cluster
Category 2: a dozen of very large JVMs
Use 2 sockets servers
to get larger NUMA
nodes
23. 23
Java Platform Categories – Category 3
Category 3: Category-1 accessing data from Category-2
Resource Pool 1
Gold LOB 1
Resource Pool 2
SilverLOB 2
25. 25
Performance Perspective
See the Performance of Enterprise Java Applications on VMware
vSphere 4.1 and SpringSource tc Server at
http://www.vmware.com/resources/techresources/10158 .
26. 26
Performance Perspective
See the Performance of Enterprise Java Applications on VMware
vSphere 4.1 and SpringSource tc Server at
http://www.vmware.com/resources/techresources/10158 .
80% Threshold
% CPU
R/T
27. 27
SQLFire vs. Traditional RDBMS
SQLFire scaled 4x compared to RDBMS
Response times of SQLFire are 5x to 30x
faster than RDBMS
Response times on SQLFire are more
stable and constant with increased load
RDBMS response times increase with
increased load
28. 28
Load Testing SpringTrader Using Client-Server Topology
SpringTrader
Integration Services
Application Tier SpringTrader
Application Service
SQLFire
Member 2
Redundant
Locators
SpringTrader Data Tier
SQLFire
Member1
Integration
Patterns
4 Application Services
29. 29
vFabric Reference Architecture Scalability Test
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
0
2000
4000
6000
8000
10000
12000
1 2 3 4
Scalingfrom1AppServicesVM
NumberofUsers
Number of Application Services VMs
Maximum Passing Users and Scaling
With this topology
10400 users session
30. 30
10k Users Load Test Response Time
0
1
2
3
4
5
6
7
0 2000 4000 6000 8000 10000 12000
Seconds
Number of Users
Operation 90th-Percentile Response-Time
Four Application Services VMs
HomePage Register Login DashboardTab PortfolioTab
TradeTab GetHoldingsPage GetOrdersPage SellOrder GetQuote
BuyOrder Logout MarketSummary
10400 users session
Approx. 0.25 seconds
response time
32. 32
Most Common VM Size for Java Workloads
2 vCPU VM with 1 JVM, for tier-1 production workloads
Maintain this ratio as you scale out or scale-up, i.e. 1 JVM : 2vCPU
Scale out preferred over Scale-up, but both can work
You can diverge from this ratio for less critical workloads
2 vCPU VM
1 JVM (-Xmx 4096m)
Approx 5GB RAM Reservation
33. 33
However for Large JVMs + CMS
For large JVMs
4+ vCPU VM
1 JVM (8-128GB)
Start with 4+ vCPU VM with 1 JVM, for
tier-1 in memory data management
systems type of production workloads
Likely increase JVM size, instead of
launching a second JVM instance
Multiple 4vCPU+ will allow for
ParallelGCThreads to be allocated 50% of
the available vCPUs to the JVM, i.e. 2 GC
Threads +
Ability to increase ParallelGCThreads is
critical to YoungGen scalability for large
JVMs
ParallelGCThreads should be allocated
50% of available vCPU to the JVM and not
more. You want to ascertain there other
vCPUs available for other txns
34. 34
Which GC?
ESX doesn’t care which GC you select, because of the degree of
independence of Java to OS and OS to Hypervisor
35. 35
GC Policy Types
GC Policy Type Description
Serial GC •Mark, sweep and compact algorithm
•Both minor and full GC are stop the world threads
•Stop the world GC means application is stopped while GC is
executing
•Not very scalable algorithm
•Suited for smaller <200MB JVMs like Client machines
Throughput
GC
•Parallel GC
•Similar to Serial GC, but uses multiple worker Threads in
parallel to increase throughput
•Both Young and Old Generation collection are multi thread, but
still stop-the-world
• number of threads allocated by -
XX:ParallelGCThreads=<nThreads>
•NOT Concurrent, meaning when the GC worker threads run,
they will pause your application threads. If this is a problem
move to CMS where GC threads are concurrent.
36. 36
GC Policy Types
GC Policy Type Description
Concurrent GC •Concurrent Mark and Sweep, no compaction
•Concurrent implies when GC is running it doesn't pause your
application threads – this is the key difference to
throughput/parallel GC
•Suited for application that care more about response time than
throughput
•CMS does use more heap when compared to
throughput/ParallelGC
•CMS works on OLD gen concurrently, but young generation is
collected using ParNewGC, a version of the throughput collector
•Has multiple phases:
• Initial mark (short pause)
• concurrent mark (no pause)
• Pre-cleaning (no pause)
• re-mark (short pause)
• Concurrent sweeping (no pause)
G1 • Only in J7 and mostly experimental, equivalent to CMS + compacting
37. 37
Tuning GC – Art Meets Science!
Either you tune for Throughput or Latency, one at the cost of the other
Increase
Throughput
Reduce
Latency Tuning
Decisions
• improved R/T
• reduce latency impact
• slightly reduced throughput
• improved throughput
• longer R/T
• increased latency impact
Job
Web
38. 38
Parallel Young Gen and CMS Old Gen
application threadsminor GC threads concurrent mark and sweep GC
Young Generation Minor GC
Parallel GC in YoungGen using
XX:ParNewGC & XX:ParallelGCThreads
-Xmn
Old Generation Major GC
Concurrent using in OldGen using
XX:+UseConcMarkSweepGC
Xmx minus Xmn
S
0
S
1
39. 39
High Level GC Tuning Recipe
Measure
Minor GC
Duration
and
Frequency
Adjust –Xmn
Young Gen size
and /or
ParallelGCThreads
Measure
Major GC
Duration
And
Frequency
Adjust
Heap space
–Xmx
Adjust –Xmn
And/or
SurvivorSpaces
Step A-Young Gen Tuning
Step B-Old Gen Tuning
Step C-
Survivor Spaces
Tuning
40. 40
CMS Collector Example
java –Xms30g –Xmx30g –Xmn10g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC –
XX:CMSInitiatingOccupancyFraction=75
–XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC
-XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking
-XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=4
-XX:+UseCompressedOops -XX:+OptimizeStringConcat -XX:+UseCompressedStrings
-XX:+UseStringCache
This JVM configuration scales up and down effectively
-Xmx=-Xms, and –Xmn 33% of –Xmx
-XX:ParallelGCThreads=< minimum 2 but less than 50% of available
vCPU to the JVM. NOTE: Ideally use it for 4vCPU VMs plus, but if
used on 2vCPU VMs drop the -XX:ParallelGCThreads option and let
Java select it
41. 41
IBM JVM – GC Choice
-Xgc:mode Usage Example
-Xgcpolicy:Optthruput
(Default)
Performs the mark and sweep operations
during garbage collection when the
application is paused to maximize
application throughput. Mostly not
suitable for multi CPU machines.
Apps that demand a
high throughput but
are not very sensitive
to the occasional long
garbage collection
pause
-
Xgcpolicy:Optavgpause
Performs the mark and sweep
concurrently while the application is
running to minimize pause times; this
provides best application response times.
There is still a stop-the-world GC, but the
pause is significantly shorter. After GC,
the app threads help out and sweep
objects (concurrent sweep).
Apps sensitive to long
latencies transaction-
based systems where
Response Time are
expected to be stable
-Xgcpolicy:Gencon Treats short-lived and long-lived objects
differently to provide a combination of
lower pause times and high application
throughput.
Before the heap is filled up, each app
helps out and mark objects
(concurrent mark).
Latency sensitive
apps, objects in the
transaction don't
survive beyond the
transaction commit
Job
Web
Web
42. 42
Middleware on VMware – Best Practices
Enterprise Java
Applications on
VMware Best
Practices Guide
http://www.vmware.com/resources/techresources/1087
Best Practices for
Performance Tuning
of Latency-Sensitive
Workloads in vSphere
VMs
http://www.vmware.com/resources/techresources/10220
vFabric SQLFire Best
Practices Guide
http://www.vmware.com/resources/techresources/10327
vFabric Reference
Architecture
http://tinyurl.com/cjkvftt
43. 43
Middleware on VMware – Best Practices Summary
Follow the design and sizing examples we discussed thus far
Set appropriate memory reservation
Leave HT enabled, size bases on vCPU=1.25pCPU if needed
RHEL6 and SLES 11 SP1 have tickless kernel that does not rely on
a high frequency interrupt-based timer, and is therefore much
friendlier to virtualized latency-sensitive workloads
Do not overcommit memory
Locators/heartbeat process should not be vMotion® migrated, it
otherwise would lead to network split brain problems
vMotion over 10Gbps when doing scheduled maintenance
Use Affinity and Anti-Affinity rules to avoid redundant copies on
the same VMware ESX®/ESXi host
44. 44
Middleware on VMware – Best Practices
Disable NIC interrupt coalescing on physical and virtual NIC
Extremely helpful in reducing latency for latency-sensitive
virtual machines
Disable virtual interrupt coalescing for VMXNET3
• It can lead to some performance penalties for other virtual machines on the
ESXi host, as well as higher CPU utilization to deal with the higher rate of
interrupts from the physical NIC
This implies it is best to use dedicated ESX cluster for
Middleware Platforms
• All host are configured the same way for latency sensitivity and this insures
non middleware workloads, such as other enterprise applications are not
negatively impacted
• This is applicable in category 2 workloads
45. 45
Middleware on VMware – Benefits
Flexibility to change compute resources, VM sizes, add more hosts
Ability to apply hardware and OS patches while
minimizing downtime
Create more manageable system through reduced
middleware sprawl
Ability to tune the entire stack within one platform
Ability to monitor the entire stack within one platform
Ability to handle seasonal workloads, commit resources when
they are needed and then remove them when not needed
47. 47
NewEdge
Virtualized GemFire workload
Multiple geographic active-
active datacenters
Multiple Terabytes of data
kept in memory
1000s of transactions per
second
Multiple vSphere clusters
Each cluster 4 vSphere hosts
and 8 large 98GB+ JVMs
http://www.vmware.com/files/pdf/customers/VMware-Newedge-12Q4-EN-Case-Study.pdf
48. 48
Cardinal Health Virtualization Journey
4
Consolidation
< 40% Virtual
<2,000 VMs
<2,355 physical
Data Center Optimization
30 DCs to 2 DCs
Transition to Blades
<10% Utilization
<10:1 VM/Physical
Low Criticality Systems
8X5 Applications
Internal cloud
>58% Virtual
>3,852 VMs
<3,049 physical
Power Remediation
P2Vs on refresh
HW Commoditization
15% Utilization
30:1 VM/Physical
Business Critical Systems
SAP ~ 382
WebSphere ~ 290
Unix to Linux ~ 655
Cloud Resources
• >90% Virtual
>8,000 VMs
<800 physical
Optimizing DCs
Internal disaster recovery
Metered service offerings
(SAAS, PAAS, IAAS)
Shrinking HW Footprint
> 50% Utilization
> 60:1 VM/Physical
Heavy Lifting Systems
Database Servers
Virtual
HW
SW
Timeline 2005 – 2008 2009 – 2011 2012 – 2015
Theme
Centralized IT
Shared Service
Capital Intensive - High
Response
Variable Cost
SubscriptionServices
DC
49. 49
Virtualization Why Virtualize WebSphere on VMWare
DC strategy alignment
• Pooled resources capacity ~15% utilization
• Elasticity – for changing workloads
• Unix to Linux
• Disaster Recovery
Simplification and manageability
• High availability for thousands instead of thousands of high
availability solutions
• Network & system management in DMZ
Five year cost savings ~ $6 million
• Hardware Savings ~ $660K
• WAS Licensing ~ $862K
• Unix to Linux ~ $3.7M
• DMZ – ports~ >$1M
50. 50
Thank you and are there any Questions?
Emad Benjamin,
ebenjamin@vmware.com
You can get the book here:
https://www.createspace.com/3632131
51. 51
Second Book
Emad Benjamin,
ebenjamin@vmware.com
Preview chapter available at
VMworld bookstore
You can get the book here:
Safari: http://tinyurl.com/lj8dtjr
Later on Amazon
http://tinyurl.com/kez9trj
52. 52
Other VMware Activities Related to This Session
HOL:
HOL-SDC-1304
vSphere Performance Optimization
Group Discussions:
VAPP1010-GD
Java with Emad Benjamin