Amazon EC2 provides a broad selection of instance types to deliver high performance for a diverse mix of applications. In this session, we overview the drivers of system performance and discuss in depth how Amazon EC2 instances deliver system performance while also providing elasticity and complete control over your infrastructure. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
5. What to Expect from the Session
• Defining system performance and how it is
characterized for different workloads
• How Amazon EC2 instances deliver performance
while providing flexibility and agility
• How to make the most of your EC2 instance experience
through the lens of several instance types
7. • Servers are hired to do jobs
• Performance is measured differently depending on the job
Hiring a Server
?
8. • What performance means
depends on your perspective:
– Response time
– Throughput
– Consistency
Defining Performance: Perspective Matters
Application
System libraries
System calls
Kernel
Devices
Workload
9. Performance Factors
Resource Performance factors Key indicators
CPU Sockets, number of cores, clock
frequency, bursting capability
CPU utilization, run queue length
Memory Memory capacity Free memory, anonymous paging,
thread swapping
Network
interface
Max bandwidth, packet rate Receive throughput, transmit throughput
over max bandwidth
Disks Input / output operations per
second, throughput
Wait queue length, device utilization,
device errors
10. Resource Utilization
• For given performance, how efficiently are resources being used?
• Something at 100% utilization can’t accept any more work
• Low utilization can indicate more resources are being purchased
than needed
11. Example: Web Application
• MediaWiki installed on Apache with 140 pages of content
• Load increased in intervals over time
16. • Picking an instance is tantamount to resource performance tuning
• Give back instances as easily as you can acquire new ones
• Find an ideal instance type and workload combination
Instance Selection = Performance Tuning
18. CPU Instructions and Protection Levels
• CPU has at least two protection levels.
• Privileged instructions can’t be executed in user mode to protect
system. Applications leverage system calls to the kernel.
Kernel
Application
20. X86 CPU Virtualization: Prior to Intel VT-x
• Binary translation for privileged instructions
• Para-virtualization (PV)
• PV requires going through the VMM, adding latency
• Applications that are system call-bound are most affected
VMM
Application
Kernel
PV
21. X86 CPU Virtualization: After Intel VT-x
• Hardware assisted virtualization (HVM)
• PV-HVM uses PV drivers opportunistically for operations that are
slow emulated:
• e.g., network and block I/O
Kernel
Application
VMM
PV-HVM
23. Timekeeping Explained
• Timekeeping in an instance is deceptively hard
• gettimeofday(), clock_gettime(), QueryPerformanceCounter()
• The TSC
• CPU counter, accessible from userspace
• Requires calibration, vDSO
• Invariant on Sandy Bridge+ processors
• Xen pvclock; does not support vDSO
• On current generation instances, use TSC as clocksource
25. CPU Performance and Scheduling
• Hypervisor ensures every guest receives CPU time
• Fixed allocation
• Uncapped vs. capped
• Variable allocation
• Different schedulers can be used depending on the goal
• Fairness
• Response time / deadline
• Shares
27. What’s new in C4: P-state and C-state control
• By entering deeper idle states, non-idle cores can achieve
up to 300MHz higher clock frequencies
• But… deeper idle states require more time to exit, may not
be appropriate for latency-sensitive workloads
28. Tip: P-state control for AVX2
• If an application makes heavy use of AVX2 on all cores, the
processor may attempt to draw more power than it should
• Processor will transparently reduce frequency
• Frequent changes of CPU frequency can slow an application
29. Review: T2 Instances
• Lowest cost EC2 instance at $0.013 per hour
• Burstable performance
• Fixed allocation enforced with CPU credits
Model vCPU CPU Credits
/ Hour
Memory
(GiB)
Storage
t2.micro 1 6 1 EBS Only
t2.small 1 12 2 EBS Only
t2.medium 2 24 4 EBS Only
t2.large 2 36 8 EBS Only
30. How Credits Work
• A CPU credit provides the
performance of a full CPU core for
one minute
• An instance earns CPU credits at
a steady rate
• An instance consumes credits
when active
• Credits expire (leak) after 24 hours
Baseline rate
Credit
balance
Burst
rate
32. Monitoring CPU Performance in Guest
• Indicators that work is being done
• User time
• System time (kernel mode)
• Wait I/O, threads blocked on disk I/O
• Else, Idle
• What happens if OS is scheduled off the CPU?
33. Tip: How to Interpret Steal Time
• Fixed CPU allocations of CPU can be offered through
CPU caps
• Steal time happens when CPU cap is enforced
• Leverage CloudWatch metrics
35. Announced: X1 Instances
• Largest memory instance with 2 TB of DRAM
• Quad socket, Intel E7 processors with 128 vCPUs
Model vCPU Memory (GiB) Local
Storage
x1.32xlarge 128 1952 2x 1920GB
36. Virtualized Address Spaces
0 4GB
Current Guest Process
0 4GB
Guest OS Virtual
Address Spaces
Physical
Address Spaces
Virtual RAM
Virtual
ROM
Virtual
Devices
Virtual
Frame
Buffer
Source: [1]
38. Virtual Machine Memory
0 4GB
Current Guest Process
0 4GB
Guest OS
Virtual
Address Spaces
Physical
Address Spaces
Virtual RAM
Virtual
ROM
Virtual
Devices
Virtual
Frame
Buffer
0 4GB
Machine
Address Space
RAM ROMDevices Frame
Buffer
Source: [1]
39. Before Intel EPT: Shadow Page Tables
• Hypervisor maintains shadow page tables that map guest
virtual pages directly to machine pages.
• Guest modifications to virtual to physical tables need to be
synced with shadow page tables.
• Shadow page tables loaded into MMU on context switch.
41. Drawbacks: Shadow Page Tables
• Maintaining consistency between guest page tables and
shadow page tables leads to hypervisor traps.
• Loss of performance due to TLB flush on every context switch.
• Memory overhead due to shadow copying of guest page
tables.
42. Extended Page Tables
• Extended page tables (EPT) translate the guest virtual
addresses to machine addresses
‒ No need to trap to hypervisor when the guest OS updates it’s page
tables
• TLB w/ virtual process identifiers
‒ No need to flush TLB on VM or hypervisor context switch
44. NUMA
• Non-uniform memory access
• Each processor in a multi-CPU system has local memory that is
accessible through a fast interconnect
• Each processor can also access memory from other CPUs, but local
memory access is a lot faster than remote memory
• Performance is related to the number of CPU sockets and how they
are connected - Intel QuickPath Interconnect (QPI)
45. Tip: Kernel Support for NUMA Balancing
• An application will perform best when the threads of its processes
are accessing memory on the same NUMA node.
• NUMA balancing moves tasks closer to the memory they are
accessing.
• This is all done automatically by the Linux kernel when automatic
NUMA balancing is active: version 3.13+ of the Linux kernel.
• Windows support for NUMA first appeared in the Enterprise and
Data Center SKUs of Windows Server 2003.
47. I/O and Devices Virtualization
• Scheduling I/O requests between virtual devices and
shared physical hardware
• Split driver model
• Intel VT-d
• Direct pass through and IOMMU for dedicated devices
• Enhanced networking
48. Review: I2 Instances
16 vCPU: 3.2 TB SSD; 32 vCPU: 6.4 TB SSD
365K random read IOPS for 32 vCPU instance
Model vCPU Memory
(GiB)
Storage Read IOPS Write IOPS
i2.xlarge 4 30.5 1 x 800 SSD 35,000 35,000
i2.2xlarge 8 61 2 x 800 SSD 75,000 75,000
i2.4xlarge 16 122 4 x 800 SSD 175,000 155,000
i2.8xlarge 32 244 8 x 800 SSD 365,000 315,000
49. Hardware
Split Driver Model
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
1
23
4
5
50. Split Driver Model
• Each virtual device has two main components
• Communication ring buffer
• An event channel signaling activity in the ring buffer
• Data is transferred through shared pages
• Shared pages requires inter-domain permissions, or granting
51. Granting in pre-3.8.0 Kernels
• Requires “grant mapping” prior to 3.8.0
• Grant mappings are expensive operations due to TLB flushes
read(fd, buffer,…)
52. Granting in 3.8.0+ Kernels, Persistent and Indirect
• Grant mappings are set up in a pool once
• Data is copied in and out of the grant pool
read(fd, buffer…)
Copy to
and from
grant pool
53. Tip: Use 3.8+ kernel
• Amazon Linux 13.09 or later
• Ubuntu 14.04 or later
• RHEL7 or later
• Etc.
54. Device Pass Through: Enhanced Networking
• SR-IOV eliminates need for driver domain
• Physical network device exposes virtual function to
instance
• Requires a specialized driver, which means:
• Your instance OS needs to know about it
• EC2 needs to be told your instance can use it
55. Hardware
After Enhanced Networking
Driver Domain Guest Domain Guest Domain
VMM
NIC
Driver
Physical
CPU
Physical
Memory
SR-IOV Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
1
2
3
NIC
Driver
56. Tip: Use Enhanced Networking
• Highest packets-per-second
• Lowest variance in latency
• Instance OS must support it
• Look for SR-IOV property of instance or image
58. • Find an instance type and workload combination
– Define performance
– Monitor resource utilization
– Make changes
Instance Selection = Performance Tuning
59. • Bare metal performance goal, and in many scenarios
already there
• History of eliminating hypervisor intermediation and driver
domains
– Hardware assisted virtualization
– Scheduling and granting efficiencies
– Device pass through
Virtualization Themes
60. • PV-HVM
• Time keeping: use TSC
• C state and P state controls
• Monitor T2 CPU credits
• NUMA balancing
• Persistent grants for I/O performance
• Enhanced networking
Recap: Getting the Most Out of EC2 Instances
61. Next steps
• Visit the Amazon EC2 documentation
• Come visit us in the Developer Chat to hear more