Amazon EC2 oferece uma ampla seleção de tipos de instâncias para suportar diversos casos de uso. Nesta sessão, entregaremos uma visão geral da plataforma de instâncias do Amazon EC2, das características mais importantes da plataforma e do conceito de gerações de instâncias. Nos aprofundamos nas escolhas da geração atual para as diferentes famílias de instâncias, incluindo as famílias de Propósito Geral (General Purpose), as otimizadas para Computação, Otimizadas para Armazenamento, Otimizadas para Memória e as familias com Computação Acelerada (GPU e FPGA). Também detalharemos as melhores práticas e compartilhamos dicas de desempenho para obter o máximo de suas instâncias do Amazon EC2.
https://aws.amazon.com/pt/ec2/
2. O que esperar dessa sessão
Compreendendo os fatores envolvidos na hora de escolher uma
instância EC2
Definindo o desempenho de sistema e como caracterizar as
diferentes tarefas (workloads)
Como as instâncias Amazon EC2 entregam desempenho com
flexibilidade e agilidade
Como extrair o máximo das instâncias Amazon EC2 considerando
os diferentes tipos disponíveis
8. Famílias de Instâncias EC2
Uso
geral
Otimizadas
para
computação
C3
Otimizadas para
armazenamento
I3 P2
Computação
acelerada
Otimizadas
para memória
R4C4
M4
D2
X1
G2
F1
9. O que é uma Virtual CPU? (vCPU)
Uma vCPU es tipicamente um core físico com hyper-threading*
No Linux, os threads “A” são enumerados antes dos threads “B”
No Windows, os threads são entrelaçados
Divida a quantidade de vCPUs por dois para obter a quantidade de
cores
Cores por tipo de Instância EC2 & RDS DB:
https://aws.amazon.com/ec2/virtualcores/
* A família “T” é especial
10.
11. Desative o Hyper-Threading se precisar
Útil para aplicações intensivas em FPU
Utilize ‘lscpu’ para validar o layout
Deixe offline os threads “B” em quente
for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
| cut -s -d, -f2- | tr ',' 'n' | sort -un); do
echo 0 | sudo tee /sys/devices/system/cpu/cpu${cpunum}/online
done
Configure grub para unicamente inicializar a primeira metade dos
threads
maxcpus=64
14. Alocação de recursos
Todos os recursos assinados a você são dedicados à sua instância,
sem sobre-subscripção*
Todas as vCPUs são dedicados para você
A memória alocada é assinada unicamente à sua instância
Os recursos de rede são particionados para evitar “noisy
neighbors”
Tem curiosidade sobre o número de instâncias por host? utilize
“Dedicated Hosts” como uma guia.
*Novamente, a família “T” é especial
15. “Lançar novas instâncias e executar os testes
em paralelo é simples...[na hora de escolher
uma instância] não há melhor substituto para
medir o desempenho que a sua própria
aplicação.”
- Documentação do EC2
16. Infraestrutura escalável para atender
picos de audiência em rede nacional
O ZAP, uma empresa do Grupo Globo,
é o mais completo, moderno e eficiente
portal do mercado imobiliário no Brasil.
Conectamos as pessoas interessadas
em imóveis numa plataforma de
negócios com as melhores
informações, análises e tecnologias
para o mercado
“A AWS permitiu que
novos requisitos de
negócio fossem atendidos
com agilidade, oferecendo
serviços inovadores,
escaláveis e com alta
disponibilidade”
Adriano Aguiar,
Gerente de
Infraestrutura
17. O Desafio
Capacidade de suportar alto volume
de acessos com baixo custo
Ser escalável em curto espaço de
tempo.
Usar eficientemente o amplo portfolio
de serviços da AWS: Cloudformation,
Beanstalk, Elastic Cache, Instâncias
Reservadas e Spot, Auto Scaling,
apoiados pelo Enterprise Support
18. Solução escalável e de baixo custo
Auto Scaling group
On Demand/Reserved
Auto Scaling group
Spot
Availability Zone
Elastic Load
Balancing
Amazon
CloudWatch
AWS
CloudFormation
User data
Custom AMI
AWS Elastic
Beanstalk
.ebextensions
+
19. Manipulação do horário – Explicação
A obtenção do horário numa instância é enganosamente
difícil
gettimeofday(), clock_gettime(), QueryPerformanceCounter()
O registro TSC
Contador do CPU, acessível desde userspace
Requer calibração, vDSO
Invariante nos processadores Sandy Bridge
Xen pvclock; não suporta vDSO
Nas instâncias de geração atual, utilize TSC como fonte
20. Benchmarking – Aplicação Intensiva no uso do tempo
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#define BILLION 1E9
int main(){
float diff_ns;
struct timespec start, end;
int x;
clock_gettime(CLOCK_MONOTONIC, &start);
for ( x = 0; x < 100000000; x++ ) {
struct timeval tv;
gettimeofday(&tv, NULL);
}
clock_gettime(CLOCK_MONOTONIC, &end);
diff_ns = (BILLION * (end.tv_sec - start.tv_sec)) + (end.tv_nsec - start.tv_nsec);
printf ("Elapsed time is %.4f secondsn", diff_ns / BILLION );
return 0;
}
23. Ou adicione no grub:
Mudar com:
Dica: utilize o TSC como clocksource
No Windows 2008 R2 e seguintes versões:
Não é necessário, ele escolhe o melhor clocksource automaticamente
No Linux:
24. Controle do P-state e C-state
c4.8xlarge, d2.8xlarge, m4.10xlarge,
m4.16xlarge, p2.16xlarge, x1.16xlarge,
x1.32xlarge
Ao entrar em estados ociosos mais
profundos, os núcleos não ociosos podem
atingir frequências mais altas de até 300MHz
Mas…os estados ociosos mais profundos
requerem mais tempo para sair, então pode
não ser apropriado para cargas sensíveis à
latência
Linux: limite o c-state adicionando
“intel_idle.max_cstate=1” no grub
Windows: não tem opções para controlar o c-
state
25. Dica: Controle do P-state para AVX2
Se uma aplicação faz uso constante do AVX2 de todos os processadores,
o processador pode tentar extrair mais potência do que precisa
O processador irá reduzir a frequência de maneira transparente
Alterações excessivas na frequência da CPU podem deixar sua aplicação
lenta
sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
Consulte também: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processor_state_control.html
26. Revisão: Instâncias T2
Instâncias EC2 de baixo custo começando em $0.0065 por hora
Desempenho com capacidade de Burst
Alocação da variação da CPU baseada em créditos
Modelo vCPU Linha de
base
Créditos de
CPU / Hora
Memoria
(GiB)
Armazenamento
t2.nano 1 5% 3 .5 Somente EBS
t2.micro 1 10% 6 1 Somente EBS
t2.small 1 20% 12 2 Somente EBS
t2.medium 2 40%** 24 4 Somente EBS
t2.large 2 60%** 36 8 Somente EBS
Uso geral, servidores web, ambientes de desenvolvimento, pequenos BDs
27. Como os créditos funcionam
Um crédito de CPU fornece o desempenho de
um núcleo de CPU completo por um minuto
A instância ganha créditos de CPU
constantemente
A instância consome os créditos quando
solicitado
Os créditos expiram depois de 24 horas
Taxa de Referência
(Baseline)
Saldo
de
crédito
Taxa de
Burst
29. Revisão: Instâncias X1
Maior instância de memória com 2 TB de DRAM
Quad-socket, Intel E7 processors com 128 vCPUs
Modelo vCPU Memória (GiB) Armazenamento
Local
Rede
x1.16xlarge 64 976 1x 1920GB SSD 10Gbps
x1.32xlarge 128 1952 2x 1920GB SSD 20Gbps
Bancos de dados em memória, processamento de Big Data, HPC
30. NUMA
Non-uniform memory access
Cada processador em um sistema multi-
CPU possui memória local acessível através
de uma conexão rápida
Cada processador também pode acessar a
memória de outras CPUs, mas o acesso à
memória local é muito mais rápido do que a
memória remota
O desempenho está relacionado ao número
de sockets de CPU e como eles estão
conectados - Intel QuickPath Interconnect
(QPI)
33. Dica: Suporte do Kernel para balanceamento NUMA
Uma aplicação vai ter melhor desempenho quando todos os threads de
seus processos estejam acessando o mesmo nó de NUMA.
O balanceamento NUMA move tarefas mais próximas da memória a que
acessam.
Isso é feito automaticamente pelo kernel do Linux quando o
balanceamento automático NUMA está ativo: versão 3.8+ do Kernel do
Linux.
O suporte do Windows para o NUMA apareceu pela primeira vez nas
SKUs Enterprise e Data Center do Windows Server 2003.
Defina "numa=off" ou use numact para reduzir a paginação NUMA, se
sua aplicação usar mais memória do que caberá em um único socket ou
tiver threads que se movam entre sockets.
34. Kernel do Linux 3.8+ e Windows Datacenter 2003+ suportam o
balanceamento NUMA
Nem sempre é o mais eficiente
Sobrecarga do gerenciamento de memória pode diminuir a velocidade do seu aplicativo
O seu aplicativo possui mais memória que se encaixa em um único socket?
Linux: Defina "numa = off" no grub para desabilitar a consciência NUMA
Você tem muitos processos ou footprint inferior a um único socket?
Linux: use "numactl" para restringir isso a núcleos ou nós específicos
Exemplo: numactl --cpunodebind=0 --membind=0 ./myapp.run
Windows: use a afinidade do processador para bloquear aplicativos em núcleos
específicos.
Dica: Suporte do Kernel para balanceamento NUMA
35. Hugepages no Linux
Desative Transparent Hugepages
# echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
# echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
Use Explicit Huge Pages
$ sudo mkdir /dev/hugetlbfs
$ sudo mount -t hugetlbfs none /dev/hugetlbfs
$ sudo sysctl -w vm.nr_hugepages=10000
$ HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so numactl --cpunodebind=0
--membind=0 /path/to/application
Consulte também: https://lwn.net/Articles/375096/
36. Sistemas Operacionais impactam o desempenho
Aplicação intensiva em memória
Criava uma grande quantidade de threads
Rapidamente alocava/deslocava memória
Se comparou desempenho do RHEL6 vs RHEL7
Se observou grande quantidade de tempo de “system” no top
Se encontrou uma ferramenta de benchmark (ebizzy) com um perfil de
desempenho semelhante
Se traçou o desempenho com “perf”
37. No RHEL6
[ec2-user@ip-172-31-12-150-RHEL6 ebizzy-0.3]$ sudo perf stat ./ebizzy -S 10
12,409 records/s
real 10.00 s
user 7.37 s
sys 341.22 s
Performance counter stats for './ebizzy -S 10':
361458.371052 task-clock (msec) # 35.880 CPUs utilized
10,343 context-switches # 0.029 K/sec
2,582 cpu-migrations # 0.007 K/sec
1,418,204 page-faults # 0.004 M/sec
10.074085097 seconds time elapsed
43. Device Pass Through: Enhanced Networking
SR-IOV elimina a necessidade do Driver Domain
A placa de rede física é exposta para a instância Amazon EC2
Requer um driver especial, isso quer dizer:
O sistema operacional de sua instância precisa saber disso
A instância EC2 precisa suportar essa funcionalidade
44. Hardware
Depois do Enhanced Networking
Driver Domain Guest Domain Guest Domain
VMM
NIC
Driver
Physical
CPU
Physical
Memory
SR-IOV Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Aplicação
1
2
3
NIC
Driver
45. Elastic Network Adapter
Seguinte geração do Enhanced
Networking
Hardware Checksums
Multi-Queue Support
Receive Side Steering
20Gbps em Placement Groups
Novo controlador de rede Open
Source desenvolvido pela Amazon
46. Desempenho de rede
20 Gigabit & 10 Gigabit
Medido de sentido único, o dobro para bidirecional (full duplex)
Alto, Moderado, Baixo – Em função do tamanho da instância e
otimização EBS
Nem todos criados iguais - Teste com iperf se é importante!
Use Placement Groups quando você precisa largura de banda alta
e consistente entre instâncias
Todo o tráfego é limitado a 5 Gb / s ao sair do EC2
47. ENA em Instâncias R4
r4.8xlarge fornece 10Gbps consistentes
r4.16xlarge fornece 20Gbps consistentes
Instâncias menores
Até 10 Gbps com uma linha de base
Acumule créditos quando o uso da rede
esteja abaixo da linha de base
A largura total também pode ser atingida
sem utilizar Placement Groups, mas
precisa de múltiplos streams
Um único stream é limitado a 10Gbps
em um Placement Group
5 Gbps por stream entre AZ’s, mas ainda
consegue 20Gbps no total
48. Desempenho do EBS
O tamanho da instância afeta a
taxa de transferência
Equipare o tamanho e tipo do
volume com o tipo da instância
utilize EBS Optimization se o
desempenho do EBS for
importante
49. “AWS nos ajudou a aumentar nosso
SLA e focar em E-Commerce”
A iSET é uma das principais
plataformas de e-commerce em
nuvem no Brasil, com mais de 8 mil
clientes atendidos.
Nossos sistemas e infraestrutura
são totalmente desenvolvidos
internamente.
“A AWS trouxe
redução de custos
operacionais e de infra
e permitiu um aumento
do SLA para 99,9%”
- Paulo Pina, CEO
50. O Desafio
Simplificar gestão dos quase 50
servidores dedicados;
Resolver em definitivo problemas de
disponibilidade;
Ter uma nova infraestrutura que
permitisse escalabilidade de forma
eficiente;
Reduzir e monitorar os custos por
serviço ativo.
52. Escolha HVM AMIs
Timekeeping: use TSC
Controle do C-state and P-state
Monitore os créditos de CPU das instâncias T2
Use Sistemas Operacionais modernos
Balanceamento NUMA
Grants persistentes para melhor desempenho de I/O
Enhanced Networking
Entenda o perfil da sua aplicação
Resumo: Obtendo mais de sua instância EC2
53. A meta é obter o desempenho do Bare Metal, e em muitos casos já
foi conseguido
Histórico de eliminação da intermediação do Hypervisor e domínios
dos controladores
Hardware assisted virtualization
Scheduling and granting efficiencies
Device pass through
Temas de Virtualização
54. Próximos Passos
Acesse nossa documentação de instâncias Amazon EC2
Inicie uma instância e teste o sua aplicação!
56. Ainda não tem o App oficial do
AWS Summit São Paulo?
http://amzn.to/2rOcsVy
Não deixe de avaliar as sessões no app!
Notas del editor
Thank you for coming
Adam Boeglin – Solutions architect HPC
Talk to you about a subject I’m pretty passionate about which is EC2 performance
I’ve been a sysadmin – frustrated by not getting performance
Lately at AWS, working with customers doing CFD, Gene Sequencing, Semiconductor design
Performance really important for them and their workloads
Talk to you today
Things I’ve learned
Things my customers do
EC2 is a big subject
Talk about the Purchase Options
APIs & SDK’s
Networking
Talk today about the instances themselves
How they operate
Features
Options when you go to launch
Other Topics
List Recommended sessions at the end
Let’s start at the basics
What is an EC2 instance?
They are virtual machines
Guests
On a Hypervisor
On physical hardware
Launched in 2006
“an instance”
Didn’t have a name
Did get any choices
Like the Model T – any color as long as black
Eventually gave it a name
M1 instance
Customers wanted more choice
We’ve been iterating and growing ever since.
Not only adding instances, but changing how EC2 works
Launched the cc2 in 2011
Placement groups
Bandwidth and latency
Hardware assisted virtualization
Exposes more of the underlying hardware
Lets you get even more performance
EC2 is always growing and changing based on customer feedback
Always check our documentation for the latest as you’re building out
How we do things today may be different in the future
Go over how we talk about instances and name them
Get on the same page
First letter is the family
Stands for what it’s suited for or what resources it has
C for compute
R for Ram
I for IOPS
Number is the generation
Like a version number
Last is the instance size
T-Shirt size
You’ve got a lot of choices and flexibility when you go to launch
It can seem overwhelming
Trying to pick the right instance
Looking at just the families
First find what your application is constrained by
If you need memory, start with R3
CPU, go with C4
If balanced, look at general purpose M or T
Perspective of your constraint, it’s easy to pick the right family
Test to find the right size within that family
If you need a little help, check the documentation
List of workloads for each family.
When you’re looking at instances, you’ll see something call vCPUs
On modern instances not in the T family
An hyperthreaded core
Hyperthreading is great to increase performance
I kinda lets your CPU do two things at once
Like waiting on IO
Real core count
Divide by two
Visit link, used for licensing.
To give a visual representation…
Output of LSTOPO on m4.10xlarge
Linux utility for enumerating hardware
Can run on any instance or physical server
Shows graphical output of hardware configuration
Sockets
Memory on each socket
L1-3 Cache
CPU thread to core mapping
Case of m4.10xlarge
40 threads
20 cores
Some applications don’t benefit from hyperthreading
Context switching may decrease performance
Typically compute heavy apps
Financial calculations & engineering simulations
These apps usually disable hyperthreading
If you’re not sure or don’t typically disable hyperthreading, don’t worry
If you do, try to run with it disabled it on EC2 and see if it improves performance
Easy on Linux, harder on Windows
Linux
The first set of threads on each cores is listed first, and the second or B threads are listed after that
Disable the last half, which will be all the B threads
Two ways
Online
Great for no reboot
But it may cause instability
Disable processors where threads may be running
Won’t be persisted after a reboot
In grub,
Set max cpus to match physical cpu count minus 1
Safer – disabled when booting
But makes it harder when you switch size
Windows is harder
Interleaved
Have to use CPU affinity
Same m4.10xlarge with hyperthreading turned off
Only one CPU thread per core
Comapred to the two that you saw earlier
Let’s dig into how instance sizes work
We build instances
Easy to scale vertically and horizontally
Look at the c4 family as an example
C4.8xlarge on the left
Largest instance size available
That single c4.8xlarge
Roughly equal to 2 c4.4xlarges
That c4.4xlarge has roughly half the
Number of vCPUS
Amount of ram
Available network bandwidth
Keeps following down the line
2x c4.4xlarge = 4x c4.2xlarge
And so on…
Reason is because of how we partition instances
Largest size is typically a full server
On the smaller one’s you’re running a fraction of it depending on the size
Virtualization historically has a bad reputation
Usually used to manage over utilization of resources
More virtual machines than physical resources
We use virtualization for a lot of other reasons
Security & Isolation
Dedicate specific resources to specific customers
vCPUS as an example
With exception of T
When you’re assigned a vCPU
only customer using it
Not sharing with anyone else on the box
Same applies to Memory & Network
We build with the goal of providing a consistent experience
No matter what else is happing
Last thing I want to say about choosing your instance
Cheesy to quote documentation
Good sentiment
Easy to get an app up and running
Don’t run synthetic benchmarks
Install your app and send some realistic load
Examples:
Mobile App
HPC application
BI database
Use a real workload to understand how your app will behave
O Titans Group é lider em serviços de telefonia e provedores de internet, com operação em 40 operadores em 17 países… um principais produtos chamado “X” é um aplicativo de sincronização de arquivos para mobile e desktop
Titans Group precisava garantir que a solução de storage do produto atendesse os requisitos mais exigentes do ponto de vista de durabilidade, disponibilidade e confidencialidade das informações dos usuários.
Também era preciso garantir a escalabilidade tanto do ponto de vista técnico quanto do ponto de vista de custo.
Utilzado S3: 99,999999999% de durabilidade, 99,99% de disponibilidade e recursos de criptografia.
Foco no desenvolvimento das aplicações.
Lições: deduplicação, diff binário, upload e download direto para S3.
Outros serviços da AWS: EC2, EBS, ELB, RDS, VPC, auto-scaling, ElasticCache, SNS, SES.
Mais de 18 milhões de usuários em 17 países.
Digging deeper into the OS…
On all systems, time keeping is important
Used for things like
Processing interrupts
Getting the time and date
Measuring performance
Most AMI’s on AWS use Xen clock by default
Compatible with all instance types
TSC was introduced in Sandy Bridge
Handled by bare metal
You’re talking to your processor
Not the hypervisor
And because of this, calls to it are going to be much quicker
To demonstrate this – simple application
It does two things
Performs a large number of get time of day calls
a bit of math
Don’t laugh at my code…
I’m a sysadmin, not a developer
Quick and dirty to test it out
These are results on Xen clock source
Profiled with Strace
Really great tool to use with any app, yours included
Shows the number of system calls make
& the time they took
Gettimeofday take the most amount of time with a lot of calls
Overall, the test took about 12 seconds to run with Xen clock source
On the same system
Switched clocksource to TSC
Reran the test
Results look a lot different
Gettimeofday doesn’t show up
Run time reduced to two seconds
This is extreme for a simple app
I’ve seen apps improve by as much as 40%
It’s an easy change to make on Linux
Do it while the system is running
First command shows available clock sources
Second shows the current clock source
Third would change it to TSC
On windows, it’s handled automatically
If you’re running a recently released EC2 instance
Can improve a lot of apps
JVM debugging
Performance tracing
SAP applications
Recently change to the platform
added P & C state control to the platform with C4, now available on many more
First, let’s talk about C states
C states control the power savings features of a processor
Using c4.8xlarge as an example
Base clock speed of 2.9Ghz
Can turbo up to 3.5Ghz on one or two cores
Must let other cores idle down
Great when you need a few cores to have high frequencies
Letting them idle down
increase the time it takes for them to respond when you want to actually use them
So if you have an application where latency is important
You can limit how deep they’ll sleep
Setting cstate parameter in grub
You can use P state to set the desired running frequency of the cores
Some customers and some workloads
consistency is more important than performance
Some Game servers good example
Operate in loops
Loop needs to complete in the same time, every time
You can set the P state to prevent the processor to prevent it from scaling up and down
Operates at the same frequency all the time
Next I want to talk about T2 and why they’re special
T2 instance are great general purpose instances
Lowest costs instance available on AWS at ~1/2 a cent per hour for t2.nano
Great for workloads where CPU demand varies over time
Websites
Developer environments
Small database
You start with a baseline level of performance
That you can see in the chart above
The magic of T2 is that you earn credits when the instance is idle
Allows you to burst above the baseline
We launched T2 because we saw that most workloads aren’t using 100% of CPU all of the time
T2 family is a great way to
Still get the performance you need when you need it
Don’t pay for it when you don’t
Let’s talk about how credits work
You can think of credits in a T2 like a bucket
When you boot the instance
Start with enough credits for OS & Application
When your app is up and running, you’ll use credits when you use CPU
A single credit will let you run 100% of one core for one minute
When the work dies down and instance becomes idle
Earning new credits that will start to file up the bucket
Credits also expire after 24 hours if unused
To monitor those instances
Cloudwatch Metrics
Two available
The one in Orange is the credit usage
Spikes when usage is high
Shows you how many credits you’re using per minute
The Blue is the Balance
Keep this above zero if you want more performance than baseline
Monitoring your credit balance lets you ensure you’re getting consistent performance on a T2
What you’ll want to hook on if you’re using autoscaling
Recently launched the X1
Biggest instance
2TB of RAM
128 Virtual CPUs
Great for apps that need a huge memory footprint
Good for:
In memory databases
big data processing
some HPC
When you have that much memory
Managing is important
On any system with multiple sockets
Memory attached to local socket will be faster than remote
Concept is called NUMA
On Intel, there’s a QPI between sockets
It’s the bus that transfer memory from one to another
Look at the r3.8xlarge as an example
Two sockets
122GB of ram on each socket
Between are 2 QPI links
Application on the left reading from the right
Will go over the QPI
Fast, but not as fast as what’s attached directly
When you go to X1, things are more complex
X1 is a 4 socket system
Numa is more important
Compared to an r3.8xlarge
More memory per socket
Only one QPI between sockets
Memory transfers from one zone to another are going to take longer on X1
So what can we do?
If you’ve ever watched top on a linux system
shows threads moving from one core to another
Process scheduling to make sure work is balanced
Around 3.8 kernel, started to use NUMA affinity
Will try to keep processes in same NUMA zone
Will also try to move memory around to be close to the process
The downside is that this can actually slow down performance on some apps
Especially true if you have a large memory pool spanning sockets
The scheduler will be moving things around when it doesn’t need to be
To disable, set NUMA=off in grub
will disable memory transfers between zones
disable NUMA awareness for process scheduling
Alternative is to use numactl to lock processes to a specific zone
Only be reading and writing memory that’s local to them
awAdam
So what can we do?
If you’ve ever watched top on a linux system
shows threads moving from one core to another
Process scheduling to make sure work is balanced
Around 3.8 kernel, started to use NUMA affinity
Will try to keep processes in same NUMA zone
Will also try to move memory around to be close to the process
The downside is that this can actually slow down performance on some apps
Especially true if you have a large memory pool spanning sockets
The scheduler will be moving things around when it doesn’t need to be
To disable, set NUMA=off in grub
will disable memory transfers between zones
disable NUMA awareness for process scheduling
Alternative is to use numactl to lock processes to a specific zone
Only be reading and writing memory that’s local to them
Last memory related tip is to Disable transparent huge pages
Huge pages are a really big subject with a lot of different options
See article
It goes into detail about all the different options
Transparent is enabled by default on most recent distributions
Disabling transparent and using explicit
Can help significantly for apps that are accessing a lot of memory
Another thing to keep in mind
Operating system and libraries can effect application performance
Use not is running a modern linux kernel important
Run as recent of a distro as you can
Recent customer visit
Custom Application using a large amount of memory
EC2 performance not as good as on premise
Their app was very complex and it was hard to get quick results when making changes
Found a benchmark tool (ebizzy) with similar behavior to test
Results of ebizzy on RHEL6
Used perf to profile and see what’s happening at a system level
Generated 12,000 requests/second
Lots of time in system space
1.5 million page faults
Generated flame graphs to see what’s happening
Created by Brendan Gregg, check out his site for more information
A really good way to understand
Paths the code is taking a system
Time spent in specific calls
You can see ebizzy on the bottom
Making lots of madvise calls
End up with a xen hypercall
Compiled the same app on RHEL7 and tested on same instance type
Saw significantly better performance
RPS went from 12,000 – 425,000
Page faults went from 1.5 million to only 14,000
What happened?
This is where flamegraphs really shine
Same exact flame graph
Same Code
Sam run type
Only difference is the OS version
What the flamegraph showed us
Glibc Changed the path memory calls on RHEL7
Instead of long madvase with trip to hypervisor
Single intel optimized call for memory management
Recompile when moving to a different OS, it can make a big difference
Next, let’s talk about IO
We have a few families that are optimized for IO
I2 – IOPS – SSD Based
D2 – Dense storage - Magetic
Need a modern kernel to get best storage performance
Reason is split driver model
Application on left doing some disk IO
Talks to the front end driver
Then back end
Then real driver
Then hardware
Data transfer happens through shared pages
Need permissions to be granted and released
Granting had lots of overhead in early kernels
Every time it needs to write to disk
Talks to VMM
Get permission to write to device
Fill a buffer with the data
Pass to backend
Wait for data to be written
Remove the grant
Really expensive process, lots of buffer flushing
Gets worse the more CPUs you have
Persistent grants created to solve this.
Permission to write is reused for all transactions between front and back
Grants don’t need to be unmapped
Translation buffer never flushed
Much better performance for IO operations
Validating grants is easy
Run dmesg and grep for blockfront
This is i2.8xlarge
All volumes have persistant grants enabled.
If I haven’t said it enough
Using a modern kernel is really important
Many customers still use Centos6
Just by switching OS’s to 3.10
Seen as much as a 60% improvement
2.6 Kernel in Centos6 released in 2009
Long time ago in the cloud computing world
Please use a modern Kernel & OS.
Same lines as split driver model
Release enhanced networking with C3
Uses Single Root IO Virtualization – SRIOV
Physical device exposed to instance itself
Has a few requirements
Needs a special driver installed in the OS
EC2 needs to be told to expose it that way
Network path is much simpler
Packets don’t’ have to go through the VMM
Higher packets per second
Decreased jitter – talking to bare metal
It’s free on all supported instance
Enabled by default in many AMI’s
Highly recommend it if you’re touching the network
And we’re not done improving the network
Still making constant improvements
Latest is with a new Network Adapter
Launched with the X1
Called Elastic Network Adapter – ENA
Built a new Amazon developed Open source Driver
Will grow with us as we’re adding new features to the network
Built to handle throughputs up to 20 Gigabits/second
This + Hardware checksums & RSS make it the fastest network available on AWS today.
Touch briefly on network
Touch on a few points about network performance
Attend the Deep dive to learn more
Easy to forget that network can be a bottleneck on smaller instance types
Customer doing S3 performance testing
Not getting good performance
Found out all network traffic was going through a T2 NAT
Largest instances should get closer to 5Gbps when leaving EC2 and talking to things like S3
When we list 10 & 20 Gigabit bandwidth
Instance bandwidth is bi-directional
On p2, X1, and m4
20Gb in and Out at the same time
But you need placement groups and multiple TCP streams
Mark
All r4 instances can get at least 10Gbps
THIS IS HUGE!!!!!
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/memory-optimized-instances.html#memory-network-perf
https://youtu.be/CBmSl3O-AhI?t=914
Just like network throughput, EBS throughput is function of size of the instance
Larger instance, more EBS traffic
EBS optimization by default on newest instances
Don’t have to worry about network and EBS competing
Look at the EBS documentation
Table of every EBS optimized instance
Throughput and max IOPS
Great place to go to look for specific performance out of EBS
Diagrama da infraestrutura base
Diagrama da Infraestrutura da API
In conclusion
Lots of things
Getting the most out of it
At bare minimum
Benchmark your app
Use a modern OS
Monitor Cloudwatch
Use enhanced networking
Goal is to make virtualization as transparent as possible
Eliminate any inefficiencies it may cause
Goal of bare metal like performance
Already there in a lot of ways
So if you have any questions, the EC2 documentation is a great resource and covers even more than I could today. Otherwise, launch an instance and start testing your app. Thank you!