© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Roberto Fuente, Technical Account Manager, AWS S...
Qué esperar de esta sesión ?
• Introducción a AWS y EC2
• Definir el desempeño de un sistema y cómo se
caracteriza para di...
Introducción a AWS
y EC2
Infraestructura Global de AWS
Region
Edge Location
12 Regions
33 Availability Zones
54 Edge Locations
US West (OR)
AZ A AZ B
AZ C
GovCloud (US)
AZ A AZ B
US West (CA)
AZ A AZ B
AZ C
US East (VA)
AZ A AZ B
AZ C AZ D
AZ E
*A l...
Qué es Elastic Cloud
Compute (EC2)?
Amazon Elastic Cloud Compute (EC2)
Servidores Virtuales
en la nube de AWS
Rápida y fácil
escalabilidad,
según lo necesite
...
Amplia variedad de Tipos de Instancias
M4
General
purpose
Compute
optimized
C4
C3
Storage and IO
optimized
I2 G2
GPU
enabl...
Amazon EC2 permite…
• Construir fácilmente aplicaciones con HA
• Distribuir la carga entre servidores EC2 usando AWS Elast...
Diferentes modelos comerciales
Instancias
Reservadas
Pague un adelanto inicial
mínimo
Reserve la capacidad
Asegure una tar...
Definiendo el desempeño
Selecionando un servidor
• Los servidores son reservados para realizar trabajos
• El desempeño se mide de manera diferente...
• Lo que desempeño significa,
depende de la perspectiva:
• Tiempo de respuesta
• Rendimiento
• Consistencia
Desempeño = pe...
Factores de desempeño
Recurso Factores Indicadores
CPU Sockets, número de núcleos,
frecuencia de reloj, capacidad
Utilizac...
Utilización de Recursos
• Cada applicacion tienen una perfile de utilizacion de
recrusos, para un dado nivel de despemeño....
Ejemplo: Aplicación Web
• MediaWiki instalado en un servidor Apache con 140
páginas de contenido
• Incremento de carga en ...
Ejemplo: Aplicación Web
• Estadísticas de Memoria
Ejemplo: Aplicación Web
• Estadísticas de Disco
Ejemplo: Aplicación Web
• Estadísticas de Red
Ejemplo: Aplicación Web
• Estadísticas de CPU
Selección de instancia = optimización
• La selección de una instancia es equivalente a la
optimización de los recursos
• D...
Ofreciendo desempeño de
cómputo en EC2
Instrucciones de CPU y Niveles de Protección
• CPU tiene dos niveles de protección: Kernel y Aplicación
• Instrucciones pr...
Ejemplo: Llamadas al sistema de una aplicación web
[ec2-user@ip-10-0-121-0 ~]$ sudo strace -c -p 2440
Process 2440 attache...
X86 CPU Virtualización : Antes de Intel VT-x
• Traducción a binario para instrucciones privilegiadas
• Para-virtualization...
27
Aplicando la ley de Moore
90 nm
2003
180 nm
1999
130 nm
2001
65 nm
2005
45 nm
2007 32 nm
2009
22 nm
2012 14 nm
2014
LEY...
28
Intel® Core™
Microarchitecture
TOCK
New
Micro-
architecture
Merom
65nm
TICK
Penryn
New
Process
Technology
45nm
Nehalem
...
RED
Datos en
movimiento
ALMACENAMIENTO
Datos
estacionarios
COMPUTO
Datos siendo
transformados
Una arquitectura común para ...
X86 CPU Virtualización : Despues de Intel VT-x
• Virtualización asistida por hardware (HVM)
• PV-HVM utiliza PV drivers pa...
Tip: Usar AMIs PV-HVM con EBS
Instancias C4
Custom Intel E5-2666 v3 at 2.9 GHz
Gestión de P-state C-state
Model vCPU Memory (GiB) EBS (Mbps)
c4.large 2 ...
Instancias: T2
• Menor costo de instancias
• Burstable performance
• Asignación fija de créditos CPU
Model vCPU CPU Credit...
How Credits Work
• Un crédito de CPU proporciona la
performance de un CPU completo
durante un minuto
• Una instancia gana ...
Tip: Supervisar el crédito de CPU
Tip: Como Interpretar Steal Time
• Asignaciones de CPU fijas puede ser ofrecidas con
limites establecidos en la CPU
• Stea...
Ofreciendo desempeño de I/O
en EC2
Virtualización de I/O y Dispositivos
• Split Driver Model
• Cada dispositivo tiene dos componentes;
• Ring buffer de comun...
Split Driver Model : Red
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driv...
Split Driver Model : Red
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driv...
Split Driver Model : Red
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driv...
Split Driver Model : Red
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driv...
Split Driver Model : Red
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driv...
Paso Directo al Dispositivo: Enhanced Networking
• SR-IOV elimina la necesidad del driver domain
• El dispositivo físico d...
Paso Directo al Dispositivo: Enhanced Networking
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
NIC
...
Paso Directo al Dispositivo: Enhanced Networking
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
NIC
...
Paso Directo al Dispositivo: Enhanced Networking
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
NIC
...
Tip: Usar Enhanced Networking
• Mayor cantidad de paquetes por segundo
• Menor varianza en latencia
• El Sistema Operativo...
Revisión de Instancias I2
Instancias I2
• Proveen almacenamiento SSD
• Proveen IOPS a bajo costo
• Optimizadas para alta demanda de I/O aleatorio
Mo...
Grants en kernels prévio a la versión 3.8.0
• Previo a la versión 3.8.0, se requiere un Mapa de grants
• El Mapa de grants...
Cesión en kernels posteriores a la versión 3.8.0
• El Mapa de grants está definido en un pool
• La información es copiada ...
Tip: Usar kernels posteriores a la versión 3.8.0
• Amazon Linux 13.09 o mayor
• Ubuntu 14.04 o mayor
• RHEL7 o mayor
• Etc.
Resumen
• Usar PV-HVM
• Monitorar creditos T2
• Usar Enhanced Networking
• Usar kernels posteriores a la versión 3.8.0
Gracias!
Próxima SlideShare
Cargando en…5
×

EC2: Cómputo en la nube a profundidad

392 visualizaciones

Publicado el

EC2: Cómputo en la nube a profundidad en el 2016 AWS Summit Buenos Aires

Publicado en: Tecnología
0 comentarios
1 recomendación
Estadísticas
Notas
  • Sé el primero en comentar

Sin descargas
Visualizaciones
Visualizaciones totales
392
En SlideShare
0
De insertados
0
Número de insertados
7
Acciones
Compartido
0
Descargas
15
Comentarios
0
Recomendaciones
1
Insertados 0
No insertados

No hay notas en la diapositiva.
  • Grna plazer de estar en el primer simmit de buenos aires.
    Support. I help my customer deep dive performance issues on EC2 and AWS services.
  • This session is designed to be educational and consultative.
    I want you all to come take away something that can help you use EC2,
    starting with how you can define performance down to features and tips you can use to get more performance and how they work
    You all have specific things you care about and objectives, but if those aren’t covered in the talk, don’t be too disappointed.
    We’ve brought some great engineers into our booth to answer your questions after the session.


    La parte mas importante es como aprovechar de mejor manero las instancias.
  • Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios.
  • Our data center footprint is global, spanning 5 continents with highly redundant clusters of data centers in each region. Our footprint is expanding continuously as we increase capacity, redundancy and add locations to meet the needs of our customers around the world.
  • You can choose to deploy and run your applications in multiple physical locations within the AWS cloud.
    Our data center footprint is global, spanning 5 continents with highly redundant clusters of data centers in each region.
    Amazon Web Services are available in geographic Regions that are independent and separate as much as possible for data sovertenty and as much as possible offer the same services.
    When you use AWS, you can specify the Region in which your data will be stored, instances run, queues started, and databases instantiated.
    Within each Region are Availability Zones (AZs).
    Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from a failure (unlikely as it might be) that affects an entire zone. Regions consist of one or more Availability Zones, are geographically dispersed, and are in separate geographic areas or countries. The Amazon EC2 service level agreement commitment is 99.95% availability for each Amazon EC2 Region.

    Our footprint is expanding continuously as we increase capacity, redundancy and add locations to meet the needs of our customers around the world.

    AWS maintains Regions, which are major geographic areas, and Availability Zones (AZ), which are individual data centers, or clusters of data centers that make up a Region. Independent and separate that as much as possible offer the same services. But they have isolation as much as possible for data sovertenty.
    Today, AWS operates 9 Regions around the world. Each Region has a minimum of 2 Azs (separate power, flood planes, etc) to allow customers to set up high availability architectures and data redundancy. An abstraction of a datacenter with fault isolation but close enough to build high availability architectures.
    In addition to Regions, AWS maintains edge locations that supporting Route 53 DNS and Amazon CloudFront (CDN) points of presence.


  • Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios.
  • EC2 is designed to make web scale computing easier for developers. It has resizable compute capacity, configurable security and network access, and you have complete control of the resource.

    Resources can be started, terminated and monitored as needed, and you can increase availability by deploying instances across multiple physical locations.

  • Talk about instance families and really stress the breadth of our offering. This graphic does not speak to all the Instance Types but it does allow you to begin the conversation on the different types of families and the purpose we had in mind when AWS created the different Instance Types.
  • Amazon EC2 enables you to increase or decrease capacity within minutes, not hours or days. You can commission one, hundreds or even thousands of server instances simultaneously. Of course, because this is all controlled with web service APIs, your application can automatically scale itself up and down depending on its needs.

    Speaker Note: Describe ELB and auto scaling, the key use cases and how they can be interdependent but not necessarily.

    ELB Benefits: HA, health checks, SSL offloading, stickiness sessions, logging, etc…..
    Detects health of amazon ec2 instances to ensure, detect, and remove failing instances
    Dynamically grows and shrinks resources based on traffic
    Seamlessly integrates with autoscaling to add and remove instances based on scaling activities
    Supports load balancing of applications using HTTP, HTTPS, SSL, and TCP protocols.
    Auto Scaling: Automatically scale your Amazon ec2 fleet to optimize utilization based on your conditions and needs
    scale customer's ec2 capacity automatically and shed unneeded Amazon EC2 instances automatically
    Good for apps that experience variability in usage
    Is enabled by Amazon CloudWatch and carries no additional fees

  • Explain how pricing works, integrate our new SPOT model with 1-6 hours. Great slide to just talk and whiteboard out how our offerings could be bought in a hybrid model. Some Spot, some Demand, and some RI.
  • In order to know how to improve performance you have to first know what it is and how to measure it.
    Performance can mean different things depending on what you’re talking about.
  • Servers are hired to do jobs, and what those are jobs depend on your business or personal objectives
    Defining performance for your application is the first step to knowing what you need out of your virtual machines on EC2
    Skipping that step can lead to overprovisioning or under provisioning, spending too much, or not spending enough and not meeting your customer promise

    Because EC2 offers lots of virtual server configurations on-demand, pay by the hour, the approach to right sizing is different and less stressful
    You aren’t stuck with it, and you can experiment easily
    The goal is to hire the right server for the job


    CPU bound, IO intensive etc.
  • The ways that you can generalize performance are the following:
    How quickly does a unit of work get done, or response time
    How much work is being done per unit of time, or throughput
    And how consistently over time is a level of performance achieved. Consistency can be very important.
    How quickly does a unit of work get done
    When you execute the database query, how quickly does it come back
    When you enter a website how quickly does the page load
    How much work are you getting done in a unit of time
    Web application: the number of requests per second handled within a tolerable response time
    Database: transactions per second
    Transcoding video: frames per second
    Machine learning: inferences per second, or number of training jobs per unit of time
    Going further down the stack, to the filesystem for example, you might look at filesystem cache hits.
    Down to the hardware resources that do the work, you’re paying attention to CPU, Memory, Disk, Network, and whether these resources are fully saturated or utilized.
    For instances, we think about performance at the resource level – the capabilities of though resources and how they are utilized
  • Inidcadores de Performance son indicadores de los recrusos para ver si todo el potential esta siendo utilizado o no.
    Explain CPU, and stuff.
    Cual la utilizacion correcta? La demando de cada componente depende de que tipo de applicacion, recordard es una applciacion que usa mucha CPU, Disco etc.
  • Each application can have a different resource utilization profile for a given level of workload performance.
    Utilization: 100% utilization is usually a sign of a bottleneck.

    Si tenemos baja capacidad de recursos.

    Performance over Application on ec2.
  • As an illustration we set up a simple mediawiki deployment – a PHP application using apache and mysql.
    We set it up with 140 pages of content and ran a load test
    We used siege and gradually upped the load over time
    On the server side, we collected some basic metrics using collectd and used a graphing tool to pull some charts together. The default interval of 10 seconds was used, so you get pretty good granularity
    It plugs into Apache, and here we show the apache requests per second rate from the web server status output


    Why not using cloudwatch metrics/ hyper visor only shows certain metrics. Lets have looks

    //are we going to show any sings of capacity or we just going to go over metrics.


  • Buffers for Filesystem metadata
    Cache for file cache to reduce disk accesses
    No swapping

    The information displayed in the memory section provides the same data about memory usage as the command free -m.
    The swapd or “swapped” column reports how much memory has been swapped out to a swap file or disk. The free column reports the amount of unallocated memory. The buff or “buffers” column reports the amount of allocated memory in use. The cache column reports the amount of allocated memory that could be swapped to disk or unallocated if the resources are needed for another task.

    The swap section reports the rate that memory is sent to or retrieved from the swap system. By reporting “swapping” separately from total disk activity, vmstat allows you to determine how much disk activity is related to the swap system.
    The si column reports the amount of memory that is moved from swap to “real” memory per second. The so column reports the amount of memory that is moved to swap from “real” memory per second.
  • I/O
    The io section reports the amount of input and output activity per second in terms of blocks read and blocks written.
    The bi column reports the number of blocks received, or “blocks in”, from a disk per second. Thebo column reports the number of blocks sent, or “blocks out”, to a disk per second.

    r/s & w/s: Read and write requests per second. This is already post-merging, and in proper I/O setups reads will mean blocking random read (serial reads are quite often merged), and writes will mean non-blocking random write (as underlying cache can allow to serve the OS instantly). 

    rrqm/s & wrqm/s: How many requests were merged by block layer. In ideal world, there should be no merges at I/O level, because applications would have done it ages ago. Ideals differ though, for others it is good to have kernel doing this job, so they don’t have to do it inside application. 
  • Buffers for Filesystem metadata
    Cache for file cache to reduce disk accesses
    No swapping

    The information displayed in the memory section provides the same data about memory usage as the command free -m.
    The swapd or “swapped” column reports how much memory has been swapped out to a swap file or disk. The free column reports the amount of unallocated memory. The buff or “buffers” column reports the amount of allocated memory in use. The cache column reports the amount of allocated memory that could be swapped to disk or unallocated if the resources are needed for another task.

    The swap section reports the rate that memory is sent to or retrieved from the swap system. By reporting “swapping” separately from total disk activity, vmstat allows you to determine how much disk activity is related to the swap system.
    The si column reports the amount of memory that is moved from swap to “real” memory per second. The so column reports the amount of memory that is moved to swap from “real” memory per second.
  • The cpu section reports on the use of the system’s CPU resources. The columns in this section always add to 100 and reflect “percentage of available time”.
    The us column reports the amount of time that the processor spends on userland tasks, or all non-kernel processes.
    The sy column reports the amount of time that the processor spends on kernel related tasks.
    The id column reports the amount of time that the processor spends idle.
    The wa column reports the amount of time that the processor spends waiting for IO operations to complete before being able to continue processing tasks.
  • Lo bueno es que pueden ellimanr las instancias un fez terminandas – o lanzar las pruebas en otras.

  • Protection, system call performance,
    Scheduling and P and C state management
    Tips: HVM, which system calls to use for timekeeping, how to manage P and C states
  • CPU has at least two protection levels: Kernel mode and user mode
    CPU checks current protection level on each instruction
    Privileged instructions can’t be executed in user mode to protect system. They include:
    “Initiate I/O” Access I/O devices, such as network and disk
    “Access protected memory” Manipulate memory management unit
    Time keeping
    Halt CPU or chance power state
    Done in user mode software through system calls – trap to kernel mode.
  • Took a sample of the system calls being done by httpd and here’s the list of the most frequently used
    Creating processes
    Input / output operations (file system operations)
    And mapping files and devices into memory

    These are generally some of the most popular system calls.

    If you have debugging enabled, for example, you’ll see an elevation in the number of gettimeofday() calls to put time stamps in the debug logs.

    Most time related php functions will use the system time. Since they use the system time, gettimeofday will be called a lot so if you want to reducte the calls, reducte your time related functions.

    If your application does a lot of I/O or you want to use debugging mode with lots of time checks, for example, you would start to care more about your system call performance.


  • When virtualizing hardware it’s job of the hypervisor to enforce protections and schedule resources to offer a controlled virtual machine experience
    Else, OS and user land share the same ring
    The hypervisor must be able to trap and moderate any instruction that changes the hardware or state of the system. This is to provide isolation between virtual machines.
    Hypervisor is moderating system calls and sending it back to the OS. System call performance is poor.
    So you have a couple options
    So you can scan the instruction stream of each virtual machine for privileged instructions and do binary translation – performance is not ideal
    Ignore those instructions and provide “hypercalls” to replace instructions that lose their functionality – modified OS Kernel, compatibility and portability
    Use hardware assisted virtualization technology provide a new CPU execution mode feature that allows the hypervisor to run in a new root mode below ring 0.
    Then there are complex devices that need to get emulated

    The hypervisor also provides hypercall interfaces for other critical kernel operations such as memory management, interrupt handling and time keeping.
    When virtualizing the CPU, one also has a choices of how to assign physical CPU cores to virtual CPUs.


  • The hypervisor must be able to trap and moderate any instruction that changes the hardware or state of the system. This is to provide isolation between virtual machines.
    Use hardware assisted virtualization technology provide a new CPU execution mode feature that allows the hypervisor to run in a new root mode below ring 0.
    Then there are complex devices that need to get emulated

    The hypervisor also provides hypercall interfaces for other critical kernel operations such as memory management, interrupt handling and time keeping.
    When virtualizing the CPU, one also has a choices of how to assign physical CPU cores to virtual CPUs.


    But fully virtualized mode, even with PV drivers, has a number of things that are unnecessarily inefficient. One example is the interrupt controllers: fully virtualized mode provides the guest kernel with emulated interrupt controllers (APICs and IOAPICs). Each instruction that interacts with the APIC requires a trip up into Xen and a software instruction decode; and each interrupt delivered requires several of these emulations.

    With the introduction of PVHVM mode, we can start to see paravirtualization not as binary on or off, but as a spectrum. In PVHVM mode, the disk and network are paravirtualized, as are interrupts and timers. But the guest still boots with an emulated motherboard, PCI bus, and so on. It also goes through a legacy boot, starting with a BIOS and then booting into 16-bit mode. Privileged instructions are virtualized using the HVM extensions, and pagetables are fully virtualized, using either shadow pagetables, or the hardware assisted paging (HAP) available on more recent AMD and Intel processors.

    The "HVM callback vector" line shows that PV interrupts are enabled (from PVHVM), which is a big difference. On full HVM mode, emulated PCI interrupts are used for device I/O delivery, along with emulating the PCI bus, local APIC, and IO APIC. If you doing a high rate of disk I/O or network packets – which is easy to do on today's networks – these emulation overheads add up. With vector callbacks instead of interrupts, the Xen hypervisor can call the destination guest driver directly, avoiding these overheads.


  • A fully virtualized system, like an OS running on bare hardware, relies on the timer interrupt for its time keeping. This means a number of things:
    An idle virtual machine still has to process hundreds of interrupts a second.
    Missed interrupts result in unstable time.

    On Linux there are two different time mechanisms
    Clock source and clock events
    Gettimeofday you are accessing a clock source, same for QueryPerformanceCounter
    Have commands that let you see your clock source
    Usually by default it's going to be the xen clock source
     
    JVM tracing does very heavy get time of day calls
    Benchmarks tend to show this problem more than a lot of applications
     
    TSC is a hardware clocksource that gets rid of all of the software that has to go on top of tings
    In linux can access the TSC without talking to the kernel
    Xen pvclock gives you compatibility with a wide range of hardware
  • If you want to see the differences, need to use a time keeping benchmark. In real world most often occurs when you are using a JVM and high a high debug level enabled so the JVM does time based tracing. Another classic example is SAP because they do a large about of time keeping operations. High fidelity trace records.

    Test before changing!
  • CPU customaizdo para EC2.
    Overlocking / C-estados.
    Son super rapdiodos las instancia C4.8XL consigue llegar a 3.5GHZ, en una solo cor.

    Controles de Estado C y P
    Estado-C
    Controla el nivel de reposo al que puede llegar un núcleo
    Numerados del C0 (el núcleo está trabajando normalmente y ejecutando instrucciones) al C6 (el núcleo está apagado)
    Estado-P
    Controla el nivel de desempeño deseado en un núcleo
    Numerados del P0 (el mayor desempeño en el núcleo en donde tiene permitido usar la tecnología Turbo Boost de Intel que permite incrementar la frecuencia), y luego va del P1 (solicita la máxima frecuencia base) al P15 (solicita la mínima frecuencia posible)
  • Nivel de int
  • You might want to change the C-state or P-state settings to increase processor performance consistency, reduce latency, or tune your instance for a specific workload. The default C-state and P-state settings provide maximum performance, which is optimal for most workloads. However, if your application would benefit from reduced latency at the cost of higher single- or dual-core frequencies, or from consistent performance at lower frequencies as opposed to bursty Turbo Boost frequencies, consider experimenting with the C-state or P-state settings that are available to these instances.
  • In this example, vCPUs 21 and 28 are running at their maximum Turbo Boost frequency because the other cores have entered the C6 sleep state to save power and provide both power and thermal headroom for the working cores. vCPUs 3 and 10 (each sharing a processor core with vCPUs 21 and 28) are in the C1 state, waiting for instruction.
  • C-states control the sleep levels that a core may enter when it is inactive. You may want to control C-states to tune your system for latency versus performance. Putting cores to sleep takes time, and although a sleeping core allows more headroom for another core to boost to a higher frequency, it takes time for that sleeping core to wake back up and perform work. For example, if a core that is assigned to handle network packet interrupts is asleep, there may be a delay in servicing that interrupt. You can configure the system to not use deeper C-states, which reduces the processor reaction latency, but that in turn also reduces the headroom available to other cores for Turbo Boost.

    A common scenario for disabling deeper sleep states is a Redis database application, which stores the database in system memory for the fastest possible query response time.
  • You can reduce the variability of processor frequency with P-states. P-states control the desired performance (in CPU frequency) from a core. Most workloads perform better in P0, which requests Turbo Boost. But you may want to tune your system for consistent performance rather than bursty performance that can happen when Turbo Boost frequencies are enabled.
    Intel Advanced Vector Extensions (AVX or AVX2) workloads can perform well at lower frequencies, and AVX instructions can use more power. Running the processor at a lower frequency, by disabling Turbo Boost, can reduce the amount of power used and keep the speed more consistent. For more information about optimizing your instance configuration and workload for AVX, seehttp://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf.
  • You can reduce the variability of processor frequency with P-states. P-states control the desired performance (in CPU frequency) from a core. Most workloads perform better in P0, which requests Turbo Boost. But you may want to tune your system for consistent performance rather than bursty performance that can happen when Turbo Boost frequencies are enabled.
    Intel Advanced Vector Extensions (AVX or AVX2) workloads can perform well at lower frequencies, and AVX instructions can use more power. Running the processor at a lower frequency, by disabling Turbo Boost, can reduce the amount of power used and keep the speed more consistent. For more information about optimizing your instance configuration and workload for AVX, seehttp://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf.
  • T2.nano .
    Este tipo de instancia fueron creados porque tenismos lientes que utilizaban poco CPU.
    Funciona tipo bursting.
  • A CPU Credit provides the performance of a full CPU core for one minute
    Hefty initial CPU credit balance for good startup experience
    Use credits when active, accrue credits when idle
    Transparency on credit balances
  • Revisar las metricas de cloudwatch para entender como estan la utilzacion.


    Hmm vemos Steal Time – me estan robando la CPU?
  • In order to do time accounting for each process, measures time, schedules process, run a new process, check time again. Charge a process with the difference.
    Model assumes OS in running 100% of the time, if hypervisor takes away time from the instance and the OS doesn't know it, because it wouldn’t be getting timer interrupts.
    The category of “steal time” is enabled paravirtual extension where the guest queries the hypervisor for time, and can figure out when time was taken away.
    Exists in Linux and not WIndows.
    There's a caveat - when call a hault - if haulted, doesn't get reported as steal time. So steal time doesn't always account for the time the hypervisor has taken from you.
  • “A common misconception about steal time (due to the unfortunate naming) is that it is a metric for showing the amount of CPU cycles stolen by other virtual machines in the same virtual host. No doubt that cloud service providers tend to oversell but steal time should not be the basis for this assumption.

    Steal time actually accounts for the cycles the local virtual machine is trying to go over its originally allocated resources. It should actually be named involuntary wait as mentioned in the Linux kernel documentation for /proc/stat.”

    There are a number of corner cases where hypervisor is doing work on your behalf. It can help you understand what's happening but it doesn't indicate that your performance is worse. The big takeaway is that your performance is not being impacted.

    The goal of steal time is to correct process accounting.



  • Protection, system call performance,
    Scheduling and P and C state management
    Tips: HVM, which system calls to use for timekeeping, how to manage P and C states
  • Consistent device drivers provided in a split driver model allows for better portability of machine images across hardware generations. Allows hardware specific drivers to reside in a control operating system, and simple front-end driver in the guest communicates to the back-end through ring buffers in shared memory pages. The multiplexing happens on the host, and it can require host CPU resoruces.
    The original challenge of assigning a device to a virtual machine has to with direct memory access, so a device can modify memory without bothering the CPU. That would be a serious security hole if allowed in the context of a multi-tenant host.
    IOMMU can identify source devices and either deny or translate memory requests using IOMMU page tables. This enables the hypervisor to assign specific devices to a guest and restrict device memory access to pages owned by the guest. This is how we enable PCI-passthrough for things like GPU instances and SR-IOV network devices.
  • Single Root I/O Virtualization

    Physical network device exposes Virtual Function to instance

    Driver in your instance is lightweight PCIe function, limited configuration, direct access to physical NIC.

    Packets no longer processed in software.

    But it’s a specialized driver, which means:
    Your instance OS needs to know about it and be using it
    EC2 needs to be told your instance OS knows about it and can handle it.
  • In a virtualized system, virtual address points to guest physical address which points to a host physical. You have this for both IO domain and instance. Grant maps two different guest physicals to the same host physical with permissions. Grant always has to originate from the instance.

    If the request is a write operation, these grants are filled with the desired data to write to the disk and necessary permissions are given to the driver domain, so it can map the grants (either read only if the request is a write operation, or with write permissions if the request is a read operation). Once we have the grants set up, a reference (the grant reference) is added to the request, and the request is finally queued on the shared ring and the driver domain is notified that it has a pending request.

    When the driver domain reads the request, it parses the grant references on the message and maps the grants on the driver domain memory space. When that is done, the driver domain can access this memory as if it was local memory.

    The request is then fulfilled by the driver domain, and data is read or written to the grants. After the operation has completed, the grants are unmapped, a response is written to the shared ring, and the guest is notified.

    Then the guest realizes it has a pending response, it reads it and removes the permissions to share the related grants. After that, the operation is completed.

    As we can see from the above flow, there is no memory copy, but each request requires the driver domain to perform several mapping and unmapping operations, and each unmapping requires a TLB flush. TLB flushes are expensive operations, and the time required to perform a TLB flush increases with the number of CPUs.


  • To solve this problem, an extension to the block ring protocol has been added, called “persistent grants“. Persistent grants consist in reusing the same grants for all the block related transactions between the guest and the driver domain, so there’s no need unmap the grants on the driver domain, and TLB flushes are not performed (unless the device is disconnected and all mappings are removed). Furthermore, since grants are done only once, there is no need to grab the driver domain’s grant lock on every transaction.

    This of course, doesn’t come for free, since grants are no longer mapped and unmapped, data has to be copied from or to the persistently mapped grant. But for large numbers of guests, the overhead from TLB flushes and lock contention greatly outweighs the overhead of copying.
  • EC2: Cómputo en la nube a profundidad

    1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Roberto Fuente, Technical Account Manager, AWS Support Damian Traverso, Solutions Architect AWS LATAM Abril 28, 2016 EC2 Cómputo en la nube a profundidad
    2. 2. Qué esperar de esta sesión ? • Introducción a AWS y EC2 • Definir el desempeño de un sistema y cómo se caracteriza para diferentes cargas de trabajo • Cómo las instancias EC2 ofrecen un óptimo desempeño, manteniendo flexibilidad y agilidad • Cómo aprovechar de mejor manera el uso de las instancias EC2
    3. 3. Introducción a AWS y EC2
    4. 4. Infraestructura Global de AWS Region Edge Location 12 Regions 33 Availability Zones 54 Edge Locations
    5. 5. US West (OR) AZ A AZ B AZ C GovCloud (US) AZ A AZ B US West (CA) AZ A AZ B AZ C US East (VA) AZ A AZ B AZ C AZ D AZ E *A limited preview of the China (Beijing) Region is available to a select group of China-based and multinational companies with customers in China. These customers are required to create a AWS Account, with a set of credentials that are distinct and separate from other global AWS Accounts. EU (Ireland) AZ A AZ B AZ C AZ A AZ B S. America (Sao Paulo) Asia Pacific (Tokyo) AZ A AZ B AZ C AZ A AZ B Asia Pacific (Singapore) China (Bejing)Asia Pacific (Sydney) AZ A AZ B EU (Frankfurt) AZ A AZ B AWS Regions China (Beijing)* AZ A AZ B Regiones de AWS y Zonas de Disponibilidad (AZs)
    6. 6. Qué es Elastic Cloud Compute (EC2)?
    7. 7. Amazon Elastic Cloud Compute (EC2) Servidores Virtuales en la nube de AWS Rápida y fácil escalabilidad, según lo necesite Pague únicamente por lo que usa Sistemas Operativos ya conocidos: Linux y Windows
    8. 8. Amplia variedad de Tipos de Instancias M4 General purpose Compute optimized C4 C3 Storage and IO optimized I2 G2 GPU enabled Memory optimized R3D2 M3
    9. 9. Amazon EC2 permite… • Construir fácilmente aplicaciones con HA • Distribuir la carga entre servidores EC2 usando AWS Elastic Load Balancers • Garantizar alta disponibilidad y escalabilidad usando Auto Scaling • Usar múltiples Zonas de Disponibilidad (AZs) • Elegir entre diferentes modelos comerciales
    10. 10. Diferentes modelos comerciales Instancias Reservadas Pague un adelanto inicial mínimo Reserve la capacidad Asegure una tarifa menor por hora Instancias On-Demand Pague de acuerdo con el uso Tarifa plana por hora Sin contratos ni compromisos Instancias Spot Haga una oferta Economice hasta 90% en comparación con On-Demand Lance 1,000s de instancias 10:00 10:05 10:10
    11. 11. Definiendo el desempeño
    12. 12. Selecionando un servidor • Los servidores son reservados para realizar trabajos • El desempeño se mide de manera diferente dependiendo del trabajo que se realice
    13. 13. • Lo que desempeño significa, depende de la perspectiva: • Tiempo de respuesta • Rendimiento • Consistencia Desempeño = perspectiva Aplicación Librerías de Sistema Llamadas a sistema Kernel Dispositivo Carga
    14. 14. Factores de desempeño Recurso Factores Indicadores CPU Sockets, número de núcleos, frecuencia de reloj, capacidad Utilización de CPU, tamaño de la fila de ejecución Memoria Capacidad Memoria libre, paginación, swapping Interfaz de Red Ancho de Banda Máximo, paquetes Cantidad paquetes recibidos, transferencia de paquetes sobre el máximo ancho de banda Disco IOPS, Desempeño Tamaño de fila en espera, utilización de dispositivos, errores en los dispositivos
    15. 15. Utilización de Recursos • Cada applicacion tienen una perfile de utilizacion de recrusos, para un dado nivel de despemeño. • Un recurso con utilización del 100% no puede recibir o atender más peticiones • Baja utilización indica que se han reservado más recursos de los necesarios
    16. 16. Ejemplo: Aplicación Web • MediaWiki instalado en un servidor Apache con 140 páginas de contenido • Incremento de carga en intervalos de tiempo
    17. 17. Ejemplo: Aplicación Web • Estadísticas de Memoria
    18. 18. Ejemplo: Aplicación Web • Estadísticas de Disco
    19. 19. Ejemplo: Aplicación Web • Estadísticas de Red
    20. 20. Ejemplo: Aplicación Web • Estadísticas de CPU
    21. 21. Selección de instancia = optimización • La selección de una instancia es equivalente a la optimización de los recursos • Dar de baja instancias es tan fácil como adquirir nuevas • Alinear el tipo de carga con el tipo de instancia óptimo
    22. 22. Ofreciendo desempeño de cómputo en EC2
    23. 23. Instrucciones de CPU y Niveles de Protección • CPU tiene dos niveles de protección: Kernel y Aplicación • Instrucciones privilegiadas no se pueden ejecutar en modo usuario para proteger el sistema. • Aplicaciones apalancan las llamadas al sistema al kernel Instrucciones privilegiadas: • Inicio de I/O • Acceso a I/O de Dispositivos (red, disco) • Manejo del tiempo • Pausa CPU Aplicación Kernel
    24. 24. Ejemplo: Llamadas al sistema de una aplicación web [ec2-user@ip-10-0-121-0 ~]$ sudo strace -c -p 2440 Process 2440 attached ^CProcess 2440 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 0.00 0.000000 0 931 11 read 0.00 0.000000 0 887 write 0.00 0.000000 0 121 open 0.00 0.000000 0 154 close 0.00 0.000000 0 1357 32 stat 0.00 0.000000 0 341 fstat 0.00 0.000000 0 99 11 lstat 0.00 0.000000 0 865 poll 0.00 0.000000 0 121 mmap 0.00 0.000000 0 121 munmap 0.00 0.000000 0 220 brk 0.00 0.000000 0 11 rt_sigaction 0.00 0.000000 0 11 rt_sigprocmask 0.00 0.000000 0 22 writev 0.00 0.000000 0 66 22 access
    25. 25. X86 CPU Virtualización : Antes de Intel VT-x • Traducción a binario para instrucciones privilegiadas • Para-virtualization (PV) • PV requiere pasar por VMM, introduciendo latencia • Aplicaciones que son ligados/bound a llamadas de sistemas son más afectadas VMM Application Kernel PV
    26. 26. 27 Aplicando la ley de Moore 90 nm 2003 180 nm 1999 130 nm 2001 65 nm 2005 45 nm 2007 32 nm 2009 22 nm 2012 14 nm 2014 LEY DE MOORE Habilitando nuevos dispositivos con mayor funcionalidad y complejidad, mientras se controla la potencia, el costo y el tamaño. (duplicando la integración cada 2 años)
    27. 27. 28 Intel® Core™ Microarchitecture TOCK New Micro- architecture Merom 65nm TICK Penryn New Process Technology 45nm Nehalem Microarchitecture TOCK New Micro- architecture Nehalem Xeon 5500 45nm TICK Westmere Xeon 5600 32nm New Process Technology Sandy Bridge Microarchitecture TOCK Sandy Bridge Xeon E5 32nm New Micro- architecture TICK Ivy Bridge Xeon E5 v2 22nm New Process Technology Haswell Microarchitecture TOCK Haswell Xeon E5 v3 22nm New Micro- architecture TICK Broadwell Xeon E5 v4 14nm New Process Technology Modelo Tick Tock – Evolución de plataformas Xeon
    28. 28. RED Datos en movimiento ALMACENAMIENTO Datos estacionarios COMPUTO Datos siendo transformados Una arquitectura común para todo el Datacenter Brindando economias de escala a toda la infraestructura
    29. 29. X86 CPU Virtualización : Despues de Intel VT-x • Virtualización asistida por hardware (HVM) • PV-HVM utiliza PV drivers para operaciones que son lentas a ser emuladas. : • e.g. Red y I/O de disco Kernel Application VMM PV-HVM
    30. 30. Tip: Usar AMIs PV-HVM con EBS
    31. 31. Instancias C4 Custom Intel E5-2666 v3 at 2.9 GHz Gestión de P-state C-state Model vCPU Memory (GiB) EBS (Mbps) c4.large 2 3.75 500 c4.xlarge 4 7.5 750 c4.2xlarge 8 15 1,000 c4.4xlarge 16 30 2,000 c4.8xlarge 36 60 4,000
    32. 32. Instancias: T2 • Menor costo de instancias • Burstable performance • Asignación fija de créditos CPU Model vCPU CPU Credits / Hour Memory (GiB) Storage t2.nano 1 3 0.5 EBS Only t2.micro 1 6 1 EBS Only t2.small 1 12 2 EBS Only t2.medium 2 24 4 EBS Only t2.large 2 36 8 EBS Only
    33. 33. How Credits Work • Un crédito de CPU proporciona la performance de un CPU completo durante un minuto • Una instancia gana créditos de CPU a un ritmo constante • Una instancia consume créditos cuando está activa • Créditos expiran (leak) después de 24 horas. Baseline Rate Credit Balance Burst Rate
    34. 34. Tip: Supervisar el crédito de CPU
    35. 35. Tip: Como Interpretar Steal Time • Asignaciones de CPU fijas puede ser ofrecidas con limites establecidos en la CPU • Steal time ocurre cuando el límite de tiempo en la CPU a sido agotado • Revisen las métricas de CloudWatch
    36. 36. Ofreciendo desempeño de I/O en EC2
    37. 37. Virtualización de I/O y Dispositivos • Split Driver Model • Cada dispositivo tiene dos componentes; • Ring buffer de comunicación • Canal de eventos avisando el ring buffer de actividad. • Intel VT-d • Paso directo para dispositivos dedicados • Enhanced Networking (SR-IOV)
    38. 38. Split Driver Model : Red Hardware Driver Domain Guest Domain Guest Domain VMM Frontend driver Frontend driver Backend driver Device Driver Physical CPU Physical Memory Network Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application
    39. 39. Split Driver Model : Red Hardware Driver Domain Guest Domain Guest Domain VMM Frontend driver Frontend driver Backend driver Device Driver Physical CPU Physical Memory Network Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application
    40. 40. Split Driver Model : Red Hardware Driver Domain Guest Domain Guest Domain VMM Frontend driver Frontend driver Backend driver Device Driver Physical CPU Physical Memory Network Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application
    41. 41. Split Driver Model : Red Hardware Driver Domain Guest Domain Guest Domain VMM Frontend driver Frontend driver Backend driver Device Driver Physical CPU Physical Memory Network Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application
    42. 42. Split Driver Model : Red Hardware Driver Domain Guest Domain Guest Domain VMM Frontend driver Frontend driver Backend driver Device Driver Physical CPU Physical Memory Network Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application
    43. 43. Paso Directo al Dispositivo: Enhanced Networking • SR-IOV elimina la necesidad del driver domain • El dispositivo físico de red expone una función virtual a la instancia • Requiere un driver especial: • El sistema operativo de la instancia necesita saber sobre el driver • Es necesario habilitar ”Enhanced Networking” en EC2
    44. 44. Paso Directo al Dispositivo: Enhanced Networking Hardware Driver Domain Guest Domain Guest Domain VMM Frontend driver NIC Driver Backend driver Device Driver Physical CPU Physical Memory SR-IOV Network Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application
    45. 45. Paso Directo al Dispositivo: Enhanced Networking Hardware Driver Domain Guest Domain Guest Domain VMM Frontend driver NIC Driver Backend driver Device Driver Physical CPU Physical Memory SR-IOV Network Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application
    46. 46. Paso Directo al Dispositivo: Enhanced Networking Hardware Driver Domain Guest Domain Guest Domain VMM Frontend driver NIC Driver Backend driver Device Driver Physical CPU Physical Memory SR-IOV Network Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application
    47. 47. Tip: Usar Enhanced Networking • Mayor cantidad de paquetes por segundo • Menor varianza en latencia • El Sistema Operativo de la instancia debe soportarlo
    48. 48. Revisión de Instancias I2
    49. 49. Instancias I2 • Proveen almacenamiento SSD • Proveen IOPS a bajo costo • Optimizadas para alta demanda de I/O aleatorio Model vCPU Memory (GiB) Storage Read IOPS Write IOPS i2.xlarge 4 30.5 1 x 800 SSD 35,000 35,000 i2.2xlarge 8 61 2 x 800 SSD 75,000 75,000 i2.4xlarge 16 122 4 x 800 SSD 175,000 155,000 i2.8xlarge 32 244 8 x 800 SSD 365,000 315,000
    50. 50. Grants en kernels prévio a la versión 3.8.0 • Previo a la versión 3.8.0, se requiere un Mapa de grants • El Mapa de grants requiere de operaciones costosas debido a flushes de TLB (Translation Lookaside Buffer) read(fd, buffer,…)
    51. 51. Cesión en kernels posteriores a la versión 3.8.0 • El Mapa de grants está definido en un pool • La información es copiada o extraída del pool Copy to and from grant pool
    52. 52. Tip: Usar kernels posteriores a la versión 3.8.0 • Amazon Linux 13.09 o mayor • Ubuntu 14.04 o mayor • RHEL7 o mayor • Etc.
    53. 53. Resumen • Usar PV-HVM • Monitorar creditos T2 • Usar Enhanced Networking • Usar kernels posteriores a la versión 3.8.0
    54. 54. Gracias!

    ×