Visit http:aws.amazon.com/hpc for more information about HPC on AWS.
High Performance Computing (HPC) allows scientists and engineers to solve complex science, engineering, and business problems using applications that require high bandwidth, low latency networking, and very high compute capabilities. AWS allows you to increase the speed of research by running high performance computing in the cloud and to reduce costs by providing Cluster Compute or Cluster GPU servers on-demand without large capital investments. You have access to a full-bisection, high bandwidth network for tightly-coupled, IO-intensive workloads, which enables you to scale out across thousands of cores for throughput-oriented applications.
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
Intro to High Performance Computing in the AWS Cloud
1. Ben Butler
Sr. Mgr., Big Data & HPC Marketing
Amazon Web Services
butlerb@amazon.com
@bensbutler
aws.amazon.com/hpc
High Performance Computing on AWS
14. Popular HPC workloads on AWS
Transcoding and
Encoding
Monte Carlo
Simulations
Computational
Chemistry
Government and
Educational Research
Modeling and
Simulation
Genome processing
16. Jun 2014 Top 500 list
484.2 TFlop/s
26,496 cores in a cluster
of EC2 C3 instances
Intel Xeon E5-2680v2
10C
2.800GHzprocessors
LinPack Benchmark
TOP500: 76th fastest supercomputer on-demand
17. Elastic Cloud-Based Resources
Actual demand
Resources scaled to demand
Benefits of Agility
Waste Customer
Dissatisfaction
Actual Demand
Predicted Demand
Rigid On-Premises Resources
18. Unilever: augmenting existing HPC capacity
The key advantage that AWS
has over running this workflow
on Unilever’s existing cluster is
the ability to scale up to a
much larger number of parallel
compute nodes on demand.
Pete Keeley
Unilever Researchs eScience
IT Lead for Cloud Solutions
”
“ • Unilever’s digital data program now processes
genetic sequences twenty times faster
23. Schrodinger & CycleComputing: computational chemistry
Simulation by Mark Thompson of
the University of Southern
California to see which of
205,000 organic compounds
could be used for photovoltaic
cells for solar panel material.
Estimated computation time 264
years completed in 18 hours.
• 156,314 core cluster across 8 regions
• 1.21 petaFLOPS (Rpeak)
• $33,000 or 16¢ per molecule
24. Cost Benefits of HPC in the Cloud
Pay As You Go Model
Use only what you need
Multiple pricing models
On-Premises
Capital Expense Model
High upfront capital cost
High cost of ongoing support
25. Reserved
Make a low, one-time
payment and receive
a significant discount
on the hourly charge
For committed
utilization
Free Tier
Get Started on AWS
with free usage &
no commitment
For POCs and
getting started
On-Demand
Pay for compute
capacity by the hour
with no long-term
commitments
For spiky workloads,
or to define needs
Spot
Bid for unused
capacity, charged at
a Spot Price which
fluctuates based on
supply and demand
For time-insensitive
or transient
workloads
Dedicated
Launch instances
within Amazon VPC
that run on hardware
dedicated to a single
customer
For highly sensitive or
compliance related
workloads
Many pricing models to support different workloads
26. 0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Heavy Utilization Reserved Instances
Light RI Light RILight RILight RI
On-Demand
Spot and
On-
Demand
100%
80%
60%
40%
20%
Utilization Over Time
Optimize Cost by using various EC2 instance
pricing models
27. Harvard Medical School: simulation development
The combination of our approach to
biomedical computing and AWS
allowed us to focus our time and
energy on simulation development,
rather than technology, to get results
quickly. Without the benefits of AWS,
we certainly would not be as far along
as we are.
Dr. Peter Tonellato,
LPM, Center for Biomedical
Informatics, Harvard Medical School
”
“ • Leveraged EC2 spot instances in workflows
• 1 day worth of effort resulted in 50% in cost savings
29. Characterizing HPC
Embarrassingly parallel
Elastic
Batch workloads
Interconnected jobs
Network sensitivity
Job specific algorithms
Data management
Task distribution
Workflow management
Loosely
Coupled
Supporting
Services
Tightly
Coupled
30. Characterizing HPC
Embarrassingly parallel
Elastic
Batch workloads
Interconnected jobs
Network sensitivity
Job specific algorithms
Data management
Task distribution
Workflow management
Loosely
Coupled
Supporting
Services
Tightly
Coupled
31. Elastic Compute Cloud (EC2)
c3.8xlarge
g2.medium
m3.large
Basic unit of compute capacity,virtual machines
Range of CPU, memory & local disk options
Choice of instance types, from micro to cluster compute
Compute Services
32. CLI, API and Console
Scripted configurations
Automation & Control
34. Characterizing HPC
Embarrassingly parallel
Elastic
Batch workloads
Interconnected jobs
Network sensitivity
Job specific algorithms
Data management
Task distribution
Workflow management
Loosely
Coupled
Supporting
Services
Tightly
Coupled
35. What if you need to:
Implement MPI?
Code for GPUs?
36. Cluster compute instances
Implement HVM process execution
Intel® Xeon® processors
10 Gigabit Ethernet –c3 has Enhanced networking, SR-IOV
cc2.8xlarge
32 vCPUs
2.6 GHz Intel Xeon
E5-2670 Sandy Bridge
60.5 GB RAM
2 x 320 GB
Local SSD
c3.8xlarge
32 vCPUs
2.8 GHz Intel Xeon
E5-2680v2 Ivy Bridge
60GB RAM
2 x 320 GB
Local SSD
Tightly coupled
37. Network placement groups
Cluster instances deployed in a Placement
Group enjoy low latency, full bisection
10 Gbps bandwidth
10Gbps
Tightly coupled
38. GPU compute instances
cg1.8xlarge
33.5 EC2 Compute Units
20GB RAM
2x NVIDIA GPU
448 Cores
3GB Mem
g2.2xlarge
26 EC2 Compute Units
16GB RAM
1x NVIDIA GPU
1536 Cores
4GB Mem
G2 instances
Intel® Intel Xeon E5-2670
1 NVIDIA Kepler GK104 GPU
I/O Performance: Very High (10 Gigabit Ethernet)
CG1 instances
Intel® Xeon® X5570 processors
2 x NVIDIA Tesla “Fermi” M2050 GPUs
I/O Performance: Very High (10 Gigabit Ethernet)
Tightly coupled
39. National Taiwan University: shortest vector problem
Our purpose is to break the record of
solving the shortest vector problem
(SVP) in Euclidean lattices…the
vectors we found are considered the
hardest SVP anyone has solved so far.
Prof. Chen-Mou Cheng
Principle Investigator of Fast Crypto Lab
”
“ • $2,300 for using 100x Tesla M2050 for ten hours
40. CUDA & OpenCL
Massive parallel clusters running in GPUs
NVIDIA Tesla cards in specialized instance types
Tightly coupled
42. Data management
Fully-managed SQL, NoSQL, and object storage
Relational Database Service
Fully-managed database
(MySQL, Oracle, MSSQL,
PostgreSQL)
DynamoDB
NoSQL, Schemaless,
Provisioned throughput
database
S3
Object datastore up to
5TB per object
Internet accessibility
Supporting Services
43. “Big Data” changes dynamic of computation and data sharing
Collection
Direct Connect
Import/Export
S3
DynamoDB
Computation
EC2
GPUs
Elastic MapReduce
Collaboration
CloudFormation
Simple Workflow
S3
Moving compute closer to the data
44. TRADERWORX: Market Information Data
Analytics System
For the growing team of quant types
now employed at the SEC, MIDAS is
becoming the world’s greatest data
sandbox. And the staff is planning to
use it to make the SEC a leader in its
use of market data
Elisse B. Walter,
Chairman of the SEC
Tradeworx
”
“ • Powerful AWS-based system for market analytics
• 2M transaction messages/sec; 20B records and
1TB/day
45. Feeding workloads
Using highly available Simple Queue Service to feed EC2 nodes
Processing task/processing trigger
Amazon SQS
Processing results
Supporting Services
46. Coordinating workloads & task clusters
Handle long running processes across many nodes and task steps with Simple Workflow
Task A
Task B
(Auto-scaling)
Task C
Supporting Services
3
1
2
48. NYU School of Medicine: Transferring large data sets
Transferring data is a large
bottleneck; our datasets are
extremely large, and it often takes
more time to move the data than to
generate it. Since our collaborators
are all over the world, if we can’t
move it they can’t use it.
”
“ • Uses Globus Online
• Data transfer speeds of up to 50MB/s
Dr. Stratos Efstathiadis
Technical Director of the
HPC facility, NYU
50. Bankinter: credit-risk simulation
With AWS, we now have the power to
decide how fast we want to obtain
simulation results. More important, we
have the ability to run simulations that
were not possible before due to the
large amount of infrastructure
required.
Javier Roldán
Director of Technological
Innovation, Bankinter
”
“ • Reduced processing time of 5,000,000 simulations
from 23 hours to 20 minutes
52. When to consider running HPC workloads on AWS
New ideas
New HPC project
Proof of concept
New application features
Training
Benchmarking algorithms
Remove the queue
Hardware refresh cycle
Reduce costs
Collaboration of results
Increase innovation speed
Reduce time to results
Improvement
56. Getting Started with HPC on AWS
aws.amazon.com/hpc
contact us, we are here to help
Sales and Solutions Architects
Enterprise Support
Trusted Advisor
Professional Services
57. cfncluster (“CloudFormation cluster”)
Command Line Interface Tool
Deploy and demo an HPC cluster
For more info:
aws.amazon.com/hpc/resources
Try out our HPC CloudFormation-based demo
59. Oil and Gas
Seismic Data
Processing
Reservoir
Simulations,
Modeling
Geospatial
applications
Predictive
Maintenance
Manufacturing
& Engineering
Computational
Fluid Dynamics
(CFD)
Finite Element
Analysis (FEA)
Wind Simulation
Life Sciences
Genome
Analysis
Molecular
Modeling
Protein Docking
Media &
Entertainment
Transcoding and
Encoding
DRM, Encryption
Rendering
Scientific
Computing
Computational
Chemistry
High Energy
Physics
Stochastic
Modeling
Quantum
Analysis
Climate Models
Finaancial
Monte Carlo
Simulations
Wealth
Management
Simulations
Portfolio, Credit
Risk Analytics
High Frequency
Trading
Analytics
Costomers are using AWS for more and more
HPC workloads
61. Cyclopic energy: computational fluid dynamics
AWS makes it possible for us to
deliver state-of-the-art technologies
to clients within timeframes that
allow us to be dynamic, without
having to make large investments in
physical hardware.
Rick Morgans
Technical Director (CTO),
cyclopic energy
”
“ • Two months worth of simulations finished in two days
62. Mentor Graphics: virtual lab for design and simulation
Thanks to AWS, the Mentor
Graphics customer experience is
now fast, fluid, and simple.
Ron Fuller
Senior Director of
Engineering, Mentor
”
“ • Developed a virtual lab for ASIC design and
simulation for product evaluation and training
63. AeroDynamic Solutions: turbine engine simulation
We’re delighted to be working closely
with the U.S. Air Force and AWS to
make time accurate simulation a
reality for designers large and small.
George Fan
CEO, AeroDynamic
Solutions
”
“ • Time accurate simulation was turned around in 72
hours with infrastructure costs well below $1,000
64. HGST: molecular dynamics simulation
HGST is using AWS for a higher
performance, lower cost, faster
deployed solution vs buying a huge
on-site cluster.
Steve Philpott
CIO, HGST
”
“ • Uses HPC on AWS for CAD, CFD, and CDA
65. Pfizer: large-scale data analytics and modeling
AWS enables Pfizer’s Worldwide
Research and Development to
explore specific difficult or deep
scientific questions in a timely,
scalable manner and helps Pfizer
make better decisions more quickly.
Dr. Michael Miller
Head of HPC for R&D,
Pfizer
”
“ • Pfizer avoiding having to procure new HPC hardware
by being able to use AWS for peak work loads.
Credit to: David Pellerin, Dougal Ballantyne, Angel Pizarro, and Ryan Shuttleworth for a lot of the content for these slides.
Amy Sun for many of the AWS graphics.
Ian Meyers for the financial grid computing architecture.
Last Updated: July 26, 2014 – butlerb@
Point of contact: Ben Butler (butlerb@)
Tags: #hpc, #high performance computing, #cluster, #MPI, #SR-IOV, #bi-section, #bandwidth, #architecture, #instance
Unlimited Infrastructure – increased scalability and elasticity – go from 10’s to 1000’s of instances
Efficient clusters – our compute instances have high efficiency comparing actual performance vs theoretical performance: Rmax/Rpeak – this means you need less nodes in your cluster than inefficient clusters – HUGE potential cost savings. Tune your cluster, not tune to your cluster – with AWS you can pick the right EC2 instance type instead of being forced to optimize your workload for the fixed cluster you have in-house
Low Cost with Flexible pricing – multiple pricing models, pay as you go, no CAPEX
Increased collaboration – access to clusters and data can be from anywhere with an internet connection
Faster time to results –focus on your business/science, increase efficiency of your people by not being burdent by IT. While you can benchmark the cluster on the performance of a running a job, it is more important and comprehensive to benchmark the total time it takes to provision and use the cluster end-to-end
Concurrent Clusters on demand – no more waiting in a queue, run multiple jobs simultaneously with an API call
Other Use Cases:
Science-as-a-Service
Large-scale HTC (100,000+ core clusters)
Small to medium-scale MPI clusters (hundreds of nodes)
Many small MPI clusters working in parallel to explore parameter space
GPGPU workloads
Dev/test of MPI workloads prior to submitting to supercomputing centers
Collaborative research environments
On-demand academic training/lab environments
Ephemeral clusters, custom tailored to the task at hand, created for various stages of a pipeline
Need design graphics here
Slot in customers are here
Customer logos
*BEN: check with David Pellerin, Jamie and Dougal on the best verticals with the best customer reference for each vertical
Check on using screenshot of a website
http://news.cnet.com/8301-1001_3-57611919-92/supercomputing-simulation-employs-156000-amazon-processor-cores
http://blog.cyclecomputing.com/2013/11/back-to-the-future-121-petaflopsrpeak-156000-core-cyclecloud-hpc-runs-264-years-of-materials-science.html
Computational compound analysisSolar panel material Estimated computation time 264 years
DESIGN/DEV:
Need a much better graphic to signify the idea of using multiple EC2 purchasing options to optimize cost.
Medium RI to replace Light RI in the chart.
http://aws.amazon.com/solutions/case-studies/harvard/
AWS Case Study: Harvard Medical School
About Harvard Medical School
The Laboratory for Personalized Medicine (LPM), of the Center for Biomedical Informatics at Harvard Medical School, run by Dr. Peter Tonellato, took the power of high throughput sequencing and biomedical data collection technologies and the flexibility of Amazon Web Services (AWS) to develop innovative whole genome analysis testing models in record time. “The combination of our approach to biomedical computing and AWS allowed us to focus our time and energy on simulation development, rather than technology, to get results quickly,“ said Tonellato. “Without the benefits of AWS, we certainly would not be as far along as we are.”
The Challenge
Tonellato’s lab focuses on personalized medicine—preventive healthcare for individuals based on their genetic characteristics—by creating models and simulations to assess the clinical value of new genetic tests.
Other projects include simulating large patient populations to aid in clinical trial simulations and predictions. To overcome the difficulty of finding enough real patient data for modeling, LPM creates patient avatars—literally “virtual” patients. The lab can create different sets of avatars for different genetic tests and then replicate huge numbers of them based on the characteristics of hospital populations.
Tonellato needed to find an efficient way to manipulate many avatars, sometimes as many as 100 million at a time. “In addition to being able to handle enormous amounts of data,” he said, “I wanted to devise system where postdoctoral researchers can scope a genetic risk situation, determine the appropriate simulation and analysis to create the avatars, and then quickly build web applications to run the simulations, rather than spend their time troubleshooting computing technology.”
Why Amazon Web Services
In 2006, Tonellato turned to cloud computing to address the complex and highly variable computational need. “I evaluated several alternatives but found nothing as flexible and robust as Amazon Web Services,” he said. Having built datacenters previously, Tonellato could not afford the time he knew would be required to set up servers and then write code. Instead, he decided to conduct a test to see how fast his team could put together a series of custom Amazon Machine Images (AMIs) that would reflect the optimal development environment for researchers’ web applications.
Now, Tonellato’s lab has extended their efforts to integrate Spot Instances into their workflows so that they could stretch their grant money even further. According to Tonellato, “We leverage Spot Instances when running Amazon Elastic Cloud Compute (Amazon EC2) clusters to analyze entire genomes. We have the potential to run even more worker nodes at less cost when using Spot Instances, so it is a huge saving in both time and cost for us. To take advantage of these savings, it just took us a day of engineering, and saw roughly 50% savings in cost.” Tonellato’s lab leverages MIT’s StarCluster tools, which has built-in capabilities to manage an Oracle Grid Engine Cluster on Spot Instances. Erik Gafni, a programmer in Tonellato’s lab, performed the integration of StarCluster into our workflow. According to Gafni, “Using StarCluster, it was incredibly easy to configure, launch, and start using a running Spot Cluster in less than 10 minutes.”
In addition the LPM recognized the need for published resources about how to effectively use cloud computing in an academic environment and published an educational primer in PLoS Computational Biology to address this need. “We believe this article clearly shows how an academic lab can effectively use AWS to manage their computing needs. It also demonstrates how to think about computational problems in relation to AWS costs and computing resources,” says Vincent Fusaro, lead author and senior research fellow in the LPM.
The Benefits
“The AWS solution is stable, robust, flexible, and low cost,” Tonellato commented. “It has everything to recommend it.”
Tonellato runs his simulations on Amazon EC2, which provides customers with scalable compute capacity in the cloud. Designed to make web-scale computing easier for developers, Amazon EC2 makes it possible to create and provision compute capacity in the cloud within minutes.
Tonellato’s lab is thrilled with their AWS solution. “The number of genetic tests available to doctors and hospitals is constantly increasing,” Tonellato explained, “and they can be very expensive. We’re interested in determining which tests will result in better patient care and better results.” He added, “We believe our models may dramatically reduce the time it usually takes to identify the tests, protocols, and trials that are worth pursuing aggressively for both FDA approval and clinical use.”
Next Step
To learn more about how AWS can help your big data needs, visit our Big Data details page: http://aws.amazon.com/big-data/.
*Ben: Slide Notes
Loosely Coupled
Embarrassingly parallel -
Elastic -
Batch workloads –
Tightly Coupled
Interconnected jobs – MPI description here
Network sensitivity – enhanced networking, high throughput, low latency, predictable performance
Job specific algorithms –
Supporting Services
Data management
Task distribution
Workflow management
Feature
Details
Flexible
Run windows or Linux distributions
Scalable
Wide range of instance types from micro to cluster compute
Machine Images
Configurations can be saved as machine images (AMIs) from which new instances can be created
Full control
Full root or administrator rights
Secure
Full firewall control via Security Groups
Monitoring
Publishes metrics to Cloud Watch
Inexpensive
On-demand, Reserved and Spot instance types
VM Import/Export
Import and export VM images to transfer configurations in and out of EC2
* DESIGN/DEV:
Grapic to signify, console, CLI – using code to provision thousands of instances, instead of provisioning datacenters
Example CLI Line
ec2-run-instances ami-b232d0db
--instance-count 3000
--availability-zone eu-west-1a
--instance-type m1.xlarge
Auto scaling is the tool that allows just-in time provisioning of compute resources based on the policies and events that you specify
This family includes the C1, CC2, and C3 instance types, and is optimized for applications that benefit from high compute power. Compute-optimized instances have a higher ratio of vCPUs to memory than other families, and the lowest cost per vCPU among all Amazon EC2 instance types. We recommend compute-optimized instances for running compute-bound scale out applications. Examples of such applications include high performance front-end fleets and web-servers, on-demand batch processing, distributed analytics, and high performance science and engineering applications.
C3 instances are the latest generation of compute-optimized instances, providing customers with the highest performing processors and the lowest price/compute performance available in EC2 currently.
Each virtual CPU (vCPU) on C3 instances is a hardware hyper-thread from a 2.8 GHz Intel Xeon E5-2680v2 (Ivy Bridge) processor allowing you to benefit from the latest generation of Intel processors.
C3 instances support Enhanced Networking that delivers improved inter-instance latencies, lower network jitter and significantly higher packet per second (PPS) performance. This feature provides hundreds of teraflops of floating point performance by creating high performance clusters of C3 instances.
Compared to C1 instances, C3 instances provide faster processors, approximately double the memory per vCPU, support for clustering and SSD-based instance storage.
CC2 instances feature 2.6GHz Intel Xeon E5-2670 processors, high core count (32 vCPUs) and support for cluster networking. CC2 instances are optimized specifically for HPC applications that benefit from high bandwidth, low latency networking and high compute capacity.
Popular use cases: High-traffic web applications, ad serving, batch processing, video encoding, distributed analytics, high-energy physics, genome analysis, and computational fluid dynamics.
Nvidia GPUs (cg1.4xlarge)
Intel Nehalem (cc1.4xlarge)
2TB of SSD 120,000 IOPS (hi1.4xlarge)
Intel Sandy Bridge E5-2670 (cc2.8xlarge)
Sandy Bridge, NUMA, 240GB RAM (cr1.4xlarge)
48 TB of ephemeral storage (hs1.8xlarge)
2.6 GHz Sandy Bridge CPU w/ Turbo enabled
1 NVIDIA GK104 GPU (Kepler)
8 vCPUs, 15 GiB of RAM
60GB SSD storage
Ideally suited for remote desktop and 3D
Supports DirectX, OpenGL, CUDA, OpenCL
Wide range of platform partnersincluding Citrix, Otoy, NICE Software
http://aws.amazon.com/solutions/case-studies/national-taiwan-university/
About National Taiwan University
Fast Crypto Lab is a research group within National Taiwan University, in Taiwan. The group’s research activities focus on the design and analysis of efficient algorithms to solve important mathematical problems, as well as the development and implementation of these algorithms on massively parallel computers.
Why Amazon Web Services
Prior to signing on with Amazon Web Services (AWS), the group used a private cloud and ran Hadoop on their own machines. Prof. Chen-Mou Cheng, the Principal Investigator of Fast Crypto Lab, explains why the research group made the switch to AWS: “It is quite easy to get started with AWS with its clear and flexible interface. Amazon Elastic Compute Cloud (Amazon EC2) provides a common measure of cost across problems of a different nature. For problems that are the same or similar, Amazon EC2 can also be used as a metric for comparing alternative or competing algorithms and their implementations.”
Chen-Mou adds, “When using Amazon EC2 as a metric, the parallelizability of the algorithm or the parallelization of the implementation is explicitly taken into account, as opposed to being assumed or unspecified. The Amazon EC2 metric is thus practical and easy to use.”
The group now uses Hadoop Streaming in their architecture, and runs their programs with Amazon Elastic MapReduce (Amazon EMR) and Cluster GPU Instances for Amazon EC2.
“Our purpose is to break the record of solving the shortest vector problem (SVP) in Euclidean lattices," Chen-Mou says. "The problem plays an important role in the field of information science. We estimated that we would need 1,000 cg1.4xlarge instance-hours. We ended up using 50 cg1.4xlarge instances for about 10 hours to solve our problem. Now, the vectors we found are considered the hardest SVP anyone has solved so far. We only spent $2,300 for using the 100 Tesla M2050 for 10 hours, which is quite a good deal.”
The Benefits
Since switching to AWS, the group indicates that the machine maintenance costs have been reduced, and they have experienced more stable and scalable computational power. The group’s favorite component of AWS is Amazon CloudWatch, which it uses to watch computer utilities while also improving their program.
Looking into the future, Chen-Mou says, “We want to increase our GPU Cluster quote and solve a higher dimension SVP. We are also considering renting an AWS machine for setting up an SVN server.”
Customer reference here?
Speaker Notes
SEC’s Market Information Data Analytics System (MIDAS) provided by AWS partner Tradeworx
Powerful AWS-based system for big data market analytics
2M transaction messages/per sec; 20B records, 1TB data/per day
From RFP to contract award to production in ~12 months
See http://sec.gov/marketstructure
“For the growing team of quant types now employed at the SEC, MIDAS is becoming the world’s greatest data sandbox. And the staff is planning to use it to make the SEC a leader in its use of market data”
– Elisse B. Walter, Chairman of the SEC
“This basically propels the SEC from zero to 60 in one fell swoop, going from being way behind even the most basic market participant to being on par if not ahead of the vast majority of market participants, in terms of their system and analytical capabilities‘”
– Gregg E. Berman, Associate Director, Office of Analytics and Research, SEC
Flip diagram
Task Clusters is HPC specific terminology
http://media.amazonwebservices.com/architecturecenter/AWS_ac_ra_loganalysis_11.pdf
To upload large data sets into AWS, it is critical to make
the most of the available bandwidth. You can do so by
uploading data into Amazon Simple Storage Service (S3) in
parallel from multiple clients, each using multithreading to
enable concurrent uploads or multipart uploads for further
parallelization. TCP settings like window scaling and selective
acknowledgement can be adjusted to further enhance
throughput. With the proper optimizations, uploads of several
terabytes a day are possible. Another alternative for huge
data sets might be Amazon Import/Export, which supports
sending storage devices to AWS and inserting their contents
directly into Amazon S3 or Amazon EBS volumes.
Parallel processing of large-scale jobs is critical, and
existing parallel applications can typically be run on
multiple Amazon Elastic Compute Cloud (EC2) instances.
A parallel application may sometimes assume large scratch
areas that all nodes can efficiently read and write from. S3
can be used as such a scratch area, either directly using
HTTP or using a FUSE layer (for example, s3fs or SubCloud)
if the application expects a POSIX-style file system.
Once the job has completed and the result data is
stored in Amazon S3, Amazon EC2 instances can be
shut down, and the result data set can be downloaded The
output data can be shared with others, either by granting read
permissions to select users or to everyone or by using time
limited URLs.
Instead of using Amazon S3, you can use Amazon
EBS to stage the input set, act as a temporary storage
area, and/or capture the output set. During the upload, the
concepts of parallel upload streams and TCP tweaking also
apply. In addition, uploads that use UDP may increase speed
further. The result data set can be written into EBS volumes,
at which time snapshots of the volumes can be taken for
sharing.
http://media.amazonwebservices.com/architecturecenter/AWS_ac_ra_financialgrid_12.pdf
Date sources for market, trade, and counterparties are
installed on startup from on premise data sources, or
from Amazon Simple Storage Service (Amazon S3).
AWS DirectConnect can be used to establish a low
latency and reliable connection between the corporate
data center site and AWS, in 1 to 10Gbit increments. For
situations with lower bandwidth requirements, a VPN
connection to the VPC Gateway can be established.
Private subnetworks are specifically created for
customer source data, compute grid clients, and the
grid controller and engines.
Application and corporate data can be securely stored
in the cloud using the Amazon Relational Database
Service (Amazon RDS).
Grid controllers and grid engines are running Amazon
Elastic Compute Cloud (Amazon EC2) instances
started on demand from Amazon Machine Images (AMIs)
that contain the operating system and grid software.
Static data such as holiday calendars and QA libraries
and additional gridlib bootstrapping data can be
downloaded on startup by grid engines from Amazon S3.
Grid engine results can be stored in Amazon
DynamoDB, a fully managed database providing
configurable read and write throughput, allowing scalability on
demand.
Results in Amazon DynamoDB are aggregated using
a map/reduce job in Amazon Elastic MapReduce
(Amazon EMR) and final output is stored in Amazon S3.
The compute grid client collects aggregate results from
Amazon S3.
Aggregate results can be archived using Amazon
Glacier, a low-cost, secure, and durable storage service
http://aws.amazon.com/solutions/case-studies/bankinter/
About Bankinter
Bankinter was founded in June, 1965 as a Spanish industrial bank through a joint venture by Banco de Santander and Bank of America. It is currently listed among the top ten banks in Spain. Bankinter has provided online banking services since 1996, when they pioneered the offering of real-time stock market operations. More than 60% of Bankinter transactions are performed through remote channels; 46% of those transactions are through the Internet. Today Bankinter.com and Bankinter brokerage services continue to lead the European banking industry in online financial operations.
The Challenge
Bankinter uses Amazon Web Services (AWS) as an integral part of their credit-risk simulation application, developing complex algorithms to simulate diverse scenarios in order to evaluate the financial health of Bankinter clients. "This requires high computational power,” says Bankinter Director of Technological Innovation, Javier Roldán. “We need to perform at least 5,000,000 simulations to get realistic results.”
Why Amazon Web Services
Bankinter uses the flexibility and power of Amazon Elastic Compute Cloud (Amazon EC2) to perform these simulations, subdividing processes through a grid of Amazon EC2 instances and implementing simulations in parallel on several Amazon EC2 instances to obtain the result in a very effective time period.
Bankinter used Java to develop their application and the Amazon Software Development Kit (SDK) to automate the provisioning process of AWS elements. Through the use of AWS, Bankinter decreased the average time-to-solution from 23 hours to 20 minutes and dramatically reduced processing, with the ability to reduce even further when required. Amazon EC2 also allowed Bankinter to adapt from a big batch process to a parallel paradigm, which was not previously possible. Costs were also dramatically reduced with this cloud-based approach.
The Benefits
Bankinter plans to expand their use of AWS for future applications and business units. “The AWS platform, with its unlimited and flexible computational power, is a good fit for our risk-simulation process requirements,” says Roldán. “With AWS, we now have the power to decide how fast we want to obtain simulation results. More important, we have the ability to run simulations that were not possible before due to the large amount of infrastructure required.”
http://media.amazonwebservices.com/architecturecenter/AWS_ac_ra_batch_03.pdf
Users interact with the Job Manager application which
is deployed on an Amazon Elastic Computer Cloud
(EC2) instance. This component controls the process of
accepting, scheduling, starting, managing, and completing
batch jobs. It also provides access to the final results, job and
worker statistics, and job progress information.
Raw job data is uploaded to Amazon Simple Storage
Service (S3), a highly-available and persistent data
store.
Individual job tasks are inserted by the Job Manager in
an Amazon Simple Queue Service (SQS) input
queue on the user’s behalf.
Worker nodes are Amazon EC2 instances deployed
on an Auto Scaling group. This group is a container
that ensures health and scalability of worker nodes. Worker
nodes pick up job parts from the input queue automatically
and perform single tasks that are part of the list of batch
processing steps.
Interim results from worker nodes are stored in
Amazon S3.
Progress information and statistics are stored on the
analytics store. This component can be either an
Amazon DynamoDB table or a relational database such as
an Amazon Relational Database Service (RDS) instance.
Optionaly, completed tasks can be inserted in an
Amazon SQS queue for chaining to a second
processing stage.
Engage Sales and Solutions Architects
You have long queues to use your cluster
Your jobs are varied enough that you spend more time optimizing for the cluster you have
You are benchmarking for you cluster
Find out how your HPC workloads can run AWS
We can find the right partner to help manage your HPC workload for you
A centralized repository of public datasets
Seamless integration with cloud based applications
No charge to the community
Some of the datasets available today:
1000 Genomes Project
Ensembl
GenBank
Illumina – Jay Flateley Human Genome Dataset
YRI Trio Dataset
The Cannabis Sativa Genome
UniGene
Influenza Virrus
PubChem
Better graphics needed for these groups
These can be there own slide
Stick these in an appendix
Analytics – log scanning and simulations
Financial Modeling and Analysis – weath management simulations, value at risk, conterparty value analytics
Image and Media encoding – render and encode media assests
Testing – load, integration, canary, and security testing
Process big data – twitter feeds, genomes, trend analysis
Geospatial analysis – rendering
Scientific computing – simulations, drug discovery
Web crawling
Particle Physics Simulaitons
Artificial Intelligence Research
Durg Discovery
Scientific Collaboration and Centralized Data management
GPU – Use Cases
Molecular Dynamics
Seismic Analysis
Genome Assembly and Alignment
Visualize molecules
Simulation
Floating point applications
Page ranking
Video transcoding
http://aws.amazon.com/solutions/case-studies/mentor-graphics/
Mentor Graphics is a global supplier of EDA and mechanical CAD tools
ASIC design and simulation, mask verification
Board-level design products
EM and CFD solvers
Business case: enable cloud-baseddeployments for product evaluationand training. Create a virtual labenvironment with IP security anda responsive user experience.
Solution: Mentor Virtual Labs, builton AWS
Time accurate fluid dynamics
SBIR-funded project for the US Air Force Research Laboratory (AFRL)
SAS 70 Type II certification and VPN-level access required
Additional security measures:
Uploaded and downloaded data was encrypted
Dedicated EC2 cluster instances were provisioned
Data was purged upon completion of the run
“The results of this case were impressive. Using Amazon EC2 the large-scale, time accurate simulation was turned around in just 72 hours with computing infrastructure costs well below $1,000.”http://aws.amazon.com/solutions/case-studies/aerodynamic-solutions/
HGST application roadmap:
Collaboration tools
Molecular dynamics
CAD, CFD, EDA
Big data for manufacturing yield analysis
http://aws.amazon.com/solutions/case-studies/pfizer/
About Pfizer
Pfizer, Inc. applies science and global resources to improve health and well-being at every stage of life. The company strives to set the standard for quality, safety, and value in the discovery, development, and manufacturing of medicines for people and animals.
The Challenge
Pfizer’s high performance computing (HPC) software and systems for worldwide research and development (WRD) support large-scale data analysis, research projects, clinical analytics, and modeling. Pfizer’s HPC services are used across the spectrum of WRD efforts, from the deep biological understanding of disease, to the design of safe, efficacious therapeutic agents.
Why Amazon Web Services
Dr. Michael Miller, Head of HPC for R&D at Pfizer explains why Pfizer initially considered using Amazon Web Services (AWS) to handle its peak computing needs: “The Amazon Virtual Private Cloud (Amazon VPC) was a unique option that offered an additional level of security and an ability to integrate with other aspects of our infrastructure.”
Pfizer has now set up an instance of the Amazon VPC to provide a secure environment with which to carry out computations for WRD. They say, “We accomplished this by customizing the ‘job scheduler’ in our HPC environment to recognize VPC workload, and start and stop instances as needed to address the workflow. Research can be unpredictable, especially as the on-going science raises new questions.” The VPC has enabled Pfizer to respond to these challenges by providing the means to compute beyond the capacity of the dedicated HPC systems, which provides answers in a timely manner.
Pfizer’s solution was written in C and is based on the Amazon Elastic Compute Cloud (Amazon EC2) command line tools. They say, “We are currently migrating this solution over to a commercial API that will enable additional provisioning and usage tracking capabilities.”
The Benefits
The primary cost savings has been in cost avoidance, “Pfizer did not have to invest in additional hardware and software, which is only used during peak loads; that savings allowed for investments in other WRD activities.”
For Pfizer, AWS is a fit-for-purpose solution. The Dr. Miller explains, “It is not a replacement for, but rather an addition to our capabilities for HPC WRD activities, providing a unique augmentation to our computing capabilities.
Overall, “AWS enables Pfizer’s WRD to explore specific difficult or deep scientific questions in a timely, scalable manner and helps Pfizer make better decisions more quickly.”
Looking ahead, Pfizer is interested in exploring Amazon Simple Storage Service (Amazon S3) for storing reference data to expand the type of computational problems they can address.