Big data technologies let you work with any velocity, volume, or variety of data in a highly productive environment. Join the General Manager of Amazon EMR, Peter Sirota, to learn how to scale your analytics, use Hadoop with Amazon EMR, write queries with Hive, develop real world data flows with Pig, and understand the operational needs of a production data platform.
2. Agenda
• Amazon EMR: Hadoop in the cloud
• Hadoop Ecosystem on Amazon EMR
• Customer Use Cases
3. Hadoop is the right system for Big Data
• Scalable and fault tolerant
• Flexibility for multiple languages
and data formats
• Open source
• Ecosystem of tools
• Batch and real-time analytics
4. Challenges with Hadoop
On Premise
On Amazon EC2
• Manage HDFS, upgrades,
and system administration
• Pay for expensive support
contracts
• Select hardware in
advance and stick with
predictions
• Difficult to integrate with
AWS storage services
• Independently manage
and monitor clusters
5. Amazon EMR is the
easiest way to run Hadoop in the cloud
6. Why Amazon EMR?
•
•
•
•
Managed services
Easy to tune clusters and trim costs
Support for multiple data stores
Unique features and ecosystem support
15. Choose your instance types
Try out different configurations to find your
optimal architecture
CPU
c1.xlarge
cc1.4xlarge
cc2.8xlarge
Memory
m1.large
m2.2xlarge
m2.4xlarge
Disk
hs1.8xlarge
16. Long running or transient clusters
Easy to run Hadoop clusters short-term or 24/7, and
only pay for what you need
=
20. Resizable clusters
Easy to add and remove compute
capacity on your cluster
Matched compute
demands with cluster sizing
10 hours
21. Use Spot and Reserved Instances
Minimize costs by supplementing on-demand pricing
22. Easy to use Spot Instances
Name-your-price supercomputing to minimize costs
Spot for
task nodes
On-demand for
core nodes
Up to 90%
off Amazon
EC2
on-demand
pricing
Standard
Amazon EC2
pricing for
on-demand
capacity
23. 24/7 clusters on Reserved Instances
Minimize cost for consistent capacity
Reserved
Instances for
long running
clusters
Up to 65% off
on-demand
pricing
24. Your data, your choice
Easy to integrate Amazon EMR with your data stores
25.
26. Using Amazon S3 and HDFS
Data aggregated
and stored in
Amazon S3
Ad-hoc Query
Long running EMR cluster
holding data in HDFS for
Hive interactive queries
Weekly Report
Data Sources
Transient EMR cluster
for batch map/reduce jobs
for daily reports
27. Use Amazon EMR with Amazon Redshift
and Amazon S3
Processed data
loaded into
Amazon Redshift
data warehouse
Daily data
aggregated in
Amazon S3
Data Sources
Amazon EMR
cluster used to
process data
28. Use the Hadoop Ecosystem
on Amazon EMR
Leverage a diverse set of tools to get the most out of your data
29. Hadoop 2.x
•
•
•
•
•
and much more...
Databases
Machine learning
Metadata stores
Exchange formats
Diverse query languages
30. Use Hive on Amazon EMR to interact with
your data in HDFS and Amazon S3
• Data warehouse for Hadoop
• Integration with Amazon S3 for
better performance reading and
writing to Amazon S3
• SQL-like query language to make
iterative queries easier
• Easy to scale in HDFS on a
persistent Amazon EMR cluster
31. Use HBase on a persistent Amazon EMR cluster
as a column-oriented scalable data store
• Billions of rows and millions
of columns
• Backup to and restore from
Amazon S3
• Flexible datatypes
• Modulate your HBase tables
when adding new data to
your system
32. Use ad-hoc queries on your cluster to
drive insights in real-time
Spark / Shark
• In-memory MapReduce
for faster queries
• Use HiveQL to interact
with your data
33. Use ad-hoc queries on your cluster to
drive insights in real-time
Spark / Shark
• In-memory MapReduce
for faster queries
• Use HiveQL to interact
with your data
Impala (coming soon!)
• Parallel database
engine for Hadoop
• Use SQL to query data
in HDFS on your cluster
in real-time
34. “Hadoop-as-a-Service [Amazon EMR] offers a
better price-performance ratio [than bare-metal Hadoop].”
1. Elastic clusters and cost optimization
2. Rapid, tuned provisioning
3. Agility for experimentation
4. Easy integration with diverse datastores
35. Diverse set of partners to build on Amazon EMR
BI / Visualization
Hadoop Distribution
Monitoring
Business Intelligence
Data Transformation
Data Transfer
Performance Tuning
Available on AWS Marketplace
BI / Visualization
ETL Tool
Graphical IDE
Available as a distribution in Amazon Elastic MapReduce
BI / Visualization
Encryption
Graphical IDE
55. Putting into perspective …
•
•
•
•
•
Billions of viewing hours of data
~3000 nodes clusters
Hundred billion events / day
Few petabytes DW on Amazon S3
Thousands of jobs / day
69. Channel 4 – Background
•
Channel 4 is a public service, commercially funded, not-for-profit, broadcaster.
•
We have a remit to deliver innovative, experimental, distinctive, and diverse
content across television, film, and digital media.
•
We are funded predominantly by television advertising, competing with the other
established UK commercial broadcasters, and increasingly with emerging,
Internet based, providers.
•
Our content, is available across our portfolio of around 10 core and time-shift
channels, and our on demand service 4oD is accessible across multiple devices
and platforms.
71. Business Intelligence at C4
•
Well established Business Intelligence capability
•
Based on industry standard proprietary products
•
Real-time data warehousing
•
Comprehensive business reporting
•
Excellent internal skills
•
Good external skills availability
72. Big Data Technology at C4
•
2011 - Embarked on Big Data initiative
–
–
•
2012 - Ran Amazon EMR in parallel with
conventional BI
–
–
•
Ran in-house and cloud-based PoCs
Selected Amazon EMR
Hive deployed to Data Analysts
Amazon EMR workflows deployed to production
2013 – Amazon EMR confirmed as primary Big Data
platform
–
–
Amazon EMR usage growing, focus on automation
Experimenting with Mahout for Machine Learning
73. What problems are we solving?
Single view of the viewer
recognising them across
devices and serving
relevant content
Personalising the viewer experience
74. How are we doing this?
• Principal tasks…
– Audience segmentation
– Personalisation
– Recommendations
• What data do we process…
–
–
–
–
Website clickstream logs
4oD activity and viewing history
Over 9m registered users
Majority of activity now from “logged-in” users
76. High-Level Architecture
•
Amazon EMR and existing BI technology are
complementary
•
Process billions of data rows in Amazon EMR,
store millions of result rows in RDBMS
•
No need to “rip and replace”, existing technology
investment is protected
•
Amazon EMR will continue to underpin major
growth in data volumes and processing
complexity
77. Where Next?
• Continued growth in usage of Amazon EMR
• Migrate to Hadoop 2.x
• Adopt Amazon Redshift
• Improved integration between C4 and AWS
• Shift toward “near real-time” processing
78. Please give us your feedback on this
presentation
BDT301
As a thank you, we will select prize
winners daily for completed surveys!