Big Data is everywhere these days. But what is it and how can you use it to fuel your business? Data is as important to organizations as labour and capital, and if organizations can effectively capture, analyze, visualize and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line.
Join this session to understand the different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together.
Reasons to attend:
- Learn how AWS can help you process and make better use of your data with meaningful insights.
- Learn about Amazon Elastic MapReduce and Amazon Redshift, fully managed petabyte-scale data warehouse solutions.
- Learn about real time data processing with Amazon Kinesis.
10. JDBC/ODBC
ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
11. Dramatically reduces I/O
v
• Column storage
• Data compression
• Zone maps
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With row storage you do unnecessary I/O
• To get average Amount by State, you have
to read everything
12. v
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With column storage, you only
read the data you need
• Column storage
• Data compression
• Zone maps
Dramatically reduces I/O
13. Dramatically reduces I/O
v analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
• Column storage
• Data compression
• Zone maps
• COPY compresses automatically
• You can analyze and override
• More performance, less cost
14. Dramatically reduces I/O
v
• Column storage
• Data compression
• Zone maps
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
• Track the minimum and maximum
value for each block
• Skip over blocks that don’t contain
relevant data
15. v
Q3. What’s good about it?
Performance, Scalability, Ease of Use, Cost
16. Performance Evaluation on 2B Rows
v
Traditional
SQL Database
Amazon
Redshift
Aggregate by month 02:08:35 00:35:46 00:00:12
28. v
S3 EMR Cluster
EMR
1. Put the
data into S3
2. Choose: Hadoop
distribution, # of nodes, types
of nodes, Hadoop apps like
Hive/Pig/HBase
4. Get the output
from S3
3. Launch the cluster using
the EMR console, CLI, SDK,
or APIs
29. v
EMR Cluster
EMR
S3
You can easily resize
the cluster
And launch parallel
clusters using the same
data
30. v
EMR Cluster
EMR
S3
Use Spot
nodes to save
time and
money
31. v
S3 EMR Cluster
When processing is complete, you
can terminate the cluster (and stop
paying)
32. v
EMR Cluster
Or just store
everything in HDFS
(local disk)
33. v
Q3. What’s good about it?
Scalability, Cost & Ease of Use
34. v
Scenario #1
Duration:
14 Hours
Scenario #2
Duration:
7 Hours
EMR with spot instances
#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Time Savings: 50%
Cost Savings: ~22%
35. Master instance group
EMR cluster
HDFS HDFS
Core instance group Task instance group
Amazon S3
Great for
Spot Instances
38. Big Data Verticals
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendations
Transactions
Analysis
Life Sciences
Genome
Analysis
Financial Services
Monte Carlo
Simulations
Risk Analysis
Security
Anti-virus
Fraud Detection
Image
Recognition
Social
Network/Gaming
User
Demographics
Usage analysis
In-game metrics
40. Log Ingest Continual Metrics Real Time Data Analytics Complex Stream
v
Processing
Software/
Technology
IT server logs ingestion IT operational metrics
dashboards
Devices / Sensor Operational
Intelligence
Digital Ad Tech./
Marketing
Advertising Data aggregation Advertising metrics like
coverage, yield, conversion
Analytics on User engagement
with Ads
Optimized bid/ buy
engines
Financial Services Market/ Financial Transaction
order data collection
Financial market data metrics Fraud monitoring, and Value-at-
Risk assessment
Auditing of market
order data
E-Commerce
Online customer engagement
data aggregation
Consumer engagement metrics
like page views, CTR
Customer clickstream analytics Recommendation
engines
Real-time Scenarios in Industry Segments
45. Availability
Zone
Stream
Availability
Zone
Availability
Zone
Data
Sources
Data
Sources
Data
Sources
Data
Sources
Data
Sources
Logging
Metrics
Analysis
Machine
Learning
S3
DynamoDB
Redshift
EMR
Kinesis
46. Putting data into Kinesis
• Each shard
• 1000 Tx Per Second
• 1MB Per Second
• 50KB Payload Per Tx
• Messages kept for 24 hours
• Simple PUT interface to store data in Kinesis
• A Partition Key is used to distribute the PUTs across Shards
• A unique Sequence # is created
47. Getting data out of Kinesis
v
Kinesis Client Library (KCL):
• Abstracts code from individual shards
• Starts a Kinesis Worker for each shard
• Increases and decreases workers
• Tracks a Worker’s location in the stream
51. Online Labs | Training
Gain confidence and hands-on
experience with AWS. Watch free
Instructional Videos and explore Self-
Paced Labs
Instructor Led Classes
Learn how to design, deploy and operate
highly available, cost-effective and secure
applications on AWS in courses led by
qualified AWS instructors
AWS Certification
Validate your technical expertise
with AWS and use practice exams
to help you prepare for AWS
Certification
http://aws.amazon.com/training