이제 빅데이터란 개념은 익숙한 것이 되었지만 이를 비지니스에 적용하고 최대의 효과를 얻는 방법에 대한 고찰은 여전히 필요합니다. 소중한 데이터를 쉽게 저장 및 분석하고 시각화하는 것은 비즈니스에 대한 통찰을 얻기 위한 중요한 과정입니다.
이 강연에서는 AWS Elastic MapReduce, Amazon Redshift, Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구축하는 방법에 대해 소개합니다.
3. 이번 웨비나 에서 들으실 내용..
이 강연에서는 AWS Elastic MapReduce, Amazon Redshift,
Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도
구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구
축하는 방법에 대해 소개합니다.
4. v
Agenda
• AWS Big data building blocks
• AWS Big data platform
• Log data collection & storage
• Introducing Amazon Kinesis
• Data Analytics & Computation
• Collaboration & sharing
• Netflix Use-case
21. v
Collection of Data
Sources
Aggrega8on
Tool
Data
Sink
Web
Servers
Applica8on
servers
Connected
Devices
Mobile
Phones
Etc
Scalable
method
to
collect
and
aggregate
Flume,
KaGa,
Kinesis,
Queue
Reliable
and
durable
des8na8on
OR
Des8na8ons
23. Run your own log collector
Your
applica0on
Amazon S3
DynamoDB
Any
other
data
store
Amazon S3
Amazon
EC2
24. Use a Queue
Amazon
Simple
Queue
Service
(SQS)
Amazon S3
DynamoDB
Any
other
data
store
25. Agency Customer: Video Analytics on AWS
Elastic Load
Balancer
Edge Servers
on EC2
Workers on
EC2
Logs Reports
HDFS Cluster
Amazon Simple Queue Service
(SQS)
Amazon Simple Storage Service (S3)
Amazon Elastic MapReduce
26. Use a Tool like FLUME, KAFKA, HONU etc
Flume running
on EC2
Amazon S3
Any
other
data
store
HDFS
27. v
Choice of tools
• (+) Pros / (-) Cons
• (+) Flexibility: Customers select the most appropriate software and underlying
infrastructure
• (+) Control: Software and hardware can be tuned to meet specific business and
scenario needs.
• (-) Ongoing Operational Complexity: Deploy, and manage an end-to-end system
• (-) Infrastructure planning and maintenance: Managing a reliable, scalable
infrastructure
• (-) Developer/ IT staff overhead: Developers, Devops and IT staff time and
energy expended
• (-) Unsupported Software: deprecated and/ pre-version 1 open source software
• Future – Need for to stream data for real time
32. Data
Sources
App.4
[Machine
Learning]
AWS
Endpoint
App.1
[Aggregate
&
De-‐Duplicate]
Data
Sources
Data
Sources
Data
Sources
App.2
[Metric
Extrac0on]
S3
DynamoDB
Redshift
App.3
[Sliding
Window
Analysis]
Data
Sources
Availability
Zone
Shard
1
Shard
2
Shard
N
Availability
Zone
Availability
Zone
Introducing Amazon Kinesis
Managed Service for Real-Time Processing of Big Data
EMR
33. Kinesis Architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts
34. Putting data into Kinesis
Managed Service for Ingesting Fast Moving Data
• Streams
are
made
of
Shards
⁻ A
Kinesis
Stream
is
composed
of
mul8ple
Shards
⁻ Each
Shard
ingests
up
to
1MB/sec
of
data,
and
up
to
1000
TPS
⁻ Each
Shard
emits
up
to
2
MB/sec
of
data
⁻ All
data
is
stored
for
24
hours
⁻ You
scale
Kinesis
streams
by
adding
or
removing
Shards
• Simple
PUT
interface
to
store
data
in
Kinesis
⁻ Producers
use
a
PUT
call
to
store
data
in
a
Stream
⁻ A
Par00on
Key
is
used
to
distribute
the
PUTs
across
Shards
⁻ A
unique
Sequence
#
is
returned
to
the
Producer
upon
a
successful
PUT
call
Producer
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Kinesis
36. v
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
KCL Worker 1
KCL Worker 2
EC2 Instance
KCL Worker 3
KCL Worker 4
EC2 Instance
KCL Worker n
EC2 Instance
Kinesis
Building Kinesis Apps
Client library for fault-tolerant, at least-once, real-time
processing
• Key streaming application attributes:
• Be distributed, to handle multiple shards
• Be fault tolerant, to handle failures in hardware or software
• Scale up and down as the number of shards increase or decrease
• Kinesis Client Library (KCL) helps with distributed processing:
• Automatically starts a Kinesis Worker for each shard
• Simplifies reading from the stream by abstracting individual shards
• Increases / Decreases Kinesis Workers as # of shards changes
• Checkpoints to keep track of a Worker’s location in the stream
• Restarts Workers if they fail
• Use the KCL with Auto Scaling Groups
• Auto Scaling policies will restart EC2 instances if they fail
• Automatically add EC2 instances when load increases
• KCL will redistributes Workers to use the new EC2 instances
OR
• Use the Get APIs for raw reads of Kinesis data streams
37. 37
Easy
Administra0on
Managed
service
for
real-‐8me
streaming
data
collec8on,
processing
and
analysis.
Simply
create
a
new
stream,
set
the
desired
level
of
capacity,
and
let
the
service
handle
the
rest.
Real-‐0me
Performance
Perform
con8nual
processing
on
streaming
big
data.
Processing
latencies
fall
to
a
few
seconds,
compared
with
the
minutes
or
hours
associated
with
batch
processing.
High
Throughput.
Elas0c
Seamlessly
scale
to
match
your
data
throughput
rate
and
volume.
You
can
easily
scale
up
to
gigabytes
per
second.
The
service
will
scale
up
or
down
based
on
your
opera8onal
or
business
needs.
S3,
EMR,
Storm,
RedshiY,
&
DynamoDB
Integra0on
Reliably
collect,
process,
and
transform
all
of
your
data
in
real-‐8me
&
deliver
to
AWS
data
stores
of
choice,
with
Connectors
for
S3,
Redshi],
and
DynamoDB.
Build
Real-‐0me
Applica0ons
Client
libraries
that
enable
developers
to
design
and
operate
real-‐8me
streaming
data
processing
applica8ons.
Low
Cost
Cost-‐efficient
for
workloads
of
any
scale.
You
can
get
started
by
provisioning
a
small
stream,
and
pay
low
hourly
rates
only
for
what
you
use.
Amazon Kinesis: Key Developer Benefits
38. Customers using Amazon Kinesis
Mobile/
Social
Gaming
Digital
Adver0sing
Tech.
Deliver
con8nuous/
real-‐8me
delivery
of
game
insight
data
by
100’s
of
game
servers
Generate
real-‐8me
metrics,
KPIs
for
online
ad
performance
for
adver8sers/
publishers
Custom-‐built
solu8ons
opera8onally
complex
to
manage,
&
not
scalable
Store
+
Forward
fleet
of
log
servers,
and
Hadoop
based
processing
pipeline
• Delay
with
cri8cal
business
data
delivery
• Developer
burden
in
building
reliable,
scalable
pladorm
for
real-‐8me
data
inges8on/
processing
• Slow-‐down
of
real-‐8me
customer
insights
• Lost
data
with
Store/
Forward
layer
• Opera8onal
burden
in
managing
reliable,
scalable
pladorm
for
real-‐8me
data
inges8on/
processing
• Batch-‐driven
real-‐8me
customer
insights
Accelerate
8me
to
market
of
elas8c,
real-‐8me
applica8ons
–
while
minimizing
opera8onal
overhead
Generate
freshest
analy8cs
on
adver8ser
performance
to
op8mize
marke8ng
spend,
and
increase
responsiveness
to
clients
39. Digital Ad. Tech Metering with Kinesis
Con0nuous
Ad
Metrics
Extrac0on
Incremental
Ad.
Sta0s0cs
Computa0on
Metering
Record
Archive
Ad
Analy0cs
Dashboard
40. v
Collection of Data
Sources
Aggrega8on
Tool
Data
Sink
Web
Servers
Applica8on
servers
Connected
Devices
Mobile
Phones
Etc
Scalable
method
to
collect
and
aggregate
Flume,
KaGa,
Kinesis,
Queue
Reliable
and
durable
des8na8on
OR
Des8na8ons
43. Cloud Database and Storage Tier — Use the Right
Tool for the Job!
App/Web
Tier
Client
Tier
Data
Tier
Database
&
Storage
Tier
Search
Hadoop/HDFS
Cache
Blob
Store
SQL
NoSQL
44. App/Web
Tier
Client
Tier
Database
&
Storage
Tier
Amazon
RDS
Amazon
DynamoDB
Amazon
Elas0Cache
Amazon
S3
Amazon
Glacier
Amazon
CloudSearch
HDFS
on
Amazon
EMR
Cloud Database and Storage Tier — Use the Right
Tool for the Job!
45. v
What Database and Storage Should I
Use?
• Data structure
• Query complexity
• Data characteristics: hot, warm, cold
48. Amazon
RDS
Amazon
S3
Request
Rate
High
Low
Cost/GB
High
Low
Latency
Low
High
Data
Volume
Low
High
Amazon
Glacier
Amazon
CloudSearch
Structure
Low
High
Amazon
DynamoDB
Amazon
Elas8Cache
HDFS
49. What Data Store Should I Use?
Amazon
Elas0Cache
Amazon
DynamoDB
Amazon
RDS
Amazon
CloudSearch
Amazon
EMR
(HDFS)
Amazon
S3
Amazon
Glacier
Average
latency
ms
ms
ms,
sec
ms,sec
sec,min,hrs
ms,sec,min
(~
size)
hrs
Data
volume
GB
GB–TBs
(no
limit)
GB–TB
(3
TB
Max)
GB–TB
GB–PB
(~nodes)
GB–PB
(no
limit)
GB–PB
(no
limit)
Item
size
B-‐KB
KB
(64
KB
max)
KB
(~rowsize)
KB
(1
MB
max)
MB-‐GB
KB-‐GB
(5
TB
max)
GB
(40
TB
max)
Request
rate
Very
High
Very
High
High
High
Low
–
Very
High
Low–
Very
High
(no
limit)
Very
Low
(no
limit)
Storage
cost
$/GB/month
$$
¢¢
¢¢
$
¢
¢
¢
Durability
Low
-‐
Moderate
Very
High
High
High
High
Very
High
Very
High
Hot
Data
Warm
Data
Cold
Data
50. Decouple your storage and analysis engine
1. Single Version of Truth
2. Choice of multiple analytics Tools
3. Parallel execution from different teams
4. Lower cost
Learning
from
Nealix
51. v
S3 as a “single source of truth”
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3
52. Amazon
SQS
Amazon S3
DynamoDB
Any
SQL
or
NO
SQL
Store
Kinesis
Choose depending upon design
56. v
Batch Processing
• Take large amount of cold data and ask
questions
• Takes minutes or hours to get answers back
Example:
Genera-ng
hourly,
daily,
weekly
reports
58. v
Stream Processing (AKA Real Time)
• Take small amount of hot data and ask
questions
• Takes short amount of time to get your
answer back
Example:
1min
metrics
60. Amplab Big Data Benchmark
Scan
query
Aggregate
query
Join
query
hops://amplab.cs.berkeley.edu/benchmark/
61. v
What Batch Processing Technology Should I Use?
RedshiY
Impala
Presto
Spark
Hive
Query
Latency
Low
Low
Low
Low
-‐
Medium
Medium
-‐
High
Durability
High
High
High
High
High
Data
Volume
1.6PB
Max
~Nodes
~Nodes
~Nodes
~Nodes
Managed
Yes
EMR
bootstrap
EMR
bootstrap
EMR
bootstrap
Yes
(EMR)
Storage
Na8ve
HDFS
HDFS/S3
HDFS/S3
HDFS/S3
#
of
BI
Tools
High
Medium
High
Low
High
Query
Latency
(Low
is
beoer)
62. v
What Stream Processing Technology Should I Use?
Spark
Streaming
Apache
Storm
+
Trident
Kinesis
Client
Library
Scale/Throughput
~
Nodes
~
Nodes
~
Nodes
Data
Volume
~
Nodes
~
Nodes
~
Nodes
Manageability
Yes
(EMR
bootstrap)
Do
it
yourself
EC2
+
Auto
Scaling
Fault
Tolerance
Built-‐in
Built-‐in
KCL
Check
poin8ng
Programming
languages
Java,
Python,
Scala
Java,
Scala,
Clojure
Java,
Python
69. SQL based processing
Amazon
SQS
Amazon S3
DynamoDB
Any
SQL
or
NO
SQL
Store
Log
Aggrega0on
tools
Amazon
Redshift
Petabyte scale
Columnar Data -
warehouse
70. SQL based processing for unstructured data
Amazon
SQS
Amazon S3
DynamoDB
Any
SQL
or
NO
SQL
Store
Log
Aggrega0on
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Petabyte scale
Columnar Data -
warehouse
71. Your choice of BI Tools on the cloud
Amazon
SQS
Amazon S3
DynamoDB
Any
SQL
or
NO
SQL
Store
Log
Aggrega0on
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
73. Collaboration and Sharing insights
Amazon
SQS
Amazon S3
DynamoDB
Any
SQL
or
NO
SQL
Store
Log
Aggrega0on
tools
Amazon
EMR
Amazon
Redshift
74. Sharing results and visualizations
Amazon
SQS
Amazon S3
DynamoDB
Any
SQL
or
NO
SQL
Store
Log
Aggrega0on
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
75. Sharing results and visualizations and scale
Amazon
SQS
Amazon S3
DynamoDB
Any
SQL
or
NO
SQL
Store
Log
Aggrega0on
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
76. Sharing results and visualizations
Amazon
SQS
Amazon S3
DynamoDB
Any
SQL
or
NO
SQL
Store
Log
Aggrega0on
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
77. Geospatial Visualizations
Amazon
SQS
Amazon S3
DynamoDB
Any
SQL
or
NO
SQL
Store
Log
Aggrega0on
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Visualization tools
78. Rinse and Repeat
Amazon
SQS
Amazon S3
DynamoDB
Any
SQL
or
NO
SQL
Store
Log
Aggrega0on
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
79. The complete architecture
Amazon
SQS
Amazon S3
DynamoDB
Any
SQL
or
NO
SQL
Store
Log
Aggrega0on
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
88. 온라인 자습 및 실습
다양한 온라인 강의 자
료 및 실습을 통해 AWS
에 대한 기초적인 사용
법 및 활용 방법을 익히
실 수 있습니다.
강의식 교육
AWS 전문 강사가 진행하는 강의를
통해 AWS 클라우드로 고가용성,
비용 효율성을 갖춘 안전한 애플리
케이션을 만드는 방법을 알아보세
요. 아키텍쳐 설계 및 구현에 대한
다양한 오프라인 강의가 개설되어
있습니다.
인증 시험을 통해 클라우
드에 대한 자신의 전문 지
식 및 경험을 공인받고 개
발 경력을 제시할 수 있습
니다.
AWS 공인 자격증
http://aws.amazon.com/ko/training
다양한 교육 프로그램
89. AWS 기초 웨비나 시리즈에 참여해 주셔서 감사합니다!
이번 웨비나가 여러분의 궁금증 해소에 도움이 되었길 바랍니다.
이후 이어질 설문 조사를 통해 오늘 웨비나에 대한 의견을 알려주세요.
aws-korea-marketing@amazon.com
http://twitter.com/AWSKorea
http://facebook.com/AmazonWebServices.ko
http://youtube.com/user/AWSKorea
http://slideshare.net/AWSKorea