AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)

Amazon Athena 및 Glue를 통한
빠른 데이터 질의 및 처리 기능 소개
김상필 솔루션즈 아키텍트

목차
• 서버리스 대화식 쿼리 서비스, Amazon Athena 소개
• 완전 관리형 ETL 서비스, AWS Glue 소개
2

Ingest/
Collect
Consume/
visualize
Store Process/
analyze
Data
1 4
0 9
5
Answers &
insights
AWS 빅데이터 분석 아키텍처

AWS Data PipelineAWS Database Migration Service
EMR
분석
Amazon
Glacier
S3
저장수집
Amazon Kinesis
Direct Connect
Amazon
Machine
Learning
Amazon
Redshift
DynamoDBAWS IoT
AWS Snowball
QuickSight
Amazon Athena
EC2
Amazon
Elasticsearch
Service
Lambda
AWS Glue

기존의 어려움
• Significant amount of work required to analyze data in
Amazon S3
• Users often only have access to aggregated data sets
• Managing a Hadoop cluster or data warehouse requir
es expertise

Amazon Athena 란?
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL

Serverless
• No Infrastructure
or administration
• Zero Spin up time
• Transparent upgra
des
Highly Available
• Connect to a
service endpoint
or log into the
console
• Uses warm
compute pools
across multiple
AZs
• Your data is in
Amazon S3
Easy to use
• Log into the Console
• Create a table
• Type in a Hive DDL
Statement
• Use the console
Add Table wizard
• Start querying
Amazon Athena 특징

Amazon S3에 있는 데이터를 직접 쿼리
• No loading of data
• Query data in its raw format
• Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the best performa
nce and lowest cost
• No ETL required
• Stream data from directly from Amazon S3
• Take advantage of Amazon S3 durability and availability

ANSI SQL 사용
• Start writing ANSI SQL
• Support for complex joins, nested q
ueries & window functions
• Support for complex data types (arra
ys, structs)
• Support for partitioning of data by a
ny key
• (date, time, custom keys)
• e.g., Year, Month, Day, Hour or Cu
stomer Key, Date

기존의 친숙한 기술들 사용
• Used for SQL Queries
• In-memory distributed query engine
• ANSI-SQL compatible with extensions
• Used for DDL functionality
• Complex data types
• Multitude of formats
• Supports data partitioning

Amazon Athena 지원 데이터 포맷
• Text files, e.g., CSV, raw logs
• Apache Web Logs, TSV files
• JSON (simple, nested)
• Compressed files
• Columnar formats such as Apache Parquet & Apache ORC
• AVRO support – coming soon

Amazon Athena의 빠른 속도
• Tuned for performance
• Automatically parallelizes queries
• Results are streamed to console
• Results also stored in S3
• Improve Query performance
• Compress your data
• Use columnar formats

Amazon Athena의 비용 효율성
• Pay per query
• $5 per TB scanned from S3
• DDL Queries and failed queries are free
• Save by using compression, columnar formats, partitions

데이터 분석 파이프라인 예

Ad-hoc access to raw data using SQL

Ad-hoc access to data using Athena
Athena can query
aggregated datasets as well

기존 어려움들의 해결
• Significant amount of work required to analyze data in Amazon S3
• No ETL required. No loading of data. Query data where it lives
• Users often only have access to aggregated data sets
• Query data at whatever granularity you want
• Managing a Hadoop cluster or data warehouse requires expertise
• No infrastructure to manage

Simple Query
editor with key
bindings

Can also see a detailed view
in the catalog tab

You can also check the
properties. Note the location.

QuickSight allows you to connect to data from a wide variety of
AWS, third-party, and on-premises sources including Amazon
Athena
Amazon RDS
Amazon S3
Amazon Redshift
Amazon Athena
Amazon QuickSight를 통한 Athena 접속 지원

테이블 생성 및 데이터 쿼리

테이블 생성
• Create Table Statements (or DDL) are written in Hive
• High degree of flexibility
• Schema on Read
• Hive is SQL like but allows other concepts such “external
tables” and partitioning of data
• Data formats supported – JSON, TXT, CSV, TSV, Parquet a
nd ORC (via Serdes)
• Data in stored in Amazon S3
• Metadata is stored in an a metadata store

Athena의 내부 메타데이터 저장소
• Stores Metadata
• Table definition, column names, partitions
• Highly available and durable
• Requires no management
• Access via DDL statements
• Similar to a Hive Metastore

간단한 쿼리 실행
Run time
and data
scanned

PARQUET
• Columnar format
• Schema segregated into footer
• Column major format
• All data is pushed to the leaf
• Integrated compression and in
dexes
• Support for predicate pushdo
wn
ORC
• Apache Top level project
• Schema segregated into footer
• Column major with stripes
• Integrated compression, indexe
s, and stats
• Support for Predicate Pushdow
n
Apache Parquet 및 Apache ORC – 컬럼기반 포맷

쿼리 수행 당 비용 - $5/TB 스캔
• Pay by the amount of data scanned per q
uery
• Ways to save costs
• Compress
• Convert to Columnar format
• Use partitioning
• Free: DDL Queries, Failed Queries
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text fi
les
1 TB 237 seconds 1.15TB $5.75
Logs stored in Apach
e Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parque
t
34x faster 99% less data scanned 99.7% cheaper

Athena는 Amazon Redshift 및 Amazon EMR 보완
Amazon S3
EMR Athena
QuickSight
Redshift

완전 관리형 ETL 서비스
AWS Glue

Fivetran
AWS의 많은 ETL 파트너들…
… 실제로는 툴보다 매뉴얼 코드

ETL Data Warehousing Business Intelligence
70% of time
spent here
Amazon Redshift Amazon QuickSight
분석에서 ETL 이 가장 시간을 많이 소모

1990 2000 2010 2020
Generated Data
Available for Analysis
Data Volume
The Data Gap
데이터의 갭 초래

ü Cataloging data sources
ü Identifying data formats and data
types
ü Generating Extract, Transform, Load code
ü Executing ETL jobs; managing dependencies
ü Handling errors
ü Managing and scaling resources
Glue는 ETL 작업을 자동화

Data Catalog
§ Hive metastore compatible metadata repository of data
sources.
§ Crawls data source to infer table, data type, partition format.
Job Execution
§ Runs jobs in Spark containers – automatic scaling based on
SLA.
§ Serverless - only pay for the resources you consume.
Job Authoring
§ Generates Python code to move data from source to
destination.
§ Edit with your favorite IDE; share code snippets using Git.
AWS Glue 구성요소

Glue 데이터 카달로그
Discover and organize your data sets

Manage table metadata through a Hive
metastore API or Hive SQL. Supported by
tools such as Hive, Presto, Spark, etc.
We added a few extensions:
§ Search metadata for data discovery
§ Connection info – JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas
evolve and other metadata are updated
Populate using Hive DDL, bulk import, or
automatically through crawlers.
Glue 데이터 카달로그

Automatic schema inference:
• Built-in classifiers detect file type and
extract schema: record structure and
data types.
• Add your own or share with others in the
Glue community - It's all Grok and
Python.
Auto-detects Hive-style partitions,
grouping similar files into one table.
Run crawlers on schedule to discover
new data and schema changes.
Serverless – only pay when crawls run.
크롤러 : 데이터 카달로그의 자동 생성

Glue에서의 작업 작성
Make ETL job authoring like code development using your own tools

1. Pick sources and targets from the data catalog
2. Glue generates transformation graph and Python code
3. Specify trigger condition
Every Friday
at 3PM GMT
Source table
@ Amazon S3
Transform
Relationalize
Transform
Filter table
Target table
@ Amazon Redshift
Target table
@ Amazon Redshift
자동 코드 생성

§ Human-readable code run on a scalable platform, PySpark
§ Forgiving in the face of failures – handles bad data and crashes
§ Flexible: handles complex semi-structured data, and adapts to source schema changes
Glue ETL 스크립트의 유연성

Glue integrates job authoring and
execution with your preferred Git
services.
Push job code to your Git
repository,
automatically pulls the latest on
job invocation.
Customize ETL jobs in your
favorite IDE – no need to learn
new tools
No need to start from scratch.
AWS CodeCommit
Git 통합

오케스트레이션 & 자원관리
Fully managed, serverless job execution

Compose jobs globally with event-
based dependencies
§ Easy to reuse and leverage work
across organization boundaries
Multiple triggering mechanisms
§ Schedule-based: e.g., time of day
§ Event-based: e.g., data availability, job
completion
§ External sources: e.g., AWS Lambda
Marketing: Ad-spend by
customer segmentData based
>10 MB new
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
ad-click
logs
weekly
sales
Data
based
작업 구성 및 트리거

Split by
message
type
Application #1 – click logs
3 different message types
…
summarize
message type
summarize
message type
Example: Dynamic number of jobs based on
application type and number of message types
summarize
message typeApplication #2 – click logs
Application #3 – click logs
§ Add jobs dynamically as graph unfolds - makes data dependent orchestration possible
§ Glue provides fault-tolerant orchestration - retries on job failure
§ Monitoring and metrics - job run history and event tracking for debugging
동적 오케스트레이션

§ Warm pools: pre-configured fleets of
instances to reduce job startup time
§ Auto-configure VPC and role-based
access
§ Automatically scale resources to meet SLA
and cost objectives
§ You pay only for the resources you
consume while consuming them.
There is no need to provision, configure,
or manage servers
Customer VPC Customer VPC
Warm pool of instances
서버리스 작업 실행

So that's the basics of what we are doing.
You can sign up for a preview at aws.amazon.com/glue.
We should start adding people soon.
Glue 프리뷰 신청

AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)

Similar a AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트) (20)

Más de Amazon Web Services Korea

Más de Amazon Web Services Korea (20)

Último

Último (20)

AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)