AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
BuildingServerlessAnalytics onAWS
Ivan Cheng
Solutions Architect
AWS
Steven Hsieh
Engineer
TrendMicro

COLLECT STORE PROCESS/
ANALYZE
CONSUME
Data Answers
Time to answer (Latency)
Throughput
Cost
Data Processing START HERE
WITH A BUSINESS CASE

To answer new questions quickly, we look to
a modern data architecture design
Massive upfront costs
Overprovisioned capacity
Long implementation times
Pay as you go, for what you use
Decoupled pipelines and engines
Experimentation platform
Ingest/
Collect
Consume/
visualize
Store Process/
analyze
1 4
0 9
5

Data Is Changing  Analytics Are Adopting
Capture and store
new data at PB-EB
scale
Do new type of analytics
in a cost effective way
• Machine learning
• Big data processing
• Real-time analytics
• Full-text search
New types of
analytics

More data lakes and analytics than anywhere
else
More than 10,000 data lakes onAWS

Data Movement
Analytics
AWS Analytics Portfolio
Broadest and deepest portfolio, purpose-built for builders
+ 10 more
Redshift
EMR (Spark
& Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
Glue (Spark
& Python)
S3/Glacier GlueLake
Formation
Visualization, Engagement, & Machine Learning
QuickSight SageMaker Comprehen
d
Le
x
Polly Rekognition Translate Transcribe
Deep Learning
AMIs
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Managed Streaming for Kafka
Data Lake Infrastructure & Management

Agility and Innovation Are Key
Amazon SageMaker
AWS Deep LearningAMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
AmazonTranslate
AmazonTranscribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon KinesisVideo Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement

Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
S3
Redshift
EMR
Athena Kinesis
Elasticsearch Service
Kinesis
Video Streams
AI Services
QuickSight
Durable and available; Exabyte scale
Secure, compliant, auditable
Rapid ingest and transformation
Schema on read
Decoupling of compute and storage
On-demand resources, tiering, cost choices
Robust Infrastructure

Your choice of Amazon S3 storage classes
Access FrequencyFrequent Infrequent
• Active, frequently
accessed data
• Milliseconds access
• > 3 AZ
• $0.0210/GB
• Data with changing
access patterns
• > 3 AZ
• $0.0210 to $0.0125/GB
• Monitoring fee per Obj.
• Min storage duration
• Infrequently accessed
data
• > 3 AZ
• $0.0125/GB
• Retrieval fee per GB
• Min object size
S3 Standard S3 S-IA S3 Z-IA Amazon Glacier
• Re-creatable, less
accessed data
• 1 AZ
• $0.0100/GB
• Min object size
• Archive data
• Select minutes or
hours
• > 3 AZ
• $0.0040/GB
• Min object size
S3 INT

Ingest Consume
Amazon Kinesis
BI Tools
Data Analytics Pipeline
Database
Migration Service
AWS Snowball
Amazon MSK
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Process & Analyze
Jupyter
Notebooks
Amazon
API Gateway
Amazon
QuickSight
Catalog
AWS Glue
Store
Amazon S3
Store
Amazon S3
Data sources
Web logs /
cookies
ERP
Connected
devices

Virtual
machines
Managed
services
Serverless
Cloud Services Evolution

Serverless analytics
Deliver on-demand analytics on the data lake
S3
Data lake
Glue
(ETL &
Data Catalog)
Athena
QuickSight
Serverless. Zero
infrastructure. Zero
administration
Never pay for
idle resources
$
Availability and
fault tolerance built
in
Automatically scales
resources with usage
AI/ML
Devices Web Sensors Social
Kinesis Data
Firehose

Amazon Athena-Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Supports Multiple Data Formats – Define Schema on Demand
Fast. Really Fast.
Interactive performance even
for large datasets. Athena
automatically executes
queries in parallel, so most
results come back within
seconds.
Open. Powerful.
Standard
Start Querying
Instantly
Pay Per Query
Athena is serverless. Just
point to your data in Amazon
S3, define the schema, and
start querying using the built-
in query editor.
Amazon Athena uses Presto
with ANSI SQL support and
works with a variety of
standard data formats,
including CSV, JSON, ORC,
Avro, and Parquet
With Amazon Athena, you pay
only for the queries that you
run. You are charged $5 per
terabyte scanned by your
queries.

Amazon S3 Amazon Athena
Data catalog
Data Engineer Data Consumer
AWS Tools and SDKs
AWS Management Console
Amazon QuickSight
Amazon SageMaker
User
Analyst
Data Scientist
Use PyAthena to query
Athena tables directly from
Amazon SageMaker
notebooks

Data consumption – Automated Reporting
athena.startQueryExecution("SELECT * FROM
business_view”)
1
2
3 4
5
1. Schedule query
2. Track QueryID for status
3. Query results to Amazon S3
4. New file trigger
5. Job complete notification
Email
notification
Query_ID

Athena Workgroups
Athena Workgroups are used to isolate queries
between different teams, workloads or applications,
and to set limits on amount of data each query or the
entire workgroup can process
Workload Isolation Query Metrics Cost Controls

Workgroups – Cost Controls
• Per query data scanned threshold; exceeding, will cancel query
• Trigger alarms to notify of increasing usage and cost
• Disable Workgroup when all queries exceed a maximum threshold
Any Athena metric: successful/failed & total queries, query run time, etc.

Visualize your data with your favorite tools
Featured Athena Partners
Amazon QuickSight

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon QuickSight
is a fully managed,
serverless, cloud
business intelligence
system

Why QuickSight
Scalable
From 10 users to 10,000, QuickSight seamlessly grows
with you with no need for additional servers or
infrastructure.
No Servers to Manage
QuickSight is a fully managed cloud service. There is
no infrastructure to maintain or upgrade and no upfront
costs.
Fully integrated
QuickSight integrates with your other AWS services
and data sources giving you everything you need to
build an end-to-end cloud analytics solution.
Pay For What You Use
Instead of buying costly licenses for all of your users,
QuickSight allows you to share dashboards and reports and
only pay when users access them.

Connect to your data, wherever it is
QuickSight allows you to connect to AWS data sources, Private VPC subnets, on-premise
and hosted databases and third party business applications.
On-premises
Securely connect to on-premise
databases and flat files like Excel
and CSV
In the cloud
Connect to hosted database, big
data formats, and secure VPCs
Applications
Connect directly to third
party business applications
• Salesforce
• Square
• Adobe Analytics
• Jira
• ServiceNow
• Twitter
• Github
• Redshift
• RDS
• S3
• Athena
• Aurora
• Teradata
• MySQL
• Presto
• Spark
• SQL Server
• Postgre SQL
• MariaDB
• Snowflake
• IoT Analytics
• Excel
• CSV
• Teradata
• MySQL
• SQL Server
• PostgreSQL

Embedding Dashboards In Your Application
QuickSight allows you to seamlessly integrate interactive dashboards and analytics into your
own applications
• Enhance your applications with rich
analytics and dashboards
• Easy maintenance, no servers to
manage
• Fast! No Custom development or
domain expertise needed
• Leverage new features as we add
them
• Utilizes Pay-per-Session Pricing.

Amazon S3
(Processed Data)
Amazon
Athena
Amazon
QuickSight
Demo Scenario
Glue Data
catalog

Building AWS Multi Account Cost
Analytics Solution at Scale
Steven Hsieh
Engineer
TrendMicro

About Me
Steven Hsieh

Background

Pillars of

Design Principles for Cost Optimization
• Adopt a consumption model
• Measure overall efficiency
• Stop spending money on data center operations
• Analyze and attribute expenditure
• Use managed services to reduce cost of ownership
Pay as you go /
need

Challenges
Large Scale Accounts
• Almost 400 accounts
• Hard management via
AWS console
Multiple Data Sources
• Billing data
• Utilization data of AWS
services ( e.g., EC2, S3)

Challenges
Permission Management
• Multiple teams
• Authorization of different
team
Insight for Better Design
• Finding insight for
design improvement
• Providing utilization
visibility for design
change

Other solution we have tried…
AWS Billing Console
• Hard to use in large
scale
• Single data source
Amazon Redshift
• Cost Model
• ETL
3rd Party BI Tool
• Expensive license
fee
• Additional operation
cost

Ideas
+ +
• Data persistence in Amazon S3
• Data querying via Amazon Athena
• Dashboard / Reporting via Amazon QuickSight

Challenges

Global Accelerator

• Using SQS to trigger parallel tasks
• Lambda limitation:
• Timeout: 15 minutes
• /tmp: 512 MB
• Spot instance interruptions
• Fargate limitation:
• Container storage: 10 GB
• Run-task: 10
• Using assume role to collect data
across accounts

• Using SNS to trace data
uploading result
• Preprocessing data before
uploading to S3
• Only creator can modify
datasets in QuickSight
• Create view in Athena

Global Accelerator
• Web application host in
Fargate
• Lambda Integration with
QuickSight for embedded URL.
• Using ALB to handle all
HTTPS interaction.
• Permission & Metadata in
DynamoDB
• ADFS Federation using
Cognito
• Performance Improvement
via AWS Global Accelerator
• Web Security Enhancement
via AWS WAF

Quick Development & Evaluation

Low Utilization & Right Sizing
• Trusted Advisor Checks
• Low utilization EC2 instances: CPU was 10% or less and
network I/O was 5 MB or less on 4 or more days during last 14
days
• Right Sizing
• Analysis metric data to recommend proper instance type and
size
• Awareness of NIC driver and Linux virtualization type issue

Saving Polar Bear
• Analyzing the CPU utilization pattern
• Tuning off non-production instances can saving almost
70% cost

Recap
• Using cost effective way to build the end-to-end BI
solution
• 2 power users $36 + ALB $18 = $54
• Using flexible reporting architecture to integrate with
multiple data sources
• Quick win & timely data driven decision
• Validating innovation idea （e.g., the potential saving of polar bear
project）

Summary
• More organizations building datalake on cloud to stay competitive
• AWS provides the broadest and deepest portfolio of databases and
analytics services includes machine learning.
• Serverless Analytics helps you build modern data pipeline with increased
agility and lower cost.
• Learn more at: https://aws.amazon.com/big-data/datalakes-and-analytics/

AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS

Similar to AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS (20)

Recently uploaded

Recently uploaded (20)

AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS

Editor's Notes