Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Specialist Solutions Architect, Data and Analytics, EMEA
November 17th, 2017
Full Stack Analytics on AWS
Ian Robinson

Forces and Trends Prompting the Move to Cloud
Cost Optimization
Licenses
Hardware
Data center and operations
Dark Data
Prematurely discarding data
Agility
Experimentation (data & tools)
Democratised Access to Data
Time-to-first-results
Terminate failed experiments early
From BI to Data Science
In-house data science
From back office to product

Storage is the Gravity for Cloud Applications
Store all your data, for ever, at every stage of its lifecycle
Apply it using the right tool for the job

Object Storage is Foundational

Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent
Access
Amazon Glacier
Create
Delete
Events and Lifecycle Management

S3 as the Data Lake Fabric
• Unlimited number of objects
and volume
• 99.99% availability
• 99.999999999% durability
• Versioning
• Tiered storage via lifecycle
policies
• SSL, client/server-side
encryption at rest
• Low cost (just over
$2700/month for 100TB)
• Natively supported by big
data frameworks (Spark, Hive,
Presto, etc)
• Decouples storage and
compute
• Run transient compute
clusters (with Amazon EC2
Spot Instances)
• Multiple, heterogeneous
clusters can use same data

Database Migration
Service
Automated Data Ingestion

Stream Events to S3 Using Kinesis Firehose

Write Database Changes to S3 with DMS
<schema_name>/<table_name>/LOAD001.csv
<schema_name>/<table_name>/LOAD002.csv
<schema_name>/<table_name>/<time-stamp>.csv
Full Load
Change Data Capture

Scalable (secure, versioned, durable) storage +
Immutable data at every stage of its lifecycle +
Versioned schema and metadata
=
Data discovery, lineage
Storage + Catalog

AWS Glue
• Data Catalog Discover
and store metadata
• Job Authoring Auto-
generated ETL code
• Job Execution
Serverless scheduling
and execution

Hive metastore-compatible, highly-
available metadata repository:
• Classification for identifying and
parsing files
• Versioning of table metadata as
schemas evolve
• Table definitions – usable by
Redshift, Athena, Glue, EMR
Populate using Hive DDL, bulk import, or
automatically through crawlers.
Glue Data Catalog

semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
…
int
array
intchar
struct
char int
array
struct
char
bool int
int
arrayint
char
char int
custom classifiers
app log parser
metrics parser
…
system classifiers
JSON parser
CSV parser
Apache log parser
…
bool
Crawlers: Automatic Schema Inference

AWS Lambda
AWS Lambda
Metadata Index
(Amazon DynamoDB)
Search Index
(Amazon Elasticsearch)
ObjectCreated
ObjectDeleted PutItem
Update Stream
Update Index
Extract Search Fields
Indexing and Searching Using Metadata
Amazon S3

Data Access & Authorisation
Give your users easy and secure access
Storage & Catalog
Secure, cost-effective storage in Amazon S3.
Robust metadata in AWS Catalog
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified

Identity and Access Management
• Manage users, groups, and roles
• Identity federation with Open ID
• Temporary credentials with Amazon Security Token
Service (Amazon STS)
• Stored policy templates
• Powerful policy language
• Amazon S3 bucket policies

IAM
Amazon
S3
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
EMR
Amazon
Kinesis
Amazon
Athena
Service API Access
Security at the Data Level

Third Party Ecosystem Security Tools
Amazon
S3
AWS
CloudTrail
http://amzn.to/2tSimHj
Amazon
Athena
Access Logging
API Logging
Access Log
Analytics
IAM
Amazon
EMR
http://amzn.to/2si6RqS
Storage Level Support for Access Logging and Audit

Encryption Options
AWS Server-Side encryption
• AWS managed key infrastructure
AWS Key Management Service
• Automated key rotation & auditing
• Integration with other AWS services
AWS CloudHSM
• Dedicated Tenancy SafeNet Luna SA HSM Device
• Common Criteria EAL4+, NIST FIPS 140-2

Serverless Processing and
Analytics

• Python code generated
by AWS Glue
• Connect a notebook or
IDE to AWS Glue
• Existing code brought
into AWS Glue
Managed ETL with AWS GLue

• Schedule-based
• Event-based
• On demand
Job Execution with AWS Glue

Amazon Kinesis Analytics
• Interact with streaming data in real time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time visualizations
and alarms

SELECT STREAM author,
count(author) OVER ONE_MINUTE
FROM Tweets
WINDOW ONE_MINUTE AS
(PARTITION BY author
RANGE INTERVAL '1' MINUTE PRECEDING)
WHERE text LIKE ‘%#BigDataSpain%';
Amazon Kinesis Analytics – Simple SQL Interface

Amazon Athena – Analyze Data in S3
• Interactive queries
• ANSI SQL
• No infrastructure or administration
• Zero spin up time
• Query data in its raw format
• AVRO, Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No loading of data, no ETL required
• Stream data from directly from Amazon S3, take advantage
of Amazon S3 durability and availability

Simple query editor
with syntax highlighting
and autocomplete
Data Catalog
Query History, Saved Queries, and
Catalog Management

QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena
Amazon RDS
Amazon S3
Amazon Redshift
Amazon Athena
Using Amazon Athena with Amazon QuickSight

Add Machine Learning Capabilities
Amazon Machine Learning Service
Batch and online predictions
Train using data in S3, RDS and
Redshift
Amazon EMR
Comprehensive machine learning
libraries (eg Spark MLlib, Anaconda)
Provision analytics clusters in minutes,
autoscale with data volume or query
demand

Amazon AI Services
Amazon Polly – Lifelike Text-to-Speech
47 voices, 24 languages
Low-latency, real time
Amazon Rekognition – Image Analysis
Object and scene detection
Facial analysis
Amazon Lex – Conversational Engine
Speech and text recognition
Enterprise connectors

Demographic Data
Facial Landmarks
Sentiment Expressed
Image Quality
Facial Analysis with Rekognition
Brightness: 25.84
Sharpness: 160
General Attributes

Up to ~40k CUDA cores
Pre-configured CUDA drivers
Jupyter notebook with Python2,
Python3, Anaconda
CloudFormation Template
AWS Marketplace – one-click deploy
AWS Deep Learning AMI

Kinesis Firehose
Athena
Query Service Glue
Machine Learning
Predictive analytics
Data Access & Authorisation
Give your users easy and secure access
Data Ingestion
Get your data into S3
quickly and securely
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Amazon AI
Storage & Catalog
Secure, cost-effective storage in Amazon S3.
Robust metadata in AWS Catalog

Thank You
Full Stack Analytics on AWS

Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Recomendados

Recomendados

Más contenido relacionado

Más de Big Data Spain

Más de Big Data Spain (20)

Último

Último (20)

Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017