Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018

© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Adir Sharabi
Solutions Architect, Amazon Web Services
Yair Weinberger
CTO and Co-Founder, Alooma
Big Data on AWS: To infinity and
beyond!

Big Data Track Agenda
# Time Title Speaker
1 13:15 – 14:00 Big Data on AWS: To infinity and beyond! Adir Sharabi – Solutions Architect
2 14:10 – 14:55 Amazon Kinesis – Building Serverless real-
time solution
Roy Ben Alta – Business
Development Manager
3 15:05 – 15:50 Data preparation and transformation: Spin
your straw into gold
Daniel Haviv – Specialist Solutions
Architect, Analytics
4 16:00 – 16:45 Success has many query engines Eden Perry – Partner Solutions
Architect
5 16:50 – 17:30 Connecting the dots: How Amazon Neptune
and Graph Databases can transform your
business
• Andi Gutmans - GM, Neptune
& Elasticsearch
• Brad Bebee - Principal Prod
Mgmt - Tech, AWS Neptune

Documents and files Streams
Your Data Sources
Multiple sources and formats… and growing everyday
Records
Amazon
RDS
Amazon
DynamoDB
AWS IoT
On Premises
databases
Spreadsheets Infrastructure
logs
Clickstream data Mobile app data
Social media data Amazon
Redshift
Device data
Sensor data
ERP WEB
Clickstream
Mobile Apps

Data Challenges
Data Visibility Multiple consumers
and requirements
Multiple Access
Mechanisms
1990 2000 2010 2020
Analysts Applications
Data Scientists
Business Users API Access BI Tools
Notebooks

Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
Relational data
Schema defined prior to data load
TBs-PBs Scale
Operational reporting and ad hoc
Large initial capex + $10K–$50K/TB/Year

Data Lakes Extend the Traditional Approach
Relational and non-relational data
Schema defined during analysis
Scale storage and compute independently
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time

Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3
Many ways to bring all kinds of data
Unmatched durability and availability at EB scale
Best security, compliance, and audit capabilities
Integration with Big Data Tools
Run any analytics on the same data without movement
Cost effective - Store data at $0.023 / GB / Month
Redshift
EMR
Athena Kinesis
Elasticsearch Service
Amazon S3 as Data Lakes Storage Layer
Kinesis
Video Streams
AI Services

Store
Simplified Big Data Pipeline
Amazon S3
Ingest
Process &
Analyze Consume
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices

Lots of ingestion tools
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Process &
Analyze Consume
Store
Amazon S3
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service

Variety of data processing tools
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service

Multiple ways to consume the data
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools

Because data is NEVER perfect
Amazon EMR
Spark and Hive running on EMR
• Clean
• Transform
• Concatenate
• Convert to better formats
• Schedule transformations
• Event-driven transformations
• Transformations expressed as code
AWS Glue
Event based Server-less ETL engine
AWS Lambda
Trigger-based Code Execution

ETL when you need it
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools

Realtime - in-stream processing
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Spark
Streaming
& Flink
Amazon
Kinesis
Analytics
In stream process
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools

AWS Glue Data Catalog
One per account
Allows you to share metadata between Amazon Athena, Amazon Redshift
Spectrum, EMR & JDBC sources
Serverless
We added a few extensions:
§ Search over metadata for data discovery
§ Manage Connections – JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas evolve and other
metadata are updated
Central Metadata
Catalog for the
data lake

AWS Glue Data Catalog Crawlers
Crawlers automatically build your Data Catalog and keep it in sync
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok
expression
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Catalogs Your
Data

Write once, catalog once, read multiple, ETL Anywhere
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Data Catalog
Store
Amazon S3
Process & Analyze
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Spark
Streaming
& Flink
Amazon
Kinesis
Analytics
In stream process
BI Tools

Yair Weinberger
CTO and Co-Founder, Alooma

Data Pipeline as a Service
yair@alooma.com

CONFIDENTIAL
Architecture
100s of Data
Sources
Alooma’s Data Pipeline Data Destinations
Incoming
Queue
MAPPER
CODE
ENGINE
Restream Queue

Data Lake - The Goal
Emilykil (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

Data Lake - What Sometimes Happens
NatalieMaynor from Jackson, Mississippi, USA - Winter Ugliness, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=5503067

Data Lake VS DataMart
Emilykil (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons, Abras2010 - WalmartUploaded by
SchuminWeb, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=14571617

Enjoy both worlds
EMR
Athena
Redshift
Spectrum
Redshift
S3

Alooma Usage of S3 as a Data Lake
● Separate between data of different tenants
○ IAM Role based access ensures data isolation
● Allow Alooma tenants to replay their data from any data
source or time
● Staging area before loading into Data Warehouse
● Storage for things that need infinite retention (e.g. audit logs)

Replay
Redsh
ift
S3
Files Structure in S3:
s3://bucket/[shard]/input_name/date/random_
me
Sharding is a tradeoff between throughput and
convenience

S3 as Data Lake - Tips and Tricks
● Use Server-Side encryption to provide automatic encryption at
rest - but it does impact performance
● Loading data in high volume
○ Keys in S3 are partitioned by prefix
○ Use Randomly prefixed or at least sharded filenames
● Use Object Expiration to avoid storing unnecessary data
Important resource:
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-
perf-considerations.html

Questions / Feedback
yair@alooma.com

Let’s take an Example

AWS Tweets Example
Record-level data
• What’s the overall sentiment today?
• What’s the sentiment trend now?
• What’s the most popular Language?
• What’s the Temp. affect on the tweet sentiment?
• Scale
• Resilience
• Minimal Operational overhead
• Agile
• Cost Effective
Business Questions
Technical Requirements

ConsumeStore Process & AnalyzeIngest
Kinesis Data Streams
Kinesis Firehose
Delivery Streams
DynamoDB
AWS Lambda
Kinesis
Analytics
Raw Bucket
Parquet Bucket
Athena Redshift
Spectrum
QuickSight
SpeedLayerBatchLayer
Glue Data
Catalog
Spark/EMR Glue ETL
Real time
Web UI

Thank You!

Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018

Similar a Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018 (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018