A sharing in a meetup of the AWS Taiwan User Group.
The registration page: https://bityl.co/7yRK
The promotion page: https://www.facebook.com/groups/awsugtw/permalink/4123481584394988/
7. ● a SAP tool to ETL
● SPSS for insight &
stats modeling
Analysts
● license fee
● outdated front-end
technology
Scientists
● got power from the
AWS computing
● explore potential
ML applications
● unavailability of data
○ external
○ internal
● data difference
Data Engineers
● open-sourced
tools to ETL
● productionize
scientists’
invention
● maintain & evolve
existing services
● explore potential
ML applications
Other BUs
● Make wishes
● repetitive tasks on
data expansion
● unavailability of
data
10. External Data Operational Data
BI Reports
Data Warehouse
ETL
Data Warehouses
- Built for BI and reporting
- No support for video, audio, text
- No support for data science, ML
- Limited support for streaming
- Closed & proprietary formats
11. structured, semi structured, and unstructured data
BI Reports
Data Warehouse
ETL
Data Lakes
- Poor BI support
- Complex to set up
- Poor performance
- Unreliable data swamps
Data Prep and
Validation
Real-time
Database
Data Lake
Machine
Learning
Data
Science
12. structured, semi structured, and unstructured data
BI
Lakehouse
Machine
Learning
Data
Science
Streaming
Analytics
One platform for every use case
Data Lake for all your data
15. CloudFormation CDK
● With elapsed time,
YAML/JSON became larger
● Difficult to work with large
YAML/JSON files
○ High error ratio when
copying/pasting
○ It’s a text file, not
programming language
● Infrastructure AS code
● No abstraction
● IDE integration
○ Multiple languages
○ Syntax check,
autocompletion, etc.
● Higher level abstraction
○ Simplified statements
○ 500 lines of CFN to 30
lines of CDK code
● Infrastructure IS code
16.
17.
18. Why S3
S3 Standard S3 INT S3 S-IA S3 O-IA S3 Glacier
Frequent Infrequent
Access Frequency
19. ● Efficient, columnar data representation.
● Utilizes the record shredding and assembly algorithm.
● Supports schema evolution.
20. Dataset
Size on
Amazon S3
Query
Run
Time
Data
Scanned
Cost
Data stored as CSV files 1 TB
236
seconds
1.15 TB $5.75
Data stored in
Apache Parquet Format
130 GB
6.78
seconds
2.51 GB $0.01
Savings
87% less when
using Parquet
34x
faster
99% less data
scanned
99.7% savings
29. Delta Lake (open source) Apache Iceberg Apache Hudi
Transaction (ACID) Y Y Y
MVCC Y Y Y
Time travel Y Y Y
Schema Evolution Y Y Y
Data Mutation Y (update/delete/merge/ merge into) N Y (upsert)
Streaming Sink and source for Spark struct streaming
Sink and source (wip) for Spark
struct streaming, Flink (wip)
DeltaStreamer
HiveincrementalPuller
File Format Parquet Parquet, ORC, AVRO Parquet
Compaction/Cleanup Manual API available Manual and Auto
Integration DSv1, Delta connector DSv2, InputFormat DSv1, InputFormat
Multiple language support Scala/Java/Python Java/Python Java/Python
Storage Abstraction Y Y N
API dependency Spark-bundled Native/Engine bundled DeltaStreamer
Data ingestion Spark, Presto, Hive Spark, Hive DeltaStreamer
31. ACID transactions on Spark
Delta Lake
Scalable metadata handling
Streaming and batch unification
Schema enforcement
Time travel
Upserts and deletes
33. Databricks
AWS account
User
AWS account Data plane
in user account
Control plane network in Databricks account
Workspace web application, APIs, and other core services
AWS VPC
endpoint service
AWS VPC
endpoint service
Back-end VPC endpoint
for secure cluster connectivity
relay
Back-end VPC endpoint
for
REST APIs
AWS PrivateLink connection (back-end) AWS PrivateLink connection (back-end)
User on-premise
or VPN network
User transit
AWS account Front-end VPC endpoint
For user access to web
App and REST APIs
User request
to web app
or REST APIs
VPC
AWS PrivateLink
Connection (front-end)
34. Languages
Scala Rust Python ruby Golang
**
**
Services
Connectors
Databases
Databricks
dafka-
delta-
ingest
Airbyte *
* Currently in development ** coming soon
39. Glue ETL
1. Serverless Spark
2. DynamicFrame
3. Self-describing, no schema required initially
4. Some feature functions
a. ResolveChoice
b. Unbox => similar to from_json
c. Spigot => similar to TABLESAMPLE
43. New log
data arrives
Start
Glue crawler
Needed stats
by requirements
Glue crawler
that deals with
metadata
Glue job
that executes
the ETL
Start
Glue job
CloudWatch
event
Before Jun 20, 2019….
44. Glue workflow
New log
data arrives
Start
Glue workflow
Needed stats
by requirements
Glue job
that executes
the ETL
Glue crawler
that deals with
metadata
49. Amazon
S3
AWS
Snowball
AWS
Snowmobile
Amazon Kinesis
Video Streams
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Amazon
Kinesis
Amazon
Redshift
Amazon
EMR Amazon
Athena
Amazon
Elasticsearch
Service
AI services
● Any type of data
● Security, compliance, and
audit capabilities across
data lake
● Empower all personas
● Democratize ML with SQL
● Unified analytics
50. Open Data Lake (S3)
Data Management & Governance
Data Science
& Machine Learning
Real-time Data
Applications
BI & SQL
Analytics
Data Engineering
Structured Semi-structured Unstructured Streaming
Lakehouse Platform