Más contenido relacionado
La actualidad más candente (20)
Similar a Snowflake for Data Engineering (20)
Snowflake for Data Engineering
- 1. © 2020 Snowflake Inc. All Rights Reserved
FROM DATA TO INSIGHTS:
DATA ENGINEERING
MIT SNOWFLAKE
ScaleUp 360° Smart Data
29. Sept. 2020
Harald Erb | harald.erb@snowflake.com
Sr. Solutions Engineer, Central Europe
- 2. © 2020 Snowflake Computing Inc. All Rights Reserved
ABOUT ME
Sr. Solutions Engineer
Central Europe
harald.erb@snowflake.com
Llinkedin.com/in/haralderb
Enthusiastic about Business Analytics &
Data Management for 20+ years
> Consulting: Delivered large-scale Data
Warehouse and BI projects as Developer,
Information Analyst, Solution Architect,
Project Lead at Oracle D/A/CH
> Presales #2 at Snowflake in Central
Europe with focus on Modern Data
Management & Analytics
> Worked with clients on Big Data & IoT
solutions as Architect and Solutions
Engineer at Oracle EMEA, Pentaho and
Hitachi Vantara
- 3. © 2020 Snowflake Computing Inc. All Rights Reserved
AGENDA
> Snowflake Cloud Data Platform – for Data Engineering
> Solution Study: Let‘s Build Something!
> Session Takeaway
- 4. © 2020 Snowflake Inc. All Rights Reserved.
SNOWFLAKE FOR
DATA ENGINEERING
© 2020 Snowflake Computing Inc. All Rights Reserved
- 5. © 2020 Snowflake Inc. All Rights Reserved
SNOWFLAKE CLOUD DATA PLATFORM
5
OLTP DATABASES
ENTERPRISE
APPLICATIONS
THIRD-PARTY
WEB/LOG DATA
IoT
DATA MONETIZATION
OPERATIONAL
REPORTING
AD HOC ANALYSIS
REAL-TIME ANALYTICS
DATA
SOURCES
DATA
CONSUMERS
Thema heute
- 6. © 2020 Snowflake Inc. All Rights Reserved
Rethink
transformation
with robust
and integrated
data pipelines
Simplify and
accelerate your
data lake with
one platform for
all your data
Develop apps
with fast and
scalable analytics
that delight
customers
Deliver
analytics at
scale with
a modern
data warehouse
Empower your
ecosystem
to securely
collaborate
across all data
Simplify and
accelerate
machine learning
and artificial
intelligence
ONE PLATFORM, ONE COPY OF DATA,
MANY WORKLOADS
6
- 7. © 2020 Snowflake Computing Inc. All Rights Reserved
OVERCOMING DATA SILOS WITH SNOWFLAKE
Data Sources Data Consumers
Structured Data
Semi-Structured Data
Web APIs
IoT Data
Data Visualization /
Reporting
Data Science
Ad hoc Queries
Data Zones
Enterprise data in one place (as much as possible), organized (e.g. in logical Data Zones) and accessible for all users
Work Area (Exploratory, AI / ML)
Persistent, user/team space, one or more Databases
Landing Zone
Transient, ELT processes, truncate/reload
Raw
Raw data, schema-
less (JSON…): no
transformations,
matches source data
Conformed
Raw +
de-duplicated, data
type standardization
(dates)
Reference
Master data, ,
manual mappings,
Business hierarchies
Modeled
Integrated, cleansed,
modeled data (3NF,
Data Vault,
Dimensional Model)
“Data Lake" “Data Warehouse”
- 8. © 2020 Snowflake Computing Inc. All Rights Reserved
ELASTIC SERVICE, SUPPORT FOR MULTIPLE WORKLOADS
8
Continuous
Loading (4TB/day)
S3
<5min SLA
Compute Cluster
“Medium”
Batch Data Loads
& Transformations
Compute Cluster
"Large”
Compute
Cluster
"2X-Large”
Customer
Analytics &
Segmentation
Interactive
Dashboard
50% < 1s
85% < 2s
95% < 5s
Compute Cluster
Auto Scale –
”X-Large” x 5
Prod DB
Snowflake Shared Data, Multi-Cluster Architecture: All data available in a central repository,
major workloads isolated, performance on demand, and easy data access for everybody via SQL
Benefit:
Deliver Reporting
SLA’s
Benefit:
Add teams as needed,
support agile development &
a data driven culture
Benefit:
Always fresh data
Benefit:
Complete more tasks
within same time frame
Structured & Semi-structured Data at Petabyte-Scale
(all encrypted, compressed)
- 9. © 2020 Snowflake Inc. All Rights Reserved
SUPPORTING CAPABILITIES FOR DATA ENGINEERING
Thema heute
- 10. © 2020 Snowflake Inc. All Rights Reserved.
SOLUTION STUDY:
LET‘S BUILD SOMETHING!
© 2020 Snowflake Computing Inc. All Rights Reserved
- 11. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
SOLUTION SCENARIOSCENARIO: INGESTING FUEL PRICE DATA FOR ANALYIS
Source: tankerkoenig.de
- 12. © 2020 Snowflake Inc. All Rights Reserved
SOLUTION
ARCHITECTURE
© 2020 Snowflake Inc. All Rights Reserved
Thema heute
- 13. © 2020 Snowflake Inc. All Rights Reserved 13
Key Steps
>Integrate with AWS S3 and connect
Snowflake via External Stage
>Create a Pipe for Automatic Data Ingestion
> Test Snowpipe with new data
SCENARIO - Part #1
DATA INGESTION WITH
SNOWPIPE
- 14. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
INTEGRATE AWS S3 WITH VIA EXTERNAL STAGE
What is… a Storage Integration and (External) Stage?
> Storage Integration: is a Snowflake object that stores a generated identity and access management (IAM)
entity for external cloud storage, along with an optional set of allowed or blocked storage locations (Amazon
S3, Google Cloud Storage, or Microsoft Azure)
> (External) Stage: a Snowflake object which encapsulates all of the required information for staging files: S3
bucket where the files are staged; the named storage integration object or S3 credentials for the bucket (if it
is protected); an encryption key (if the files in the bucket have been encrypted)
v
v
SF Admin Task, typically
not done by developers!
- 15. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
IDENTIFY DATA TO BE LOADED FROM EXTERNAL STAGE
v
List content of a S3 bucket directly
from Snowflake, navigate subfolder
structure.
Identify, inspect and select files to be
loaded using “ * ” and RegExp etc.
Compute statistics
on files to be loaded
into Snowflake
- 16. © 2020 Snowflake Inc. All Rights Reserved
AUTOMATIC DATA INGESTION WITH SNOWPIPE
v
v
Bulk load
command
v
Target table to be updated
What is… Snowpipe?
> Snowpipe enables loading data from files as soon as
they’re available in a stage. Data can be loaded from files
in micro-batches, making it available to users within
minutes, rather than manually executing COPY statements
on a schedule to load larger batches.
> Alternative: Clients can call public Snowpipe REST
endpoints to load data and retrieve load history reports
Source location,
external stage
(e.g. S3 Bucket)
- 17. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
UPLOAD NEW DATA TO S3 & CHECK STATUS OF SNOWPIPE
v
- 18. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
VALIDATE RESULT IN SNOWSIGHT DASHBOARD
- 19. © 2020 Snowflake Inc. All Rights Reserved 19
Key Steps
>Integrate AWS Lambda Function
>Automate API Calls + store Payloads (JSON)
> Implement Change Data Capture
> Automate JSON flattening + Data Loading
SCENARIO - Part #2
AUTOMATED RETRIEVAL +
PROCESSING OF API DATA
- 20. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
INTEGRATE AWS LAMBDA WITH VIA EXTERNAL FUNCTION
What is… an API Integration and External Function?
> API Integration (Preview Feature): object stores information about an HTTPS proxy service, including information
about: The cloud platform provider (e.g. Amazon AWS); type of proxy service (in case the cloud platform provider
offers more than one type of proxy service); identifier and access credentials
> External Function (Preview Feature): Snowflake does not call a remote service directly. Instead, Snowflake calls
the remote service through a cloud provider’s native HTTPS proxy service, for example API Gateway on AWS
SF Admin Task, typically
not done by developers
v
- 21. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
AUTOMATE API REQUESTS WITH TASK #1
V Automation Task with dedicated
compute (TASK_WAREHOUSE),
schedule and no dependencies
V
What is… a Task?
> User-defined tasks allow scheduled execution of SQL statements. Tasks run according
to a specified execution configuration, using any combination of a set interval and/or a
flexible schedule using a subset of familiar cron utility syntax.
> There is no event source that can trigger a task; instead, a task runs on a schedule,
which can be defined when creating a task (using CREATE TASK) or later
(using ALTER TASK)
- 22. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
API PAYLOAD RETRIEVED
V
Fuel price data of multiple
gas stations
- 23. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
API PAYLOAD RETRIEVED
V
Fuel price data of multiple
gas stations
- 24. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
ACTIVATE CHANGE DATA CAPTURE WITH STREAMS
V
V
Source table where
data record changes
should be tracked
V
SQL query on a table stream
to view which records have
been added, changed, deleted
V
What is… a Stream?
> An individual table stream
tracks the changes made
to rows in a source table. A
table stream makes a
“change table” available of
what changed, at the row
level, between two
transactional points of time
in a table.
> a stream itself does not
contain any table data, it
only stores the offset for
the source table and
returns CDC records by
leveraging the versioning
history.
- 25. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
AUTOMATE DELTA LOAD WITH STREAMS AND TASK #2
V
Task will only start if table stream
has new data records to process
à saves compute resources!
Only CDC data
records of interest will
be processed and then
cleared from stream
when committed
V
Lateral view and flatten table function
used to split price data by Gas Station
and store as separate records in the
target table REMOTE_FUEL_PRICES
V
- 26. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
STREAM CLEARED & PRICE DATA READY FOR ANALYSIS!
New fuel prices prepared
and stored in target table
REMOTE_FUEL_PRICES
(still in JSON format)
V
V
Query of table stream returns no
rows because the stream was
cleared after successful INSERT
into target table (Auto committed)
- 27. © 2020 Snowflake Inc. All Rights Reserved 27
Key Steps
>Consolidate Data for Analysis
>Query + visualize data for a given Gas Station
in Germany
>Analyze Snowflake Consumption
SCENARIO - Part #3
DATA CONSOLIDATION +
VISUALIZATION
- 28. © 2020 Snowflake Inc. All Rights Reserved
COMBINING
HISTORIC DATA
WITH API DATA
© 2020 Snowflake Inc. All Rights Reserved
V
Reading, formatting and
joining JSON price data
directly with master data
V
Putting all together:
Historic data from
dimensional model
combined with
current price data
- 29. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
ANALYSIS & VISUALIZATION FOR A GIVEN GAS STATION
- 30. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
PAY AS YOU USE + BUILT-IN COST TRANSPARENCY
Snowflake Default Billing & Usage Dashboard Snowpipe Usage History queried via SQL
- 31. © 2020 Snowflake Inc. All Rights Reserved.
SESSION TAKEAWAY
© 2020 Snowflake Computing Inc. All Rights Reserved
- 32. © 2020 Snowflake Inc. All Rights Reserved
A COMPLETE AND EASY-TO-USE DATA PLATFORM
Structured Data
Semi-Structured Data
Web APIs
IoT Data
Visualization /
Reporting
Data Science
Ad hoc Queries
Data Sources Stage
Presentation /
Consumers
JSON, AVRO
(VARIANT)
Hive Metastore
Integration
External Tables
Parquet
Load/Unload
ANSI SQL
Data Lake Warehouse Aggregation
Semantic /
Federated
Elastic Multi-
Cluster Compute
Data Vault,
3NF Modeling
ACID
Transactional
Consistency
Secure Views /
Data Masking
Materialized
Views
Zero Copy
Cloning
SSO
LDAP
OAUTH
SCIM
ODBC/JDBC
Python/R/Spark
Connector
End-to-End Security (RBAC, Encryption at Rest/in Motion)
Web UI
External
Functions
Data Sharing /
Marketplace
Streams (CDC) &
Tasks (Scheduler)
Time Travel
Kafka-Connector /
Snowpipe
Stored Procs /
UDFs
Geospatial
Snowflake supports Data Lake, Data Warehouse, and Data Engineering workloads
Dimensional
Modeling
32
Information
Schema
- 33. © 2020 Snowflake Inc. All Rights Reserved
SNOWFLAKE FOR DATA ENGINEERING
ALL DATA,
ANY SPEED
BETTER PRICE &
PERFORMANCE
NO SUPER POWERS
REQUIRED
Structured & Semi-Structured Data
Batch & Continuous Data Ingestion
Partner Ecosystems
Dedicated Resources
Auto Scaling
SQL-based
Single Platform with Near-Zero
Maintenance
Streams & Tasks