This document discusses the WALD stack, a modern and sustainable analytics stack combining Snowflake, Airbyte, dbt, Lightdash, and optionally Streamlit. It provides an example of analyzing Formula 1 racing data using these tools, with Airbyte ingesting data into Snowflake, dbt for transformations, Lightdash for visualization, and dbt with Snowpark for machine learning predictions. The speaker argues the WALD stack is flexible for many use cases from analytics to full data products at scale.
2. Mathematical Modelling
Modern Data Warehousing & Analytics
Personalisation & RecSys
Uncertainty Quantification & Causality
Python Data Stack
OSS Contributor & Creator of PyScaffold
Dr. Florian Wilhelm
• H E A D O F D ATA S C I E N C E
FlorianWilhelm.info
florian.wilhelm@inovex.de
Florian Wilhelm
@FlorianWilhelm
3. ‣ Application Development (Web Platforms,
Mobile Apps, Smart Devices and Robotics, UI/
UX design,Backend Services)
‣ Data Management and Analytics (Business
Intelligence, Big Data, Searches, Data Science
and Deep Learning, Machine Perception and
Artificial Intelligence)
‣ Scalable IT-Infrastructures (IT Engineering,
Cloud Services, DevOps, Replatforming,
Security)
‣ Training and Coaching (inovex Academy)
is an innovation and quality-
driven IT project house with a
focus on digital transformation.
Using technology to
inspire our clients.
And ourselves.
Berlin · Karlsruhe · Pforzheim · Stuttgart · München · Köln · Hamburg · Erlangen
www.inovex.de
5. C O M PA R I S O N
Data Lake vs Data Warehouse
Data Lake Data Warehouse
Data Structure Raw Processed
Purpose of Data Not yet determined Currently in use
Users Data scientists /
engineers
Business professionals
Accessibility Highly accessible and
quick to update
More complicated and
costly to make changes
6. C O M PA R I S O N
ETL — Extract, Transform, Load
S TA G I N G A R E A
I N D ATA L A K E
T R A N S F O R M
S Q L R D B M S
N O S Q L D B M S
X M L F I L E S
S A A S P L AT F O R M
E X T R A C T
D ATA
WA R E H O U S E
L O A D
A N A LY T I C S
A N A L Y S E
7. C O M PA R I S O N
ELT — Extract, Load, Transform
D ATA
WA R E H O U S E
T R A N S F O R M
S Q L R D B M S
N O S Q L D B M S
X M L F I L E S
S A A S P L AT F O R M
E X T R A C T L O A D
A N A LY T I C S
A N A L Y S E
8. P R O G R E S S
The Evolution of Databricks & Snowflake
D ATA L A K E
& E T L
D ATA WA R E H O U S E
& E LT
9. P R O G R E S S
The Evolution of Databricks & Snowflake
D ATA B R I C K S
L A K E H O U S E
S N OW F L A K E
D ATA C L O U D
U N I F I C AT I O N
O F D ATA WA R E H O U S E & L A K E
11. W A L D
S T A C K
WA L D S TA C K
A Modern & Sustainable Analytics Stack
Warehouse like
Snowflake, Google
BigQuery, Databricks
Airbyte for
Integration of
hundreds of data
sources
Lightdash for
Business
Intelligence
DBT for data
transformations
12. WA L D S TA C K
WALD based on Google Big Query
‣ fully managed enterprise data
warehouse
‣ only available on Google Cloud
Platform
‣ uses dbt-bigquery to run Python
UDFs with Dataproc using
PySpark
13. WA L D S TA C K
WALD based on Databricks
‣ uses Databrick’s SQL warehouse
‣ available on all major cloud
providers
‣ uses dbt-databricks to run Python
UDFs with PySpark
14. WA L D S TA C K
WALD based on Snowflake
‣ comes with all the benefits of
Snowflake, e.g. governance, easy
collaboration, security, near zero
maintenance
‣ runs on all major cloud providers
‣ tons of great documentation
‣ the easiest to use and full-
featured option to get started
with WALD
15. WA L D S TA C K
Modern Analytics Stack
DATA CLOUD
DATA SOURCES
E X T R A C T L O A D T R A N S F O R M A n a l y s e
SNOWPARK
16. WA L D S TA C K
Example
Use-Case
>find all code & details under
https://waldstack.org
17. F O R M U L A 1
Analysis and Position Prediction
1. Upload data using Airbyte to
Snowflake
2. Use dbt for some basic analysis
3. Use dbt with Snowpark for some
ML to predict future position
4. Visualise our results with
Lightdash
18. F O R M U L A 1
Rough Overview of the Data
‣ dimensions: drivers,
circuits, constructors,
races
‣ facts: lap_times,
pit_tops, results
19. F O R M U L A 1
Rough Overview of the Data
‣ dimensions: drivers,
circuits, constructors,
races
‣ facts: lap_times,
pit_tops, results
20. F O R M U L A 1
Rough Overview of the Data
‣ dimensions: drivers,
circuits, constructors,
races
‣ facts: lap_times,
pit_tops, results
21. F O R M U L A 1
Rough Overview of the Data
‣ dimensions: drivers,
circuits, constructors,
races
‣ facts: lap_times,
pit_tops, results
22. F O R M U L A 1
Rough Overview of the Data
‣ dimensions: drivers,
circuits, constructors,
races
‣ facts: lap_times,
pit_tops, results
23. F O R M U L A 1
Rough Overview of the Data
‣ dimensions: drivers,
circuits, constructors,
races
‣ facts: lap_times,
pit_tops, results
24. F O R M U L A 1
Rough Overview of the Data
‣ dimensions: drivers,
circuits, constructors,
races
‣ facts: lap_times,
pit_tops, results
25. F O R M U L A 1
Rough Overview of the Data
‣ dimensions: drivers,
circuits, constructors,
races
‣ facts: lap_times,
pit_tops, results
26. S N OW F L A K E
Snowflake Setup
‣ activate the Anaconda Packages
(Preview Feature)
‣ create a warehouse to be used
for dbt
‣ that’s it already!
27. A I R B Y T E
Ingesting the
Data with Airbyte
Simply select
‣ file as source
‣ Snowflake as destination
and hit Launch
28. P Y T H O N + S N OW PA R K
Setup Python/Snowpark for dbt
‣ just install along dbt also
‣ dbt-snowflake
‣ snowflake-snowpark-
python
‣ snowflake-connector-
python
‣ and configure Snowflake in your
profiles.yml
29. P Y T H O N + S N OW PA R K
Training a Python Model in dbt/Snowpark
30. D B T / S N OW PA R K
Predicting with a trained Model in dbt/Snowpark
31. L I G H T D A S H
Visualisation with Lightdash
Lightdash pulls dimensions and metrics from dbt.
Lightdash
‣ runs ad-hoc queries
‣ creates charts from queries
‣ creates dashboards from charts
‣ organises everything in spaces
What are the avg lap times split by year and grand prix?
M E T R I C D I M E N S I O N S
37. O N M O R E T H I N G O N …
Streamlit
‣ provides faster way to build and
share data apps
‣ was recently acquired by
Snowflake
‣ might be a nice addition to
Lightdash for interactive data
apps
38. Dr. Florian Wilhelm, 2023
WA L D S TA C K
Conclusion
The WALD stack combines 4 powerful
technologies that work very well
together.
The WALD stack is flexible enough for
many use-cases from analytics to
building a full-fledged end-to-end
data product at scale.
39. C H E E R S T O T H E C O M M U N I T Y
Credits & Resources
‣ python-snowpark-formula
Github repo by Hope Watson
from dbt labs
‣ kind support by Michael Gorkow
& Marko Cavar from Snowflake
‣ these awesome slides by
Michael Hofmann from inovex