Injustice - Developers Among Us (SciFiDevCon 2024)
IBM Cloud Data Lake Deep Dive
1. IBM Cloud Day 2021
Well Architected Data Lake
James Bennett, Offering Manager
Torsten Steinbach, Senior Technical Staff Member
2. Two Cloud Data Lake Sessions Today
• The Well Constructed Architecture of a Modern Data Lake
• Introductory Session
• What we provide, how you can consume it
• Light introduction to deeper architecture
• Deep Dive into Cloud Native Data Lakes with IBM Cloud
• Session led by Torsten
• Everything you need to know about building a Data Lake on IBM Cloud
• Includes our Covid-19 Data Lake Implementation
3. 3
Organizations need the ability to:
o Visualize data and build data
driven applications
o Increased Data flexibility and
accessibility
o Provide Data governance to
retain data authenticity
o Gain speed with data insights
o Collect, explore and analyze
data
Cloud Data Lake for the
Enterprise
Data Architects Business and Data Analysts
Data scientists and application developers
4. Cloud Data Lake Evolutionary Context
Enterprise Data
Warehouses
Tightly integrated and
optimized systems
Hadoop
Introduced open data formats &
easy scaling on commodity HW
Cloud-Native: Serverless Analytics-aaS
• Elasticity
• Pay-per-query
• Data in object store
• Disaggregated architecture
• Increasingly real-time first
The 90-ies 2000 Today
5. 5
Need ability to effectively analyze data from from remote locations to
gain insights with cost effective, secure, on demand analytics and long-
term data retention
o Nightly batch export from operational production databases in factory
locations are automatically uploaded to data lake in cloud (central COS
bucket).
o LoB engineers subscribes to data in data lake, which is then ETLed with
SQL query to tenant-specific zones (tenant specific COS buckets).
o Future updates of data lake data in central COS bucket is automatically
ETLed right away to tenant specific COS bucket via cloud functions
events.
o LoB engineers explore, experiment and do data preparation using SQL
query on tenant specific buckets.
o LoB engineer uses Watson Studio to run data science, visualize and
present insights to executives.
Solution
Business Problem
Case Study
6. 6
Need ability to effectively ingest and analyze data from multiple vendors
in various data formats to gain competitive insights
Ø Ingest pricing data from 20+ external vendors and persist in Cloud
Object Store.
Ø Data Engineers prep the data by joining vendor data with on-premise
data warehouse
Ø Data Engineers then process result sets using Analytics Engine (Spark)
and Db2 Warehouse on Cloud.
Ø LoB engineers explore, experiment and do data preparation using SQL
query on tenant specific buckets.
Ø LoB engineer uses Watson Studio (notebooks) to run data science,
visualize and present actionable competitive insights to executives.
Solution
Business Problem
Case Study
7. Replicate on-prem
DB to cloud data lake
for analytics
o Capture database
change feed into
Kafka in Cloud
o Land Kafka data to
object storage
o Prepare replicated
change feed for
analytics
o Query for insights
o Present & visualize
insights
Collect, historize &
analyze IoT data
o Land IoT message
data through Even
Streams (Kafka)
o Prepare, cleanse,
extract and enrich
IoT data
o Query for insights
o Present & visualize
insights
Move existing
Hadoop Workload to
Cloud
o Replace HDFS with
cloud-native
storage: object
storage
o Run Hadoop
processing in fully
managed Hadoop
service: analytic
engine
o Interactive analytics
through Watson
Studio
AIOps, gain operational
& business insights
from solution logs
o Collect full solution
telemetry (logs)
o Prepare, cleanse,
extract and enrich
data from logs
o Query for insights
o Present & visualize
insights
7
Use Cases
SQL in Place :
Reduce cost and
decouple workload
from DWHs
o Use data lake in as
landing and
preparation storage
before data gets
ingested to DWH
o Archive data from
DWH to data lake
from affordable
SQL-enabled
archive
o Automate ETL and
enable SQL-
federation across
data lake and DWH
8. Cloud Pak for Data as a Service
Built On
IBM Cloud
Uses
IBM Cloud Data Lake
COS
Storage Analytics
SQL Query
Event Streams
Streaming Transformation
Spark Cloud Databases
Databases
9. Scalability
Start small and grow
large without
overprovisioning for
anticipated scale.
Efficiency and Speed
Get applications to
market quickly,
without worrying
about underlying
infrastructure costs,
maintenance, and
provider security
Flexibility
Pick and choose
services to fit their
needs, customize
applications and
expand across geos
seamlessly
Security
Common security
integrations with
Identity and Access
Management,
customer managed
encryption key, and
common compliance
roadmap
9
IBM Cloud enables a secure,
fully integrated set of Cloud
Data Services
12. Data Science
Tooling
Streaming
Analytics
Analytical
Dashboards
AI
Applications
Data Prep
Tools
Object Stores
Data Lake
Databases
Unstructured &
Streaming Data
Intelligent
data catalog
Assess Risk
Discover
data
Self-serve find &
‘deploy’ data Data Privacy
enforced
Business meaning
Data
Consumers
Hybrid
Data
Sources
Integrated
Data
Governance
• Extract greater value from your data assets through
better data organization and intelligent data
discovery
• Enable AI to help you derive better insights from
your organized data
• Improve data risk strategies by assessing risks
across your data estates
• Increase user productivity through safe self-service
data access
• Unified end-user experience driven by seamlessly
integrated services across the platform
12
Enable safe self-service access to data across users with multiple skill levels enabling them
to use the power of AI securely at speed
Key Business Outcome: DataOps
13. Cloud Pak for Data as a Service
Built On
IBM Cloud
Uses
IBM Cloud Data Lake
COS
Storage Analytics
SQL Query
Event Streams
Streaming Transformation
Spark Cloud Databases
Databases
14. Industry-leading
optimizations for SQL-
native location &
timeseries data and
indexing of object storage
data
High velocity due to self-
service data management,
preparation & analytics
with extreme low barrier
of entry thanks to
serverless model
Most secure data lake
option in cloud due unique
BYO and KYOK key
services in IBM Cloud.
Enables Cloud Economics,
Resiliency and Scale for
Big Data
14
Why IBM Cloud Data Lake?
16. Telemetry Data
Explore
ETL
Prep Enrich
Streaming
Optimize Analyze
ü Seamless Elasticity
ü Seamless Scalability
ü Highly Cost Effective
ü Long Term Retention
ü Any data formats
ETL
IBM Cloud Data Lake – Big Picture
DWH
Databases
ü Response Time SLAs
ü Warm High-quality Data only
Cloud Data Lake
Analytics
Optional:
17. IBM Serverless Stack for Analytics
Serverless
Storage
Serverless
Runtimes
Serverless
Analytics
Object
Storage
Cloud
Functions
Query
Only pay for volume of data
that you really store
Only pay for
amount of
data that you
really scan
Only pay for
CPU that
you really
consume
Blog Article
§ Properties of Serverless:
– No management of resources, hosts and
processes
– Auto-scaling and auto-provisioning based
on actual load
– Precise billing based on really consumed
system resources (memory, storage, CPU,
network, I/O)
– High-Availability is always implicit
18. IBM SQL Query – The Central Cloud Data Lake Service
Cloud Data
Data
Transformation
Serverless SQL Query Service
Analytics
Object
Storage RDBMS
+
Developers
Data
Engineers
Data Analysts
ü Supports ad-hoc and
unknown data structures
ü ETL & ELT Support
ü 100% Pay-as-you-go (5$/TB)
ü 100% API enabled
ü Automatic Big Data Scale-
Out with Spark
ü 100% Self service, No Setup
Data
Management
+
Data Scientists
ü Built-In Database Catalog &
Data Skipping
Data Ingestion
+
19. IBM SQL Query Architecture
2. Read data
4. Read
results
Application
3. Write data
Cloud Data Services
1. Submit SQL
SQL
Event Streams
Query
Db2 on Cloud
Geospatial SQL
Data Skipping
Timeseries SQL
Hive Metastore
Video
Cloud Object Storage
• Using IBM Analytic Engine service
(Spark clusters aaS)
• Large farm of Spark clusters auto-
provisioned & auto-managed in background
• Managing a hot pool of Spark applications
(a.k.a. kernels, using Jupyter Kernel Gateway)
• SQL grammar sandbox
• Auto-scaling of each serverless SQL job
inside large Spark clusters using dynamic
resource allocation
• Intrinsically HA (dispatching across Spark
environments in each availability zone)
20. IBM SQL Query – Access Patterns
Create
Query
SQL
Console
Watson
Studio
Notebooks
Cloud Functions
Integrate Explore
Deploy
Python SDK
REST API
JDBC
Object
Store
Console
Event
Streams
Console
21. Meta Data
IBM Cloud Data Lake – Meta Data
Cloud Data
ACID
Spark
Data Skipping Indexes Governance Policies
& Lineage
Schema, Partitioning,
Statistics
Serverless SQL
Object
Storage RDBMS
Hive
Metastore
Kafka Schema
Registry
Xskipper Iceberg
Watson Knowledge
Catalog
Deltalake
22. Event Streams SQL Query
Object
Storage Meta Data
Integrated Hive Metastore + Kafka Schema Registry + ACID (Iceberg)
Real-Time
Queries
IBM Cloud Data Lake – 2021 Architecture
COS
Batch
Queries
Stream Xform
& Joins
Stream data landing
Schema management & enforcement
ETL & Data
Preparation